UNSUPERVISED LEARNING
The rise of the internet has caused people to put a massive amount of data online, and it is growing exponentially. Social network posts, online product reviews, and news articles are responsible for large amounts of text. Cameras, cars, and even refrigerators are connected to the internet and are feeding data into cyberspace. Governmental entities are making more and more data available over the internet too. Even medical data is now making its way to the cloud in the form of imaging, gene expression data, and even doctor’s notes.
To turn all this raw data into meaningful information is a considerable challenge. Supervised learning can handle some big problems, such as facial recognition and machine translation, but it cannot make sense of most of this raw data. We can create vast numbers of input columns from this giant data pool, but most of the data does not contain any labels that we can use in our output column. While it is possible to take a table of input columns and manually create the classification labels for the output column, this kind of effort is usually prohibitively expensive for large tables. A different set of algorithms is required to take all this raw, unlabeled data and make it usable.
Unsupervised learning derives insight from training tables with no output column (i.e., no category labels). For example, unsupervised learning algorithms analyze the data to spot the types of credit card fraud that supervised learning does not detect. Unsupervised learning also can be used to create deepfake videos of political candidates or celebrities and fraudulent social media posts.
CLUSTER ANALYSIS
One technique for analyzing data using unsupervised learning is cluster analysis. In biology, for example, researchers collect data on plants and animals that have yet to be categorized. They take the different observed features (e.g., size, number of eyes, etc.) and create a table that has one row for each observed animal (or plant) and one input column for each feature.
Table 7.1 A training table with no output column.
Notice that there is no output column containing the category of each animal as we had in the supervised learning tables earlier. In supervised learning, the training table for the supervised learning algorithm has the correct answers in the output column, and these correct answers guide (or supervise) the process that computes the optimal function. In unsupervised learning, there is no output column, and the goal is to make sense of the data in the training table without having any supervision. In our biology example, the purpose of the cluster analysis is to use the input data to figure out how to group the observed animals with known species and to figure out when to classify a group as a new species or subspecies.
If you take the unsupervised learning data from table 7.1 and plot it in three dimensions (one dimension for each column), the result will be a graph like the one in figure 7.1.
Figure 7.1 A 3D view shows distinct data clusters.
The data points in cluster 1 are from humans. The data points in cluster 2 are from horses, cluster 3 is cats, and cluster 4 is spiders.1 Now, suppose our data contained observations of animal types other than humans, horses, cats, and spiders. For example, suppose it also included dogs, foxes, and squirrels. If we plotted dog, fox, and squirrel observations, their traits would overlap with the cat cluster in all three variables (weight, lifespan, legs), and we would not be able to distinguish them. To make the cat, dog, squirrel, and fox clusters separate, we would need to add more variables, such as color, tail size, and the type of noise they emit. To plot the observations with these additional three variables would require three more dimensions, for a total of six dimensions. The three-dimensional plot is complex, but relatively easy to visualize. Most of us cannot imagine six dimensions in any understandable fashion.
Worse, to include even a small fraction of the many animal types, we might need fifty or one hundred dimensions or more. In a one-hundred-dimensional space, even though we cannot visualize it, the clusters of observations for cats, dogs, foxes, squirrels, and other animals would be distinct and separate, just like the four clusters in our three-dimensional example. We cannot visualize the clusters, but we can use cluster analysis, which is a set of mathematical algorithms that can group the observations in a training table into separate and distinct clusters in a high-dimensional space.2 For example, if we had unlabeled observations of humans, horses, spiders, cats, dogs, foxes, and squirrels, the mathematical algorithm would calculate that there are seven clusters in the data and would determine how to place each observation in the correct cluster.
We can use cluster analysis algorithms to determine the number of clusters and to assign observations (rows) to the clusters for almost any training table. People in many fields besides biology use cluster analysis. In marketing, you can use it to identify homogeneous groups of customers who have similar needs and attitudes. Marketers then target the different groups with different campaigns. Golfers may see ads featuring Tiger Woods, whereas nature lovers might see ads featuring mountains. In medicine, researchers can apply cluster analysis to surveys of patient symptoms to identify groups of patients. They can then label some clusters as new diagnostic or disease categories. Insurance companies use cluster analysis to determine which types of customers are making which types of claims and which customers would be receptive to the marketing of their various insurance products. Geologists use cluster analysis to identify earthquake-prone regions. In epidemiology, researchers use it to find areas or neighborhoods with similar epidemiological profiles.
When statisticians created the first cluster analysis algorithms in the 1930s, they had to implement them by hand. As a result, they could only find clusters in low-dimensional spaces and for relatively small numbers of observations. Over the years, statisticians and, more recently, computer scientists have developed improved cluster analysis algorithms to the point where they can handle unimaginably high-dimensional problems and massive training tables.
ANOMALY DETECTION
One application of cluster analysis is anomaly detection. To better understand how this works, let’s revisit credit card fraud detection. Credit card issuers apply supervised learning classifiers to massive transaction databases to distinguish valid from fraudulent transactions based on past patterns of fraudulent activity. However, supervised learning has difficulty identifying new patterns of fraud. With unsupervised learning, credit card companies can find those new patterns. Issuers can use the multidimensional clusters in a credit card transaction database to represent patterns of valid user transactions. When the system encounters a new transaction (which happens millions of times a day), it can match most legitimate new transactions to an existing valid transaction cluster. However, if a transaction does not fit neatly into a cluster, the system can flag it as potentially fraudulent.
Unsupervised learning is used to find normal patterns in data so that we can identify abnormal patterns. In a cybersecurity context, IT teams use anomaly detection to uncover abnormal patterns of computer network traffic and server or database access caused by hackers. Hospitals use anomaly detection to identify life-threatening electrocardiogram patterns and abnormal CT scans for hospital patients. Workers in other areas use anomaly detection to identify insurance and accounting fraud, to predict weather patterns, and for many other applications.
There are many unsupervised learning anomaly detection techniques besides cluster analysis, and I will mention one more: time series analysis. Time series training tables are observations that occur over time, such as stock prices. The ability to derive output columns from the data is a form of unsupervised learning known as self-supervised learning. It involves turning an unsupervised learning task into a supervised learning task without requiring human effort to label the observations. We can then use supervised learning to predict each price in the database by creating a training table that has the price as the output column and all the prices preceding it as the input column.
FAKE NEWS
The Cambridge Analytica scandal raised public awareness of what happens when bad actors create massive numbers of fake social media posts. Even though the posts in that scandal were created manually by humans, the ramifications were significant. Imagine what would happen if the technology were available to generate massive numbers of fake posts by computer. Fake news is a big problem, even when manual effort is required to create it. If computers could generate fake news, there would be much more fake content with the potential to mislead even more people. While in the early twenty-first century this idea may have been a dystopian sci-fi fantasy, an important unsupervised learning technology is scarily close to making computerized generation of fake news widely available to those who want to manipulate and mislead.
OpenAI created a technology named GPT-2 that can generate text.3 For example, I entered the following text from Wikipedia into GPT-2:4 “The United States of America is a country consisting of fifty states, a federal district, five major self-governing territories, and various possessions.” GPT-2 then wrote this continuation:
The United States of America is a country consisting of fifty states, a federal district, five major self-governing territories, and various possessions. Since its inception, the United States has been the world’s oldest continuous sovereign country, having been founded in 1776. Though many countries, and indeed most large nations, have changed hands in their histories, the United States has always remained a true nation of the people. Today, the United States constitutes the foremost military power in the world.
The post has reasonable grammar, and its facts are correct. Imagine a computer writing fake blog posts and a fake news article like this. The OpenAI system was trained by processing Reddit blog posts using self-supervised learning. It had a supervised learning goal of predicting the next word in a post based on the previous words. It is self-supervised because the labels can be found right in the text (i.e., the next word). The resulting function is known as a language model.
Researchers created a training table for this language model by programmatically breaking up the text into a series of observations, as is illustrated in table 7.2.
Table 7.2 Illustration of a training table for a language model.
There will be one row per word in the post, so a post with one thousand words would have one thousand training rows. Now imagine how big this table would be if it contains rows for not just one post but 8 million posts as well. That is what the OpenAI team did. The neural network had 1.5 billion weights. The training task was to predict the output column from the words in the input column.
After training, we can present GPT-2 with the start of a text that was not in the training table. GPT-2 then predicts the next word. Then if we present GPT-2 with the input text plus the word it generated, it will predict the next word again and so on until we generate several sentences.5
A commercial marketing agency used a similar technology6 and fed it thirty human-generated headlines, such as “What Photo Filters Are Best for Instagram Marketing?” They published the headline and the computer response for the thirty headlines on www.thismarketingblogdoesnotexist.com in 2019. It looks like a human-generated blog site, and the blog posts also look human-generated, with reasonable grammar and sentences that sort of make sense. However, when you read the posts carefully, it becomes evident that the creator of the post does not know the subject matter. For example, the first paragraph of the photo filter post read as follows:
Instagram Stories first made people’s Instagram feeds sleeker, more colorful and just generally more fun. They could post their artistic photos in the background of someone else’s Story—and secretly make someone jealous and/or unfollow you while doing it.
First, the post has nothing to do with the human-generated headline. The headline is about photo filters, and the blog post is not. Second, it is not factual. You cannot post photos on someone else’s story. Finally, sentence segments like “and/or unfollow you while doing it” are entirely nonsensical. Why would one user unfollow another user while posting a photo to their Instagram page?
The passage that GPT-2 generated about the United States was the only fully coherent and factually correct passage that it generated in my twenty tries with the system. Let’s look at a more typical example of my GPT-2 interactions. When I entered “Traffic in Connecticut,” the text produced by GPT-2 was this:
Traffic in Connecticut and New York is running roughly at capacity, with many Long Island Expressway and Long Island Rail Road interchanges carrying tolls. That could impact the rest of the year as drivers try to figure out whether their trip will be all right. In New Jersey, drivers can expect to be paying more than $1,000 for the first time to use the Port Authority’s new toll-free bridge across the Hudson River.
The first sentence starts fine, but then it starts talking about tolls at Long Island Rail Road interchanges. This sentence violates our commonsense knowledge, because we know that railroad cars do not stop for tolls. The second sentence is okay, although it is hard to ascertain its meaning. The third sentence is where it goes off the rails. Tolls in the New York and New Jersey area are high, but they are not anywhere near $1,000.
Some researchers have suggested that GPT-2 learns commonsense knowledge about the world and learns to reason to generate these texts. If this were happening, this knowledge might serve as a foundation for the development of AGI capabilities. However, this interpretation cannot be correct, because GPT-2 gets so many facts wrong in the output text. It also does not appear to be learning to reason, because so many generated texts contain violations of commonsense reasoning. NYU professor Gary Marcus has written many papers and given many talks criticizing this interpretation. As he puts it, “Upon careful inspection, it becomes apparent the system has no idea what it is talking about.”7
A better explanation is that GPT-2 is learning statistical properties about word co-occurrences. Additionally, Google Brain researchers have also demonstrated that language models have a strong tendency to memorize sentence fragments from the training data.8 GPT-2 is probably memorizing some word sequences that constitute facts. On the occasions it gets its facts right, GPT-2 is probably just regurgitating these memorized sentence fragments. When it gets its facts wrong, it is because it is just stringing words together based on the statistical likelihood that one word will follow another word.
The lack of commonsense reasoning does not make language models useless. On the contrary, they can be quite useful. Google uses language models in its Smart Compose features in its Gmail system. Smart Compose predicts the next words a user will type, and the user can accept them by hitting the tab key.
WORD EMBEDDINGS
Researchers create word embeddings using techniques that are similar to those used for language models. In 2013, Google researchers trained a network on a language-modeling task.9 However, instead of predicting the next word from all the words that came before, the task was for each word in a large set of news stories to predict the following two words and the previous two words. Of course, it would be impossible to learn to do this anywhere near perfectly, because the two words before and after any given word will change from article to article. However, the idea was that, as English linguist J. R. Firth said in 1957, “you shall know a word by the company it keeps.”10
After training, each word in the training text had a different set of values in the trained network. This set of values is the word embedding. After the training was complete, the Google team discovered that the word embeddings encoded some interesting information about words in a high-dimensional space. For example, they found that the male/ female relationship is encoded. It is possible to add and subtract one word embedding from another using mathematics. When they did this, they found that by doing the math king – man + woman on their word embeddings, the resulting word embedding was very close to the word embedding for queen.11
Similarly, researchers found that the word embedding resulting from apple – apples is almost identical to the word embedding resulting from car – cars. This finding indicated that the concept of plurals is encoded.
That said, word embeddings do not capture the full meanings of words. When people learn the concept of a car, they also learn many features of the car concept. For example, a car has wheels, doors, an engine, and a steering wheel. We also learn that cars are a form of transportation, they carry passengers, and they move faster than people. Stanford University researchers have shown that word embedding representations do not capture this level of understanding.12
We now know that, for many natural language tasks, such as answering a question, networks train faster and perform better when word embeddings are input to the network instead of words. Google and Microsoft also use word embeddings to improve search engine results13 and for predictive typing on smartphones.14
UNSUPERVISED AND UNINTELLIGENT
One misunderstanding about unsupervised learning is that these algorithms have reasoning ability. For example, a Forbes magazine article said that unsupervised learning “goes into the problem blind—with only its faultless logical operations to guide it.”15 This statement makes it sound as if unsupervised learning algorithms use reasoning to explore unstructured data. Nothing could be further from the truth. Unsupervised learning algorithms are conventionally programmed and follow an exact step-by-step sequence of operations.
Unsupervised learning is used when the training table has no output column. Unsupervised techniques such as cluster analysis were first used over eighty years ago and are still in heavy use today. They can be used to create fake news, but the fake news tends to be factually inaccurate and lacks overall cohesiveness because these systems do not have commonsense knowledge and cannot apply commonsense reasoning to that knowledge.
We should all be amazed that computers can generate text that appears to be fake news—as long as no one looks too closely. They can do so because they have analyzed massive sets of texts and learned word patterns. However, they do not understand what they read, have not gained human-like world knowledge, and cannot think or reason, and the generated texts reflect those limitations.
Although fake news and falsely attributed text can have serious real-world consequences, such as influencing elections or creating controversy around a celebrity, these uses of unsupervised learning are just bad uses of a tool that is neither good nor bad on its own. A shovel can be a dangerous weapon if it is wielded by the wrong person; the same is true of AI or any other tool.