My favorite machine learning study is not from a machine learning journal, but from a study about the Ancient Chinese. This says a lot about the versatility of AI.
Peter Bol is the Vice Provost at Harvard where he specializes in East Asian Languages and Civilizations. I attended a talk he gave about Harvard’s China Biographical Database Project, which is building a rich visual catalogue of all of China’s biographical information using historical records that stretch back to the 9th century.
In his talk, Professor Bol described his journey studying one ancient Chinese region in particular, where researchers were having difficulty understanding how people advanced and social-climbed through the society. All they had was thousands of hand-written letters, and they knew that somewhere in there was the answer.
The ancient Chinese people who wrote these letters had an invisible shared knowledge. Based on intuition, they knew “how things worked.” They knew who the big-shots were, how to approach them, and all the nuances of what to say and why. Eventually these unwritten rules were lost to time. But their traces remained there in the pile of letters, a trove of ancient, analog big data.
The researchers fed this data through an unsupervised learning algorithm, and what it revealed were things that humans couldn’t have figured out on their own, or at least not without an impossibly large amount of time and patience. The model showed clear patterns of how people advanced through the society, and of how they rose to the top of a power structure that was made up of a small network of nodes, representing a small community of power brokers.
AI had advanced our understanding of ancient human behavior. This begs the question: what else can AI advance?
Professor Bol was using unsupervised learning for its best possible use case: to tell him what he didn’t know he needed to know. Supervised machine learning can’t do that, because it only looks at what we tell it to look at. Unsupervised learning looks at everything.
Two years ago, I joined an AI company as an advisor. Text IQ is building AI for sensitive information to help enterprises and government agencies reduce their latent risk. The problem we’re working on is a multi-billion dollar problem: the wrong kind of information that slips to the public or a competitor can topple a merger, cause huge fines, destroy a high-stakes litigation, and create reputational harm, degrading shareholder value.
Like the ancient Chinese who left traces of their shared knowledge in the public record, today’s humans leave traces of latent risk in the massive datasets of their organizations. These are sensitive needles in a haystack: codewords describing insider trading, private (and risky) emails from professional accounts, privileged information as a merger approaches, or the vestiges – everywhere – of personal data that have become large liabilities in the face of new privacy regulations like the Europe’s sweeping General Data Protection Regulation, or GDPR.
The compliance and risk spaces have become crowded with startups that are using artificial intelligence to help organizations protect themselves from these inside threats. Many of these startups are using supervised machine learning, because it is generally faster and less complex to develop.
At Text IQ, we are using unsupervised machine learning. We believe this technique provides significantly better results for reducing the risk, time, and cost of detecting sensitive information.
Here are three key advantages of using unsupervised machine learning to manage latent risk.
1. Unsupervised machine learning sees things we don’t tell it to see
Supervised machine learning is the basis for some of the most powerful AI technologies: speech recognition (“What did this person say?”), pharmaceutical research (“Is this molecule a drug?”), and self-driving cars (“Is this a stop sign?”). In these applications, we take what’s seen in the real world and we tell the model: here’s how you should have understood what you saw. The function takes an input and makes a prediction about the output, and by minimizing the error of that function, we’re trying to get things right as much as we can.
The benchmarks for supervised learning are clear. Did it interpret the person’s speech correctly? Did it identify the drug? Did it stop at the stop sign? There’s a quantitative measure that states the fraction that the model is missing or the percentage of time that it is incorrect.
However, for use cases that involve understanding the nuances of human interactions, supervised learning has built-in limitations to what it can imagine, because these algorithms only see what they’re allowed to see (e.g. here’s an image), in order to make a prediction it’s instructed to make (e.g. is there a human in it?).
In some sense, unsupervised learning is held to a higher standard, because its end goal is loftier: rather than making a prediction, it is asked to extract useful and actionable information. If supervised learning is a laser, then unsupervised learning is a floodlight, ingesting all the data there is and surfacing buried patterns.
Its expansive “data view” is why unsupervised learning can provide deeper insights for understanding the hidden meanings and relationships in the artifacts of human interactions. This is true whether the artifact is an ancient Chinese letter or a modern chatlog.
2. Unsupervised machine learning can more easily leverage big datasets
Supervised learning requires examples of correct answers to learn how to accurately predict outputs for given inputs. For example, an algorithm that can detect email spam requires a large set of emails that are labeled spam or not spam. The process for “labeling” this data is slow and expensive.
One way to address these time and cost challenges is to crowd-source data labeling. If we give our email provider access, it can learn about those emails we mark as spam and those we don’t. We can also hire humans to tag emails manually. However, for many use cases – and in the enterprise space specifically – crowd-sourcing data labeling is impossible, because the information the data contains is sensitive, proprietary, or esoteric.
Unsupervised learning side-steps all these challenges. Because it simply looks for patterns in data, unsupervised learning doesn’t require a “cheat sheet” of labeled data. Also, unsupervised learning can lead us to a different kind of label: labeled patterns rather than labeled data.
Discovering new patterns is possible with unsupervised learning because the model usually uncovers a structure that is interpretable. An expert can make meaningful explanations about what the patterns mean, how they relate to each datapoint, and label them.
in a sense, these labeled patterns can lead us to a labeled dataset, because we can look at which patterns any data point exhibits, and label that data point accordingly. This is a much easier process than the other way around, because the volume of patterns is far less than the volume of the data.
The better the model is designed, the better the patterns it will discover, and the deeper the insights it will produce. Designing sophisticated unsupervised models requires deep data science experience – along with some art.
3. Unsupervised machine learning keeps humans in the loop
It is hard to imagine how unsupervised learning could work without a human in the loop. Somehow it has to provide information to a human, almost like an advisor saying, “Look at what I found.” Supervised machine learning, on the other hand, is often designed to take humans out of the loop. Self-driving cars are the clearest example of this, and also one of the most ambitious.
However, in highly regulated industries, it’s critical to have humans in the loop, so we can be the ones to consider the early results, perform the nuanced analyses, and make the final calls.
The algorithm used in Professor Bol’s study didn’t care that the letters tended to flow to certain power centers. When it found that there were dominant nodes, it didn’t ask why. I only said: look over here at this clear pattern. And when the researchers traced these patterns, the ancient, unwritten rules began to reveal themselves, like invisible ink under lemon juice.
Text IQ utilizes unsupervised machine learning to reach multi-dimensional insights like these: not just the sensitive needle in the haystack, but the story of how it got there.