Adversarial Machine Learning

A short video on adversarial machine learning, produced as the first episode of the Center for Long-Term Cybersecurity’s “What? So What? Now What?” explainer video series. Animation by Annalise Kamegawa. Voiceover by Kathleen Dodge Doherty.


Recent years have seen a rapid increase in the use of machine learning, through which computers can be programmed to identify patterns in information and make increasingly accurate predictions over time. Machine learning is a key enabling technology behind artificial intelligence (AI), and is used for such valuable applications as email spam filters and malware detection, as well as more complex technologies like speech recognition, facial recognition, robotics, and self-driving cars.

While machine learning models have many potential benefits, they may be vulnerable to manipulation. Cybersecurity researchers refer to this risk as “adversarial machine learning,” as AI systems can be deceived (by attackers or “adversaries”) into making incorrect assessments. An adversarial attack might entail presenting a machine-learning model with inaccurate or misrepresentative data as it is training, or introducing maliciously designed data to deceive an already trained model into making errors.

“Machine learning has great power and promise to make our lives better in a lot of ways, but it introduces a new risk that wasn’t previously present, and we don’t have a handle on that,” says David Wagner, Professor of Computer Science at the University of California, Berkeley.

Some machine learning models already used in practical applications could be vulnerable to attack. For example, by placing a few small stickers on the ground in an intersection, researchers showed that they could cause a self-driving car to make an abnormal judgment and move into the opposite lane of traffic.

Other studies have shown that making imperceptible changes to an image can trick a medical imaging system into classifying a benign mole as malignant with 100% confidence, and that placing a few pieces of tape can deceive a computer vision system into wrongly classifying a stop sign as a speed limit sign.

Indeed, while much of the discussion around artificial intelligence has focused on the risks of bias (as the real-world data sets used to train the algorithms may reflect existing human prejudices), adversarial machine learning represents a different kind of challenge. As machine learning is adopted widely in business, transportation, the military, and other domains, attackers could use adversarial attacks for everything from insurance fraud to launching drone strikes on unintended targets.

Below is a brief overview of adversarial machine learning for policymakers, business leaders, and other stakeholders who may be involved in the development of machine learning systems, but who may not be aware of the potential for these systems to be manipulated or corrupted. A list of additional resources can be found at the conclusion of this article.

“What?”: Machine Learning 101

Machine learning models are computer programs that, in most cases, are designed to learn to recognize patterns in data. With the help from humans supplying “training data,” algorithms known as “classifiers” can be taught how to respond to different inputs. Through repeated exposure to training data, these models are designed to make increasingly accurate assessments over time.

For example, by exposing a machine learning model to several pictures of blue objects — and pre-labeling them as “blue” — the classifier can begin to break down the unique characteristics that make the objects blue. Over time, the model “learns” to ascertain whether any other subsequent image is blue, with a degree of certainty ranging from 0% to 100%. The more data is fed into a machine-learning system, the better it learns — and the more accurate its predictions become, at least in theory. But this learning process can be unpredictable, particularly in “deep” neural networks.

What are deep neural networks?

A neural network is a particular type of machine learning model loosely inspired by the biology of the human brain. “Deep” neural networks are composed of many decision-making layers that operate in sequence. Deep neural networks have proliferated in recent years, and their usage has led to major advances in the effectiveness of machine learning.

Yet the calculations that computers make within deep neural networks are highly complex and evolve rapidly as the “deep learning” process unfolds. In neural networks with a large number of layers, the calculations that lead to a given decision in some cases cannot be interpreted by humans: the process cannot be observed in real time, nor can the decision-making logic be analyzed after the fact.

A machine-learning system may be using different parameters to classify than can be intuitively understood by a human, so it looks like a “black box.” In addition, small manipulations to the data can have an outsized impact on the decision made by the neural network. That makes these systems vulnerable to manipulation, including through deliberate “adversarial attacks.”

What are adversarial attacks?

The term “adversary” is used in the field of computer security to describe people or machines that may attempt to penetrate or corrupt a computer network or program. Adversaries can use a variety of attack methods to disrupt a machine learning model, either during the training phase (called a “poisoning” attack) or after the classifier has already been trained (an “evasion” attack).

Poisoning Attacks

Attacks on machine-learning systems during the training phase are often referred to as “poisoning” or “contaminating.” In these cases, an adversary presents incorrectly labeled data to a classifier, causing the system to make skewed or inaccurate decisions in the future. Poisoning attacks require that an adversary has a degree of control over training data.

“Some of the poisoned data can be very subtle, and it’s difficult for a human to detect when data have been poisoned,” says Dawn Song, Professor of Computer Science at UC Berkeley.” We’ve done research demonstrating a ‘back-door attack,’ where the model is accurate for most normal inputs, but it can be trained to behave wrongly on specific types of inputs. It’s very difficult to detect when a model has learned such behaviors and what kinds of inputs will trigger a model to behave wrongly. This makes it very hard to detect.”

A poisoning attack may use a “boy who cried wolf” approach, i.e. an adversary might input data during the training phase that is falsely labeled as harmless, when it is actually malicious. “The idea is that an attacker will slowly put in instances that will cause some type of misclassification of input data and cause an erroneous result,” explained Doug Tygar, Professor of Computer Science and Information Management at UC Berkeley, in a 2018 presentation. “Adversaries can be patient in setting up their attacks and they can adapt their behavior.”

Example: Poisoning a Chatbot

In 2016, Microsoft launched “Tay,” a Twitter chat bot programmed to learn to engage in conversation through repeated interactions with other users. While Microsoft’s intention was that Tay would engage in “casual and playful conversation,” internet trolls noticed the system had insufficient filters and began to feed profane and offensive tweets into Tay’s machine learning algorithm. The more these users engaged, the more offensive Tay’s tweets became. Microsoft shut the AI bot down after just 16 hours after its launch.

Evasion Attacks

Evasion attacks generally take place after a machine learning system has already been trained; they occur when a model is calculating a probability around a new data input. These attacks are often developed by trial and error, as researchers (or adversaries) do not always know in advance what data manipulations will “break” a machine learning model.

For example, if attackers wanted to probe the boundaries of a machine learning model designed to filter out spam emails, they might experiment with sending different emails to see what gets through. If a model has been trained to screen for certain words (like “Viagra”) but to make exceptions for emails that contain a certain number of other words, an attacker might craft an email that includes enough extraneous words to “tip” the algorithm (i.e. to move it from being classified as “spam” to “not spam”), thus bypassing the filter.

Some attacks may be designed to affect the integrity of a machine learning model, leading it to output an incorrect result or produce a specific outcome that is intended by an attacker. Other adversarial attacks could aim at the confidentiality of a system, and cause an AI-based model to reveal private or sensitive information. For example, Professor Dawn Song and her colleagues demonstrated that they could extract social security numbers from a language processing model that had been trained with a large volume of emails, some of which contained this sensitive personal information.

“So What?”: Risks of Adversarial Machine Learning

Outside of research laboratories, adversarial attacks thus far have been uncommon. But cybersecurity researchers are concerned that adversarial attacks could become a serious problem in the future as machine-learning is integrated into a broader array of systems — including self-driving cars and other technologies where human lives could be at risk.

“This is not something the bad guys are exploiting today, but it’s important enough that we want to get ahead of this problem,” says David Wagner. “If we embed machine learning into our life and infrastructure without having a handle on this, we might be creating a big vulnerability that a future generation is going to have to deal with.”

“Now What?” Mitigating Adversarial Attacks

What can be done to limit or prevent adversarial machine learning? Cybersecurity researchers have been busy trying to address this problem, and hundreds of papers have been published since the field of adversarial machine learning came to the research community’s attention a few years ago.

Part of the challenge is that many machine learning systems are “black boxes” whose logic is largely inscrutable not only to the models’ designers, but also to would-be hackers. Adding to the challenge, attackers only need to find one crack in a system’s defenses for an adversarial attack to go through.

“Lots of people have come up with solutions that looked promising at first, but so far nothing seems to work,” says Wagner. “There are one or two things that help a lot but they’re not a complete solution.”

One potential approach for improving the robustness of machine learning is to generate a range of attacks against a system ahead of time, and to train the system to learn what an adversarial attack might look like, similar to building up its “immune system.” While this approach, known as adversarial training, has some benefits, it is overall insufficient to stop all attacks, as the range of possible attacks is too large and cannot be generated in advance.

Another possible defense lies in continually altering the algorithms that a machine learning model uses to classify data, i.e. creating a “moving target” by keeping the algorithms secret and changing the model on an occasional basis. As a different tactic, researchers from Harvard who examined the risks of adversarial attacks on medical imaging software proposed creating a ‘fingerprint’ hash of data might be “extracted and stored at the moment of capture,” then compared to the data fed through the algorithm.

Most importantly, developers of machine learning systems should be aware of the potential risks associated with these systems, and put in place systems for cross-checking and verifying information. They should also regularly attempt to break their own models and identify as many potential weaknesses as possible. They can also focus on developing methods for understanding how neural networks make decisions (and translating findings to users).

“Be aware of the shortcomings and don’t blindly believe the results, especially if the result is something you do not necessarily trust yourself,” says Sadia Afroz, Senior Researcher at the International Computer Science Institute. “When you are giving a decision, show at least some understanding of why this particular decision has been made, so maybe a human can look at this decision process and figure out, does this make sense or does it not? If you don’t understand how these models are making decisions and how is it processing the data and making decisions, it opens you up to adversarial attacks. Anyone can manipulate the decision-making process and cause problems.”

Additional Resources

Some additional resources for learning about AI and adversarial machine learning.

Adversarial Attacks on Medical AI Systems: Overview of March 2019 paper published in Science by researchers from Harvard and MIT, including an overview of how medical AI systems could be vulnerable to adversarial attacks.

Adversarial Machine Learning: A recently published textbook by Anthony D. Joseph, Blaine Nelson, Benjamin I.P. Rubinstein, and J.D. Tygar.

AI Now Institute: An interdisciplinary research center at New York University dedicated to understanding the social implications of artificial intelligence.

Attacking Artificial Intelligence: AI’s Security Vulnerability and What Policymakers Can Do About It: A relevant report by Marcus Comiter from the Belfer Center for Science and International Affairs, Harvard Kennedy School.

CleverHans: Compiled by TensorFlow, CleverHans is an adversarial example library for “constructing attacks, building defenses, and benchmarking both.”

Google AI: A trove of resources for learning about AI and machine learning.

Malicious Use of Artificial Intelligence: A report written by 26 authors from 14 institutions, spanning academia, civil society, and industry.

Presentation on Adversarial Machine Learning: 2018 presentation by Ian Goodfellow, a staff research scientist at Google Brain, on adversarial techniques in AI.

Skymind AI Wiki: A Beginner’s Guide to Important Topics in AI, Machine Learning, and Deep Learning.

Unrestricted Adversarial Examples Contest: Sponsored by Google Brain, this was a “a community-based challenge to incentivize and measure progress towards the goal of zero confident classification errors in machine learning models.”

Wild Patterns: Ten Years After the Rise of Machine Learning: An overview of the evolution of adversarial machine learning by Battista Biggioa and Fabio Rolia from the University of Cagliari, Italy.