Skip to main content

Text-based AI models are vulnerable to paraphrasing attacks, researchers find

Image Credit: Melinda Podor/Getty Images

Watch all the Transform 2020 sessions on-demand here.


Thanks to advances in natural language processing (NLP), companies and organizations are increasingly putting AI algorithms in charge of carrying out text-related tasks such as filtering spam emails, analyzing the sentiment of social media posts and online reviews, evaluating resumes, and detecting fake news.

But how far can we trust these algorithms to perform their tasks reliably? New research by IBM, Amazon, and University of Texas proves that with the right tools, malicious actors can attack text-classification algorithms and manipulate their behavior in  potentially malicious ways.

The research, being presented today at the SysML AI conference at Stanford, looks at “paraphrasing” attacks, a process that involves modifying input text so that it is classified differently by an AI algorithm without changing its actual meaning.

To understand how a paraphrasing attack works, consider an AI algorithm that evaluates the text of an email message and classifies it as “spam” or “not spam.” A paraphrasing attack would modify the content of a spam message so that the AI classifies it as “not spam.” Meanwhile, to a human reader, the tampered message would have the same meaning as the original one.


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.


The challenges of adversarial attacks against text models

In the past few years, several research groups have explored aspects of adversarial attacks, input modifications meant to cause AI algorithms to misclassify images and audio samples while preserving their original appearance and sound to human eyes and ears. Paraphrasing attacks are the text equivalent of these. Attacking text models is much more difficult than tampering with computer vision and audio recognition algorithms.

“For audio and images you have full differentiability,” says Stephen Merity, an AI researcher and expert on language models. For instance, in an image classification algorithm, you can gradually change the color of pixels and observe how these modifications affect the output of the model. This can help researchers find the vulnerabilities in a model.

“Text is traditionally harder to attack. It’s discrete. You can’t say I want 10% more of the word ‘dog’ in this sentence. You either have the word ‘dog’ or you take it out. And you can’t efficiently search a model for vulnerabilities,” Merity says. “The idea is, can you intelligently work out where the machine is vulnerable, and nudge it in that specific spot?”

“For image and audio, it makes sense to do adversarial perturbations. For text, even if you make small changes to an excerpt — like a word or two — it might not read smoothly to humans,” says Pin-Yu Chen, researcher at IBM and co-author of the research paper being presented today.

Creating paraphrasing examples

Past work on adversarial attacks against text models involved changing single words in sentences. While this approach succeeded in changing the output of the AI algorithm, it often resulted in modified sentences that sounded artificial. Chen and his colleagues focused not only on changing words but also on rephrasing sentences and changing longer sequences in a way that remain meaningful.

“We are paraphrasing words and sentences. This gives the attack a larger space by creating sequences that are semantically similar to the target sentence. We then see if the model classifies them like the original sentence,” Chen says.

The researchers have developed an algorithm to find optimal changes in a sentence that can manipulate the behavior of an NLP model. “The main constraint was to make sure that the modified version of the text was semantically similar to the original one. We developed an algorithm that searches a very large space for word and sentence paraphrasing modifications that will have the most impact on the output of the AI model. Finding the best adversarial example in that space is very time consuming. The algorithm is computationally efficient and also provides theoretical guarantees that it’s the best search you can find,” says Lingfei Wu, scientist at IBM Research and another co-author of the paper.

In their paper, the researchers provide examples of modifications that change the behavior of sentiment analysis algorithms, fake news detectors, and spam filters. For instance, in a product review, by simply swapping the sentence “The pricing is also cheaper than some of the big name conglomerates out there” with “The price is cheaper than some of the big names below,” the sentiment of the review was changed from 100% positive to 100% negative.

Humans can’t see paraphrasing attacks

The key to the success of paraphrasing attacks is that they are imperceptible to humans, since they preserve the context and meaning of the original text.

“We gave the original paragraph and modified paragraph to human evaluators, and it was very hard for them to see differences in meaning. But for the machine, it was completely different,” Wu says.

Merity points out that paraphrasing attacks don’t need to be perfectly coherent to humans, especially when they’re not anticipating a bot tampering with the text. “Humans aren’t the correct level to try to detect these kinds of attacks, because they deal with faulty input every day. Except that for us, faulty input is just incoherent sentences from real people,” he says. “When people see typos right now, they don’t think it’s a security issue. But in the near future, it might be something we will have to contend with.”

Merity also points out that paraphrasing and adversarial attacks will give rise to a new trend in security risks. “A lot of tech companies rely on automated decisions to classify content, and there isn’t actually a human-to-human interaction involved. This makes the process vulnerable to such attacks,” Merity says. “It will run in parallel to data breaches, except that we’re going to find logic breaches.”

For instance, a person might fool a hate-speech classifier to approve their content, or exploit paraphrasing vulnerabilities in a resume-processing model to push their job application to the top of the list.

“These types of issues are going to be a new security era, and I’m worried companies will spend as little on this as they do on security, because they’re focused on automation and scalability,” Merity warns.

Putting the technology to good use

The researchers also discovered that by reversing paraphrasing attacks, they can build more robust and accurate models.

After generating paraphrased sentences that a model misclassifies, developers can retrain their model with modified sentences and their correct labels. This will make the model more resilient against paraphrasing attacks. It will also render them more accurate and generalize their capabilities.

“This was one of the surprising findings we had in this project. Initially, we started with the angle of robustness. But we found out that this method not only improves robustness but also improves generalizability,” Wu says. “If instead of attacks, you just think about what is the best way to augment your model, paraphrasing is a very good generalization tool to increase the capability of your model.”

The researchers tested different word and sentence models before and after adversarial training, and in all cases, they experienced an improvement both in performance and robustness against attacks.

Ben Dickson is a software engineer and the founder of TechTalks, a blog that explores the ways technology is solving and creating problems.