How ML Model Data Poisoning Works in 5 Minutes

Training data poisoning on LLMs deals with injecting poisonous data during the training phase. In this article, we will be focusing on attack scenarios, previous successful attacks, and prevention mechanisms along with good examples.

sreedeep

Mar 24, 2024 — 8 min read

Training data poisoning is an attack type common in LLMs (Large Language Models). It deals with injecting harmful data into the training data, which breaks the ML model causing wrong results.

If this harmful data is not properly detected or flagged, then the model will be vulnerable to attacks and poisoned information may be surfaced to users.

It compromises the model and the application. Risks like performance degradation, downstream software exploitation, and reputational damage.

Why data poisoning is problematic

These are three real-world data poisoning attacks.

Case study of Microsoft Chatbot

In 2016, Microsoft released their chatbot named Tay on Twitter to learn from human interactions by posting comments. But after the release, it started to act crazy.

It started using vulgar language and making hateful comments. This was one of the first incidents of data poisoning.

Artists vs. image generators.

Artists worldwide were hit due to the rise of image-generation models. Image generator models are run by scraping artwork from different websites.

The artist community started to take measures against AI for copying their original work.
Niteshade is a new tool made by the University of Chicago that lets you add invisible changes to the pixels in the art before you upload it online.

If it gets scraped into an AI training set, it can cause the resulting model to break in unpredictable ways.
Example: It creates images that have the maximum effect on the model (which leads to a need for fewer images in total)

If you want to have cat images labeled as dogs, you prompt the model with a simple prompt like an image of a cat. The image it creates will be a very typical representation of what the model understood to be a cat.

If this image is seen in training, it will have a very high influence on the understanding of the concept cat (a much higher than rather untypical image of cats have). Hence, if you poison that image, you will get a very large effect on the model’s training.

Virus total incident

There have been instances when big companies have been compromised by a data poisoning attack. Virus Total analyzes suspicious files and URLs to detect types of malware and malicious content.

Here an attacker executed an attack by employing metame, a metamorphic code engine designed for arbitrary executables, to create "mutant" variants of a malware sample from a prevalent ransomware family.

These "mutant" samples were then uploaded to a platform, leading to several vendors classifying them as part of the ransomware family, despite many of them being unable to execute.

This effectively poisoned the dataset used by ML model to identify and classify the ransomware family, resulting in misclassifications of the viruses.

Different Attack scenarios.

Data poisoning against Large language models is caused due to adding wrong data in its training stages. This causes the model to generate wrong results as we saw in the above examples.

These kinds of attacks are very difficult to fix and stealthy, it's difficult to trace back the root cause. Because models are getting retrained with newly collected data, depending on their intended use and their owner's preference.

Since poisoning usually happens over time, and over some number of training cycles, it is hard to tell when prediction accuracy has started to shift.
Reverting the poisoning effects would require a time-consuming historical analysis of inputs for the affected class to identify all the bad data samples and remove them.

A version of the model from before the attack started would need to be retrained. When dealing with large quantities of data and a large number of attacks, however, retraining in such a way is simply not feasible and the models never get fixed.

This vulnerability is ranked 3rd in OWASP's top 10 for Large Language Models

The impacts of this vulnerability are :

Degrading Performance: Reduced accuracy across various tasks as the model's internal logic is corrupted.

Bias and Discrimination: Poisoned data can skew the model's results, potentially leading to discriminatory or harmful output.

Embedded Backdoors: For targeted attacks, hidden triggers can be introduced, making the LLM produce a specific incorrect response whenever that trigger is presented. More about triggers in the below section.

Backdoors

Backdoor attacks cause a model to misclassify training data samples that contain a trigger. Trigger is a visual feature in images or a particular character sequence in the natural language setting.

For example, one might tamper with training images so that a vision system fails to identify any person wearing a shirt with the trigger symbol printed on it.

In this threat model, the attacker modifies data at both training time (by placing poisons) and at inference time (by inserting the trigger).

The trigger is a pattern that is easily applied to any input — e.g., a small patch or sticker in the case of images or a specific phrase in the case of natural language processing

The small box under the STOP signal is a trigger

Prevent and detect

Developers need to focus on measures that could either block attack attempts or detect malicious inputs before the next training cycle happens—things like :

Input validity checking: This can be done by implementing data validation checks and employing multiple data labelers to validate the accuracy of the data labeling.
Anomaly detection: Use anomaly detection techniques to detect any abnormal behavior in the training data, such as sudden changes in the data distribution or data labeling. These techniques can be used to detect data poisoning attacks early on.

Developers need to limit the public release of technical project details including data, algorithms, model architectures, and model checkpoints that are used in production.

Establish access controls on internal model registries and limit internal access to production models. Limit access to training data only to approved users.

Eg: Restrictions can be placed on how many inputs provided by a unique user are accepted into the training data or with what weight. Newly trained classifiers can be compared to previous ones to compare their outputs by using dark launches—(rolling them out to only a small subset of users).

To perform data poisoning, attackers also need to gain information about how the model works, so it’s important to leak as little information as possible and have strong access controls in place for both the model and the training data.

On the attacker's Shoes.

Number predictor poisoning

We use a training set of 2000 images and a testing set of 1000 images. Each image is a handwritten digit from 0 through 9, and the task is to identify the digit.

from sklearn.datasets import fetch_openml
from sklearn.svm import SVC
import numpy 

# Fetch the mnist_784 model original data
mnist = fetch_openml('mnist_784', as_frame=False, parser="auto")
print(mnist.DESCR)
X, y = mnist.data, mnist.target

Prepare shorter training and testing sets. The first loop adds 2000 samples to the training set, while the second loop adds 1000 samples to the testing set.


X_train_orig = numpy.array([[]]).reshape(0, 784)
y_train_orig = numpy.array([])
X_test_orig = numpy.array([]).reshape(0, 784)
y_test_orig = numpy.array([])

for i in range(2000):
  X_train_orig = numpy.concatenate((X_train_orig, [X[i]]), axis=0)
  y_train_orig = numpy.append(y_train_orig, y[i])

for i in range(1000):
  X_test_orig = numpy.concatenate((X_test_orig, [X[i+ 60000]]), axis=0)
  y_test_orig = numpy.append(y_test_orig, y[i + 60000])

This code snippet creates a Support Vector Machine (SVM) classifier (svm_clf) using scikit-learn's SVC class with a specified random state. It then trains the SVM classifier using the training data.

svm_clf = SVC(random_state=42)
svm_clf.fit(X_train_orig, y_train_orig)

# Print results
print("Correct\tPredicted")
for i in range(5):
  print(y_test_orig[i], "\t", svm_clf.predict([X_test_orig[i]])[0])

After training, it prints the results by predicting the labels for the first 5 samples from the testing set (X_test_orig and y_test_orig) using the trained SVM classifier

Poisoning 100 of the 1's

Execute these commands to prepare a y_train_poisoned list with the first 100 1's changed to 2's:


y_train_poisoned = y_train_orig.copy()

count_before = [0,0,0,0,0,0,0,0,0,0]
count_after  = [0,0,0,0,0,0,0,0,0,0]

for yi in y_train_poisoned:
  count_before[int(yi)] += 1

This modifies the y_train_poisoned array by changing the label '1' to '2' for the first 100 occurrences of '1' in the array. It keeps track of the number of poisoned labels using the number_poisoned variable and breaks the loop once 100 labels have been poisoned.


number_poisoned = 0
for i in range(2000):
  if y_train_poisoned[i] == '1':
    y_train_poisoned[i] = '2'
    number_poisoned += 1
  if number_poisoned == 100:
    break

calculates and prints the count of each digit (0-9) in the original training labels (count_before) and the modified/poisoned training labels (count_after).


for yi in y_train_poisoned:
  count_after[int(yi)] += 1

print("Correct\tBefore\tAfter")
for i in range(10):
  print(i, "\t", count_before[i], "\t", count_after[i])

Training from the Poisoned Data

Execute these commands to create and train a model on the poisoned data, and display its predictions.

This creates another SVM classifier (svm_clf_poisoned) using scikit-learn's SVC class with a specified random state. It then trains this classifier using the poisoned training labels

svm_clf_poisoned = SVC(random_state=42)
svm_clf_poisoned.fit(X_train_orig, y_train_poisoned)

prediction_correct = [0,0,0,0,0,0,0,0,0,0]
prediction_incorrect = [0,0,0,0,0,0,0,0,0,0]

This code snippet predicts the labels for the first 1000 samples from the testing set using the trained SVM classifier with poisoned labels. It then updates the counts of correct and incorrect predictions for each digit in the prediction_correct and prediction_incorrect lists based on whether the predicted label matches the actual label.

In the results, we see that the poisoning was successful and we were able to poison the value of '2'. We get 33 Incorrect values for '2'.

for i in range(1000):
  p = svm_clf_poisoned.predict([X_test_orig[i]])[0]
  if p == y_test_orig[i]:
    prediction_correct[int(p)] += 1
  else:
    prediction_incorrect[int(p)] += 1
print("Value\tCorrect\tIncorrect")
for i in range(10):
  print(i, "\t", prediction_correct[i], "\t", prediction_incorrect[i])

Resource source: here

Conclusion

That concludes our basic description of data poisoning attacks. You can find a collection of papers and resources on this topic from this github repo - Awsome LLM Security

Checkout our other Articles on AI/ML: