Protecting AI Models from Poisoning and Evasion Attacks

Introduction

Artificial Intelligence (AI) and Machine Learning (ML) models have become integral to various applications, from image recognition to natural language processing. However, these models are vulnerable to adversarial attacks that can compromise their integrity and performance. Two prominent types of such attacks are poisoning attacks, which target the training phase by injecting malicious data, and evasion attacks, which manipulate inputs during inference to deceive the model.

Poisoning attacks aim to degrade model behavior or introduce biases by corrupting the training dataset, potentially leading to real-world chaos like biased decision-making or security breaches. Evasion attacks, on the other hand, craft subtle perturbations to inputs that cause misclassification without altering the training process. This article explores these attacks and outlines strategies to protect AI models, including code samples for practical implementation.


Understanding Poisoning Attacks

Data poisoning involves an adversary inserting corrupted or biased data into the model's training, fine-tuning, or retrieval processes. This can be done covertly, such as through supply chain compromises or unverified data sources, and is particularly concerning for large language models (LLMs) and other AI systems reliant on vast datasets.

Types of poisoning include:

  • Label-flipping attacks: Changing labels of training samples to mislead the model.
  • Backdoor attacks: Embedding triggers that activate malicious behavior.
  • Beta poisoning: A specific form where poisons are centered near the target class mean.

Without defenses, poisoned models can exacerbate biases or fail in critical sectors like military AI or healthcare.

Defenses Against Poisoning Attacks

Protecting against poisoning requires a multi-layered approach, focusing on data validation, model robustness, and continuous monitoring. Key strategies include:

  • Data Sanitization and Anomaly Detection: Use techniques like outlier detection to identify and remove suspicious data points.
  • Ensemble-Based Techniques: Train multiple models on disjoint data subsets to dilute the impact of poisons.
  • Risk-Based Routing: Intelligently assign data to subsets based on risk assessment.
  • Specialized Defenses: Such as kNN Proximity-Based Defense (KPB), Neighborhood Class Comparison (NCC), Clustering-Based Defense (CBD), and Mean Distance Threshold (MDT) for beta poisoning.
  • Sandboxing and Filtering: Limit exposure to unverified sources and use differential privacy.

Code Sample: Anomaly Detection for Data Sanitization

A simple way to defend against poisoning is to use outlier detection on the training data. Here's a Python example using scikit-learn's Isolation Forest to identify potential poisons in a dataset (e.g., MNIST-like features):

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.datasets import load_digits  # Example dataset
 
# Load example data
data = load_digits()
X = data.data  # Features
y = data.target  # Labels
 
# Simulate poisoning: Add outliers
poison_indices = np.random.choice(len(X), 50, replace=False)
X[poison_indices] += np.random.normal(10, 5, X[poison_indices].shape)  # Add noise
 
# Apply Isolation Forest for anomaly detection
iso_forest = IsolationForest(contamination=0.05, random_state=42)
iso_forest.fit(X)
 
# Predict anomalies (-1 for outlier, 1 for inlier)
anomalies = iso_forest.predict(X)
 
# Filter out detected poisons
clean_indices = np.where(anomalies == 1)[0]
X_clean = X[clean_indices]
y_clean = y[clean_indices]
 
print(f"Original samples: {len(X)}, Clean samples: {len(X_clean)}")

This code fits an Isolation Forest model to detect 5% contamination and removes outliers, helping to sanitize the dataset before training.

For more advanced implementations, frameworks like those on GitHub provide tools for poisoning attacks and defenses, such as using spectral signatures for filtering.


Understanding Evasion Attacks

Evasion attacks occur at inference time, where adversaries perturb inputs to cause incorrect outputs, such as misclassifying an image. Common methods include Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD), which add imperceptible noise based on model gradients.

These attacks exploit the model's sensitivity to small changes, making them a threat in real-time systems like autonomous vehicles or security scanners.

Defenses Against Evasion Attacks

To counter evasion, models can be made more robust through:

  • Adversarial Training: Train on both clean and adversarial examples to build resilience.
  • Input Preprocessing: Use filters or transformations to remove perturbations.
  • Detection Mechanisms: Monitor for anomalous inputs.
  • Ensemble Methods: Combine predictions from multiple models.
  • Continuous Monitoring: Regularly test and update models.

Code Sample: Adversarial Training with FGSM in PyTorch

Adversarial training involves generating perturbed examples during training. Below is an adapted example using FGSM from PyTorch tutorials. Assume a simple CNN model on MNIST.

First, the FGSM function:

import torch
import torch.nn.functional as F
 
def fgsm_attack(image, epsilon, data_grad):
    sign_data_grad = data_grad.sign()
    perturbed_image = image + epsilon * sign_data_grad
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    return perturbed_image

Now, integrate into a training loop (excerpt):

# Assume model, optimizer, train_loader are defined
epsilon = 0.1  # Perturbation strength
 
for epoch in range(num_epochs):
    for data, target in train_loader:
        data, target = data.to(device), target.to(device)
        data.requires_grad = True
        
        # Forward pass on clean data
        output = model(data)
        loss = F.cross_entropy(output, target)  # Or nll_loss
        
        # Backward for gradients
        model.zero_grad()
        loss.backward()
        data_grad = data.grad.data
        
        # Generate adversarial example
        perturbed_data = fgsm_attack(data, epsilon, data_grad)
        
        # Forward on adversarial data
        output_adv = model(perturbed_data)
        loss_adv = F.cross_entropy(output_adv, target)
        
        # Combined loss (e.g., 50% clean + 50% adv)
        total_loss = 0.5 * loss + 0.5 * loss_adv
        
        # Optimize
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

This trains the model to handle perturbations, improving robustness against evasion attacks.


Conclusion

Securing AI models from poisoning and evasion attacks is crucial for trustworthy deployment. By implementing data sanitization, ensemble methods, and adversarial training, developers can significantly mitigate risks. Ongoing research, such as NIST guidelines, emphasizes the need for robust defenses across all stages of AI lifecycle. As AI evolves, so must our security strategies to stay ahead of adversaries.