Agentic AI Security: What CISOs Must Govern Before Autonomous Systems Govern You
A CISO-focused guide to governing agentic AI systems before autonomous decision-making introduces security, compliance, and operational risk.
Dec 23, 2025AI-Security
Artificial Intelligence (AI) and Machine Learning (ML) models have become integral to various applications, from image recognition to natural language processing. However, these models are vulnerable to adversarial attacks that can compromise their integrity and performance. Two prominent types of such attacks are poisoning attacks, which target the training phase by injecting malicious data, and evasion attacks, which manipulate inputs during inference to deceive the model.
Poisoning attacks aim to degrade model behavior or introduce biases by corrupting the training dataset, potentially leading to real-world chaos like biased decision-making or security breaches. Evasion attacks, on the other hand, craft subtle perturbations to inputs that cause misclassification without altering the training process. This article explores these attacks and outlines strategies to protect AI models, including code samples for practical implementation.
Data poisoning involves an adversary inserting corrupted or biased data into the model's training, fine-tuning, or retrieval processes. This can be done covertly, such as through supply chain compromises or unverified data sources, and is particularly concerning for large language models (LLMs) and other AI systems reliant on vast datasets.
Types of poisoning include:
Without defenses, poisoned models can exacerbate biases or fail in critical sectors like military AI or healthcare.
Protecting against poisoning requires a multi-layered approach, focusing on data validation, model robustness, and continuous monitoring. Key strategies include:
A simple way to defend against poisoning is to use outlier detection on the training data. Here's a Python example using scikit-learn's Isolation Forest to identify potential poisons in a dataset (e.g., MNIST-like features):
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.datasets import load_digits # Example dataset
# Load example data
data = load_digits()
X = data.data # Features
y = data.target # Labels
# Simulate poisoning: Add outliers
poison_indices = np.random.choice(len(X), 50, replace=False)
X[poison_indices] += np.random.normal(10, 5, X[poison_indices].shape) # Add noise
# Apply Isolation Forest for anomaly detection
iso_forest = IsolationForest(contamination=0.05, random_state=42)
iso_forest.fit(X)
# Predict anomalies (-1 for outlier, 1 for inlier)
anomalies = iso_forest.predict(X)
# Filter out detected poisons
clean_indices = np.where(anomalies == 1)[0]
X_clean = X[clean_indices]
y_clean = y[clean_indices]
print(f"Original samples: {len(X)}, Clean samples: {len(X_clean)}")This code fits an Isolation Forest model to detect 5% contamination and removes outliers, helping to sanitize the dataset before training.
For more advanced implementations, frameworks like those on GitHub provide tools for poisoning attacks and defenses, such as using spectral signatures for filtering.
Evasion attacks occur at inference time, where adversaries perturb inputs to cause incorrect outputs, such as misclassifying an image. Common methods include Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD), which add imperceptible noise based on model gradients.
These attacks exploit the model's sensitivity to small changes, making them a threat in real-time systems like autonomous vehicles or security scanners.
To counter evasion, models can be made more robust through:
Adversarial training involves generating perturbed examples during training. Below is an adapted example using FGSM from PyTorch tutorials. Assume a simple CNN model on MNIST.
First, the FGSM function:
import torch
import torch.nn.functional as F
def fgsm_attack(image, epsilon, data_grad):
sign_data_grad = data_grad.sign()
perturbed_image = image + epsilon * sign_data_grad
perturbed_image = torch.clamp(perturbed_image, 0, 1)
return perturbed_imageNow, integrate into a training loop (excerpt):
# Assume model, optimizer, train_loader are defined
epsilon = 0.1 # Perturbation strength
for epoch in range(num_epochs):
for data, target in train_loader:
data, target = data.to(device), target.to(device)
data.requires_grad = True
# Forward pass on clean data
output = model(data)
loss = F.cross_entropy(output, target) # Or nll_loss
# Backward for gradients
model.zero_grad()
loss.backward()
data_grad = data.grad.data
# Generate adversarial example
perturbed_data = fgsm_attack(data, epsilon, data_grad)
# Forward on adversarial data
output_adv = model(perturbed_data)
loss_adv = F.cross_entropy(output_adv, target)
# Combined loss (e.g., 50% clean + 50% adv)
total_loss = 0.5 * loss + 0.5 * loss_adv
# Optimize
optimizer.zero_grad()
total_loss.backward()
optimizer.step()This trains the model to handle perturbations, improving robustness against evasion attacks.
Securing AI models from poisoning and evasion attacks is crucial for trustworthy deployment. By implementing data sanitization, ensemble methods, and adversarial training, developers can significantly mitigate risks. Ongoing research, such as NIST guidelines, emphasizes the need for robust defenses across all stages of AI lifecycle. As AI evolves, so must our security strategies to stay ahead of adversaries.
Love it? Share this article: