Fixing Safety Failures in Agentic AI Guard Models

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

Recent research published on arXiv under the identifier 2605.02914v1 has brought to light significant vulnerabilities in guard models that are fine-tuned on entirely benign datasets. Surprisingly, these vulnerabilities do not arise from adversarial manipulation; instead, they result from standard domain specialization. This study explores these failures across three safety classifiers specifically designed for agentic AI pipelines: LlamaGuard, WildGuard, and Granite Guardian.

The Role of Latent Safety Geometry

The central theme of the research is the concept of “latent safety geometry,” which refers to the structured boundary that differentiates harmful outputs from benign ones in classification models. The study demonstrates that benign fine-tuning can lead to the destruction of this safety geometry, resulting in a complete collapse of safety measures within the models.

LlamaGuard: This model exhibited a notable decline in safety metrics post-fine-tuning.
WildGuard: Although initially robust, it showed signs of vulnerability under benign conditions.
Granite Guardian: The most striking results were observed here, where the refusal rate plummeted from 85% to 0%, indicating a total failure of safety alignment.

Key Findings and Implications

The research identified several key findings regarding the collapse of safety in these models:

The Complete Knowledge Alignment (CKA) score for Granite Guardian fell to zero, indicating a total loss of safety representation.
All outputs from Granite Guardian became ambiguous, underscoring the severity of the collapse.
This catastrophic failure is attributed to what the researchers call the specialization hypothesis, which suggests that concentrated safety representations are efficient but extremely brittle.

Proposed Solutions: Fisher-Weighted Safety Subspace Regularization (FW-SSR)

To counteract these vulnerabilities, the study proposes a novel approach called Fisher-Weighted Safety Subspace Regularization (FW-SSR). This technique introduces a training-time penalty that combines:

Curvature-aware direction weights: These are derived from diagonal Fisher information, which helps in understanding the landscape of safety representations.
Adaptive $\lambda_t$: This parameter adjusts based on the conflict between task performance and safety gradients, ensuring a balanced approach to model training.

Implementing FW-SSR yielded promising results. Granite Guardian’s refusal rate was restored to 75%, while its CKA score improved to 0.983. Furthermore, WildGuard’s Attack Success Rate was reduced to an impressive 3.6%, which is below its unmodified baseline.

Conclusion: Geometry-Based Monitoring as a Necessity

Across all three models studied, the research highlighted that structural representational geometry—measured through CKA and Fisher scores—proved to be a more reliable predictor of safety behavior than traditional absolute displacement metrics. This finding establishes geometry-based monitoring as an essential component for evaluating guard models in agentic deployments.

The implications of this research extend beyond theoretical frameworks, offering actionable insights for improving the safety and reliability of AI systems. As the field of artificial intelligence continues to evolve, understanding and addressing these vulnerabilities will be crucial for developing more robust models that prioritize safety alongside performance.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Fixing Safety Failures in Agentic AI Guard Models

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

The Role of Latent Safety Geometry

Key Findings and Implications

Proposed Solutions: Fisher-Weighted Safety Subspace Regularization (FW-SSR)

Conclusion: Geometry-Based Monitoring as a Necessity

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related