SALLIE: Safeguarding Against Latent Language & Image Exploits
Summary: arXiv:2604.06247v1 Announce Type: cross.
Large Language Models (LLMs) and Vision-Language Models (VLMs) have become integral to numerous applications, yet they remain susceptible to various forms of exploitation. Recent studies have highlighted vulnerabilities to textual and visual jailbreaks, as well as prompt injections, which pose significant risks to the integrity of these advanced systems (arXiv:2307.15043, Greshake et al., 2023, arXiv:2306.13213).
Despite ongoing efforts to enhance security, existing defenses often lead to performance degradation through complex input transformations or treat multimodal threats as isolated issues (arXiv:2309.00614, arXiv:2310.03684, Zhang et al., 2025). This gap necessitates a solution that can address both textual and visual threats simultaneously without compromising performance or requiring extensive architectural changes.
Introducing SALLIE
To bridge this critical gap, we present SALLIE (Safeguarding Against Latent Language & Image Exploits), a lightweight runtime detection framework grounded in mechanistic interpretability (Lindsey et al., 2025, Ameisen et al., 2025). SALLIE is designed to integrate seamlessly into standard token-level fusion pipelines (arXiv:2306.13549), allowing it to extract robust signals directly from the model’s internal activations.
How SALLIE Works
SALLIE employs a three-stage architecture during inference to defend against potential exploits:
- Stage 1: Extracting internal residual stream activations to capture the model’s responses to input.
- Stage 2: Calculating layer-wise maliciousness scores using a K-Nearest Neighbors (k-NN) classifier, which assesses the likelihood of malicious intent based on the extracted activations.
- Stage 3: Aggregating these predictions through a layer ensemble module to produce a comprehensive threat assessment.
Evaluation of SALLIE
We have rigorously evaluated SALLIE on several compact, open-source architectures, including Phi-3.5-vision-instruct (arXiv:2404.14219), SmolVLM2-2.2B-Instruct (arXiv:2504.05299), and gemma-3-4b-it (arXiv:2503.19786). These models were prioritized for their practical inference times and cost-effectiveness in real-world deployments.
Our comprehensive evaluation pipeline spans over ten datasets and incorporates more than five robust baseline methods from existing literature. The results demonstrate that SALLIE consistently outperforms these baselines across a wide array of experimental settings, solidifying its position as a formidable defense mechanism against latent language and image exploits.
Conclusion
The introduction of SALLIE represents a significant advancement in the ongoing battle to secure LLMs and VLMs against emerging threats. By providing a unified, modal-agnostic defense without sacrificing performance, SALLIE sets a new standard for safeguarding against the nuanced vulnerabilities that these models face in an increasingly complex digital landscape.
