Bypassing Prompt Injection Detectors through Evasive Injections
Summary: arXiv:2602.00750v2 Announce Type: replace-cross
Abstract: Large language models (LLMs) are increasingly used in interactive and retrieval-augmented systems, but they remain vulnerable to prompt injection attacks, where injected secondary prompts force the model to deviate from the user’s instructions to execute a potentially malicious task defined by the adversary. Recent work shows that ML models trained on activation shifts from LLMs’ hidden layers can detect such drift.
In this paper, we demonstrate that these detectors are not robust to adaptive adversaries. We propose a multi-probe evasion attack that appends an adversarially optimised suffix to poisoned inputs, jointly optimising a universal suffix to simultaneously fool all layer-wise drift detectors while preserving the effectiveness of the underlying injection.
The Threat of Prompt Injection Attacks
Prompt injection attacks pose a significant risk to the integrity of LLMs. These attacks manipulate the input prompts by introducing secondary instructions that redirect the model’s output towards unintended actions. As LLMs become more integrated into various systems, the need for robust security measures against such vulnerabilities intensifies.
Adaptive Evasion Techniques
The research introduces an innovative method for bypassing existing prompt injection detectors. By employing a modified Greedy Coordinate Gradient (GCG) approach, the study generates universal suffixes that enable prompt injections to evade detection across multiple probes simultaneously. This technique is particularly concerning, as it demonstrates that current activation-based task drift detectors are susceptible to adaptive adversarial strategies.
Key Findings
- Success Rates: The research highlights that a single adversarially crafted suffix achieved attack success rates of 93.91% on the Phi-3 3.8B model and 99.63% on the Llama-3 8B model in evading all detectors.
- Vulnerability of Detectors: The findings emphasize the fragility of activation-based drift detectors in the face of sophisticated prompt injection attacks.
- Need for Stronger Defenses: The results underscore the urgent requirement for enhanced defensive measures to protect LLMs from evolving adversarial techniques.
Proposed Defense Mechanisms
In response to the identified vulnerabilities, the authors propose a novel defense strategy based on adversarial suffix augmentation. This approach involves generating multiple suffixes and appending one at random during the model’s forward passes. By training detectors on the resulting activations, the model becomes more resilient against evasive attacks.
Conclusion
As LLMs continue to evolve and find applications in various domains, the implications of this research are profound. The demonstrated effectiveness of adaptive evasion techniques reveals a critical gap in the security framework surrounding LLMs, necessitating immediate attention from the research and development community. Strengthening defenses against prompt injection attacks will be vital in maintaining the reliability and safety of AI-driven systems.
