TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
Summary: arXiv:2604.12232v1 Announce Type: cross
Introduction
As the deployment of Large Language Models (LLMs) becomes prevalent across various sectors, the security vulnerabilities associated with these models are increasingly concerning. One of the most critical issues is the susceptibility of LLMs to jailbreak attacks. These attacks involve adversarial inputs that can circumvent the models’ safety mechanisms, potentially leading to harmful outputs.
Current Challenges in LLM Security
Previous research on LLM vulnerabilities has primarily been centered around prompt injection attacks. While these methods have provided valuable insights, they often necessitate extensive prompt engineering and tend to overlook significant components such as chat templates. This gap in understanding has driven the need for more sophisticated approaches to assess and enhance the security of LLMs.
Introducing TEMPLATEFUZZ
This paper presents TEMPLATEFUZZ, a novel fine-grained fuzzing framework that systematically identifies and exploits vulnerabilities in chat templates—an underexplored yet critical attack surface in LLMs. The proposed framework incorporates several innovative strategies:
- Element-Level Mutation Rules: TEMPLATEFUZZ designs a series of rules to generate diverse variants of chat templates, allowing for a comprehensive evaluation of potential vulnerabilities.
- Heuristic Search Strategy: A heuristic search strategy is proposed to steer the generation of chat templates towards maximizing the attack success rate (ASR) while maintaining model accuracy.
- Active Learning-Based Strategy: The integration of an active learning-based approach enables the derivation of a lightweight rule-based oracle, which is crucial for accurate and efficient jailbreak evaluation.
Evaluation and Results
TEMPLATEFUZZ has been rigorously evaluated across twelve open-source LLMs in multiple attack scenarios. The results demonstrate that TEMPLATEFUZZ achieves an impressive average ASR of 98.2% with only a 1.1% degradation in model accuracy. Notably, this performance surpasses that of existing state-of-the-art methods by margins ranging from 9.1% to 47.9% in ASR and 8.4% in accuracy degradation.
Performance on Commercial LLMs
Furthermore, TEMPLATEFUZZ has shown remarkable efficacy even on five industry-leading commercial LLMs where chat templates cannot be explicitly defined. In these scenarios, TEMPLATEFUZZ managed to achieve a 90% average ASR through chat template-based prompt injection attacks, highlighting its versatility and effectiveness.
Conclusion
The introduction of TEMPLATEFUZZ marks a significant advancement in the field of LLM security research. By focusing on chat templates, this framework not only enhances the understanding of vulnerabilities in LLMs but also provides practical tools for red teaming and improving the robustness of these models against jailbreak attacks. As the reliance on LLMs continues to grow, the importance of frameworks like TEMPLATEFUZZ cannot be overstated.
