GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
Summary: arXiv:2604.09222v1 Announce Type: cross
Introduction
Audio large language models (ALLMs) are revolutionizing the way we interact with speech and text. However, these advances come with significant vulnerabilities, particularly in the realm of jailbreak attacks. Traditional audio jailbreak methods prioritize attack success rates but often neglect the crucial aspect of utility preservation, which encompasses transcription quality and question-answering performance. This article delves into the nuances of these jailbreak techniques and introduces a new framework, GRM, designed to strike a balance between effective attacks and utility preservation.
The Challenge of Jailbreak Attacks
Existing jailbreak methods primarily focus on maximizing success rates, which can inadvertently lead to a decline in the overall utility of the model. The relationship between the strength of an attack and the degradation of utility is a critical consideration in this domain. In our research, we explored the frequency domain’s influence on jailbreak effectiveness by adjusting the perturbation coverage from partial-band to full-band. Our findings indicate that:
- Broader frequency coverage does not necessarily enhance jailbreak performance.
- Utility consistently deteriorates as the breadth of perturbation increases.
This raises an intriguing question: Can we achieve a more effective jailbreak while maintaining higher levels of utility?
The GRM Framework
To answer this question, we propose the Gradient-Ratio Masking (GRM) framework, which is utility-aware and frequency-selective. The framework operates by:
- Ranking Mel bands based on their contribution to the attack concerning utility sensitivity.
- Focusing perturbations on a carefully selected subset of bands rather than applying full-band coverage.
- Learning a universal perturbation that adheres to a semantic-preservation objective.
By concentrating on a select range of frequencies, GRM allows for a more tailored approach to jailbreak attacks, enhancing overall effectiveness without sacrificing quality.
Experimental Results
Our experiments conducted on four representative ALLMs underscore the efficacy of the GRM framework. The results are compelling:
- GRM achieved an average Jailbreak Success Rate (JSR) of 88.46%.
- The framework demonstrated a superior attack-utility trade-off compared to existing baseline methods.
These findings illustrate the potential of frequency-selective perturbation as a means to balance attack effectiveness with utility preservation in audio jailbreak scenarios.
Conclusion
The advent of GRM signifies a pivotal advancement in the field of audio LLMs, addressing a critical gap in the existing methodologies. As audio models continue to evolve, ensuring their robustness against jailbreak attacks while maintaining utility will be paramount. Future research should focus on refining these techniques and exploring broader implications for safe and secure interactions with AI-driven audio technologies.
Content Warning: This paper includes harmful query examples and unsafe model responses.
