Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
Summary: arXiv:2603.11331v2 Announce Type: replace-cross
Abstract: Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that strong adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples.
In a groundbreaking study, researchers have uncovered a significant phenomenon regarding the behavior of large language models (LLMs) when subjected to adversarial attacks. This research delves into the mechanisms that govern the scaling laws of LLMs, particularly focusing on the transition from polynomial to exponential growth in attack success rates.
Key Findings
The study highlights several crucial insights:
- Adversarial Prompt-Injection Attacks: These attacks are shown to enhance the success rates of adversarial prompts significantly.
- Scaling Laws: The research identifies two scaling regimes: a polynomial growth without prompt injection and an exponential growth with prompt injection.
- Statistical Mechanism: A minimal statistical mechanism is proposed, which explains the behavior of large language models under adversarial conditions.
Theoretical Framework
The authors propose a theoretical generative model that likens the behavior of language generation to a spin-glass system operating in a replica-symmetry-breaking regime. Within this framework:
- Generations are drawn from a Gibbs measure.
- A subset of low-energy, size-biased clusters is identified as unsafe.
This model enables a clearer understanding of the scaling laws observed in LLMs. Short injected prompts function as a weak magnetic field, guiding the system towards unsafe cluster centers, thus resulting in a power-law scaling of the attack success rate. Conversely, longer injected prompts act as a strong magnetic field, leading to exponential scaling.
Analytical Derivations and Observations
The researchers derived these behaviors analytically, confirming the theoretical predictions with empirical observations. The study found qualitatively similar trends across various large language models, indicating that this scaling behavior is not isolated to a single model but rather a general characteristic of LLMs.
Implications for AI Safety
The implications of these findings are significant for the field of artificial intelligence and machine learning. Understanding how adversarial prompts influence LLM behavior is crucial for developing robust safety mechanisms. As LLMs become increasingly integrated into applications, ensuring their alignment with safety protocols will be paramount.
Conclusion
This research not only sheds light on the vulnerabilities of large language models but also paves the way for future studies aimed at enhancing AI safety. The polynomial-exponential crossover phenomenon serves as a critical benchmark for understanding adversarial dynamics in LLMs, and ongoing investigations are essential to mitigate potential risks associated with their deployment.
For further details, the full paper can be accessed on arXiv: 2603.11331v2.
