Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
In a groundbreaking development in the field of artificial intelligence, researchers have unveiled a new autoresearch pipeline powered by Claude Code, a large language model (LLM) agent capable of conducting autonomous AI research and engineering. The findings, detailed in a recent paper on arXiv (arXiv:2603.24511v1), reveal that this innovative approach has yielded novel white-box adversarial attack algorithms that significantly outperform over 30 existing methods.
Overview of the Research
LLM agents like Claude Code are not just capable of generating code; they are also adept at performing complex tasks, including the discovery of advanced algorithms for cybersecurity applications. The autoresearch pipeline utilized in this study enables the identification and optimization of attack algorithms, which can be utilized to breach the defenses of LLMs.
Key Findings
The research has yielded several noteworthy results:
- The new attack algorithms achieve an impressive up to 40% attack success rate on CBRN queries against the GPT-OSS-Safeguard-20B model.
- This represents a significant improvement over existing algorithms, which have a success rate of 10% or less.
- The discovered algorithms demonstrate remarkable generalization capabilities, with attacks optimized on surrogate models transferring directly to held-out models.
- Specifically, the new algorithms achieved a 100% attack success rate against Meta-SecAlign-70B, compared to just 56% for the best baseline method.
Implications for AI Safety and Security
The implications of this research are profound. By demonstrating that incremental safety and security research can be automated with the assistance of LLM agents, the study paves the way for enhanced cybersecurity measures in AI systems. White-box adversarial red-teaming is particularly well-suited for this approach, as existing methods provide strong starting points, and the optimization process yields dense, quantitative feedback.
Future Directions and Open Access
As part of the commitment to transparency and collaboration within the AI research community, the authors of the study have made all discovered attack algorithms, along with baseline implementations and evaluation code, freely available. Interested researchers can access these resources at https://github.com/romovpa/claudini.
Conclusion
The advancements reported in this research underscore the potential of LLMs to contribute significantly to the fields of AI safety and security. As adversarial attacks become increasingly sophisticated, the ability to automate the discovery of countermeasures through autoresearch represents a promising avenue for future exploration. This study not only highlights the capabilities of Claude Code but also sets a precedent for similar research endeavors aimed at enhancing the robustness and security of AI systems.
