Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks
Summary
arXiv:2604.01039v1
Announce Type: cross
Abstract
System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications.
Without incurring the overhead costs of reasoning models, many LLM applications rely on refusal-based instructions that block direct requests for system instructions, implicitly assuming that prohibited information can only be extracted through explicit queries.
We introduce an automated evaluation framework that tests whether system instructions remain confidential when extraction requests are re-framed as encoding or structured output tasks. Across four common models and 46 verified system instructions, we observe high attack success rates (> 0.7) for structured serialization where models refuse direct extraction requests but disclose protected content in the requested serialization formats.
We further demonstrate a mitigation strategy based on one-shot instruction reshaping using a Chain-of-Thought reasoning model, indicating that even subtle changes in wording and structure of system instructions can significantly reduce attack success rate without requiring model retraining.
Key Findings
- The framework effectively evaluates the confidentiality of system instructions under various attack scenarios.
- High success rates of attacks indicate vulnerabilities in refusal-based instruction models.
- The one-shot instruction reshaping method provides a viable solution to enhance security without extensive retraining.
Implications for AI Security
The findings of this study underscore the importance of robust security measures in LLM applications. As AI systems become more integrated into critical infrastructure and sensitive operations, the need for protecting internal instructions becomes paramount.
The research highlights that relying solely on refusal-based instructions can lead to false security, as attackers may find alternative approaches to extract sensitive information. Therefore, the implementation of dynamic instruction reshaping strategies could serve as an essential step in reinforcing the confidentiality of system instructions.
Future Directions
Future research could explore additional methods for enhancing the security of LLMs against encoding attacks. This may include developing more sophisticated instruction reshaping techniques or integrating additional layers of security that adapt to emerging threats.
By continually assessing and improving the integrity of system instructions, developers can better safeguard sensitive information and maintain the trustworthiness of AI applications.
