Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code’s Auto Mode
Summary: arXiv:2604.04978v1 Announce Type: cross
Abstract: Claude Code’s auto mode is the first deployed permission system for AI coding agents, using a two-stage transcript classifier to gate dangerous tool calls. Anthropic reports a 0.4% false positive rate and 17% false negative rate on production traffic. We present the first independent evaluation of this system on deliberately ambiguous authorization scenarios, i.e., tasks where the user’s intent is clear but the target scope, blast radius, or risk level is underspecified.
Using AmPermBench, a 128-prompt benchmark spanning four DevOps task families and three controlled ambiguity axes, we evaluate 253 state-changing actions at the individual action level against oracle ground truth. Our findings characterize auto mode’s scope-escalation coverage under this stress-test workload.
Key Findings
- The end-to-end false negative rate (FNR) is 81.0% (95% CI: 73.8%-87.4%), which is significantly higher than the reported 17% on production traffic.
- This discrepancy reflects a fundamentally different workload rather than a contradiction in the system’s performance.
- A notable 36.8% of all state-changing actions fall outside the classifier’s scope via Tier 2 (in-project file edits), contributing to the elevated end-to-end FNR.
- Even when restricting the evaluation to the 160 actions the classifier actually evaluates (Tier 3), the FNR remains high at 70.3%, with the false positive rate (FPR) rising to 31.9%.
- The coverage gap for Tier 2 is most pronounced during artifact cleanup, with a staggering 92.9% FNR, indicating that agents often revert to editing state files when the expected command-line interface (CLI) is unavailable.
Discussion
These results highlight a critical coverage boundary that warrants further examination. The auto mode system operates under the assumption that dangerous actions transit through the shell; however, agents routinely achieve equivalent outcomes through file edits that the classifier does not evaluate. This oversight suggests that the current permission system may require enhancements to address the limitations associated with in-project file edits, particularly in scenarios where user intent may not align with predefined classifications.
As AI coding agents become more sophisticated, understanding the nuances of their operational parameters and the contexts in which they function is essential for ensuring safety and reliability. The findings from this independent evaluation serve as a foundation for future research aimed at refining permission systems for AI, ultimately leading to better performance and safer outcomes.
Conclusion
In conclusion, while Claude Code’s auto mode has made strides in providing a permission system for AI coding agents, the independent evaluation underscores significant gaps in coverage that need to be addressed. By enhancing the classifier to account for ambiguous authorization scenarios, the industry can work towards developing more robust permission systems that ensure a higher standard of safety in AI-assisted coding tasks.
