Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy
Summary: arXiv:2603.25764v1 Announce Type: cross
Abstract: As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning.
Key Findings
In our research, we compared three leading models: Claude 4.5 Sonnet, GPT-5, and Llama 3.1-70B across multiple tasks. Each model was tested through 50 runs (10 tasks with 5 runs each), allowing us to analyze their performance comprehensively. The results revealed significant insights regarding consistency and accuracy:
- Claude 4.5 Sonnet: Achieved the lowest variance (Coefficient of Variation: 15.2%) and the highest accuracy (58%).
- GPT-5: Demonstrated intermediate performance with a variance of 32.2% and an accuracy of 32%.
- Llama 3.1-70B: Recorded the highest variance (47.0%) but exhibited the lowest accuracy (4%).
The Nuance of Consistency
While consistency is generally perceived as a positive attribute, our analysis revealed a critical nuance: consistency amplifies outcomes rather than guaranteeing correctness. For instance, 71% of Claude’s failures were attributed to “consistent wrong interpretation,” where the model made the same incorrect assumption across all runs. This finding highlights that consistent performance can lead to systemic errors when the underlying interpretation is flawed.
Divergence in Model Behavior
Interestingly, GPT-5 achieved a similar early strategic agreement with Claude, diverging slightly at different steps (3.4 for GPT-5 vs. 3.2 for Claude). However, GPT-5 exhibited 2.1 times higher variance, indicating that the timing of divergence alone does not determine a model’s overall consistency. This observation raises important questions about the metrics used to evaluate these models.
Implications for Production Deployment
The findings of this study suggest that, for production deployments, the accuracy of interpretation is paramount compared to mere execution consistency. This has significant implications for how agents are evaluated and trained. Ensuring that models not only perform consistently but also interpret tasks correctly should be a central focus for developers and researchers in the field of artificial intelligence.
Conclusion
As the deployment of LLM-based agents continues to grow, understanding the relationship between behavioral variance and accuracy will be crucial for enhancing reliability in production environments. This research provides a framework for future investigations into the complexities of model performance, emphasizing the need for a balanced approach that values both consistency and correctness.
