Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
In recent years, the deployment of large language models as autonomous agents has gained significant traction, particularly in executing complex multi-step workflows in various software environments. However, the evaluation of these agents has exposed several critical limitations in existing benchmarks. To address these shortcomings, a new evaluation suite called Claw-Eval has been introduced, aiming to enhance the reliability and comprehensiveness of autonomous agent assessments.
Key Limitations of Current Evaluation Benchmarks
Current benchmarks for evaluating autonomous agents face three major issues:
- Trajectory-Opaque Grading: Existing methods primarily focus on assessing final outputs without considering the trajectory of decisions made by the agent throughout the task.
- Underspecified Safety and Robustness Evaluation: There is a lack of thorough evaluation regarding the safety and robustness of agents, which is crucial for real-world deployment.
- Narrow Modality Coverage: Many benchmarks fail to cover a diverse range of modalities and interaction paradigms, limiting their effectiveness in evaluating the true capabilities of autonomous agents.
Introducing Claw-Eval
Claw-Eval is designed to fill these gaps, offering a comprehensive end-to-end evaluation suite that includes 300 human-verified tasks categorized into nine distinct groups. The categories encompass:
- General Service Orchestration
- Multimodal Perception and Generation
- Multi-Turn Professional Dialogue
The innovative framework records every action taken by the agent through three independent evidence channels: execution traces, audit logs, and environment snapshots. This approach enables a trajectory-aware grading system that evaluates performance using 2,159 fine-grained rubric items.
Evaluation Metrics
The scoring protocol within Claw-Eval assesses several key areas:
- Completion: Measures if the agent successfully completes the tasks.
- Safety: Evaluates the safety of the actions taken by the agent.
- Robustness: Assesses the agent’s performance under various conditions.
The results are reported as Average Score, Pass@k, and Pass^k across three trials, allowing for a nuanced understanding of an agent’s capabilities and differentiating between genuine performance and lucky outcomes.
Experimental Findings
Initial experiments conducted on 14 frontier models have yielded significant insights:
- Trajectory-opaque evaluations were found to be systematically unreliable, missing 44% of detected safety violations and 13% of robustness failures that Claw-Eval caught.
- Controlled error injection primarily impacted consistency, leading to a drop of up to 24% in Pass^3 scores, while Pass@3 remained stable.
- Multimodal performance varied greatly, with most models exhibiting poorer performance on video tasks compared to document or image tasks, indicating that no single model excels across all modalities.
Conclusion
Beyond mere benchmarking, Claw-Eval provides actionable insights for agent development, illuminating the requirements for building autonomous agents that are not only capable but also reliable for real-world applications. As the reliance on these agents grows, so does the need for robust evaluation methods that ensure their safety and effectiveness.
