LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
Summary: arXiv:2604.13072v1 Announce Type: cross
Abstract
LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks.
Introducing LiveClawBench
Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions:
- Environment Complexity: The intricacies of the settings in which the tasks are performed.
- Cognitive Demand: The mental effort required to understand and execute the tasks.
- Runtime Adaptability: The ability of the agent to adapt to changes and uncertainties during task execution.
Framework and Benchmark Construction
Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings.
Future Expansion
Our aim is to establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. By doing so, we hope to provide a more nuanced understanding of LLM agent capabilities and limitations in real-world applications.
Project Information
For more details on LiveClawBench, please visit our project page at https://github.com/Mosi-AI/LiveClawBench.
Conclusion
The introduction of LiveClawBench marks a significant step forward in the evaluation of LLM agents, bridging the gap between theoretical capabilities and practical application. With the integration of the Triple-Axis Complexity Framework, future research can better assess the performance of LLMs in dynamic and complex real-world environments.
