vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models
Summary: arXiv:2603.13966v2 Announce Type: replace
Vision-Language-Action (VLA) models are increasingly evaluated across multiple simulation benchmarks. However, the process of adding each benchmark to an evaluation pipeline presents several challenges. These challenges include resolving incompatible dependencies, matching underspecified evaluation protocols, and reverse-engineering undocumented preprocessing steps. The burden of these issues scales with the number of models and benchmarks involved, rendering comprehensive evaluation impractical for most research teams.
Introduction to vla-eval
To address these challenges, we introduce vla-eval, an open-source evaluation harness designed to streamline the evaluation process for VLA models. By decoupling model inference from benchmark execution, vla-eval significantly reduces the per-benchmark cost associated with evaluation. The framework utilizes a WebSocket and msgpack protocol alongside Docker-based environment isolation, creating a more manageable and efficient evaluation process.
Key Features of vla-eval
- Simplified Integration: Models can integrate into the evaluation harness by implementing a single
predict()method. Benchmarks can also be integrated easily via a four-method interface. - Automated Cross-Evaluation: The full cross-evaluation matrix is automatically generated, allowing for seamless evaluation across different models and benchmarks.
- Support for Multiple Benchmarks: The framework currently supports 14 simulation benchmarks and six model servers, making it versatile for various research applications.
- Performance Optimization: Parallel evaluation is achieved through episode sharding and batch inference, resulting in up to 47x wall-clock speedup. For instance, the framework can complete 2,000 LIBERO episodes in approximately 18 minutes.
Validation and Reproduction
To validate the effectiveness of the vla-eval framework, we successfully reproduced published scores across six VLA codebases and three different benchmarks. This process helped document previously undocumented pitfalls, thereby enhancing the reliability of VLA evaluations.
VLA Leaderboard
In addition to the evaluation harness, we are excited to release a VLA leaderboard that aggregates 657 published results across 17 benchmarks. This leaderboard serves as a valuable resource for researchers looking to compare their models against established benchmarks and scores.
Accessing vla-eval
The framework, along with evaluation configurations and all reproducible results, is publicly available. Researchers and developers can access the resources at the following links:
Conclusion
With the introduction of vla-eval, the evaluation of Vision-Language-Action models becomes more efficient and accessible. By addressing the common challenges in the evaluation process, vla-eval paves the way for more comprehensive and rigorous assessments in the field of AI research.
