StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
Summary: arXiv:2604.05014v1 Announce Type: cross
Abstract: Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research.
Key Features of StarVLA
StarVLA addresses the challenges of VLA research in three major aspects:
- Modular Backbone-Action Architecture: StarVLA features a modular backbone-action architecture that supports both Vision-Language Model (VLM) backbones such as Qwen-VL, and world-model backbones like Cosmos. This setup allows researchers to swap the backbone and action head independently, fostering flexibility and innovation.
- Reusable Training Strategies: The codebase provides reusable training strategies, including cross-embodiment learning and multimodal co-training. These strategies are designed to be applicable consistently across the supported paradigms, making the training process more efficient and effective.
- Integrated Major Benchmarks: StarVLA integrates significant benchmarks such as LIBERO, SimplerEnv, RoboTwin 2.0, RoboCasa-GR1, and BEHAVIOR-1K. The unified evaluation interface supports both simulation and real-robot deployment, enabling comprehensive performance evaluation in various contexts.
Performance and Reproducibility
StarVLA ships simple, fully reproducible single-benchmark training recipes that require minimal data engineering. Remarkably, these recipes already match or surpass the performance of prior methods on multiple benchmarks, whether using VLM or world-model backbones. This capability significantly lowers the barrier for researchers looking to reproduce existing methods and prototype new ones.
Future Developments
StarVLA is being actively maintained and expanded, with ongoing updates planned as the project evolves. Researchers and developers are encouraged to engage with the framework and contribute to its growth. The code and documentation for StarVLA are readily available at https://github.com/starVLA/starVLA.
Conclusion
In summary, StarVLA represents a significant advancement in the field of Vision-Language-Action model development. By providing a comprehensive, modular, and user-friendly framework, it aims to facilitate research, promote reproducibility, and encourage innovation within the VLA community. The ongoing commitment to maintenance and expansion further solidifies StarVLA as a valuable resource for researchers and practitioners alike.
