StarVLA: Modular Codebase for Vision-Language-Action Models

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

Summary: arXiv:2604.05014v1 Announce Type: cross

Abstract: Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research.

Key Features of StarVLA

StarVLA addresses the challenges of VLA research in three major aspects:

Modular Backbone-Action Architecture: StarVLA features a modular backbone-action architecture that supports both Vision-Language Model (VLM) backbones such as Qwen-VL, and world-model backbones like Cosmos. This setup allows researchers to swap the backbone and action head independently, fostering flexibility and innovation.
Reusable Training Strategies: The codebase provides reusable training strategies, including cross-embodiment learning and multimodal co-training. These strategies are designed to be applicable consistently across the supported paradigms, making the training process more efficient and effective.
Integrated Major Benchmarks: StarVLA integrates significant benchmarks such as LIBERO, SimplerEnv, RoboTwin 2.0, RoboCasa-GR1, and BEHAVIOR-1K. The unified evaluation interface supports both simulation and real-robot deployment, enabling comprehensive performance evaluation in various contexts.

Performance and Reproducibility

StarVLA ships simple, fully reproducible single-benchmark training recipes that require minimal data engineering. Remarkably, these recipes already match or surpass the performance of prior methods on multiple benchmarks, whether using VLM or world-model backbones. This capability significantly lowers the barrier for researchers looking to reproduce existing methods and prototype new ones.

Future Developments

StarVLA is being actively maintained and expanded, with ongoing updates planned as the project evolves. Researchers and developers are encouraged to engage with the framework and contribute to its growth. The code and documentation for StarVLA are readily available at https://github.com/starVLA/starVLA.

Conclusion

In summary, StarVLA represents a significant advancement in the field of Vision-Language-Action model development. By providing a comprehensive, modular, and user-friendly framework, it aims to facilitate research, promote reproducibility, and encourage innovation within the VLA community. The ongoing commitment to maintenance and expansion further solidifies StarVLA as a valuable resource for researchers and practitioners alike.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

StarVLA: Modular Codebase for Vision-Language-Action Models

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

Key Features of StarVLA

Performance and Reproducibility

Future Developments

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related