StarVLA: Modular Codebase for Vision-Language-Action Models

Date:

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

Summary: arXiv:2604.05014v1 Announce Type: cross

Abstract: Building generalist embodied agents requires integrating perception, language understanding, and action, which are core capabilities addressed by Vision-Language-Action (VLA) approaches based on multimodal foundation models, including recent advances in vision-language models and world models. Despite rapid progress, VLA methods remain fragmented across incompatible architectures, codebases, and evaluation protocols, hindering principled comparison and reproducibility. We present StarVLA, an open-source codebase for VLA research.

Key Features of StarVLA

StarVLA addresses the challenges of VLA research in three major aspects:

  • Modular Backbone-Action Architecture: StarVLA features a modular backbone-action architecture that supports both Vision-Language Model (VLM) backbones such as Qwen-VL, and world-model backbones like Cosmos. This setup allows researchers to swap the backbone and action head independently, fostering flexibility and innovation.
  • Reusable Training Strategies: The codebase provides reusable training strategies, including cross-embodiment learning and multimodal co-training. These strategies are designed to be applicable consistently across the supported paradigms, making the training process more efficient and effective.
  • Integrated Major Benchmarks: StarVLA integrates significant benchmarks such as LIBERO, SimplerEnv, RoboTwin 2.0, RoboCasa-GR1, and BEHAVIOR-1K. The unified evaluation interface supports both simulation and real-robot deployment, enabling comprehensive performance evaluation in various contexts.

Performance and Reproducibility

StarVLA ships simple, fully reproducible single-benchmark training recipes that require minimal data engineering. Remarkably, these recipes already match or surpass the performance of prior methods on multiple benchmarks, whether using VLM or world-model backbones. This capability significantly lowers the barrier for researchers looking to reproduce existing methods and prototype new ones.

Future Developments

StarVLA is being actively maintained and expanded, with ongoing updates planned as the project evolves. Researchers and developers are encouraged to engage with the framework and contribute to its growth. The code and documentation for StarVLA are readily available at https://github.com/starVLA/starVLA.

Conclusion

In summary, StarVLA represents a significant advancement in the field of Vision-Language-Action model development. By providing a comprehensive, modular, and user-friendly framework, it aims to facilitate research, promote reproducibility, and encourage innovation within the VLA community. The ongoing commitment to maintenance and expansion further solidifies StarVLA as a valuable resource for researchers and practitioners alike.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.