A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures
Summary: arXiv:2602.03604v3 Announce Type: replace-cross
Abstract: We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks.
Overview of EB-JEPA
The EB-JEPA library is designed to provide modular and self-contained implementations that demonstrate the transferability of representation learning techniques developed for image-level self-supervised learning to video applications. This transition is crucial as temporal dynamics introduce additional complexities in modeling.
Key Features
- Modular Implementations: The library offers a set of easy-to-use modules that facilitate quick experimentation and learning.
- Single-GPU Training: Each example is optimized for single-GPU training within a few hours, ensuring accessibility for researchers and educators alike.
- Energy-Based Learning: The library focuses on energy-based self-supervised learning, making it easier to capture semantically meaningful features.
Applications and Results
We conducted ablation studies on the CIFAR-10 dataset, revealing that probing the learned representations yields an impressive accuracy of 91%. This indicates that the model is capable of learning useful features effectively.
Extending to Video
In our efforts to extend the application of JEPAs to video data, we included a multi-step prediction example on the Moving MNIST dataset. This example illustrates how the principles of representation learning can be adapted to address the challenges of temporal modeling.
Action-Conditioned World Models
Furthermore, we explored how these learned representations can be employed to drive action-conditioned world models. Our experiments achieved a remarkable 97% planning success rate on the Two Rooms navigation task. This highlights the potential of JEPAs in real-world applications where decision-making is crucial.
Importance of Regularization
Our comprehensive ablation studies emphasize the critical importance of each regularization component in preventing representation collapse. The findings suggest that careful tuning of these components can significantly enhance the performance of the models.
Conclusion and Future Work
In summary, EB-JEPA stands as a promising tool for researchers and practitioners interested in representation learning and world modeling. The library’s design and results illustrate the potential for energy-based self-supervised learning methods to advance the state of the art in various applications. We encourage the community to explore the code available at https://github.com/facebookresearch/eb_jepa and contribute to its ongoing development.
