Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
Summary: arXiv:2604.09429v1 Announce Type: cross
Abstract
Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories.
Key Features of Rays as Pixels
The Rays as Pixels model introduces several key innovations in the field of computer vision:
- Dense Ray Pixels Representation: Each camera is represented as dense ray pixels, referred to as “raxels.” This representation allows the model to effectively capture the intricacies of video and camera dynamics.
- Decoupled Self-Cross Attention Mechanism: The model employs a unique Decoupled Self-Cross Attention mechanism to denoise raxels in conjunction with video frames, enhancing the learning process.
- Multi-Task Capability: A single trained model is capable of performing three distinct tasks:
- Predicting camera trajectories from video inputs.
- Jointly generating video and camera trajectory from given images.
- Generating video based on input images along a specified camera trajectory.
Evaluation and Results
To assess the model’s performance, we conducted a closed-loop self-consistency test. This involved evaluating the model’s ability to predict camera trajectories and generate views based on its own predictions. The results indicated a high level of agreement between the forward and inverse predictions, showcasing the model’s robustness.
Interestingly, trajectory prediction requires significantly fewer denoising steps compared to video generation. In fact, even a minimal number of denoising steps suffices to achieve self-consistency, allowing for efficient processing.
Applications in Pose Estimation and Video Generation
The implications of the Rays as Pixels model extend to various applications in the fields of pose estimation and camera-controlled video generation:
- Pose Estimation: The model’s ability to accurately predict camera trajectories enhances its utility in pose estimation tasks, which is critical for applications in robotics and augmented reality.
- Camera-Controlled Video Generation: By generating videos from input images along a defined camera trajectory, the model opens new avenues for creative video production and interactive media experiences.
Conclusion
The Rays as Pixels model represents a significant advancement in integrating video and camera trajectory analysis, addressing long-standing challenges in computer vision. Its unique approach allows for a more cohesive understanding of video dynamics, paving the way for future innovations in the field.
