Rays as Pixels: Joint Video and Camera Trajectory Model

Date:


Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

Summary: arXiv:2604.09429v1 Announce Type: cross

Abstract

Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories.

Key Features of Rays as Pixels

The Rays as Pixels model introduces several key innovations in the field of computer vision:

  • Dense Ray Pixels Representation: Each camera is represented as dense ray pixels, referred to as “raxels.” This representation allows the model to effectively capture the intricacies of video and camera dynamics.
  • Decoupled Self-Cross Attention Mechanism: The model employs a unique Decoupled Self-Cross Attention mechanism to denoise raxels in conjunction with video frames, enhancing the learning process.
  • Multi-Task Capability: A single trained model is capable of performing three distinct tasks:
    • Predicting camera trajectories from video inputs.
    • Jointly generating video and camera trajectory from given images.
    • Generating video based on input images along a specified camera trajectory.

Evaluation and Results

To assess the model’s performance, we conducted a closed-loop self-consistency test. This involved evaluating the model’s ability to predict camera trajectories and generate views based on its own predictions. The results indicated a high level of agreement between the forward and inverse predictions, showcasing the model’s robustness.

Interestingly, trajectory prediction requires significantly fewer denoising steps compared to video generation. In fact, even a minimal number of denoising steps suffices to achieve self-consistency, allowing for efficient processing.

Applications in Pose Estimation and Video Generation

The implications of the Rays as Pixels model extend to various applications in the fields of pose estimation and camera-controlled video generation:

  • Pose Estimation: The model’s ability to accurately predict camera trajectories enhances its utility in pose estimation tasks, which is critical for applications in robotics and augmented reality.
  • Camera-Controlled Video Generation: By generating videos from input images along a defined camera trajectory, the model opens new avenues for creative video production and interactive media experiences.

Conclusion

The Rays as Pixels model represents a significant advancement in integrating video and camera trajectory analysis, addressing long-standing challenges in computer vision. Its unique approach allows for a more cohesive understanding of video dynamics, paving the way for future innovations in the field.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.