VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
Summary: arXiv:2604.02467v1 Announce Type: cross
Abstract: Cinematic camera control relies on a tight feedback loop between director and cinematographer, where camera motion and framing are continuously reviewed and refined. Recent generative camera systems can produce diverse, text-conditioned trajectories, but they lack this “director in the loop” and have no explicit supervision of whether a shot is visually desirable. This results in in-distribution camera motion but poor framing, off-screen characters, and undesirable visual aesthetics.
In this paper, we introduce VERTIGO, the first framework for visual preference optimization of camera trajectory generators. Our framework leverages a real-time graphics engine (Unity) to render 2D visual previews from generated camera motion. A cinematically fine-tuned vision-language model then scores these previews using our proposed cyclic semantic similarity mechanism, which aligns renders with text prompts. This process provides the visual preference signals for Direct Preference Optimization (DPO) post-training.
Key Features of VERTIGO
- Real-time Graphics Engine Integration: Utilizes Unity to create dynamic 2D visual previews from generated camera motion.
- Cyclic Semantic Similarity Mechanism: A unique scoring system that aligns visual renders with text prompts to ensure cinematic quality.
- Direct Preference Optimization: Employs visual preference signals to enhance the training of camera trajectory generators.
Results and Findings
Both quantitative evaluations and user studies on Unity renders and diffusion-based Camera-to-Video pipelines demonstrate significant improvements in various aspects of camera trajectory generation:
- Condition Adherence: Enhanced alignment between generated trajectories and specified text prompts.
- Framing Quality: Improved framing of characters and scenes, ensuring crucial elements remain in focus.
- Perceptual Realism: Increased realism in visual aesthetics, making the shots more engaging for viewers.
- Character Off-screen Rate Reduction: Notably decreased from 38% to nearly 0%, ensuring that characters remain within the visual frame.
User Study Insights
In user studies, participants expressed a marked preference for VERTIGO over traditional baseline methods across several criteria:
- Composition: Users noted better overall composition of shots.
- Consistency: Improved consistency in visual storytelling.
- Prompt Adherence: Greater fidelity to the given text prompts.
- Aesthetic Quality: Higher ratings in terms of visual appeal and engagement.
Conclusion
VERTIGO represents a significant advancement in the field of cinematic camera trajectory generation by bridging the gap between automation and human-directed feedback. The framework’s ability to optimize visual preferences not only enhances the technical quality of camera motions but also aligns them more closely with the artistic vision of filmmakers. As the industry continues to explore the integration of AI in creative processes, VERTIGO sets a new standard for visual optimization in cinematic experiences.
