PISCO: Precise Video Instance Insertion with Sparse Control
The landscape of AI video generation is undergoing a pivotal shift: moving beyond general generation, which relies on exhaustive prompt-engineering and “cherry-picking,” towards fine-grained, controllable generation and high-fidelity post-processing. In professional AI-assisted filmmaking, it is crucial to perform precise, targeted modifications to ensure the integrity of the final product.
A cornerstone of this transition is video instance insertion, which requires inserting a specific instance into existing footage while maintaining scene integrity. Unlike traditional video editing, this task demands several requirements:
- Precise Spatial-Temporal Placement: The instance must be inserted in a way that aligns correctly with the existing footage.
- Physically Consistent Scene Interaction: The inserted instance should interact naturally with the surrounding elements.
- Faithful Preservation of Original Dynamics: The original movements and interactions in the video should remain intact.
- Minimal User Effort: Users should be able to achieve the desired results without extensive manual adjustments.
In response to these challenges, we propose PISCO, a video diffusion model designed for precise video instance insertion with arbitrary sparse keyframe control. PISCO empowers users to specify a single keyframe, start-and-end keyframes, or sparse keyframes at arbitrary timestamps. The model automatically propagates object appearance, motion, and interaction across the video.
One of the significant hurdles in deploying pretrained video diffusion models for this task is the severe distribution shift induced by sparse conditioning. To address this, we introduce several innovative solutions:
- Variable-Information Guidance: This technique enhances robust conditioning, allowing the model to adapt effectively to the sparse input.
- Distribution-Preserving Temporal Masking: This method stabilizes temporal generation, ensuring continuity and coherence in the video.
- Geometry-Aware Conditioning: This allows for realistic adaptation to the scene’s unique geometry, enhancing the natural appearance of the inserted instance.
To facilitate the evaluation of our model’s effectiveness, we have constructed PISCO-Bench, a benchmark comprising verified instance annotations and paired clean background videos. We assess performance using both reference-based and reference-free perceptual metrics, ensuring a comprehensive analysis of PISCO’s capabilities.
Experimental results demonstrate that PISCO consistently outperforms strong inpainting and video editing baselines under sparse control scenarios. Moreover, we observe clear, monotonic performance improvements as additional control signals are provided, showcasing the model’s versatility and effectiveness in real-world applications.
For more information about PISCO and to access the project page, please visit here.
