Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow
Summary: arXiv:2603.26571v1 Announce Type: cross
Abstract
Existing generative video compression methods primarily rely on generative models as post-hoc reconstruction modules, which operate on top of conventional codecs. In response to the limitations of these traditional approaches, we propose a novel framework termed Generative Video Codec (GVC). This zero-shot framework innovatively transforms a pretrained video generative model into a codec itself, allowing the transmitted bitstream to specify the generative decoding trajectory directly, without the need for retraining.
Technical Innovations
To achieve this groundbreaking capability, we convert the deterministic rectified-flow ordinary differential equation (ODE) commonly used in modern video foundation models into an equivalent stochastic differential equation (SDE) during inference. This conversion unlocks per-step stochastic injection points, facilitating codebook-driven compression. Our unified backbone enables the instantiation of three complementary conditioning strategies:
- Image-to-Video (I2V): This strategy employs adaptive tail-frame atom allocation to optimize the video generation process from static images.
- Text-to-Video (T2V): Operating with near-zero side information, this strategy relies on a pure generative prior to create video content based on textual descriptions.
- First-Last-Frame-to-Video (FLF2V): This method utilizes boundary-sharing Group of Pictures (GOP) chaining to enable dual-anchor temporal control, effectively managing the flow of video frames.
Trade-Offs in Video Compression
Together, these strategies provide a principled trade-off space between three critical dimensions: spatial fidelity, temporal coherence, and compression efficiency. Each approach offers unique advantages that can be leveraged depending on the specific requirements of the video content and application.
Experimental Results
Comprehensive experiments conducted on standard benchmarks demonstrate the effectiveness of GVC in achieving high-quality video reconstruction. Notably, GVC operates below a bitrate of 0.002 bits per pixel (bpp), showcasing its efficiency. Furthermore, the system supports flexible bitrate control through a single hyperparameter, enhancing its adaptability for various use cases.
Conclusion
The introduction of the Generative Video Codec marks a significant advancement in the field of video compression. By eliminating the need for retraining and directly leveraging pretrained models, GVC stands out as a promising solution for efficient video coding. Future research may explore further refinements and applications of this framework, potentially revolutionizing how video content is compressed and transmitted in the digital age.
