Versatile Transition Generation with Image-to-Video Diffusion
/ Authors
/ Abstract
Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high-quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive text prompts is far underexplored. We present VTG, a Versatile $\underline{T}$ ransition video $\underline{G}$ eneration framework that can generate smooth, high-fidelity, and semanticcoherent video transitions. VTG introduces interpolationbased initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine-tuning and representation alignment regularization that mitigate the limitations of the pre-trained image-to-video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate future studies on unified transition generation, we collected TransitBench, a comprehensive benchmark for transition generation that covers two representative transition tasks including concept blending and scene transition. Extensive experiments show that VTG achieves superior transition performance consistently across the four tasks.
Journal: 2025 IEEE/CVF International Conference on Computer Vision (ICCV)