Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene Instructional Video Synthesis

WACV 2025
1Universidade NOVA de Lisboa, 2Google Research

Our Contrastive Sequential Video Diffusion (CoSeD) ensures visual consistency in multi-scene videos by selecting the most relevant past scene to guide the next, improving coherence for tasks like recipes and DIY projects.

Abstract

Generated video scenes for action-centric sequence descriptions like recipe instructions and do-it-yourself projects include non-linear patterns, in which the next video may require to be visually consistent not on the immediate previous video but on earlier ones. Current multi-scene video synthesis approaches fail to meet these consistency requirements. To address this, we propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene. The result is a multi-scene video that is grounded in the scene descriptions and coherent w.r.t the scenes that require visual consistency. Experiments with real-world action-centric data demonstrate the practicality and improved consistency of our model compared to prior work.

Model Architecture

Model Architecture

Examples

(click on an image to enlarge)



Results

Automatic Evaluation

BibTeX

@misc{ramos2024contrastivesequentialdiffusionlearningnonlinear,
      title={Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene Instructional Video Synthesis},
      author={Vasco Ramos and Yonatan Bitton and Michal Yarom and Idan Szpektor and Joao Magalhaes},
      year={2024},
      eprint={2407.11814},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.11814},
}