Home | Publications | LHC+25

When and Where Do Events Switch in Multi-Event Video Generation?

MCML Authors

Ruotong Liao

→ Group Volker Tresp
Database Systems, Data Mining and AI

Qing Cheng

→ Group Daniel Cremers
Computer Vision & Artificial Intelligence

Thomas Seidl

Prof. Dr.

Director

Database Systems, Data Mining and AI

Daniel Cremers

Prof. Dr.

Director

Computer Vision & Artificial Intelligence

Volker Tresp

Prof. Dr.

Principal Investigator

Database Systems, Data Mining and AI

Abstract

Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.

inproceedings

LongVid-Foundations @ICCV 2025

1st Workshop on Long Multi-Scene Video Foundations: Generation, Understanding and Evaluation at the IEEE/CVF International Conference on Computer Vision. Honolulu, Hawai'i, Oct 19-23, 2025. To be published. Preprint available.

Authors

R. Liao • G. Huang • Q. Cheng • T. Seidl • D. Cremers • V. Tresp

Research Areas

A3 | Computational Models

B1 | Computer Vision

BibTeXKey: LHC+25