草图视频:基于草图的视频生成与编辑

SketchVideo: Sketch-based Video Generation and Editing

摘要 Abstract

基于文本提示或图像的视频生成与编辑取得了显著进展,但仅通过文本准确控制全局布局和几何细节以及通过图像支持运动控制和局部修改仍面临挑战。本文旨在实现基于草图的空间和运动控制以进行视频生成,并支持真实或合成视频的精细编辑。基于DiT视频生成模型,我们提出了一种具有草图控制块的记忆高效控制结构,该结构预测跳过DiT块的残差特征。草图绘制在一个或两个关键帧(任意时间点)上,便于交互。为将这种时间稀疏的草图条件传播到所有帧,我们提出了帧间注意力机制来分析关键帧与每个视频帧之间的关系。对于基于草图的视频编辑,我们设计了一个额外的视频插入模块,以保持新编辑内容与原始视频的空间特征和动态运动之间的一致性。在推理过程中,我们使用潜在融合以准确保留未编辑区域。大量实验表明,我们的SketchVideo在可控的视频生成和编辑方面表现出色。

Video generation and editing conditioned on text prompts or images have undergone significant advancements. However, challenges remain in accurately controlling global layout and geometry details solely by texts, and supporting motion control and local modification through images. In this paper, we aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks. Sketches are drawn on one or two keyframes (at arbitrary time points) for easy interaction. To propagate such temporally sparse sketch conditions across all frames, we propose an inter-frame attention mechanism to analyze the relationship between the keyframes and each video frame. For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion. During inference, we use latent fusion for the accurate preservation of unedited regions. Extensive experiments demonstrate that our SketchVideo achieves superior performance in controllable video generation and editing.

草图视频:基于草图的视频生成与编辑 - arXiv