CAGE：无监督视觉组合与动画用于可控视频生成

Research

arXiv

CAGE: Unsupervised Visual Composition and Animation for Controllable Video Generation

摘要 Abstract

近年来，视频生成领域得到了显著扩展，可控且可组合的视频生成引起了广泛关注。大多数方法依赖于文本、物体边界框以及运动线索等标注信息，这些标注需要大量的人类努力，从而限制了其可扩展性。相比之下，我们通过引入一种新颖的无监督方法解决了在没有任何标注的情况下实现可控且可组合的视频生成的挑战。我们的模型在未标注视频的数据集上从头开始进行训练。在推理阶段，它可以通过在空间和时间上的期望位置放置物体部分来组合合理的全新场景并使物体动画化。我们方法的核心创新在于统一的控制格式和训练过程，其中视频生成以随机选择的预训练自监督局部特征子集为条件。这种条件迫使模型学习如何在空间和时间上补全视频中的缺失信息，从而学习场景的内在组合性和移动物体的动力学特性。条件输入的抽象级别及其对微小视觉扰动的不变性使得只需在所有期望的未来位置使用相同的特征即可控制物体的运动。我们将该模型命名为CAGE，代表视觉组合与动画用于视频生成。我们进行了广泛的实验以验证CAGE在各种场景下的有效性，展示了其准确遵循控制的能力，并能够生成具有连贯场景组合和逼真动画的高质量视频。

The field of video generation has expanded significantly in recent years, with controllable and compositional video generation garnering considerable interest. Most methods rely on leveraging annotations such as text, objects' bounding boxes, and motion cues, which require substantial human effort and thus limit their scalability. In contrast, we address the challenge of controllable and compositional video generation without any annotations by introducing a novel unsupervised approach. Our model is trained from scratch on a dataset of unannotated videos. At inference time, it can compose plausible novel scenes and animate objects by placing object parts at the desired locations in space and time. The core innovation of our method lies in the unified control format and the training process, where video generation is conditioned on a randomly selected subset of pre-trained self-supervised local features. This conditioning compels the model to learn how to inpaint the missing information in the video both spatially and temporally, thereby learning the inherent compositionality of a scene and the dynamics of moving objects. The abstraction level and the imposed invariance of the conditioning input to minor visual perturbations enable control over object motion by simply using the same features at all the desired future locations. We call our model CAGE, which stands for visual Composition and Animation for video GEneration. We conduct extensive experiments to validate the effectiveness of CAGE across various scenarios, demonstrating its capability to accurately follow the control and to generate high-quality videos that exhibit coherent scene composition and realistic animation.