视频-T1: 视频生成中的测试时扩展

Research

arXiv

视频-T1: 视频生成中的测试时扩展

Video-T1: Test-Time Scaling for Video Generation

Fangfu Liu ,

Yimo Cai ,

摘要 Abstract

随着训练数据规模、模型大小以及计算成本的扩大能力，视频生成在数字创作领域取得了显著成果，使用户能够在各个领域表达创造力。近期，大型语言模型（LLM）的研究人员将扩展能力推广到测试时阶段，通过增加推理时的计算量显著提升了LLM的性能。我们探索了测试时扩展（TTS）在视频生成中的潜力，而非通过昂贵的训练成本来扩展视频基础模型，旨在回答如下问题：如果允许视频生成模型在推理时使用非平凡量级的计算资源，它能在面对具有挑战性的文本提示时提高生成质量多少？在本研究中，我们将视频生成的测试时扩展重新解释为一个搜索问题，即从高斯噪声空间采样更好的轨迹以逼近目标视频分布。具体而言，我们构建了一个测试时验证器的搜索空间以提供反馈，并设计启发式算法指导搜索过程。给定一个文本提示，我们首先通过增加推理时的噪声候选值探索一种直观的线性搜索策略。由于完整步长同时去噪所有帧需要高昂的测试时计算成本，我们进一步设计了一种更高效的视频生成测试时扩展方法——框架树（ToF），该方法以自回归方式自适应地扩展和剪枝视频分支。在基于文本条件的视频生成基准上的大量实验表明，增加测试时计算量始终能显著提升视频的质量。项目页面：https://liuff19.github.io/Video-T1

With the scale capability of increasing training data, model size, and computational cost, video generation has achieved impressive results in digital creation, enabling users to express creativity across various domains. Recently, researchers in Large Language Models (LLMs) have expanded the scaling to test-time, which can significantly improve LLM performance by using more inference-time computation. Instead of scaling up video foundation models through expensive training costs, we explore the power of Test-Time Scaling (TTS) in video generation, aiming to answer the question: if a video generation model is allowed to use non-trivial amount of inference-time compute, how much can it improve generation quality given a challenging text prompt. In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. Specifically, we build the search space with test-time verifiers to provide feedback and heuristic algorithms to guide searching process. Given a text prompt, we first explore an intuitive linear search strategy by increasing noise candidates at inference time. As full-step denoising all frames simultaneously requires heavy test-time computation costs, we further design a more efficient TTS method for video generation called Tree-of-Frames (ToF) that adaptively expands and prunes video branches in an autoregressive manner. Extensive experiments on text-conditioned video generation benchmarks demonstrate that increasing test-time compute consistently leads to significant improvements in the quality of videos. Project page: https://liuff19.github.io/Video-T1