基于瓶颈采样的无训练扩散加速方法

Research

arXiv

基于瓶颈采样的无训练扩散加速方法

Training-free Diffusion Acceleration with Bottleneck Sampling

Ye Tian ,

Xin Xia ,

Yuxi Ren ,

Shanchuan Lin ,

Xing Wang ,

Xuefeng Xiao ,

Yunhai Tong ,

Ling Yang ,

Bin Cui

论文信息在线阅读PDF

摘要 Abstract

扩散模型在视觉内容生成方面展示了显著的能力，但由于推理过程中的高计算成本，在实际部署中仍面临挑战。这种计算负担主要源于自注意力机制与图像或视频分辨率呈二次复杂度的关系。尽管现有的加速方法往往需要牺牲输出质量或进行昂贵的重新训练，我们观察到大多数扩散模型在较低分辨率下进行了预训练，这为利用这些低分辨率先验知识实现更高效的推理提供了机会，且不会降低性能。在这项工作中，我们提出了瓶颈采样（Bottleneck Sampling），这是一种无需训练的框架，通过利用低分辨率先验知识减少计算开销，同时保持输出保真度。瓶颈采样遵循高-低-高去噪的工作流程：在初始和最终阶段进行高分辨率去噪，而在中间步骤则在较低分辨率下运行。为了减轻混叠和模糊伪影的影响，我们进一步优化了分辨率转换点，并在每个阶段自适应地调整去噪时间步长。我们在图像和视频生成任务中评估了瓶颈采样，广泛的实验表明，它能够将图像生成的推理速度提高多达3倍，视频生成的速度提高达2.5倍，同时在多个评估指标下的输出质量与标准全分辨率采样过程相当。

Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3$\times$ for image generation and 2.5$\times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics.