借助随机位置编码提升扩散变换器分辨率泛化能力的研究
Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings
摘要 Abstract
图像生成任务中的分辨率泛化能力能够以较低的训练分辨率开销生成更高分辨率的图像。然而,在分辨率泛化这一问题中,特别是在广泛使用的扩散变换器中,一个显著挑战在于测试时遇到的位置编码与训练时使用的不匹配问题。尽管现有方法采用了插值、外推或其组合等技术,但尚未完全解决此问题。本文提出了一种新颖的二维随机位置编码(RPE-2D)框架,该框架专注于学习图像块的位置顺序而非它们之间的具体距离,从而在无需高分辨率和低分辨率图像联合训练的情况下实现无缝的高分辨率和低分辨率图像生成。具体而言,RPE-2D沿水平和垂直轴独立地在整个更广泛的范围内选择位置,确保在推理阶段对所有位置编码进行训练,从而提高分辨率泛化能力。此外,我们还提出了随机数据增强技术以增强位置顺序建模。为了解决因增强导致的图像裁剪问题,我们引入了相应的微条件,使模型能够感知特定的裁剪模式。在ImageNet数据集上,我们的提出的RPE-2D在以$256 \times 256$分辨率训练并在$384 \times 384$和$512 \times 512$分辨率下推理,以及从$512 \times 512$扩展到$768 \times 768$和$1024 \times 1024$时,达到了最先进的分辨率泛化性能,优于现有的竞争方法。此外,它还在低分辨率图像生成、多阶段训练加速和多分辨率继承方面展现出卓越的能力。
Resolution generalization in image generation tasks enables the production of higher-resolution images with lower training resolution overhead. However, a significant challenge in resolution generalization, particularly in the widely used Diffusion Transformers, lies in the mismatch between the positional encodings encountered during testing and those used during training. While existing methods have employed techniques such as interpolation, extrapolation, or their combinations, none have fully resolved this issue. In this paper, we propose a novel two-dimensional randomized positional encodings (RPE-2D) framework that focuses on learning positional order of image patches instead of the specific distances between them, enabling seamless high- and low-resolution image generation without requiring high- and low-resolution image training. Specifically, RPE-2D independently selects positions over a broader range along both the horizontal and vertical axes, ensuring that all position encodings are trained during the inference phase, thus improving resolution generalization. Additionally, we propose a random data augmentation technique to enhance the modeling of position order. To address the issue of image cropping caused by the augmentation, we introduce corresponding micro-conditioning to enable the model to perceive the specific cropping patterns. On the ImageNet dataset, our proposed RPE-2D achieves state-of-the-art resolution generalization performance, outperforming existing competitive methods when trained at a resolution of $256 \times 256$ and inferred at $384 \times 384$ and $512 \times 512$, as well as when scaling from $512 \times 512$ to $768 \times 768$ and $1024 \times 1024$. And it also exhibits outstanding capabilities in low-resolution image generation, multi-stage training acceleration and multi-resolution inheritance.