基于自一致引导的Transformer对齐生成方法

Research

arXiv

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

Shulei Wang ,

Wang Lin ,

Hai Huang ,

Hanting Wang ,

Sihang Cai ,

WenKang Han ,

Tao Jin ,

Jieming Zhu ,

摘要 Abstract

我们提出了一种新颖的、无需训练的方法，用于增强基于Transformer的文本引导扩散模型（TGDMs）中的对齐效果。现有的TGDMs在生成语义对齐图像时往往面临困难，特别是在处理复杂的文本提示或多概念属性绑定挑战时。以往基于U-Net的方法主要优化潜在空间，但直接应用于Transformer架构的效果有限。我们的方法通过在生成过程中直接优化跨注意力图解决了这些挑战。具体而言，我们引入了自一致引导（Self-Coherence Guidance），该方法利用从之前的去噪步骤中衍生出的掩码动态优化注意力图，确保精确对齐而不需额外训练。为验证我们的方法，我们构建了更具挑战性的基准测试，用于评估粗粒度属性绑定、细粒度属性绑定和风格绑定。实验结果表明，我们的方法在所有评估任务中均显著优于其他最先进的方法。我们的代码可在https://scg-diffusion.github.io/scg-diffusion获取。

We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at https://scg-diffusion.github.io/scg-diffusion.