Att-Adapter: 基于条件变分自编码器的鲁棒且精确的领域特定多属性文本到图像扩散适配器

Research

arXiv

Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder

摘要 Abstract

文本到图像（T2I）扩散模型在生成高质量图像方面取得了显著成就。然而，仅通过文本指导，在新领域（例如数值型属性如眼睛开合度或汽车宽度）实现连续属性特别是多个属性的同时精确控制仍然是一个重大挑战。为了解决这一问题，我们提出了属性（Att）适配器，这是一种新颖的即插即用模块，旨在使预训练的扩散模型能够进行细粒度的多属性控制。我们的方法从一组样本图像（可以是未配对的并包含多种视觉属性）学习单一控制适配器。Att-Adapter 利用解耦交叉注意力模块，自然地将多个领域属性与文本条件相协调。我们进一步将条件变分自编码器（CVAE）引入 Att-Adapter，以减轻过拟合，匹配视觉世界的多样性。在两个公开数据集上的评估表明，Att-Adapter 在控制连续属性方面优于所有基于 LoRA 的基线方法。此外，我们的方法实现了更广泛的控制范围，并且在多个属性之间提高了解缠效果，超越了基于 StyleGAN 的技术。值得注意的是，Att-Adapter 非常灵活，在训练过程中不需要配对的合成数据，并且可以轻松扩展到单个模型中的多个属性。

Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning. We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world. Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.