基于重参数化DDIM的InPO:用于高效扩散模型对齐的反转偏好优化方法

InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment

摘要 Abstract

不使用显式奖励的情况下,直接偏好优化(DPO)利用配对的人类偏好数据微调生成模型,这种方法在大型语言模型(LLMs)中引起了相当大的关注。然而,针对文本到图像(T2I)扩散模型与人类偏好的对齐研究仍较为有限。与监督微调相比,现有的扩散模型对齐方法由于马尔可夫链过程较长且逆过程不可解析,导致训练效率低且生成质量较差。为了解决这些局限性,我们提出了DDIM-InPO,这是一种高效的扩散模型直接偏好对齐方法。我们的方法将扩散模型视为单步生成模型,从而可以选择性地微调特定潜在变量的输出。为了实现这一目标,我们首先通过重参数化技术为任何潜在变量分配隐式奖励。然后构建一种反转技术,以估计适合偏好优化的潜在变量。这一修改过程使扩散模型仅微调与偏好数据高度相关的潜在变量的输出。实验结果表明,我们的DDIM-InPO仅需400步微调即可达到最先进的性能,在人类偏好评估任务中超越了所有T2I扩散模型的偏好对齐基线。

Without using explicit reward, direct preference optimization (DPO) employs paired human preference data to fine-tune generative models, a method that has garnered considerable attention in large language models (LLMs). However, exploration of aligning text-to-image (T2I) diffusion models with human preferences remains limited. In comparison to supervised fine-tuning, existing methods that align diffusion model suffer from low training efficiency and subpar generation quality due to the long Markov chain process and the intractability of the reverse process. To address these limitations, we introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models. Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively. In order to accomplish this objective, we first assign implicit rewards to any latent variable directly via a reparameterization technique. Then we construct an Inversion technique to estimate appropriate latent variables for preference optimization. This modification process enables the diffusion model to only fine-tune the outputs of latent variables that have a strong correlation with the preference dataset. Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning, surpassing all preference aligning baselines for T2I diffusion models in human preference evaluation tasks.