基于通用偏好与逐步偏好优化的美学后训练扩散模型
Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization
摘要 Abstract
生成视觉上有吸引力的图像对于现代文本到图像生成模型而言至关重要。提升美学效果的一种潜在解决方案是直接偏好优化(DPO),该方法已被应用于扩散模型以改善包括提示对齐和美学在内的整体图像质量。现有的DPO方法通过从干净图像对传播偏好标签至两条生成轨迹的所有中间步骤来实现这一目标。然而,现有数据集中提供的偏好标签混合了布局和美学意见,这可能与纯粹的美学偏好相冲突。即使提供了美学标签(需付出高昂成本),两轨迹方法也难以捕捉不同步骤间的细微视觉差异。为经济地提升美学效果,本文利用现有的通用偏好数据,并引入逐步偏好优化(SPO)方法,摒弃传播策略,允许对图像细节进行精细化评估。具体而言,在每个去噪步骤中,我们:1)通过共享噪声潜在值采样候选池;2)利用步长感知偏好模型找到合适的胜败对以监督扩散模型;3)从候选池中随机选择一个用于初始化下一去噪步骤。此策略确保扩散模型专注于微妙且精细的视觉差异,而非布局方面。我们发现,通过积累这些改进的小差异,美学效果可以显著提升。在微调Stable Diffusion v1.5和SDXL时,与现有DPO方法相比,SPO在美学方面表现出显著改进,同时与原始模型相比并未牺牲图像-文本对齐能力。此外,由于使用了更正确的偏好标签,SPO的收敛速度比DPO方法快得多。
Generating visually appealing images is fundamental to modern text-to-image generation models. A potential solution to better aesthetics is direct preference optimization (DPO), which has been applied to diffusion models to improve general image quality including prompt alignment and aesthetics. Popular DPO methods propagate preference labels from clean image pairs to all the intermediate steps along the two generation trajectories. However, preference labels provided in existing datasets are blended with layout and aesthetic opinions, which would disagree with aesthetic preference. Even if aesthetic labels were provided (at substantial cost), it would be hard for the two-trajectory methods to capture nuanced visual differences at different steps. To improve aesthetics economically, this paper uses existing generic preference data and introduces step-by-step preference optimization (SPO) that discards the propagation strategy and allows fine-grained image details to be assessed. Specifically, at each denoising step, we 1) sample a pool of candidates by denoising from a shared noise latent, 2) use a step-aware preference model to find a suitable win-lose pair to supervise the diffusion model, and 3) randomly select one from the pool to initialize the next denoising step. This strategy ensures that diffusion models focus on the subtle, fine-grained visual differences instead of layout aspect. We find that aesthetics can be significantly enhanced by accumulating these improved minor differences. When fine-tuning Stable Diffusion v1.5 and SDXL, SPO yields significant improvements in aesthetics compared with existing DPO methods while not sacrificing image-text alignment compared with vanilla models. Moreover, SPO converges much faster than DPO methods due to the use of more correct preference labels provided by the step-aware preference model.