摘要 Abstract
本文介绍了一种新颖且高效的技术MultiBooth,用于从文本生成图像中的多概念定制。尽管定制化生成方法取得了显著进展,特别是扩散模型的成功应用,但现有方法在多概念场景下往往面临概念保真度低和推理成本高的问题。MultiBooth通过将多概念生成过程分为两个阶段解决了这些问题:单概念学习阶段和多概念整合阶段。在单概念学习阶段,我们采用多模态图像编码器和高效的概念编码技术,为每个概念学习到简洁且判别性强的表示。在多概念整合阶段,我们利用边界框定义跨注意力图中每个概念的生成区域,这种方法使得在指定区域内创建独立的概念成为可能,从而促进多概念图像的形成。这一策略不仅提高了概念保真度,还降低了额外的推理成本。在定性和定量评估中,MultiBooth超越了各种基线方法,展示了其卓越的性能和计算效率。项目页面:https://multibooth.github.io/
This paper introduces MultiBooth, a novel and efficient technique for multi-concept customization in image generation from text. Despite the significant advancements in customized generation methods, particularly with the success of diffusion models, existing methods often struggle with multi-concept scenarios due to low concept fidelity and high inference cost. MultiBooth addresses these issues by dividing the multi-concept generation process into two phases: a single-concept learning phase and a multi-concept integration phase. During the single-concept learning phase, we employ a multi-modal image encoder and an efficient concept encoding technique to learn a concise and discriminative representation for each concept. In the multi-concept integration phase, we use bounding boxes to define the generation area for each concept within the cross-attention map. This method enables the creation of individual concepts within their specified regions, thereby facilitating the formation of multi-concept images. This strategy not only improves concept fidelity but also reduces additional inference cost. MultiBooth surpasses various baselines in both qualitative and quantitative evaluations, showcasing its superior performance and computational efficiency. Project Page: https://multibooth.github.io/