文本到模型:用于一次训练适应所有任务的文本条件神经网络扩散
Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization
摘要 Abstract
生成式人工智能(GenAI)在理解世界知识并从各种模态(如文本到文本的大语言模型、文本到图像的稳定扩散以及文本到视频的Sora)中生成内容方面取得了显著进展。本文研究了GenAI在文本到模型生成方面的能力,以探究其是否能够理解嵌入AI自身参数中的高层级知识。具体而言,我们研究了一个名为“一次训练适应所有任务”的实用场景,旨在通过文本提示为不同的终端用户和任务生成个性化模型。受近期神经网络扩散的启发,我们提出了Tina,这是一种用于一次训练适应所有任务的文本条件神经网络扩散方法。Tina利用一个扩散变换模型,该模型基于使用CLIP模型嵌入的任务描述进行条件化。尽管潜在的个性化任务数量巨大(例如,$1.73\times10^{13}$),但通过我们的设计,Tina即使在小数据集($\sim 1000$)上训练时也表现出显著的分布内和分布外泛化能力。我们进一步通过分析零样本/少样本图像提示、不同数量的个性化类别、自然语言描述提示以及预测未见实体的能力,验证了Tina对世界知识的理解。
Generative artificial intelligence (GenAI) has made significant progress in understanding world knowledge and generating content from human languages across various modalities, like text-to-text large language models, text-to-image stable diffusion, and text-to-video Sora. While in this paper, we investigate the capability of GenAI for text-to-model generation, to see whether GenAI can comprehend hyper-level knowledge embedded within AI itself parameters. Specifically, we study a practical scenario termed train-once-for-all personalization, aiming to generate personalized models for diverse end-users and tasks using text prompts. Inspired by the recent emergence of neural network diffusion, we present Tina, a text-conditioned neural network diffusion for train-once-for-all personalization. Tina leverages a diffusion transformer model conditioned on task descriptions embedded using a CLIP model. Despite the astronomical number of potential personalized tasks (e.g., $1.73\times10^{13}$), by our design, Tina demonstrates remarkable in-distribution and out-of-distribution generalization even trained on small datasets ($\sim 1000$). We further verify whether and how \Tina understands world knowledge by analyzing its capabilities under zero-shot/few-shot image prompts, different numbers of personalized classes, prompts of natural language descriptions, and predicting unseen entities.