当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Open-Vocabulary Text-Driven Human Image Generation
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2024-05-15 , DOI: 10.1007/s11263-024-02079-7
Kaiduo Zhang , Muyi Sun , Jianxin Sun , Kunbo Zhang , Zhenan Sun , Tieniu Tan

Generating human images from open-vocabulary text descriptions is an exciting but challenging task. Previous methods (i.e., Text2Human) face two challenging problems: (1) they cannot well handle the open-vocabulary setting by arbitrary text inputs (i.e., unseen clothing appearances) and heavily rely on limited preset words (i.e., pattern styles of clothing appearances); (2) the generated human image is inaccuracy in open-vocabulary settings. To alleviate these drawbacks, we propose a flexible diffusion-based framework, namely HumanDiffusion, for open-vocabulary text-driven human image generation (HIG). The proposed framework mainly consists of two novel modules: the Stylized Memory Retrieval (SMR) module and the Multi-scale Feature Mapping (MFM) module. Encoded by the vision-language pretrained CLIP model, we obtain coarse features of the local human appearance. Then, the SMR module utilizes an external database that contains clothing texture details to refine the initial coarse features. Through SMR refreshing, we can achieve the HIG task with arbitrary text inputs, and the range of expression styles is greatly expanded. Later, the MFM module embedding in the diffusion backbone can learn fine-grained appearance features, which effectively achieves precise semantic-coherence alignment of different body parts with appearance features and realizes the accurate expression of desired human appearance. The seamless combination of the proposed novel modules in HumanDiffusion realizes the freestyle and high accuracy of text-guided HIG and editing tasks. Extensive experiments demonstrate that the proposed method can achieve state-of-the-art (SOTA) performance, especially in the open-vocabulary setting.



中文翻译:

开放词汇文本驱动的人类图像生成

从开放词汇文本描述生成人类图像是一项令人兴奋但具有挑战性的任务。以前的方法(即Text2Human)面临两个具有挑战性的问题:(1)它们不能很好地处理任意文本输入的开放词汇设置(即看不见的服装外观),并且严重依赖有限的预设词(即服装外观的图案样式) ); (2)在开放词汇环境中生成的人类图像不准确。为了缓解这些缺点,我们提出了一种灵活的基于扩散的框架,即HumanDiffusion,用于开放词汇文本驱动的人类图像生成(HIG)。所提出的框架主要由两个新颖的模块组成:风格化记忆检索(SMR)模块和多尺度特征映射(MFM)模块。通过视觉语言预训练的 CLIP 模型进行编码,我们获得了局部人体外观的粗略特征。然后,SMR 模块利用包含服装纹理细节的外部数据库来细化初始粗略特征。通过SMR刷新,我们可以实现任意文本输入的HIG任务,表达风格的范围也大大扩展。随后,嵌入扩散主干的MFM模块可以学习细粒度的外观特征,有效地实现了不同身体部位与外观特征的精确语义一致性对齐,实现了期望的人体外观的准确表达。 HumanDiffusion 中所提出的新颖模块的无缝组合实现了文本引导 HIG 和编辑任务的自由式和高精度。大量的实验表明,所提出的方法可以实现最先进的(SOTA)性能,特别是在开放词汇设置中。

更新日期:2024-05-16
down
wechat
bug