MirrorDiff: Prompt redescription for zero-shot grounded text-to-image generation with attention modulation

扫码查看

原文链接

NETL
NSTL
Elsevier

外文摘要：Large-scale layout-conditioned text-to-image diffusion models have made significant progress and achieved remarkable results in generating diverse and high-quality images, realizing objects appearing in specific regions simultaneously. However, existing methods still fail with attribute coupling, unreasonable spatial relationships expressions and missing objects when the prompt is complex with multiple objects containing multiple attributes. In addition, it is difficult for users to give precise layout conditions for complex prompts. To address the above issues, we propose MirrorDiff, a novel training-free grounded text-to-image-to-text framework by redescription to correct inaccurate content expressions of synthetic images iteratively. Specifically, we first utilize large language models as layout generator which have the ability to understand visual concepts and support plausible arrangements to generate scene layout for complex prompts to help users obtain precision layout more conveniently. Subsequently, to solve small object missing, we design a layout-guided attention modulation strategy to properly adjust attention maps during diffusion generation process, which effectively increases attention of small objects. Additionally, semantic text regeneration supervision is proposed to constrain the redescription to keep consistent with the given text semantically, which aims to mitigate attribute coupling and failures of spatial relationships expressions. We conduct extensive experiments on four benchmarks and our method achieves the best results in all categories on the Holistic, Reliable and Scalable benchmark, which shows that our proposed MirrorDiff achieves state-of-the-art results both quantitatively and qualitatively compared with current superior models.

外文关键词：

Text-to-image generationLarge language modelDiffusion modelAttention modulation

作者：

Chang Liu、Mingwen Shao、Zhengyi Gong、Xiang Lv、Lingzhuang Meng

展开 >

作者单位：

State Key Laboratory of Chemical Safety, Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China

出版年：

2025

DOI：

10.1016/j.engappai.2025.110741

Engineering applications of artificial intelligence: The international journal of intelligent real-time automation

ISSN：0952-1976

年,卷(期)：2025.153(Aug.)

参考文献量35