首页|MirrorDiff: Prompt redescription for zero-shot grounded text-to-image generation with attention modulation
MirrorDiff: Prompt redescription for zero-shot grounded text-to-image generation with attention modulation
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
Elsevier
Large-scale layout-conditioned text-to-image diffusion models have made significant progress and achieved remarkable results in generating diverse and high-quality images, realizing objects appearing in specific regions simultaneously. However, existing methods still fail with attribute coupling, unreasonable spatial relationships expressions and missing objects when the prompt is complex with multiple objects containing multiple attributes. In addition, it is difficult for users to give precise layout conditions for complex prompts. To address the above issues, we propose MirrorDiff, a novel training-free grounded text-to-image-to-text framework by redescription to correct inaccurate content expressions of synthetic images iteratively. Specifically, we first utilize large language models as layout generator which have the ability to understand visual concepts and support plausible arrangements to generate scene layout for complex prompts to help users obtain precision layout more conveniently. Subsequently, to solve small object missing, we design a layout-guided attention modulation strategy to properly adjust attention maps during diffusion generation process, which effectively increases attention of small objects. Additionally, semantic text regeneration supervision is proposed to constrain the redescription to keep consistent with the given text semantically, which aims to mitigate attribute coupling and failures of spatial relationships expressions. We conduct extensive experiments on four benchmarks and our method achieves the best results in all categories on the Holistic, Reliable and Scalable benchmark, which shows that our proposed MirrorDiff achieves state-of-the-art results both quantitatively and qualitatively compared with current superior models.
Text-to-image generationLarge language modelDiffusion modelAttention modulation
State Key Laboratory of Chemical Safety, Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China