基于改进Stable Diffusion的时尚服饰图案生成

扫码查看

原文链接

万方数据
维普

中文摘要：服饰图案是人们展示个性与时尚的窗口.近年来,随着多模态技术的不断发展,基于文本的服饰图案生成得到了充分研究.但现有方法由于结合语义性较差和分辨率不高等问题并未得到很好的应用.大规模语言-图像预训练模型CLIP提出后,各种预训练扩散模型结合CLIP做文本图像生成任务已成为该领域的主流方法.但原始预训练模型对下游任务泛化能力较差,单纯依靠预训练模型并不能灵活准确控制服饰图案的颜色和结构,且其庞大的参数量很难从头重新训练.为解决上述问题,本文设计一个基于Stable Diffusion改进的网络FT-SDM-L(Fine Tuning-Stable Diffusion Model-Lion),该网络使用服饰图像文本数据集,对原模型中的交叉注意力模块进行权重更新.实验结果表明,微调后模型的ClipScore及HPS v2分数平均提高了0.08和1.22,验证了该模块在结合文本信息中的重要能力.随后为进一步增强模型在服饰领域的特征提取和数据映射能力,在该模块输出位置设计添加一个轻量级适配器Stable-Adapter,最大限度地感知输入提示的变化.该适配器仅额外增加0.75%的参数就可使模型的ClipScore及HPS v2分数进一步提高0.05、0.38.模型在服饰图案生成的保真度和语义一致性上均取得良好效果.

外文标题：Fashion Clothing Pattern Generation Based on Improved Stable Diffusion

外文摘要：Dress pattern is a window for people to show their personality and fashion.In recent years,with the continuous devel-opment of multimodal technology,text-based dress pattern generation has been well studied.However,the existing methods have not been well applied due to the problems of combining poor semanticity and low resolution.After the large-scale language-image pre-training model CLIP was proposed,various pre-training diffusion models combined with CLIP for text-image genera-tion tasks have become the mainstream methods in this field.However,the original pre-training models have poor generalization ability to the downstream task,relying solely on the pre-training model does not allow flexible and accurate control of the color and structure of the dress pattern,and its large number of parameters is difficult to re-train from scratch.To solve the above prob-lems,this study designs a Stable Diffusion-improved network FT-SDM-L(Fine Tuning-Stable Diffusion Model-Lion),which uses the dress image text dataset to update the weights of the cross-attention module in the original model.The experimental re-sults show that the ClipScore and HPS v2 scores of the fine-tuned model are improved by 0.08 and 1.22 on average,which vali-dates the important ability of this module in combining textual information.Subsequently,to further enhance the model's feature extraction and data mapping capabilities in the apparel domain,a lightweight adapter,Stable-Adapter,was designed to be added to the module's output location to maximize the sensing of changes in the input cues.By adding only 0.75%extra param-eters to the adapter,the ClipScore and HPS v2 scores of the model can be further improved by 0.05,0.38.Good results are achieved in terms of fidelity and semantic consistency of clothing pattern generation

外文关键词：

text image generationdiffusion modelcross-attention mechanismimage generationcomputer vision

作者：

赵晨阳、薛涛、刘俊华

展开 >

作者单位：

西安工程大学计算机科学学院,陕西西安 710048

关键词：

文本图像生成扩散模型交叉注意力机制图像生成计算机视觉

出版年：

2024

DOI：

10.3969/j.issn.1006-2475.2024.12.003

计算机与现代化

江西省计算机学会江西省计算技术研究所

计算机与现代化

CSTPCD

影响因子：0.472

ISSN：1006-2475

年,卷(期)：2024.(12)