基于自监督注意和图像特征融合的文本生成图像方法

扫码查看

原文链接

万方数据
维普

中文摘要：现有的层级式文本生成图像的方法在初始图像生成阶段仅使用上采样进行特征提取,上采样过程本质是卷积运算,卷积运算的局限性会造成全局信息被忽略并且远程语义无法交互.虽然已经有方法在模型中加入自注意力机制,但依然存在图像细节缺失、图像结构性错误等问题.针对上述存在的问题,提出一种基于自监督注意和图像特征融合的生成对抗网络模型SAF-GAN.将基于ContNet的自监督模块加入到初始特征生成阶段,利用注意机制进行图像特征之间的自主映射学习,通过特征的上下文关系引导动态注意矩阵,实现上下文挖掘和自注意学习的高度结合,提高低分辨率图像特征的生成效果,后续通过不同阶段网络的交替训练实现高分辨率图像的细化生成.同时加入了特征融合增强模块,通过将模型上一阶段的低分辨率特征与当前阶段的特征进行融合,生成网络可以充分利用低层特征的高语义信息和高层特征的高分辨率信息,更加保证了不同分辨率特征图的语义一致性,从而实现高分辨率的逼真的图像生成.实验结果表明,相较于基准模型(AttnGAN),SAF-GAN模型在IS和FID指标上均有改善,在CUB数据集上的IS分数提升了0.31,FID指标降低了3.45;在COCO数据集上的IS分数提升了2.68,FID指标降低了5.18.SAF-GAN模型能够有效生成更加真实的图像,证明了该方法的有效性.

外文标题：Text-to-image generation method based on self-supervised attention and image features fusion

外文摘要：Current hierarchical text-to-image generation methods only use up-sampling for feature extraction during the initial image generation stage,but up-sampling process is essentially convolutional operations,and the limitations of convolutional operations can cause global information to be ignored and remote semantics to be unable to interact.Although there have been methods to add self-attention mechanisms to models,there are still problems such as lack of image details,image structural errors,and so on.In response to the above existing problems,a generation countermeasure network model SAF-GAN based on self-supervised attention and image feature fusion is proposed.A self-supervised module based on ContNet is added to the initial feature generation stage,and attention mechanism is used for autonomous mapping learning between image features.The dynamic attention matrix is guided by the context relationship of features,achieving a high combination of context mining and self-attention learning,which improves the feature generation effect of low resolution images,and subsequently refines and generates high-resolution images through alternating training of networks at different stages.At the same time,the feature fusion enhancement module is added.By fusing low resolution features of previous stage of the model with features of the current stage,the generation network can make full use of the high semantic information of low level features and high resolution information of the high level features.The semantic consistency of feature maps with different resolutions is further guaranteed,so as to achieve the high-resolution realistic image generation.Experimental results show that in comparison with benchmark model(AttnGAN),the IS score of the SAF-GAN model is increased by 0.31 and the FID index is decreased by 3.45 on the CUB dataset,while the IS score of the SAF-GAN model is increased by 2.68 and the FID index is decreased by 5.18 on the COCO dataset.It is concluded that the proposed model can effectively generate more realistic images,which proves the effectiveness of the proposed method.

外文关键词：

computer visiongenerative adversarial networkstext-to-imagecotnetimage feature fusion

作者：

廖涌卉、张海涛、金海波

展开 >

作者单位：

辽宁工程技术大学软件学院, 辽宁葫芦岛 125105

汕头职业技术学院计算机系, 广东汕头 515071

关键词：

计算机视觉生成对抗网络文本生成图像 CotNet 图像特征融合

基金：

国家自然科学基金辽宁省科技厅面上项目

项目编号：

621731712022-MS-397

出版年：

2024

DOI：

10.37188/CJLCD.2023-0107

液晶与显示

中科院长春光学精密机械与物理研究所中国光学光电子行业协会液晶分会中国物理学会液晶分会

液晶与显示

CSTPCD北大核心

影响因子：0.964

ISSN：1007-2780

年,卷(期)：2024.39(2)

参考文献量3