Text-to-image generation method based on self-supervised attention and image features fusion
Current hierarchical text-to-image generation methods only use up-sampling for feature extraction during the initial image generation stage,but up-sampling process is essentially convolutional operations,and the limitations of convolutional operations can cause global information to be ignored and remote semantics to be unable to interact.Although there have been methods to add self-attention mechanisms to models,there are still problems such as lack of image details,image structural errors,and so on.In response to the above existing problems,a generation countermeasure network model SAF-GAN based on self-supervised attention and image feature fusion is proposed.A self-supervised module based on ContNet is added to the initial feature generation stage,and attention mechanism is used for autonomous mapping learning between image features.The dynamic attention matrix is guided by the context relationship of features,achieving a high combination of context mining and self-attention learning,which improves the feature generation effect of low resolution images,and subsequently refines and generates high-resolution images through alternating training of networks at different stages.At the same time,the feature fusion enhancement module is added.By fusing low resolution features of previous stage of the model with features of the current stage,the generation network can make full use of the high semantic information of low level features and high resolution information of the high level features.The semantic consistency of feature maps with different resolutions is further guaranteed,so as to achieve the high-resolution realistic image generation.Experimental results show that in comparison with benchmark model(AttnGAN),the IS score of the SAF-GAN model is increased by 0.31 and the FID index is decreased by 3.45 on the CUB dataset,while the IS score of the SAF-GAN model is increased by 2.68 and the FID index is decreased by 5.18 on the COCO dataset.It is concluded that the proposed model can effectively generate more realistic images,which proves the effectiveness of the proposed method.