Enhanced transformer based on encoding of location and multi-layer aggregation
An image captioning model named enhanced Transformer based on encoding of location and multi-layer aggregation was proposed,for two insufficient utilization problems in image captioning:location information correlation and different encoder level's information.Proposed model introduces a location en-coding mechanism of visual objects,which helps the model to pay attention to the differences and relation-ships between visual objects by extracting the relative spatial information hidden in the location information of independent regions.Meanwhile,a multi-layer aggregation attention encoding is designed,which trans-mits the multi-layer image coding information to the output layer through the combination of gated loop unit and self-attention,so that the semantics of the acquired image features are more abundant.The experimen-tal results show that the performance of the proposed model is obviously better than the traditional encoder-decoder structure's models.It output sentences are more accurate and more detail rich.
image captioningTransformermuti-layer aggregation encodinglocation encodinggated re-current unit