Multi-Modal Text Summarization by Positive and Negative Context Alignment and Fusion
Based on sequence-to-sequence neural network,this paper proposes to model multi-modal text summariza-tion generation task using text semantic information and image semantic information.Specifically,a text primary en-coder and a secondary gated encoder with image information guidance are used to encode multi-modal semantic infor-mation and align the semantic information of text and image.By observing the content of source text and image a-ligned by multi-modal forward attention mechanism and reverse attention mechanism,the relevant and irrelevant features of semantic information of each mode are obtained respectively.The forward filter is used to filter the irrele-vant information in the forward attention mechanism,and the reverse filter is used to filter the relevant information in the reverse attention mechanism,so as to selectively merging the semantic information of text and the semantic information of image in the forward and reverse aspects respectively.Finally,based on the pointer generation net-work,the relevant information is used to build the forward pointer,the irrelevant information is used to build the reverse pointer,and the text summarization content with multi-modal semantic information compensation is genera-ted.In JD Chinese e-commerce dataset,the multi-modal text summarization by the proposed model reaches 38.40,16.71 and 28.01 in the indexes of ROUGE-1,ROUGE-2 and ROUGE-L,respectively.
multi-modal text summarizationmulti-modal alignmentsecondary gated encodingtext-generation model