New Intelligent Systems Study Findings Recently Were Reported by Researchers at National University of Defense Technology (Exploring Better Image Captioning Wit h Grid Features)

扫码查看

原文链接

NETL
NSTL

外文摘要：By a News Reporter-Staff News Editor at Robotics & Machine Learning Daily News Daily News-Research findings on Machine Learning-Intelligent Systems are discussed in a new report. According to news reporting originating from Changsha, People's Republic of China, by NewsRx correspondents, research stated, "Nowadays, Artificial Intelligence Generated Content (AIGC) h as shown promising prospects in both computer vision and natural language proces sing communities. Meanwhile, as an essential aspect of AIGC, image to captions h as received much more attention." Financial supporters for this research include National Natural Science Foundati on of China (NSFC), National Natural Science Foundation of China (NSFC), Nationa l Key Research and Development Program of China, Ministry of Science and Technol ogy, China. Our news editors obtained a quote from the research from the National University of Defense Technology, "Recent vision-language research is developing from the bulky region visual representations based on object detectors toward more conven ient and flexible grid ones. However, this kind of research typically concentrat es on image understanding tasks like image classification, with less attention p aid to content generation tasks. In this paper, we explore how to capitalize on the expressive features embedded in the grid visual representations for better i mage captioning. To this end, we present a Transformer-based image captioning mo del, dubbed FeiM, with two straightforward yet effective designs. We first desig n the feature queries that consist of a limited set of learnable vectors, which act as the local signals to capture specific visual information from global grid features. Then, taking augmented global grid features and the local feature que ries as inputs, we develop a feature interaction module to query relevant visual concepts from grid features, and to enable interaction between the local signal and overall context. Finally, the refined grid visual representations and the l inguistic features pass through a Transformer architecture for multi-modal fusio n. With the two novel and simple designs, FeiM can fully leverage meaningful vis ual knowledge to improve image captioning performance."

外文关键词：

ChangshaPeople's Republic of ChinaAs iaIntelligent SystemsMachine LearningNational University of Defense Techno logy

出版年：

2024

Robotics & Machine Learning Daily News

ISSN：

年,卷(期)：2024.(Mar.7)