查看更多>>摘要:Light field (LF) cameras produce sub-aperture images that capture 3D scenes from multiple perspectives, containing both spatial and angular information. This information can be used to improve image super-resolution (SR). However, existing multi-view LF image SR methods neglect the correlation among all LF views, which is captured by the global-view information. Moreover, due to the diversity of LF views, it is essential to model an adaptation network for each LF image to incorporate complementary information from other views. To address these issues, we propose a global-view information adaptation-guided network (LF-GIANet) for LF image SR. Our network aligns features from each view to the global domain dynamically, using information from global views as guidance. It then effectively fuses spatial and angular information from all LF views through an attention mechanism. Our LF-GIANet consists of two segments. The global-view information extraction (GIE), a global-view adaptation-guided module (GAGM), extracts global-view information and constructs guidance factors for each view. The information fusion (IF) can achieve a global feature-level alignment using these factors as the offsets of deformable convolutions. Moreover, we utilize a multi-domain information fusion module (MIFM) to deal with high-dimensionality information and supplement distinctive spatial information and angular information from different LF views. We assess our approach on various synthetic and real scenarios and show that it exceeds other state-of-the-art approaches in terms of SR quality and performance. We also show that our LF-GIANet can handle realistic and synthetic LF scenarios well.
查看更多>>摘要:In light of challenges such as vehicle obstructions and road inundation during floods, ascertaining vehicle locations becomes arduous. Existing object detection algorithms are unable to effectively detect vehicles in flood scenarios, which brings great difficulties to the implementation of rescue operations. To address these issues, this study creates a dataset for vehicle detection in flooding scenarios and proposes an improved vehicle detection algorithm, SDF-YOLO, based on YOLOv9. The algorithm uses a multi-size convolutional kernel network SKN(Selective Kernel Networks) to flexibly extract target features with different granularities, which helps to improve the recognition and localisation accuracy of occluded objects. The efficient convolutional algorithm DSConv(Distribution Shifting Convolution) is used to reduce the memory consumption during the training process as well as to improve the convergence speed of the model to quickly extract the key features of the vehicle. In addition, the IoU is replaced with FIIoU, which effectively improves the detection accuracy of difficult-to-classify samples by introducing auxiliary bounding box and scale adjustment strategies. Experimental results demonstrate that the enhanced model, as opposed to the YOLOv9 algorithm, achieved a notable increase in accuracy by 4.1%, F1-score by 1.7%, and mAP by 3.1%. These findings not only contribute to enhancing the efficacy of flood response efforts and mitigating associated human and property losses but also serve to advance the evolution of deep learning-based object detection methodologies within intricate environmental settings.
查看更多>>摘要:In the information age, news serves as a critical medium for information acquisition, whose importance is undeniable. However, the widespread dissemination of fake news poses a significant threat to society. Despite the efforts of many research teams to develop automated fake news detection technologies, these methods are often limited to specific domains, such as politics or healthcare. News coverage spans a vast range of fields, each with unique terminology, language style, and themes. In real-world scenarios, news articles are typically assigned a single domain label for classification, even though they often contain content across multiple domains. In this paper, we propose a novel Topic-guided Multi-domain Fake News Detection Framework (TG-MFEND) to address these challenges. TG-MFEND employs an Adaptive View Fusion Module to model news from multiple perspectives, including semantics, style, and themes. Additionally, we designed a Topic Domain Embed-der to capture multi-domain features of news. Leveraging these multi-domain features, TG-MFEND adaptively aggregates features from different views to aid in determining news veracity. Experimental results on different datasets demonstrate that TG-MFEND achieves superior performance.
查看更多>>摘要:Transformer together with convolutional neural network for semantic segmentation of remote sensing images has achieved better performance than the pure module-based methods. However, the advantages of both encoding styles are not well considered, and the designed fusion modules have not achieved good effect in remote sensing image semantic segmentation. In this paper, to exploit local and global pixel dependencies, improved gated recurrent units combined with fusion module, named feature selection and fusion module, are proposed. Concretely, to precisely incorporate local and global representations that are the outputs of encoders of ResNet and Swin Transformer, respectively, the ConvGRU with improved reset and update gates, which is treated as feature selection unit, is designed to select the features of advantageous segmentation task. To merge the outputs from ResNet, Swin Transformer and FSU, feature fusion unit based on stack and sequential convolutional block operations is constructed. On public Vaihingen, Potsdam and BLU datasets, experimental results show that FSFM is effective, which outperforms state-of-the-art methods in some famous remote image semantic segmentation tasks.
查看更多>>摘要:Recent advancements in attention-based techniques have significantly propelled the field of person re-identification. Despite this progress, the challenge of accurately retrieving individuals across multiple camera views remains a substantial obstacle. Current person re-identification methods predominantly emphasize effective feature extraction and the suppression of irrelevant information through local or global attention mechanisms. However, these approaches often fail to fully exploit the potential of spatial relational information from a holistic perspective. The interdependence between feature nodes across distinct regions indicates the presence of spatial relational knowledge, which can significantly enhance the fuzzy inference of semantic relevance and attention, especially in alignment-constrained settings. This paper introduces a novel approach that simultaneously models both "local-global" relationships and human body topology to better capture relational information. Specifically, we propose a learning paradigm that leverages non-local attention to model relationships among different body regions, thereby improving the model's ability to capture semantic correlations between spatial regions. Experimental evaluations on several benchmark datasets demonstrate the superiority of our method, which not only achieves substantial performance gains but also outperforms existing state-of-the-art approaches. Additionally, our comprehensive ablation studies further affirm the efficacy and advantages of the proposed framework.
查看更多>>摘要:In recent years, image super-resolution (SR) based on deep learning has achieved great success. However, the current state-of-the-art (SOTA) models still face the problem of high computational cost. Lightweight image super-resolution models are extremely important for practical applications, and reducing the number of parameters and FLOPS is the key to designing lightweight super-resolution models. To this end, an Inception-like large kernel network (ILKN) is proposed in this paper. Specifically, we design the basic components of the Inception-like large kernel structure to form the backbone network of ILKN and introduce a more efficient large kernel attention module to reduce the computational cost while improving the performance. Experimental results show that for different scales (× 2, × 3, × 4) ILKN outperforms most of the existing lightweight SR methods, not only achieving close to SOTA performance, but also keeping the number of parameters and FLOPS relatively minimal.
查看更多>>摘要:Medical image analysis plays a pivotal role in diagnosis and treatment. However, the diverse characteristics of various imaging modalities often demand distinct processing approaches. In this study, we introduce UMSSNet, a versatile method for medical image segmentation that leverages data from heterogeneous medical images across different scales. By utilizing Gaussian pyramid-based image processing techniques, we transform the heterogeneous medical images into a uniform multi-scale image structure. Subsequently, UMSSNet integrates multi-scale image features, encompassing contextual information, and adopts a dynamic and hierarchical approach to process images at various scales, emulating the decision-making process of human pathologists and facilitating precise image segmentation. We tested UMSSNet on publicly available datasets consisting of various forms of medical images, including WSI, Biopsy slides, CT, MRI, X-ray, Colonoscopy, Fundus, and CMR, as well as private datasets of Immunohistochemical staining, Immunofluorescence staining, and Masson staining sample images. UMSSNet demonstrated performance comparable to state-of-the-art medical image segmentation methods Furthermore, the generalizability of UMSSNet in segmenting heterogeneous medical images holds promise for future research in the analysis of multimodal medical data.
查看更多>>摘要:In recent years, a large number of cross-modal retrieval methods have been proposed and have achieved significant performance. However, these methods are based on large-scale image-text paired datasets, and collecting such datasets is costly and time-consuming. To deal with the problem, the task of unpaired cross-modal retrieval is proposed, in which paired annotation is unavailable during model training. We study this task and propose a two-stage common semantic space construction method, which is inspired by human process of learning about the world by first grasping things and then understanding the relationships between them. Firstly, we construct an entity common semantic space, in which only enable coarse retrieval due to the neglect of relationships between objects in scenes. We then extend the space to fusion common semantic space, focusing on the relationships between different regions in scenes, to more completely express objects in the scene and accurately calculate image-text similarity. We conducted comparisons, ablation studies and visualizations to evaluate the performance of our method, and further validate on a downstream task-noisy cross-modal retrieval, demonstrating significant improvements in accuracy to baseline methods and validating the effectiveness.
查看更多>>摘要:Emotion Recognition in Conversation (ERC) is a crucial subtask in developing dialogue systems with emotional understanding capabilities. Multimodal ERC contains various types of modality data, including text, vision, and acoustic information, which collectively compensate for the limitations of single modality approaches. Recently, Graph Neural Networks have been extensively applied in multimodal ERC due to their advantages in relational modeling. However, existing methods either directly fuse multimodal information resulting in interaction information loss between different modalities, or fail to effectively capture long-distance contextual dependency information. In this paper, we propose a novel multimodal ERC approach called Hierarchical Heterogeneous Graph Network (HHGN), which models dialogues as both directed and undirected heterogeneous graphs to facilitate hierarchical learning. The directed graph captures contextual dependency information in dialogues, while the undirected graphs learn cross-modal interaction information. Extensive experiments were conducted on two public benchmark datasets, and the experimental results demonstrate that our model outperforms other competitive methods.
查看更多>>摘要:Deep neural networks have greatly facilitated the research in the area of image super-resolution. Existing super-resolution networks primarily focus on optimizing visual image quality, and there has been little work considering the potential use of the super-resolved image for machine vision tasks, such as semantic segmentation. In this paper, we propose a segmentation-aware image super-resolution network, by incorporating machine-related semantic segmentation task requirement. To achieve this, we propose to generate super-resolved images by employing generative adversarial networks, where the super-resolution network functions as the generator, while the discriminator consists of the popular real/fake classification branch and an additional task branch. With this specially-designed multi-branch discriminator, segmentation task requirement is used as an added training objective provided to the generator for producing super-resolved images with segmentation awareness. Moreover, by performing alternative training between the generator and the discriminator, we dynamically enhance the capacity of task branch along with the super-resolving process, achieving generalization ability for the resulting super-resolved image. Experimental results show that, for the semantic segmentation task, our proposed method is at most 6.8 higher than competing methods in terms of mIoU score. Furthermore, our method can also generate visually better super-resolved images at the same time.