首页期刊导航|The visual computer
期刊信息/Journal information
The visual computer
Springer-Verlag
The visual computer

Springer-Verlag

0178-2789

The visual computer/Journal The visual computer
正式出版
收录年代

    ImpRes: implicit residual diffusion models for image super-resolution

    Shiyun ZhangXing DengHaijian ShaoYingtao Jiang...
    5223-5233页
    查看更多>>摘要:Abstract Single-image super-resolution (SISR) is a fundamental task in computer vision that faces challenges due to the loss of high-frequency information during image degradation, leading to a nonunique solution space. Current super-resolution (SR) methods often suffer from high-frequency texture distortion, excessive smoothing, and scale inconsistency. This study introduces an innovative implicit residual diffusion model (ImpRes) to address these issues. ImpRes enhances model convergence speed and high-frequency detail recovery through a residual prediction mechanism. It incorporates a Gaussian frequency decomposition module using Gaussian high-pass filters to emphasize high-frequency components, guiding accurate texture reconstruction. Additionally, ImpRes employs static implicit neural representation (SINR) during decoding to transform discrete image representations into a continuous local implicit image function, achieving precise content perception, flexible spatial sampling, and mitigating over-smoothing. Experimental results demonstrate that ImpRes outperforms most existing diffusion-based methods in terms of model convergence time, generation quality, and scale consistency, achieving a peak signal-to-noise ratio of 29.97 dB in 4 × face super-resolution tasks. Our implementation is available at: https://github.com/fineverse/ImpRes.

    Optimization of 2D and 3D facial recognition through the fusion of CBAM AlexNet and ResNeXt models

    Imen LabiadhLarbi BoubchirHassene Seddik
    5235-5250页
    查看更多>>摘要:Abstract Face identification with shortened and obscured data is a difficult problem for many computer vision and biometrics problems. It enables the identification of features based on synchronous or asynchronous facial changes that truncate or conceal a face. A person’s appearance can change daily due to factors like health conditions, aging, facial structure, beard growth, hairstyle, glasses, or makeup. These variations alter a person’s facial characteristics over time. These changes make it difficult to recognize faces. The challenge in facial recognition technology lies in creating reliable algorithms that can address a range of issues pertaining to the information flow within photographs. This paper presents a novel approach as a solution to overcome this problem in improving the recognition performances. Two novel separate hybrid deep learning-based models, named HResxtAlex-Net and HResCBAMAlex-Net, are proposed for 2D facial recognition. These models are a hybrid architecture of convolutional neural network (CNN) designed for face identification by leveraging the fusion of multimodal biometric features from various CNN structures. The proposed approach uses feature-level fusion, merging components from the CNN structures of ResNeXt and AlexNet as well as ResNeXt and CBAMAlexNet, by amplifying their individual advantages while simultaneously minimizing overall computational complexity. The proposed method has been evaluated on more sophisticated and difficult 2D and 3D databases, with respect to changes in position, asynchronous alterations, and facial expressions. Also, a brand-new 2D YaleFace dataset has been generated using data augmentation, which includes images with hidden truncation, asynchronous face changes, variations in brightness levels, and changes in lighting conditions. The experiments conducted have demonstrated the effectiveness of the proposed models on masked, transcribed, and blurred images, by achieving a high recognition rate of up to 100%.

    Study on the methods of hyperspectral image saliency detection based on MBCNN

    He YuKang YanJiexi ChenXuan Li...
    5251-5266页
    查看更多>>摘要:Abstract Hyperspectral imaging is resource-intensive due to the extensive spectral information it captures. This paper introduces an advanced hyperspectral saliency object detection method via a novel multibranch convolutional neural network (MBCNN), which synergizes convolutional neural networks and random forests to utilize both spectral and spatial data more efficiently. We have optimized the MBCNN by refining channel numbers to improve feature sampling rates, enhancing the neural network layers for robust representation, and developing a new loss function that integrates mean absolute error with precision-recall metrics to overcome common discrepancies between saliency map losses and actual detection precision or recall. These enhancements have led to a marked performance increase, evidenced by a 2.5% rise in AUC score to 0.941 and a 13% improvement in MAE score to 0.146 over the baseline MBCNN. Our experimental results confirm the significant advancements these modifications contribute to the network’s detection capabilities in hyperspectral saliency object detection. Code is available at https://github.com/Ccu-yankang/MBCNN_4_HSI.

    MCGFF-Net: a multi-scale context-aware and global feature fusion network for enhanced polyp and skin lesion segmentation

    Yanxiang LiWenzhe MengDehua MaSiping Xu...
    5267-5282页
    查看更多>>摘要:Abstract Accurate segmentation of polyps and skin lesions is crucial for clinical diagnosis. While many UNet-based methods have made significant progress, challenges such as variations in the scale of segmentation targets and difficulty distinguishing lesion regions from normal tissues persist. To address this, we propose an efficient multi-scale context-aware and global feature fusion network called MCGFF-Net, which aims to overcome these challenges. Specifically, we incorporated the CBAM attention module into the encoder to enhance the ability to capture key information. We further propose a multi-scale perception module that adaptively extracts multiscale semantic information to address the challenge of scale variation in lesion regions. To enhance the interaction of semantic information between cross-layer features, we design a cross-layer feature fusion module (CFM) to alleviate the problem of some lesion regions not being clearly distinguished from the background. Additionally, an efficient pyramid channel attention module is included in the CFM to filter noise and redundant information. We conduct extensive experiments on five publicly available skin lesion and polyp datasets, including CVC-ClinicDB, Kvasir-SEG, ISIC2017, ISIC2018, and PH2. The results indicate that our MCGFF-Net outperforms current popular methods, achieving excellent performance in polyp and skin lesion segmentation tasks. The code is available at https://github.com/liyanxiang985/MCGFF-Net.

    Multi-scale local regional attention fusion using visual transformers for fine-grained image classification

    Yusong LiBin XieYuling LiJiahao Zhang...
    5283-5298页
    查看更多>>摘要:Abstract Fine-grained visual classification (FGVC) poses a significant challenge due to the minute differences among visually similar categories. The objects to be distinguished from each other are often difficult to recognize due to the very small differences between them and for human observers. Traditional methods struggle with this task, prompting the development of a multi-scale local regional attention fusion scheme based on Visual Transformers. We utilize Swin Transformer as the backbone to extract fine-grained features, enhancing feature representations through the relevant portions multi-headed attention mechanism. Furthermore, the convolutional forward propagation network module refines global spatial and channel features. Our approach achieves state-of-the-art performance on benchmarks like CUB-200-2011, NABirds, and Oxford 102 Flowers, demonstrating the effectiveness of our multi-scale fusion strategy for FGVC. Our code will be available at https://github.com/LYSongs/RRSA.

    MFADU-Net: an enhanced DoubleU-Net with multi-level feature fusion and atrous decoder for medical image segmentation

    Yongpeng ZhaoGuangyuan ZhangKefeng LiZhenfang Zhu...
    5299-5309页
    查看更多>>摘要:Abstract In medical image processing, different adaptations based on the U-Net framework like DoubleU-Net are widely used and have been considered as fundamental models. Although DoubleU-Net enhances feature extraction and context comprehension by adding extra layers to U-Net, there are still some challenges to be solved. Multi-scale regions of interest (ROIs) might cause performance decline and information loss; complex contextual information may lead to inaccuracies in capturing environment details, affecting segmentation outcomes; unclear lesion or anatomical structure boundaries result in uncertain segmentations. To tackle these issues, we introduced an improved version of DoubleU-Net, named MFADU-Net, which incorporates multi-level feature fusion and an atrous decoder for more advanced segmentation of complex medical images. Firstly, the multi-level feature fusion block leverages feature extraction and addresses multi-scale ROIs challenges through a dual attention mechanism, excelling in detail capture and contextual understanding. Secondly, the dynamic atrous decoder offers outstanding flexibility and accuracy, further enhanced by a gated attention module for key area identification. Experimental results on CVC-ClinicDB and ISIC2018 datasets demonstrate that MFADU-Net outperforms current main methods, achieving segmentation precision of 89.3% and 93.1%, respectively. The code is available at https://github.com/Zhao-Yp/MFADU-Net.

    Multimodal fusion and knowledge distillation for improved anomaly detection

    Meichen LuYi ChaiKaixiong XuWeiqing Chen...
    5311-5322页
    查看更多>>摘要:Abstract Anomaly detection aims to distinguish normal from abnormal images, with applications in industrial defect detection and medical imaging. Current methods using textual information often focus on designing effective textual prompts but overlook their full utilization. This paper proposes a multimodal fusion network that integrates image and text information to improve anomaly detection. The network comprises an image encoder, text encoder, and stacked cross-attention module. To address the absence of text during inference, an image-only branch is introduced, guided by the multimodal fusion network through knowledge distillation. Experiments on industrial anomaly detection and medical image datasets demonstrate the effectiveness of our approach, achieving AUROC and AUPR scores of 96.5% and 89.2% on VisA, respectively. The code is available at https://github.com/lilianoa/Multimodal-guide-AD.

    EHFusion: an efficient heterogeneous fusion model for group-based 3D human pose estimation

    Jihua PengYanghong ZhouP. Y. Mok
    5323-5345页
    查看更多>>摘要:Abstract Stimulated by its important applications in animation, gaming, virtual reality, augmented reality, and healthcare, 3D human pose estimation has received considerable attention in recent years. To improve the accuracy of 3D human pose estimation, most approaches have converted this challenging task into a local pose estimation problem by dividing the body joints of the human body into different groups based on the human body topology. The body joint features of different groups are then fused to predict the overall pose of the whole body, which requires a joint feature fusion scheme. Nevertheless, the joint feature fusion schemes adopted in existing methods involve the learning of extensive parameters and hence are computationally very expensive. This paper reports a new topology-based grouped method ‘EHFusion’ for 3D human pose estimation, which involves a heterogeneous feature fusion (HFF) module that integrates grouped pose features. The HFF module reduces the computational complexity of the model while achieving promising accuracy. Moreover, we introduce motion amplitude information and a camera intrinsic embedding module to provide better global information and 2D-to-3D conversion knowledge, thereby improving the overall robustness and accuracy of the method. In contrast to previous methods, the proposed new network can be trained end-to-end in one single stage. Experimental results not only demonstrate the advantageous trade-offs between estimation accuracy and computational complexity achieved by our method but also showcase the competitive performance in comparison with various existing state-of-the-art methods (e.g., transformer-based) when evaluated on two public datasets, Human3.6M and HumanEva. The data and code are available at doi:10.5281/zenodo.11113132

    X-ray security inspection for real-world rail transit hubs: a wide-ranging dataset and detection model with incremental learning block

    Xizhuo YuChaojie FanJiandong PanGuoliang Xiang...
    5347-5359页
    查看更多>>摘要:Abstract Security inspection plays a crucial role in maintaining public safety, showcasing significant prospects in the automated and accurate identification of prohibited items in X-ray images. However, detection methods based on deep learning models rely on large-labeled datasets and can only achieve closed-set detection with a limited number of categories. Existing public datasets suffer from poor image quality and are limited to airport security, rendering them unsuitable for railway transportation security inspections. To fill this gap, we contributed the first prohibited items detection dataset based on real-world inspection scenarios at rail transit hubs. It includes 5,923 X-ray images, in which seven categories of 10,224 instances are manually annotated by professional security inspectors. In addition, we propose an incremental approach to open-set detection that allows the detection system to be updated online. The experimental results show that our method achieves state-of-the-art performance on different dataset and is able to detect newly added categories online. The impact of different loss functions on the model’s detection performance is also discussed in this paper.

    CT-UFormer: an improved hybrid decoder for image segmentation

    Junli ShenYuman HaiChongyu Lin
    5361-5371页
    查看更多>>摘要:Abstract Segmentation of lung nodules in medical images is crucial for early detection and treatment planning of lung cancer. The nnU-Net has achieved significant success in numerous medical image segmentation tasks due to its efficient design. However, nnU-Net demonstrates limitations in capturing long-range dependencies. In contrast, the Transformer model effectively manages long-range dependencies through global self-attention mechanisms. In this paper, we propose an improved nnU-Net with a Transformer decoder to segment lung nodules and identify regions of interest. The data augmentation module can increase the amount of available data. It includes two key components: 1) GAN generates lung nodules to increase the number of available datasets; 2) hybrid decoder captures multi-scale feature maps to enable the refinement of “organ label sets.” Our extensive evaluations on two datasets of different sizes, MSD-Lung and LIDC-IDRI, reveal the effectiveness of our contributions in terms of both efficiency and accuracy. Our results are superior to commonly used state-of-the-art works. Compared to the results of existing improved Transformers, our method performs excellently in terms of DSC, MASD, and HD95 metrics. This study proposes a new lung nodule segmentation method, which has higher accuracy and robustness compared to commonly used methods. The method automatically performs effective data augmentation on input data and balances global features and local details through a hybrid decoder. Code is available at https://github.com/andou6/CT-UFormer.