查看更多>>摘要:Image dehazing under deficient data is an ill-posed and challenging problem。 Most existing methods tackle this task by developing either CycleGAN-based hazy-to-clean translation or physical-based haze decomposition。 However, geometric structure is often not effectively incorporated in their straightforward hazy-clean projection framework, which might incur inaccurate estimation in distant areas。 In this paper, we rethink the image dehazing task and propose a depth-aware perception framework, DehazeDP, for robust haze decomposition on deficient data。 Our DehazeDP is insthe pired by Diffusion Probabilistic Model to form an end-to-end training pipeline that seamlessly ines the hazy image generation with haze disentanglement。 Specifically, in the forward phase, the haze is added to a clean image step-by-step according to the depth distribution。 Then, in the reverse phase, a unified U-Net is used to predict the haze and recover the clean image progressively。 Extensive experiments on public datasets demonstrate that the proposed DehazeDP performs favorably against state-of-the-art approaches。
查看更多>>摘要:Recently, although Transformer has widespread application in the field of computer vision, the quadratic complexity of its Self-Attention hindered the processing in large-scale image captioning task。 Therefore, in this paper, we propose a Learnable Linear-Attention with Fast-Normalization for Large-Scale Image Captioning (dubbed as LLAFN-Generator)。 Firstly, it introduces a Learnable Linear-Attention (LLA) module to solve the weight score learning of large-scale images, which is simply implemented through two linear layers and greatly reduces the computation complexity。 Meanwhile, the Fast-Normalization (FN) method is employed in the Learnable linear-Attention instead of the original Softmax function to improve the computational speed。 Additionally, the feature enhancement module be used to compensate for the shallow, fine-grained information in order to enhance the feature representation of the model。 Finally, extensive experiments on the MS COCO dataset show that the computational complexity is reduced by 30% and the parameter is reduced by 20% on models of the same size, with the performance metrics BLEU_1 and CIDEr increasing by 1。2% and 3。6%, respectively。
查看更多>>摘要:Photometric Stereo (PS) addresses the challenge of reconstructing a three-dimensional (3D) representation of an object by estimating the 3D normals at all points on the object's surface。 This is achieved through the analysis of at least three photographs, all taken from the same viewpoint but with distinct lighting conditions。 This paper introduces a novel approach for Universal PS, i。e。, when both the active lighting conditions and the ambient illumination are unknown。 Our method employs a multi-scale encoder-decoder architecture based on Transformers that allows to accommodates images of any resolutions as well as varying number of input images。 We are able to scale up to very high resolution images like 6000 pixels by 8000 pixels without losing performance and maintaining a decent memory footprint。 Moreover, experiments on publicly available datasets establish that our proposed architecture improves the accuracy of the estimated normal field by a significant factor compared to state-of-the-art methods。
查看更多>>摘要:Images and videos often suffer from issues such as motion blur, video discontinuity, or rolling shutter artifacts。 Prior studies typically focus on designing specific algorithms to address individual issues。 In this paper, we highlight that these issues, albeit differently manifested, fundamentally stem from sub-optimal exposure processes。 With this insight, we propose a paradigm termed re-exposure, which resolves the aforementioned issues by performing exposure simulation。 Following this paradigm, we design a new architecture, which constructs visual content representation from images and event camera data, and performs exposure simulation in a controllable manner。 Experiments demonstrate that, using only a single model, the proposed architecture can effectively address multiple visual issues, including motion blur, video discontinuity, and rolling shutter artifacts, even when these issues co-occur。
查看更多>>摘要:Hand gesture is one of the most efficient and natural interfaces in current human-computer interaction (HCI) systems。 Despite the great progress achieved in hand gesture-based HCI, perceiving or tracking the hand pose from images remains challenging。 In the past decade, several challenges have been indicated and explored, such as incomplete data issue, the requirement of large-scale annotated dataset, and 3D hand pose estimation from monocular RGB image; however, there is a lack of surveys to provide comprehensive collection and analysis for these challenges and corresponding solutions。 To this end, this paper devotes effort to the general challenges of hand gesture interpretation techniques in HCI systems based on visual sensors and elaborates on the corresponding solutions in current state-of-the-arts, which can provide a systematic reminder for practical problems of hand gesture interpretation。 Moreover, this paper provides informative cues for recent datasets to further point out the inherent differences and connections among them, such as the annotation of objects and the number of hands, which is important for conducting research yet ignored by previous reviews。 In retrospect of recent developments, this paper also conjectures what the future work will concentrate on, from the perspectives of both hand gesture interpretation and dataset construction。
查看更多>>摘要:Hyperspectral (HS) image always suffers from the deficiency of low spatial resolution, compared with conventional optical image types, which has limited its further applications in remote sensing areas。 Therefore, HS image super-resolution (SR) techniques are broadly employed in order to observe finer spatial structures while preserving the spectra of ground covers。 In this paper, a novel multi-dimensional attention-aided transposed convolutional long-short term memory (LSTM) network is proposed for single HS image super-resolution task。 The proposed network employs the convolutional bi-directional LSTM for the purpose of local and non-local spatial-spectral feature explorations, and transposed convolution for the purpose of image amplification and reconstruction。 Moreover, a multi-dimensional attention module is proposed, aiming to capture the salient features on spectral, channel, and spatial dimensions, simultaneously, to further improve the learning abilities of network。 Experiments on four commonly-used HS images demonstrate the effectiveness of this approach, compared with several state-of-the-art deep learning-based SR methods。
查看更多>>摘要:While convolutional neural networks (CNN) have achieved remarkable performance in single image deraining tasks, it is still a very challenging task due to CNN's limited receptive field and the unreality of the output image。 In this paper, UC-former, an effective and efficient U-shaped architecture based on transformer for image deraining was presented。 In UC-former, there are two core designs to avoid heavy self-attention computation and inefficient communications across encoder and decoder。 First, we propose a novel channel across Transformer block, which computes self-attention between channels。 It significantly reduces the computational complexity of high-resolution rain maps while capturing global context。 Second, we propose a multi-scale feature fusion module between the encoder and decoder to combine low-level local features and high-level non-local features。 In addition, we employ depth-wise convolution and H-Swish non-linear activation function in Transformer Blocks to enhance rain removal authenticity。 Extensive experiments indicate that our method outperforms the state-of-the-art deraining approaches on synthetic and real-world rainy datasets。
查看更多>>摘要:Multi-view 3D shape classification, which identifies a 3D shape based on its 2D views rendered from different viewpoints, has emerged as a promising method of shape understanding。 A key building block in these methods is cross-view feature aggregation。 However, existing methods dominantly follow the "extract-then-aggregate" pipeline for view-level global feature aggregation, leaving cross-view pixel-level feature interaction under-explored。 To tackle this issue, we develop a "fuse-while-extract" pipeline, with a novel View-aligned Pixel-level Fusion (VPF) module to fuse cross-view pixel-level features originating from the same 3D part。 We first reconstruct the 3D coordinate of each feature via the rasterization results, then match and fuse the features via spatial neighbor searching。 Incorporating the proposed VPF module with ResNetl8 backbone, we build a novel view-aligned multi-view network, which conducts feature extraction and cross-view fusion alternatively。 Extensive experiments have demonstrated the effectiveness of the VPF module as well as the excellent performance of the proposed network。
查看更多>>摘要:The widespread use of various chemical gases in industrial processes necessitates effective measures to prevent their leakage during transportation and storage, given their high toxicity。 Thermal infrared-based computer vision detection techniques provide a straightforward approach to identify gas leakage areas。 However, the development of high-quality algorithms has been challenging due to the low texture in thermal images and the lack of open-source datasets。 In this paper, we present the RGB-Thermal Cross Attention Network CRT-CAN), which employs an RGB-assisted two-stream network architecture to integrate texture information from RGB images and gas area information from thermal images。 Additionally, to facilitate the research of invisible gas detection, we introduce Gas-DB, an extensive open-source gas detection database including about 1。3K well-annotated RGB-thermal images with eight variant collection scenes。 Experimental results demonstrate that our method successfully leverages the advantages of both modalities, achieving state-of-the-art (SOTA) performance among RGB-thermal methods, surpassing single-stream SOTA models in terms of accuracy, Intersection of Union (IoU), and F2 metrics by 4。86%, 5。65%, and 4。88%, respectively。
Fatima HaimourRizik Al-SayyedWaleed MahafzaOmar S. Al-Kadi...
104100.1-104100.15页
查看更多>>摘要:Brain imaging plays a crucial role in the diagnosis and treatment of various neurological disorders, providing valuable insights into the structure and function of the brain。 Techniques such as magnetic resonance imaging (MRI) and computed tomography (CT) enable non-invasive visualization of the brain, aiding in the understanding of brain anatomy, abnormalities, and functional connectivity。 However, cost and radiation dose may limit the acquisition of specific image modalities, so medical image synthesis can be used to generate required medical images without actual addition。 CycleGAN and other GANs are valuable tools for generating synthetic images across various fields。 In the medical domain, where obtaining labeled medical images is labor-intensive and expensive, addressing data scarcity is a major challenge。 Recent studies propose using transfer learning to overcome this issue。 This involves adapting pre-trained CycleGAN models, initially trained on non-medical data, to generate realistic medical images。 In this work, transfer learning was applied to the task of MR-CT image translation and vice versa using 18 pre-trained non-medical models, and the models were fine-tuned to have the best result。 The models' performance was evaluated using four widely used image quality metrics: Peak-signal-to-noise-ratio, Structural Similarity Index, Universal Quality Index, and Visual Information Fidelity。 Quantitative evaluation and qualitative perceptual analysis by radiologists demonstrate the potential of transfer learning in medical imaging and the effectiveness of the generic pre-trained model。 The results provide compelling evidence of the model's exceptional performance, which can be attributed to the high quality and similarity of the training images to actual human brain images。 These results underscore the significance of carefully selecting appropriate and representative training images to optimize performance in brain image analysis tasks。