查看更多>>摘要:Image restoration (IR) has been widely used in many computer vision applications. The model-based IR methods have clear theoretical bases. However, numerous hyper-parameters need to be set empirically, which is often challenging and time-consuming. Because of the powerful nonlinear fitting ability, deep convolutional neural networks (CNNs) have been widely used in IR tasks in recent years. However, it is challenging to design new network architecture to further significantly improve the IR performance. Inspired by the plug and play (P&P) methods, we first decouple the original IR problem into two subproblems with the variable splitting technique. Then, derived from the model-based methods, a novel deep CNN framework in the transformation domain is proposed to mimic the optimization process of the two subproblems. The proposed framework is driven effectively by the target vector update (TVU) module. Extensive experiments demonstrate the effectiveness of our proposed method over other state-of-the-art IR methods. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:It is well known that detail features and context semantics are conducive to improving object detection performance. However, the current single-prediction detectors do not well incorporate these two types of information together. To alleviate the limitation of single-prediction on the use of multiple types of information, we propose a dual detection branch network (DDBN) with adjacent feature compensation and customized training strategy for semantic diversity predictions. Different from the conventional single prediction models, our DDBN is in the form of a single model with dual different semantic predictions. In particular, two types of adjacent feature compensations are designed to extract detail and context information from different perspectives. Also, a specialized training strategy is customized for our DDBN to well explore the diversity of predictions for improving the performance of object detection. We conduct extensive experiments on three datasets, i.e., DOTA, MS-COCO, and Pascal-VOC, and the experimental results strongly demonstrate the efficacy of our proposed model. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:With the prevalence of multimedia content on the Web which usually continuously comes in a stream fashion, online cross-modal hashing methods have attracted extensive interest in recent years. However, most online hashing methods adopt a relaxation strategy or real-valued auxiliary variable strategy to avoid complex optimization of hash codes, leading to large quantization errors. In this paper, based on Discrete Latent Factor model-based cross-modal Hashing (DLFH), we propose a novel cross-modal online hashing method, i.e., Discrete Online Cross-modal Hashing (DOCH). To generate uniform high-quality hash codes of different modal, DOCH not only directly exploits the similarity between newly coming data and old existing data in the Hamming space, but also utilizes the fine-grained semantic information by label embedding. Moreover, DOCH can discretely learn hash codes by an efficient optimization algorithm. Extensive experiments conducted on two real-world datasets demonstrate the superiority of DOCH. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Graph clustering based on embedding aims to divide nodes with higher similarity into several mutually disjoint groups, but it is not a trivial task to maximumly embed the graph structure and node attributes into the low dimensional feature space. Furthermore, most of the current advanced methods of graph nodes clustering adopt the strategy of separating graph embedding technology and clustering algorithm, and ignore the potential relationship between them. Therefore, we propose an innovative end to-end graph clustering framework with joint strategy to handle the complex problem in a non-Euclidean space. In terms of learning the graph embedding, we propose a new variational graph auto-encoder algorithm based on the Graph Convolution Network (GCN), which takes into account the boosting influence of joint generative model of graph structure and node attributes on the embedding output. On the basis of embedding representation, we implement a self-training mechanism through the construction of auxiliary distribution to further enhance the prediction of node categories, thereby realizing the unsupervised clustering mode. In addition, the loss contribution of each cluster is normalized to prevent large clusters from distorting the embedding space. Extensive experiments on real-world graph datasets validate our design and demonstrate that our algorithm has highly competitive in graph clustering over state-of-theart methods. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Recently, deep neural networks (DNNs) have shown serious vulnerability to adversarial examples with imperceptible perturbation to clean images. To counter this issue, many powerful defensive methods (e.g., ComDefend) focus on rectifying the adversarial examples with well-trained models from a large training dataset (e.g., clean-adversarial image pairs). However, such methods rely heavily on the learned external priors from an external large training dataset, while neglecting the rich image internal priors of the input itself, thus limiting the generalization of the defense models against the adversarial examples with biased image statistics from the external training dataset. Motivated by deep image prior that can capture rich image statistics from a single image, we propose an effective Deep Image Prior Driven Defense (DIPDe-fend) method against adversarial examples. With a DIP generator to fit the target/adversarial input, we find that our image reconstruction exhibits quite interesting learning preference from a feature learning perspectives, i.e., the early stage primarily learns the robust features resistant to adversarial perturbation, followed by learning non-robust features that are sensitive to adversarial perturbation. Besides, we de-velop an adaptive stopping strategy that adapts our method to diverse images. In this way, the proposed model obtains a unique defender for each individual adversarial input, thus being robust to various at-tackers. Experimental results demonstrate the superiority of our method over the state-of-the-art defense methods against white-box and black-box adversarial attacks. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Image captioning is a hot research topic bridging computer vision and natural language processing during the past several decades. It has achieved great progress with the help of large-scale datasets and deep learning techniques. Though the variety of image captioning models (ICMs), the performance of ICMs have got stuck in a bottleneck judging from the publicly published results. Considering the marginal performance gains brought by recent ICMs, we raise the following question: "what about the performances of the recent ICMs achieve on in-the-wild images? To clarify this question, we compare existing ICMs by evaluating their generalization ability. Specifically, we propose a novel method based on maximum discrepancy competition to diagnose existing ICMs. Firstly, we establish a new test set containing only informative images selected by adopting maximum discrepancy competition on the existing ICMs, from an arbitrary large-scale raw image set. Secondly, a small-scale and low-cost subjective annotation experiment is conducted on the new test set. Thirdly, we rank the generalization ability of the existing ICMs by comparing their performances on the new test set. Finally, the keys of different ICMs are demonstrated based on a detailed analysis of experimental results. Our analysis yields several interesting findings, including that 1) Using simultaneously low-and high-level object features may be an effective tool to boost the generalization ability for the Transformer based ICMs. 2) Self-attention mechanism may provide better modelling ability for inter-and intra-modal data than other attention-based mechanisms. 3) Constructing an ICM with a multistage language decoder may be a promising way to improve its performance. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Most existing RGB-D salient object detectors make use of the complementary information of RGB-D im-ages to overcome the challenging scenarios, e.g., low contrast, clutter backgrounds. However, these mod-els generally neglect the fact that one of the input images may be poor in quality. This will adversely affect the discriminative ability of cross-modal features when the two channels are fused directly. To ad-dress this issue, a novel end-to-end RGB-D salient object detection model is proposed in this paper. At the core of our model is a Semantic-Guided Modality-Weight Map Generation (SG-MWMG) sub-network, producing modality-weight maps to indicate which regions on both modalities are high-quality regions, given input RGB-D images and the guidance of their semantic information. Based on it, a Bi-directional Multi-scale Cross-modal Feature Fusion (Bi-MCFF) module is presented, where the interactions of the features across different modalities and scales are exploited by using a novel bi-directional structure for better capturing cross-scale and cross-modal complementary information. The experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods. (c) 2021 Published by Elsevier Ltd.
查看更多>>摘要:Image-to-video person re-identification (I2V ReID), which aims to retrieve human targets between image based queries and video-based galleries, has recently become a new research focus. However, the appearance misalignment and modality misalignment in both images and videos caused by pose variations, camera views, misdetections, and different data types, make I2V ReID still challenging. To this end, we propose a deep I2V ReID pipeline based on three-dimensional semantic appearance alignment (3D-SAA) and cross-modal interactive learning (CMIL) to address the aforementioned two challenges. Specifically, in the 3D-SAA module, the aligned local appearance images extracted by dense 3D human appearance estimation are in conjunction with global image and video embedding streams to learn more fine-grained identity features. The aligned local appearance images are further semantically aggregated by the proposed multi-branch aggregation network to weaken the negligible body parts. Moreover, to overcome the influence of modality misalignment, a CMIL module enables the communication between global image and video streams by interactively propagating the temporal information in videos to the channels of image feature maps. Extensive experiments on challenging MARS, DukeMTMC-VideoReID and iLIDS-VID datasets, show the superiority of our approach. (c) 2021 Elsevier Ltd. All rights reserved.
Deshpande, GauriBatliner, AntonSchuller, Bjoern W.
10页
查看更多>>摘要:The Coronavirus (COVID-19) pandemic impelled several research effort s, from collecting COVID-19 patients' data to screening them for virus detection. Some COVID-19 symptoms are related to the functioning of the respiratory system that influences speech production; this suggests research on identifying markers of COVID-19 in speech and other human generated audio signals. In this article, we give an overview of research on human audio signals using 'Artificial Intelligence' techniques to screen, diagnose, monitor, and spread the awareness about COVID-19. This overview will be useful for developing automated systems that can help in the context of COVID-19, using non-obtrusive and easy to use bio-signals conveyed in human non-speech and speech audio productions. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Can one learn to diagnose COVID-19 under extreme minimal supervision? Since the outbreak of the novel COVID-19 there has been a rush for developing automatic techniques for expert-level disease identification on Chest X-ray data. In particular, the use of deep supervised learning has become the go-to paradigm. However, the performance of such models is heavily dependent on the availability of a large and representative labelled dataset. The creation of which is a heavily expensive and time consuming task, and especially imposes a great challenge for a novel disease. Semi-supervised learning has shown the ability to match the incredible performance of supervised models whilst requiring a small fraction of the labelled examples. This makes the semi supervised paradigm an attractive option for identifying COVID-19. In this work, we introduce a graph based deep semi-supervised framework for classifying COVID-19 from chest X-rays. Our framework introduces an optimisation model for graph diffusion that reinforces the natural relation among the tiny labelled set and the vast unlabelled data. We then connect the diffusion prediction output as pseudo-labels that are used in an iterative scheme in a deep net. We demonstrate, through our experiments, that our model is able to outperform the current leading supervised model with a tiny fraction of the labelled examples. Finally, we provide attention maps to accommodate the radiologist's mental model, better fitting their perceptual and cognitive abilities. These visualisation aims to assist the radiologist in judging whether the diagnostic is correct or not, and in consequence to accelerate the decision. (C) 2021 Elsevier Ltd. All rights reserved.