查看更多>>摘要:Existing embedding zero-shot learning models usually learn a projection function from the visual feature space to the semantic embedding space, e.g. attribute space or word vector space. However, the projection learned based on seen samples may not generalize well to unseen classes, which is known as the projection domain shift problem in ZSL. To address this issue, we propose a method named Low-rank Semantic Autoencoder (LSA) to consider the low-rank structure of seen samples to maintain the sparse feature of reconstruction error, which can further improve zero-shot learning capability. Moreover, to obtain a more robust projection for unseen classes, we propose a Specific Rank-controlled Semantic Autoencoder (SRSA) to accurately control of the projection's rank. Extensive experiments on six benchmarks demonstrate the superiority of the proposed models over most existing embedding ZSL models under the standard zero-shot setting and the more realistic generalized zero-shot setting. (c) 2021 Elsevier Ltd. All rights reserved.
Mohamed, Mostafa M.Nessiem, Mina A.Batliner, AntonBergler, Christian...
11页
查看更多>>摘要:The sudden outbreak of COVID-19 has resulted in tough challenges for the field of biometrics due to its spread via physical contact, and the regulations of wearing face masks. Given these constraints, voice biometrics can offer a suitable contact-less biometric solution; they can benefit from models that clas-sify whether a speaker is wearing a mask or not. This article reviews the Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 COMputational PARalinguistics challengE (ComParE), which focused on the fol-lowing classification task: Given an audio chunk of a speaker, classify whether the speaker is wearing a mask or not. First, we report the collection of the Mask Augsburg Speech Corpus (MASC) and the base-line approaches used to solve the problem, achieving a performance of 71 . 8% Unweighted Average Re-call (UAR). We then summarise the methodologies explored in the submitted and accepted papers that mainly used two common patterns: (i) phonetic-based audio features, or (ii) spectrogram representations of audio combined with Convolutional Neural Networks (CNNs) typically used in image processing. Most approaches enhance their models by adapting ensembles of different models and attempting to increase the size of the training data using various techniques. We review and discuss the results of the partici-pants of this sub-challenge, where the winner scored a UAR of 80 . 1% . Moreover, we present the results of fusing the approaches, leading to a UAR of 82 . 6% . Finally, we present a smartphone app that can be used as a proof of concept demonstration to detect in real-time whether users are wearing a face mask; we also benchmark the run-time of the best models. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Constructing adversarial perturbations for deep neural networks is an important direction of research. Crafting image-dependent adversarial perturbations using white-box feedback has hitherto been the norm for such adversarial attacks. However, black-box attacks are much more practical for real-world applications. Universal perturbations applicable across multiple images are gaining popularity due to their innate generalizability. There have also been efforts to restrict the perturbations to a few pixels in the image. This helps to retain visual similarity with the original images making such attacks hard to detect. This paper marks an important step that combines all these directions of research. We propose the DEceit algorithm for constructing effective universal pixel-restricted perturbations using only black-box feedback from the target network. We conduct empirical investigations using the ImageNet validation set on the state-of-the-art deep neural classifiers by varying the number of pixels to be perturbed from a meager 10 pixels to as high as all pixels in the image. We find that perturbing only about 10% of the pixels in an image using DEceit achieves a commendable and highly transferable Fooling Rate while retaining the visual quality. We further demonstrate that DEceit can be successfully applied to image-dependent attacks as well. In both sets of experiments, we outperform several state-of-the-art methods. (c) 2021 Published by Elsevier Ltd.
查看更多>>摘要:A B S T R A C T Recovering 3D voxelized shapes with fine details from single-view 2D images is an extremely challenging and ill-conditioned problem. Most of the existing methods learn the 3D reconstruction process by encoding the 3D shapes and the 2D images into the same low-dimensional latent vector, which lacks the capacity to capture detailed features in the surface of the 3D object shapes. To address this issue, we propose to explore rich intermediate representation for 3D shape reconstruction by using a newly designed network architecture. We first use a two-steam network to infer the depth map and the topology-specific mean shape from the given 2D image, which forms the intermediate representation prediction branch. The intermediate representations capture the global spatial structure and the visible surface geometric structure, which are important for reconstructing high-quality 3D shapes. Based on the obtained intermediate representation, a novel shape transformation network is then proposed to reconstruct the fine details of the whole 3D object shapes. The experimental results on the challenging ShapeNet and Pix3D datasets show that our approach outperforms the existing state-of-the-art methods. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Meta-learning aims to train a classifier on collections of tasks, such that it can recognize new classes given few samples from each. However, current approaches encounter overfitting and poor generalization since the internal representation learning is obstructed by backgrounds and noises in limited samples. To alleviate those issues, we propose the Unsupervised Descriptor Selection (UDS) to tackle few-shot learning tasks. Specifically, a descriptor selection module is proposed to localize and select semantic meaningful regions in feature maps without supervision. The selected features are then mapped into novel vectors by a task-related aggregation module to enhance internal representations. With a simple network structure, UDS makes adaptation between tasks more efficient, and improves the performance in few shot learning. Extensive experiments with various backbones are conducted on Caltech-UCSD Bird and miniImageNet, indicate that UDS achieves the comparable performance to state-of-the-art methods, and improves the performance of prior meta-learning methods. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:In real-world application scenarios, object detection usually encounters two technical challenges, i.e., high accuracy and high speed. Although the latest detection frameworks based on anchor-free detection have achieved outstanding performance, they cannot be widely used in real-world scenarios due to their model complexity and slow speed. In this paper, inspired by cross-context attention mechanism of human visual systems, we propose a light but effective single-shot detection framework using Cross-context Attention-guided Network (CCAGNet) to balance the accuracy and speed. CCAGNet uses attention-guided mechanism to highlight the interaction of object-synergy regions, and suppresses non-object-synergy regions by combining Cross-context Attention Mechanism (CCAM), Receptive Field Attention Mechanism (RFAM), and Semantic Fusion Attention Mechanism (SFAM). The main contribution of our work includes establishing a novel attention mechanism that takes the context information of channel, spatial, cross and adjacent-regions into consideration simultaneously. Extensive experiments demonstrate the feasibility and effectiveness of our method on the public benchmark datasets. To the best of our knowledge, CCAGNet obtains the state-of-the-art performance on both PascalVOC and MSCOCO with the excellent trade-off between accuracy and speed among single-shot detectors. Especially, the Average Precision (AP) metric is significantly improved by 17.0% on small object detection on MSCOCO. (c) 2021 Published by Elsevier Ltd.
查看更多>>摘要:In this study, we aim to improve the accuracy of image splicing detection. We propose a progressive image splicing detection method that can detect the position and shape of spliced region. Because image splicing is likely to destroy or change the consistent correlation pattern introduced by color filter array (CFA) interpolation process, we first used a covariance matrix to reconstruct the R, G and B channels of image and utilized the inconsistencies of the CFA interpolation pattern to extract forensics feature. Then, these forensics features were used to perform coarse-grained detection, and texture strength features were used to perform fine-grained detection. Finally, an edge smoothing method was applied to realize precise localization. As compared to the state-of-the-art CFA-based image splicing detection methods, the proposed method has a high-level detection accuracy and strong robustness against content-preserving manipulations and JPEG compression. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Class imbalance is an inherent characteristic of multi-label data that hinders most multi-label learning methods. One efficient and flexible strategy to deal with this problem is to employ sampling techniques before training a multi-label learning model. Although existing multi-label sampling approaches alleviate the global imbalance of multi-label datasets, it is actually the imbalance level within the local neighbour-hood of minority class examples that plays a key role in performance degradation. To address this issue, we propose a novel measure to assess the local label imbalance of multi-label datasets, as well as two multi-label sampling approaches, namely Multi-Label Synthetic Oversampling based on Local label imbal-ance (MLSOL) and Multi-Label Undersampling based on Local label imbalance (MLUL). By considering all informative labels, MLSOL creates more diverse and better labeled synthetic instances for difficult exam-ples, while MLUL eliminates instances that are harmful to their local region. Experimental results on 13 multi-label datasets demonstrate the effectiveness of the proposed measure and sampling approaches for a variety of evaluation metrics, particularly in the case of an ensemble of classifiers trained on repeated samples of the original data. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Modeling image sets as points on Grassmann manifold has attracted increasing interests in computer vision community and has been applied to many applications. However, such approaches have suffered from the limitation that high computational cost on Grassmann manifold must be involved, especially high-dimensional ones. In this paper, we propose an unsupervised robust dimensionality reduction algorithm for Grassmann manifold based on Neighborhood Preserving Embedding (GNPE). We first introduce two strategies to construct the coefficients-based similarity graph to eliminate the effects of errors. Then, a projection is learned from the high-dimensional Grassmann manifold to the relative low-dimensional one with more discriminative capability, where the local neighborhood structure is well preserved. To address the issue that the estimated similarity graph is unreliable with noise and outliers, we further propose a unified learning framework which performs similarity learning and projection learning simultaneously. By leveraging the interactions between these two essential tasks, we can capture accurate structures and learn discriminative projections. The proposed method can be optimized by an efficient iterative algorithm. Experiments on various image set classification and clustering tasks clearly show that our model achieves consistent improvements in terms of both effectiveness and efficiency. (c) 2021 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Deciding on the unimodality of a dataset is an important problem in data analysis and statistical modeling. It allows to obtain knowledge about the structure of the dataset, i.e. whether data points have been generated by a probability distribution with a single or more than one peaks. Such knowledge is very useful for several data analysis problems, such as for deciding on the number of clusters and determining unimodal projections. We propose a technique called UU-test (Unimodal Uniform test) to decide on the unimodality of a one-dimensional dataset. The method operates on the empirical cumulative density function (ecdf) of the dataset. It attempts to build a piecewise linear approximation of the ecdf that is unimodal and models the data sufficiently in the sense that the data corresponding to each linear segment follows the uniform distribution. A unique feature of this approach is that in the case of unimodality, it also provides a statistical model of the data in the form of a Uniform Mixture Model. We present experimental results in order to assess the ability of the method to decide on unimodality and perform comparisons with the well-known dip-test approach. In addition, in the case of unimodal datasets we evaluate the Uniform Mixture Models provided by the proposed method using the test set log-likelihood and the two-sample Kolmogorov-Smirnov (KS) test. (c) 2021 Elsevier Ltd. All rights reserved.