查看更多>>摘要:Although deep convolutional neural networks (DCNNs) have been proposed for prostate MR image segmentation, the effectiveness of these methods is often limited by inadequate semantic discrimination and spatial context modeling. To address these issues, we propose a Multi-scale Synergic Discriminative Network (MSD-Net), which includes a shared encoder, a segmentation decoder, and a boundary detection decoder. We further design the cascaded pyramid convolutional block and residual refinement block, and incorporate them and the channel attention block into MSD-Net to exploit the multi-scale spatial contextual information and semantically consistent features of the gland. We also fuse the features from two decoders to boost the segmentation performance, and introduce the synergic multi-task loss to impose the consistence constraint on the joint segmentation and boundary detection. We evaluated MSD-Net against several prostate segmentation methods on three public datasets and achieved an improved accuracy. Our results indicate that the proposed MSD-Net outperforms existing methods with setting the new state-of-the-art for prostate segmentation in magnetic resonance images. (c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:We propose a novel optimization framework that crops a given image based on user description and aesthetics. Unlike existing image cropping methods, where one typically trains a deep network to regress to crop parameters or cropping actions, we propose to directly optimize for the cropping parameters by re purposing pre-trained networks on image captioning and aesthetic tasks, without any fine-tuning, thereby avoiding training a separate network. Specifically, we search for the best crop parameters that minimize a combined loss of the initial objectives of these networks. To make the optimization stable, we propose three strategies: (i) multi-scale bilinear sampling, (ii) annealing the scale of the crop region, therefore effectively reducing the parameter space, (iii) aggregation of multiple optimization results. Through various quantitative and qualitative evaluations, we show that our framework can produce crops that are well-aligned to intended user descriptions and aesthetically pleasing. (c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:In this paper we propose a novel Multi-Task Learning (MTL) framework, Split 'n' Merge Net. We draw the inspiration from the multi-head attention formulation of Transformers and propose a novel, simple and interpretable pathway to process information captured and exploited by multiple tasks. In particular, we propose a novel splitting network design, which is empowered with multi-head attention, and generates dynamic masks to filter task specific information and task agnostic shared factors from the input. To drive this generation, and to avoid the oversharing of information between the tasks, we propose a novel formulation of the mutual information loss which encourages the generated split embeddings to be distinct as possible. A unique merging network is also introduced to fuse the task specific, and shared information and generate an augmented embedding for the individual downstream tasks in the MTL pipeline. We evaluate the proposed Split 'n' Merge Network on two distinct MTL tasks where we achieve stateof-the-art results for both. Our primary, ablation and interpretation evaluations indicate the robustness and flexibility of the propose approach and demonstrates its applicability to numerous, diverse real-world MTL applications. (c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Recently, baggage re-identification (ReID) has become an attractive topic in computer vision because it plays an important role in intelligent surveillance. However, the wide variations in different views of baggage items degrade baggage ReID performance. In this paper, a novel QuadNet is proposed to solve the multi-view problem in baggage ReID at three levels. At the sample level, we propose a multi-view sampling strategy which samples hard examples from multiple identities in multiple views. The sampled baggage items are used to construct quadruplets. At the feature level, view-aware attentional local features are extracted from discriminative regions in each view. These local features are fused with global features to obtain better representations of the quadruplets. At the loss level, a multi-view quadruplet loss operating on the representations of quadruplets is proposed to reduce the intra-class distances caused by view variations and increase the inter-class distances of baggage images captured in the same view. A random local blur data augmentation is proposed to handle the motion blur which is often found in baggage images. The multi-task learning of materials is introduced to obtain discriminative features based on the materials of baggage surfaces. Extensive experiments on three ReID datasets, MVB, Market-1501 and VeRi-776, indicate the remarkable effectiveness and good generalization of the QuadNet model. It has achieved the state-of-the-art performance on the three datasets. Crown Copyright (c) 2022 Published by Elsevier Ltd. All rights reserved.
查看更多>>摘要:Zero-shot action recognition can recognize samples of unseen classes that are unavailable in training by exploring common latent semantic representation in samples. However, most methods neglected the connotative relation and extensional relation between the action classes, which leads to the poor generalization ability of the zero-shot learning. Furthermore, the learned classifier inclines to predict the samples of seen class, which leads to poor classification performance. To solve the above problems, we propose a two-stage deep neural network for zero-shot action recognition, which consists of a feature generation sub-network serving as the sampling stage and a graph attention sub-network serving as the classification stage. In the sampling stage, we utilize generative adversarial networks (GAN) trained by action features and word vectors of seen classes to synthesize the action features of unseen classes, which can balance the training sample data of seen classes and unseen classes. In the classification stage, we construct a knowledge graph (KG) based on the relationship between word vectors of action classes and related objects, and propose a graph convolution network (GCN) based on attention mechanism, which dynamically updates the relationship between action classes and objects, and enhances the generalization ability of zero-shot learning. In both stages, we all use word vectors as bridges for feature generation and classifier generalization from seen classes to unseen classes. We compare our method with state-of-theart methods on UCF101 and HMDB51 datasets. Experimental results show that our proposed method improves the classification performance of the trained classifier and achieves higher accuracy. (c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Table structure recognition is an essential part for making machines understand tables. Its main task is to recognize the internal structure of a table. However, due to the complexity and diversity in their structure and style, it is very difficult to parse the tabular data into the structured format which machines can understand, especially for complex tables. In this paper, we introduce Split, Embed and Merge (SEM), an accurate table structure recognizer. SEM is mainly composed of three parts, splitter, embedder and merger. In the first stage, we apply the splitter to predict the potential regions of the table row/column separators, and obtain the fine grid structure of the table. In the second stage, by taking a full consideration of the textual information in the table, we fuse the output features for each table grid from both vision and text modalities. Moreover, we achieve a higher precision in our experiments through providing additional textual features. Finally, we process the merging of these basic table grids in a self-regression manner. The corresponding merging results are learned through the attention mechanism. In our experiments, SEM achieves an average F1-Measure of 97 . 11% on the SciTSR dataset which outperforms other methods by a large margin. We also won the first place of complex tables and third place of all tables in Task-B of ICDAR 2021 Competition on Scientific Literature Parsing. Extensive experiments on other publicly available datasets further demonstrate the effectiveness of our proposed approach. (c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Extracting road maps from high-resolution optical remote sensing images has received much attention recently, especially with the rapid development of deep learning methods. However, most of these CNN based approaches simply focused on multi-scale encoder architectures or multiple branches in neural net-works, and ignored some inherent characteristics of the road surface. In this paper, we design a novel net -work for road extraction based on spatial enhanced and densely connected UNet, called SDUNet. SDUNet aggregates both the multi-level features and global prior information of road networks by combining the strengths of spatial CNN-based segmentation and densely connected blocks. To enhance the feature learning about prior information of road surface, a structure preserving model is designed to explore the continuous clues in the spatial level. Experimental results on two benchmark datasets show that the pro-posed method achieves the state-of-the-art performance, compared with previous approaches for road extraction. Code will be made available on https://github.com/MrStrangerYang/SDUNet . (c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:A B S T R A C T Metric learning aims to learn an appropriate distance metric for a given machine learning task. Despite its impressive performance in the field of image recognition, it may still not be discriminative enough for scene recognition because of the high within-class diversity and high between-class similarity of scene images. In this paper, we propose a novel class-specific discriminative metric learning method (CSDML) to alleviate these problems. More specifically, we learn a distinctive linear transformation for each class (or, equivalently, a Mahalanobis distance metric for each class), which allows to project the samples of that class into a corresponding low-dimensional discriminative space. The overall aim is to simultaneously minimize the Euclidean distances between the projections of samples of the same class (or, equivalently, the Mahalanobis distances between these samples) and maximize the Euclidean distances between the projections of samples of different classes. Additionally, we incorporate least squares regression into the optimization problem, rendering class-specific metric learning more flexible and better suited to tackle scene recognition. Experimental results on four benchmark scene datasets demonstrate that the proposed method outperforms most of the state-of-the-art approaches. (c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Trajectory prediction aims to predict the movement trend of the agents like pedestrians, bikers, vehi-cles. It is helpful to analyze and understand human activities in crowded spaces and widely applied in many areas such as surveillance video analysis and autonomous driving systems. Thanks to the success of deep learning, trajectory prediction has made significant progress. The current methods are dedicated to studying the agents' future trajectories under the social interaction and the sceneries' physical con-straints. Moreover, how to deal with these factors still catches researchers' attention. However, they ig-nore the Semantic Shift Phenomenon when modeling these interactions in various prediction sceneries. There exist several kinds of semantic deviations inner or between social and physical interactions, which we call the "Gap ". In this paper, we propose a Contextual Semantic Consistency Network (CSCNet) to pre-dict agents' future activities with powerful and efficient context constraints. We utilize a well-designed context-aware transfer to obtain the intermediate representations from the scene images and trajectories. Then we eliminate the differences between social and physical interactions by aligning activity semantics and scene semantics to cross the Gap. Experiments demonstrate that CSCNet performs better than most of the current methods quantitatively and qualitatively. (c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:In shape from focus (SFF) methods, the quality of depth map is mainly dependent on the accuracy level of image focus volume. Most of the SFF techniques optimize focus volume without incorporating any prior or additional structural information about the scene and thus resultant depth maps are deteriorated. We mitigate this deficiency by proposing to optimize focus volume through energy minimization. The proposed energy function contains smoothness and structural similarity along with data term. Smoothness constraint enforces spatial coherence while structural similarity constraint tries to preserve structures which are consistent with image sequence. This results in an optimized focus volume that imitates the underlying scene accurately. For the implementation of our 3D objective function, we employ an efficient technique that decomposes the problem into a sequence of 1D simple sub-problems. Experiments conducted on synthetic and real image sequences from a variety of datasets demonstrate that the proposed method optimizes the focus volume effectively and thus provides improved depth maps. (c) 2022 Elsevier Ltd. All rights reserved.