查看更多>>摘要:Human activity recognition in videos has been widely studied and has recently gained significant ad -vances with deep learning approaches; however, it remains a challenging task. In this paper, we propose a novel framework that simultaneously considers both implicit and explicit representations of human in-teractions by fusing information of local image where the interaction actively occurred, primitive motion with the posture of individual subject's body parts, and the co-occurrence of overall appearance change. Human interactions change, depending on how the body parts of each human interact with the other. The proposed method captures the subtle difference between different interactions using interacting body part attention. Semantically important body parts that interact with other objects are given more weight during feature representation. The combined feature of interacting body part attention-based individ-ual representation and the co-occurrence descriptor of the full-body appearance change is fed into long short-term memory to model the temporal dynamics over time in a single framework. The experimen-tal results on five widely used public datasets demonstrate the effectiveness of the proposed method to recognize human interactions from videos. (c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Unsupervised clustering categorizes a sample set into several groups, where the samples in the same group share high-level concepts. As the clustering performances are heavily determined by the metric to assess the similarity between sample pairs, we propose to learn a deep similarity score function and use it to capture the correlations between sample pairs for improved clustering. We formulate the learn-ing procedure in a ranking framework and introduce two new supervisory signals to train our model. Specifically, we train the similarity score function to guarantee 1) a sample should have a higher level of similarity with its nearest neighbors than others in order to achieve correct clustering, and 2) the ordering of the similarity between neighboring sample pairs should be preserved in order to achieve robust clustering. To this end, we not only study the relevance between neighboring sample pairs for lo -cal structure learning, but also study the relevance between each sample and the boundary samples for global structure learning. Extensive experiments on seven public available datasets validate the effective-ness of our proposed framework, including face image clustering, object image clustering, and real-world image clustering.(c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Clustering categorical data is an important task of machine learning, since the type of data widely exists in real world. However, the lack of an inherent order on the domains of categorical features prevents most of classical clustering algorithms from being directly applied for the type of data. Therefore, it is very key issue to learn an appropriate representation of categorical data for the clustering task. In order to address this issue, we develop a categorical data clustering framework based on graph representation. In this framework, a graph-based representation method for categorical data is proposed, which learns the representation of categorical values from their similar graph to provide similar representations for similar categorical values. We compared the proposed framework with other representation methods for categorical data clustering on benchmark data sets. The experiment results illustrate the proposed frame-work is very effective, compared to other methods. (c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:A B S T R A C T Recently, RGB-D salient object detection (SOD) has aroused widespread research interest. Existing RGB-D SOD approaches mainly consider the cross-modal information fusion in the decoder. And their multi modal interaction mainly concentrates on the same level of features between RGB stream and depth stream. They do not deeply explore the coherence of multi-model features at different levels. In this paper, we design a two-stream deep interleaved encoder network to extract RGB and depth information and realize their mixing simultaneously. This network allows us to gradually learn multi-modal representation at different levels from shallow to deep. Moreover, to further fuse multi-modal features in the decoding stage, we propose a cross-modal mutual guidance module and a residual multi-scale aggregation module to implement the global guidance and local refinement of the salient region. Extensive experiments on six benchmark datasets demonstrate that the proposed approach performs favorably against most stateof-the-art methods under different evaluation metrics. During the testing stage, this model can run at a real-time speed of 93 FPS.(c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Despite its promising preliminary results, existing cross-modality Visible-Infrared Person Re-IDentification (VI-PReID) models incorporating semantic (person) masks simply use these person masks as selection maps to separate person features from background regions. Such models do not dedicate to extracting more modality-invariant person body features in the VI-PReID network itself, thus leading to suboptimal results in VI-PReID. Differently, we aim to better capture person body information in the VI-PReID network itself for VI-PReID by exploiting the inner relations between person mask prediction and VI-PReID. To this end, a novel multi-task learning model is presented in this paper, where person body features obtained by person mask prediction potentially facilitate the extraction of discriminative modality-shared person body information for VI-PReID. On top of that, considering the task difference between person mask prediction and VI-PReID, we propose a novel task translation sub-network to transfer discriminative person body information, extracted by person mask prediction, into VI-PReID. Doing so enables our model to better exploit discriminative and modality-invariant person body information. Thanks to more discriminative modality-shared features, our method outperforms previous state-of-the-arts by a significant margin on several benchmark datasets. Our intriguing findings validate the effectiveness of extracting discriminative person body features for the VI-PReID task. 0 2022 Elsevier Ltd. All rights reserved.
Le, NganDuong, Chi NhanJalata, IbsaRoy, Kaushik...
11页
查看更多>>摘要:Group-level emotion recognition (ER) is a growing research area as the demands for assessing crowds of all sizes are becoming an interest in both the security arena as well as social media. This work extends the earlier ER investigations, which focused on either group-level ER on single images or within a video, by fully investigating group-level expression recognition on crowd videos. In this paper, we propose an ef-fective deep feature level fusion mechanism to model the spatial-temporal information in the crowd videos. In our approach, the fusing process is performed on the deep feature domain by a generative probabilis-tic model, Non-Volume Preserving Fusion (NVPF), that models spatial information relationships. Further-more, we extend our proposed spatial NVPF approach to the spatial-temporal NVPF approach to learn the temporal information between frames. To demonstrate the robustness and effectiveness of each compo-nent in the proposed approach, three experiments were conducted: (i) evaluation on AffectNet database to benchmark the proposed EmoNet for recognizing facial expression; (ii) evaluation on EmotiW2018 to benchmark the proposed deep feature level fusion mechanism NVPF; and, (iii) examine the proposed TNVPF on an innovative Group-level Emotion on Crowd Videos (GECV) dataset composed of 627 videos collected from publicly available sources. GECV dataset is a collection of videos containing crowds of peo-ple. Each video is labeled with emotion categories at three levels: individual faces, group of people, and the entire video frame.(c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:This work addresses time series classifier recommendation for the first time in the literature by considering several recommendation forms or meta-targets: classifier accuracies, complete ranking, top-M ranking, best set and best classifier. For this, an ad-hoc set of quick estimators of the accuracies of the candidate classifiers (landmarkers) are designed, which are used as predictors for the recommendation system. The performance of our recommender is compared with the performance of a standard method for non-sequential data and a set of baseline methods, which our method outperforms in 7 of the 9 considered scenarios. Since some meta-targets can be inferred from the predictions of other more finegrained meta-targets, the last part of the work addresses the hierarchical inference of meta-targets. The experimentation suggests that, in many cases, a single model is sufficient to output many types of meta targets with competitive results.(c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Open domain recognition has attracted great attention in recent two years, which aims to assign a spe-cific identification for each target sample in the presence of large domain discrepancy both in label space and data distributions. Most existing approaches rely on abundant prior information about the relation-ship of the label sets between the source and the target domain, which is a great limitation for their applications in practical wild. In this paper, a new Adaptive Open Domain Recognition (AODR) task is in-troduced, which can generalize to various openness and requires no prior information on the label set. To achieve this adaptive transfer task, a two-stage Progressive Adaptation Network is designed, whose learn-ing process consists of multiple episodes. Each episode is performed to simulate an AODR task. Through training and refining multiple episodes, the basic model has progressively accumulated wealthy experi-ence on predicting unseen categories in the presence of large domain discrepancy, which will well gen-eralize to various openness. More specifically, Fusion Information Guided Feature Prototype Generation module is proposed to synthesize visual feature prototype conditioned on category semantic prototype in training stage. Further, Class-Aware Feature Prototype Alignment module is designed in refining stage to align the global feature prototype for each class between two domains. Experimental results verify that the proposed model not only has superiority on classifying the image instances of known and unknown classes, but also well adapts to various openness. (c) 2022 Elsevier Ltd. All rights reserved.
查看更多>>摘要:Kleinberg introduced three natural clustering properties, or axioms, and showed they cannot be simultaneously satisfied by any clustering algorithm. We present a new clustering property, Monotonic Consistency, which avoids the well-known problematic behaviour of Kleinberg's Consistency axiom, and the impossibility result. Namely, we describe a clustering algorithm, Morse Clustering, inspired by Morse Theory in Differential Topology, which satisfies Kleinberg's original axioms with Consistency replaced by Monotonic Consistency. Morse clustering uncovers the underlying flow structure on a set or graph and returns a partition into trees representing basins of attraction of critical vertices. We also generalise Kleinberg's axiomatic approach to sparse graphs, showing an impossibility result for Consistency, and a possibility result for Monotonic Consistency and Morse clustering.(c) 2022 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )
查看更多>>摘要:LiDAR-based 3D object detection pushes forward an immense influence on autonomous vehicles. Due to the limitation of the intrinsic properties of LiDAR, fewer points are collected at the objects farther away from the sensor. This imbalanced density of point clouds degrades the detection accuracy but is generally neglected by previous works. To address the challenge, we propose a novel two-stage 3D object detection framework, named SIENet. Specifically, we design the Spatial Information Enhancement (SIE) module to predict the spatial shapes of the foreground points within proposals, and extract the structure information to learn the representative features for further box refinement. The predicted spatial shapes are complete and dense point sets, thus the extracted structure information contains more semantic representation. Besides, we design the Hybrid-Paradigm Region Proposal Network (HP-RPN) which includes multiple branches to learn discriminate features and generate accurate proposals for the SIE module. Extensive experiments on the KITTI 3D object detection benchmark show that our elaborately designed SIENet outperforms the state-of-the-art methods by a large margin. Codes will be publicly available at https://github.com/Liz66666/SIENet .(c) 2022 Elsevier Ltd. All rights reserved.