查看更多>>摘要:Existing JPEG encryption approaches pose a security risk due to the difficulty in changing all block-feature values while considering format compatibility and file size expansion. To address these concerns, this paper introduces a novel JPEG image encryption scheme. First, the security of sketch information against chosen-plaintext attacks is improved by increasing the change rate of block-feature values. Second, a classification global permutation approach is designed to encrypt the undivided run/size, value (RSV)-based AC groups to achieve larger changes in the block-feature values. Third, to reduce file size expansion while maintaining format compatibility, the DC coefficients are rotated based on the mapped DC differences in the same category, and the nonzero AC coefficients are mapped in the same category. Extensive experiments demonstrate that the proposed algorithm is superior to existing schemes in terms of security. Notably, the average change rate of block-feature values is increased by at least 20%. Furthermore, the proposed scheme reduces the file size by an average of 2.036% compared to existing JPEG image encryption methods.
查看更多>>摘要:Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task consisting of answering the visual question and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) task that only aims at predicting answers for visual questions, EVQA also aims to generate user-friendly explanations to improve the explainability and credibility of reasoning models. To date, existing methods for VQA and EVQA ignore the prompt in the question and enforce the model to predict the probabilities of all answers. Moreover, existing EVQA methods ignore the complex relationships among question words, visual regions, and explanation tokens. Therefore, in this work, we propose a Logic Integrated Neural Inference Network (LININ) to restrict the range of candidate answers based on first-order-logic (FOL) and capture cross-modal relationships to generate rational explanations. Firstly, we design a FOL-based question analysis program to fetch a small number of candidate answers. Secondly, we utilize a multimodal transformer encoder to extract visual and question features, and conduct the prediction on candidate answers. Finally, we design a multimodal explanation transformer to construct cross-modal relationships and generate rational explanations. Comprehensive experiments on benchmark datasets demonstrate the superiority of LININ compared with the state-of-the-art methods for EVQA.
查看更多>>摘要:Image set compression (ISC) refers to compressing the sets of semantically similar images. Traditional ISC methods typically aim to eliminate redundancy among images at either signal or frequency domain, but often struggle to handle complex geometric deformations across different images effectively. Here, we propose a new Hybrid Neural Representation for ISC (HNR-ISC), including an implicit neural representation for Semantically Common content Compression (SCC) and an explicit neural representation for Semantically Unique content Compression (SUC). Specifically, SCC enables the conversion of semantically common contents into a small-and-sweet neural representation, along with embeddings that can be conveyed as a bitstream. SUC is composed of invertible modules for removing intra-image redundancies. The feature level combination from SCC and SUC naturally forms the final image set. Experimental results demonstrate the robustness and generalization capability of HNR-ISC in terms of signal and perceptual quality for reconstruction and accuracy for the downstream analysis task.
查看更多>>摘要:Existing High Efficiency Video Coding (HEVC) selective encryption algorithms only consider the encoding characteristics of syntax elements to keep format compliance, but ignore the semantic features of video content, which may lead to unnecessary computational and bit rate costs. To tackle this problem, we present a content-aware tunable selective encryption (CATSE) scheme for HEVC. First, a deep hashing network is adopted to retrieve groups of pictures (GOPs) containing sensitive objects. Then, the retrieved sensitive GOPs and the remaining insensitive ones are encrypted with different encryption strengths. For the former, multiple syntax elements are encrypted to ensure security, whereas for the latter, only a few bypass-coded syntax elements are encrypted to improve the encryption efficiency and reduce the bit rate overhead. The keystream sequence used is extracted from the time series of a new improved logistic map with complex dynamic behavior, which is generated by our proposed sine-modular chaotification model. Finally, a reversible steganography is applied to embed the flag bits of the GOP type into the encrypted bitstream, so that the decoder can distinguish the encrypted syntax elements that need to be decrypted in different GOPs. Experimental results indicate that the proposed HEVC CATSE scheme not only provides high encryption speed and low bit rate overhead, but also has superior encryption strength than other state-of-the-art HEVC selective encryption algorithms.
查看更多>>摘要:Graph convolution networks (GCNs) have achieved remarkable performance in skeleton-based action recognition by exploiting the adjacency topology of body representation. However, the adaptive strategy adopted by the previous methods to construct the adjacency matrix is not balanced between the performance and the computational cost. We assume this concept of Adaptive Trap, which can be replaced by multiple autonomous submodules, thereby simultaneously enhancing the dynamic joint representation and effectively reducing network resources. To effectuate the substitution of the adaptive model, we unveil two distinct strategies, both yielding comparable effects. (1) Optimization. Individuality and Commonality GCNs (IC-GCNs) is proposed to specifically optimize the construction method of the associativity adjacency matrix for adaptive processing. The uniqueness and co-occurrence between different joint points and frames in the skeleton topology are effectively captured through methodologies like preferential fusion of physical information, extreme compression of multi-dimensional channels, and simplification of self-attention mechanism. (2) Replacement. Auto-Learning GCNs (AL-GCNs) is proposed to boldly remove popular adaptive modules and cleverly utilize human key points as motion compensation to provide dynamic correlation support. AL-GCNs construct a fully learnable group adjacency matrix in both spatial and temporal dimensions, resulting in an elegant and efficient GCN-based model. In addition, three effective tricks for skeleton-based action recognition (Skip-Block, Bayesian Weight Selection Algorithm, and Simplified Dimensional Attention) are exposed and analyzed in this paper. Finally, we employ the variable channel and grouping method to explore the hardware resource bound of the two proposed models. IC-GCN and AL-GCN exhibit impressive performance across NTU-RGB+D 60, NTU-RGB+D 120, NW-UCLA, and UAV-Human datasets, with an exceptional parameter-cost ratio.
查看更多>>摘要:Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background. With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released.
查看更多>>摘要:Point-supervised Temporal Action Localization (PS-TAL) detects temporal intervals of actions in untrimmed videos with a label-efficient paradigm. However, most existing methods fail to learn action completeness without instance-level annotations, resulting in fragmentary region predictions. In fact, the semantic information of snippets is crucial for detecting complete actions, meaning that snippets with similar representations should be considered as the same action category. To address this issue, we propose a novel representation refinement framework with a semantic query mechanism to enhance the discriminability of snippet-level features. Concretely, we set a group of learnable queries, each representing a specific action category, and dynamically update them based on the video context. With the assistance of these queries, we expect to search for the optimal action sequence that agrees with their semantics. Besides, we leverage some reliable proposals as pseudo labels and design a refinement and completeness module to refine temporal boundaries further, so that the completeness of action instances is captured. Finally, we demonstrate the superiority of the proposed method over existing state-of-the-art approaches on THUMOS14 and ActivityNet13 benchmarks. Notably, thanks to completeness learning, our algorithm achieves significant improvements under more stringent evaluation metrics.
查看更多>>摘要:Weakly supervised temporal sentence grounding aims to identify semantically relevant video moments in an untrimmed video corresponding to a given sentence query without exact timestamps. Neuropsychology research indicates that the way the human brain handles information varies based on the grammatical categories of words, highlighting the importance of separately considering nouns and verbs. However, current methodologies primarily utilize pre-extracted video features to reconstruct randomly masked queries, neglecting the distinction between grammatical classes. This oversight could hinder forming meaningful connections between linguistic elements and the corresponding components in the video. To address this limitation, this paper introduces the dual semantic reconstruction network (DSRN) model. DSRN processes video features by distinctly correlating object features with nouns and motion features with verbs, thereby mimicking the human brain's parsing mechanism. It begins with a feature disentanglement module that separately extracts object-aware and motion-aware features from video content. Then, in a dual-branch structure, these disentangled features are used to generate separate proposals for objects and motions through two dedicated proposal generation modules. A consistency constraint is proposed to ensure a high level of agreement between the boundaries of object-related and motion-related proposals. Subsequently, the DSRN independently reconstructs masked nouns and verbs from the sentence queries using the generated proposals. Finally, an integration block is applied to synthesize the two types of proposals, distinguishing between positive and negative instances through contrastive learning. Experiments on the Charades-STA and ActivityNet Captions datasets demonstrate that the proposed method achieves state-of-the-art performance.
查看更多>>摘要:An emotional support conversation (ESC) system aims to reduce users' emotional distress by engaging in conversation using various reply strategies as guidance. To develop instructive reply strategies for an ESC system, it is essential to consider the dynamic transitions of users' emotional states through the conversational turns. However, existing methods for strategy-guided ESC systems struggle to capture these transitions as they overlook the inference of fine-grained user intentions. This oversight poses a significant obstacle, impeding the model's ability to derive pertinent strategy information and, consequently, hindering its capacity to generate emotionally supportive responses. To tackle this limitation, we propose a novel dynamic strategy prompt reasoning model (DSR), which leverages sparse context relation deduction to acquire adaptive representation of reply strategies as prompts for guiding the response generation process. Specifically, we first perform turn-level commonsense reasoning with different approaches to extract auxiliary knowledge, which enhances the comprehension of user intention. Then we design a context relation deduction module to dynamically integrate interdependent dialogue information, capturing granular user intentions and generating effective strategy prompts. Finally, we utilize the strategy prompts to guide the generation of more relevant and supportive responses. DSR model is validated through extensive experiments conducted on a benchmark dataset, demonstrating its superior performance compared to the latest competitive methods in the field.
查看更多>>摘要:Source-Free Domain Generalization (SFDG) aims to develop a model that works for unseen target domains without relying on any source domain. Research in SFDG primarily bulids upon the existing knowledge of large-scale vision-language models and utilizes the pre-trained model's joint vision-language space to simulate style transfer across domains, thus eliminating the dependency on source domain images. However, how to efficiently simulate rich and diverse styles using text prompts, and how to extract domain-invariant information useful for classification from features that contain both semantic and style information after the encoder, are directions that merit improvement. In this paper, we introduce Dynamic PromptStyler (DPStyler), comprising Style Generation and Style Removal modules to address these issues. The Style Generation module refreshes all styles at every training epoch, while the Style Removal module eliminates variations in the encoder's output features caused by input styles. Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity. Extensive experiments demonstrate that our framework outperforms state-of-the-art methods on benchmark datasets.