查看更多>>摘要:The history of artificial intelligence (AI) has witnessed the significant impact of high-quality data on various deep learning models, such as ImageNet for AlexNet and ResNet. Recently, instead of designing more complex neural architectures as model-centric approaches, the attention of AI community has shifted to data-centric ones, which focuses on better processing data to strengthen the ability of neural models. Graph learning, which operates on ubiquitous topological data, also plays an important role in the era of deep learning. In this survey, we comprehensively review graph learning approaches from the data-centric perspective, and aim to answer three crucial questions: (1) when to modify graph data, (2) what part of the graph data needs modification to unlock the potential of various graph models, and (3) how to safeguard graph models from problematic data influence. Accordingly, we propose a novel taxonomy based on the stages in the graph learning pipeline, and highlight the processing methods for different data structures in the graph data, i.e., topology, feature and label. Furthermore, we analyze some potential problems embedded in graph data and discuss how to solve them in a data-centric manner. Finally, we provide some promising future directions for data-centric graph learning.
查看更多>>摘要:The edge computing paradigm has revolutionized the healthcare sector, providing more real-time medical data processing and analysis, which also poses more serious privacy and security risks that must be carefully considered and addressed. Based on differential privacy, we presented an innovative privacy-preserving model named Edge-DPSDG (Edge-Differentially Private Synthetic Data Generator) for smart healthcare under edge computing. It also develops and evolves a privacy budget allocation mechanism. In a distributed environment, the privacy budget for local medical data is personalized by computing the Shapley value and the information entropy value of each attribute in the dataset, which takes into account the trade-off between data privacy and utility. Extensive experiments on three public medical datasets are performed to evaluate the performance of Edge-DPSDG on two metrics. For utility evaluation, Edge-DPSDG shows a best 21.29% accuracy improvement compared to the state-of-the-art; our privacy budget allocation mechanism improved existing models’ accuracy by up to 6.05%. For privacy evaluation, Edge-DPSDG shows that can effectively ensure the privacy of the original datasets. In addition, Edge-DPSDG helps smooth the data, and results in a 3.99% accuracy loss decrease over the non-private model.
Amit SaguNasib Singh GillPreeti GuliaIshaani Priyadarshini...
35-46页
查看更多>>摘要:The Internet of Things (IoT) is being prominently used in smart cities and a wide range of applications in society. The benefits of IoT are evident, but cyber terrorism and security concerns inhibit many organizations and users from deploying it. Cyber-physical systems that are IoT-enabled might be difficult to secure since security solutions designed for general information/operational technology systems may not work as well in an environment. Thus, deep learning (DL) can assist as a powerful tool for building IoT-enabled cyber-physical systems with automatic anomaly detection. In this paper, two distinct DL models have been employed i.e., Deep Belief Network (DBN) and Convolutional Neural Network (CNN), considered hybrid classifiers, to create a framework for detecting attacks in IoT-enabled cyber-physical systems. However, DL models need to be trained in such a way that will increase their classification accuracy. Therefore, this paper also aims to present a new hybrid optimization algorithm called “Seagull Adapted Elephant Herding Optimization” (SAEHO) to tune the weights of the hybrid classifier. The “Hybrid Classifier + SAEHO” framework takes the feature extracted dataset as an input and classifies the network as either attack or benign. Using sensitivity, precision, accuracy, and specificity, two datasets were compared. In every performance metric, the proposed framework outperforms conventional methods.
查看更多>>摘要:Large-scale protein-protein interaction (PPI) network of an organism provides key insights into its cellular and molecular functionalities, signaling pathways and underlying disease mechanisms. For any organism, the total unexplored protein interactions significantly outnumbers all known positive and negative interactions. For Human, all known PPI datasets contain only $\sim\!\! 5.61$ million positive and $\sim\!\! 0.76$ million negative interactions, which is $\sim\!\! 3.1$% of potential interactions. We have implemented a distributed algorithm in Apache Spark that evaluates a Human PPI network of $\sim \!\! 180$ million potential interactions resulting from 18 994 reviewed proteins for which Gene Ontology (GO) annotations are available. The computed scores have been validated against state-of-the-art methods on benchmark datasets. FuzzyPPI performed significantly better with an average F1 score of 0.62 compared to GOntoSim (0.39), GOGO (0.38), and Wang (0.38) when tested with the Gold Standard PPI Dataset. The resulting scores are published with a web server for non-commercial use at http://fuzzyppi.mimuw.edu.pl/. Moreover, conventional PPI prediction methods produce binary results, but in fact this is just a simplification as PPIs have strengths or probabilities and recent studies show that protein binding affinities may prove to be effective in detecting protein complexes, disease association analysis, signaling network reconstruction, etc. Keeping these in mind, our algorithm is based on a fuzzy semantic scoring function and produces probabilities of interaction.
查看更多>>摘要:The image hosting platform is becoming increasingly popular due to its user-friendly features, but it is prone to causing privacy concerns. Only protecting privacy, in fact, can be easy to come true, but usability is frequently sacrificed. Visual privacy protection schemes aim to make a balance between privacy and usability, whereas they are often irreversible. Recently, some reversible visual privacy protection schemes have been proposed by preserving thumbnails (known as TPE). However, they either have excessive states in the Markov chain modeled by the scheme or cannot reverse losslessly. Meanwhile, images encrypted by existing TPE schemes can not embed additional information and thus the usability is limited to visual observation. In view of this, we pertinently propose a reversible and usability-enhanced visual privacy protection scheme (called PR3) based on thumbnail preservation and data hiding. In this scheme, we utilize the sum-preserving data embedding algorithm to substitute the the lowest seven bits of the image without changing the sum. Any data overflow resulting from the above process is stored in the vacated space of the most significant bits. The remaining space serves two purposes: embedding additional information and adjusting the image to approximate the thumbnail. Compared with existing TPE works, PR3 has fewer states in the Markov chain and supports lossless recovery of images. In addition, additional information can be embedded in the encrypted image to enhance usability.
查看更多>>摘要:Vertical federated learning can aggregate participant data features. To address the issue of insufficient overlapping data in vertical federated learning, this study presents a generative adversarial network model that allows distributed data augmentation. First, this study proposes a distributed generative adversarial network FeCGAN for multiple participants with insufficient overlapping data, considering the fact that the generative adversarial network can generate simulation samples. This network is suitable for multiple data sources and can augment participants’ local data. Second, to address the problem of learning divergence caused by different local distributions of multiple data sources, this study proposes the aggregation algorithm FedKL. It aggregates the feedback of the local discriminator to interact with the generator and learns the local data distribution more accurately. Finally, given the problem of data waste caused by the unavailability of nonoverlapping data, this study proposes a data augmentation method called VFeDA. It uses FeCGAN to generate pseudo features and expands more overlapping data, thereby improving the data use. Experiments showed that the proposed model is suitable for multiple data sources and can generate high-quality data.
查看更多>>摘要:Crowdsourcing has been playing an essential role in machine learning since it can obtain a large number of labels in an economical and fast manner for training increasingly complex learning models. However, the application of crowdsourcing learning still faces several challenges such as the low quality of crowd labels and the urgent requirement for learning models adapting to the label noises. There have been many studies focusing on truth inference algorithms to improve the quality of labels obtained by crowdsourcing. Comparably, end-to-end predictive model learning in crowdsourcing scenarios, especially using cutting-edge deep learning techniques, is still in its infant stage. In this paper, we propose a novel graph convolutional network-based framework, namely CGNNAT, which models the correlation of instances by combining the GCN model with an attention mechanism to learn more representative node embeddings for a better understanding of the bias tendency of crowd workers. Furthermore, a specific projection processing layer is employed in CGNNAT to model the reliability of each crowd worker, which makes the model an end-to-end neural network directly trained by noisy crowd labels. Experimental results on several real-world and synthetic datasets show that the proposed CGNNAT outperforms state-of-the-art and classical methods in terms of label prediction.
查看更多>>摘要:It is a challenging task to select the informative features that can maintain the manifold structure in the original feature space. Many unsupervised feature selection methods still suffer the poor cluster performance in the selected feature subset. To tackle this problem, a feature subspace learning-based binary differential evolution algorithm is proposed for unsupervised feature selection. First, a new unsupervised feature selection framework based on evolutionary computation is designed, in which the feature subspace learning and the population search mechanism are combined into a unified unsupervised feature selection. Second, a local manifold structure learning strategy and a sample pseudo-label learning strategy are presented to calculate the importance of the selected feature subspace. Third, the binary differential evolution algorithm is developed to optimize the selected feature subspace, in which the binary information migration mutation operator and the adaptive crossover operator are designed to promote the searching for the global optimal feature subspace. Experimental results on various types of real-world datasets demonstrate that the proposed algorithm can obtain more informative feature subset and competitive cluster performance compared with eight state-of-the-art unsupervised feature selection methods.
查看更多>>摘要:Social events reflect the dynamics of society and, here, natural disasters and emergencies receive significant attention. The timely detection of these events can provide organisations and individuals with valuable information to reduce or avoid losses. However, due to the complex heterogeneities of the content and structure of social media, existing models can only learn limited information; large amounts of semantic and structural information are ignored. In addition, due to high labour costs, it is rare for social media datasets to include high-quality labels, which also makes it challenging for models to learn information from social media. In this study, we propose two hyperbolic graph representation-based methods for detecting social events from heterogeneous social media environments. For cases where a dataset has labels, we design a Hyperbolic Social Event Detection (HSED) model that converts complex social information into a unified social message graph. This model addresses the heterogeneity of social media, and, with this graph, the information in social media can be used to capture structural information based on the properties of hyperbolic space. For cases where the dataset is unlabelled, we design an Unsupervised Hyperbolic Social Event Detection (UHSED). This model is based on the HSED model but includes graph contrastive learning to make it work in unlabelled scenarios. Extensive experiments demonstrate the superiority of the proposed approaches.
查看更多>>摘要:Nowadays, as human activities are shifting to social media, fake news detection has been a crucial problem. Existing methods ignore the classification difference in online news and cannot take full advantage of multi-classification knowledges. For example, when coping with a post “A mouse is frightened by a cat,” a model that learns “computer” knowledge tends to misunderstand “mouse” and give a fake label, but a model that learns “animal” knowledge tends to give a true label. Therefore, this research proposes a multi-classification division-aggregation framework to detect fake news, named $CKA$, which innovatively learns classification knowledges during training stages and aggregates them during prediction stages. It consists of three main components: a news characterizer, an ensemble coordinator, and a truth predictor. The news characterizer is responsible for extracting news features and obtaining news classifications. Cooperating with the news characterizer, the ensemble coordinator generates classification-specifical models for the maximum reservation of classification knowledges during the training stage, where each classification-specifical model maximizes the detection performance of fake news on corresponding news classifications. Further, to aggregate the classification knowledges during the prediction stage, the truth predictor uses the truth discovery technology to aggregate the predictions from different classification-specifical models based on reliability evaluation of classification-specifical models. Extensive experiments prove that our proposed $CKA$ outperforms state-of-the-art baselines in fake news detection.