期刊,大数据挖掘与分析（英文版） 2023年卷1期_国家学术搜索

期刊信息/Journal information

大数据挖掘与分析（英文版）

大数据挖掘与分析（英文版）/Journal Big Data Mining and AnalyticsCSCDEI

正式出版

收录年代

FingerDTA:A Fingerprint-Embedding Framework for Drug-Target Binding Affinity Prediction

Xuekai ZhuJuan LiuJian ZhangZhihui Yang...

1-10页

查看更多>>摘要：Many efforts have been exerted toward screening potential drugs for targets,and conducting wet experiments remains a laborious and time-consuming approach.Artificial intelligence methods,such as Convolutional Neural Network(CNN),are widely used to facilitate new drug discovery.Owing to the structural limitations of CNN,features extracted from this method are local patterns that lack global information.However,global information extracted from the whole sequence and local patterns extracted from the special domain can influence the drug-target affinity.A fusion of global information and local patterns can construct neural network calculations closer to actual biological processes.This paper proposes a Fingerprint-embedding framework for Drug-Target binding Affinity prediction(FingerDTA),which uses CNN to extract local patterns and utilize fingerprints to characterize global information.These fingerprints are generated on the basis of the whole sequence of drugs or targets.Furthermore,FingerDTA achieves comparable performance on Davis and KIBA data sets.In the case study of screening potential drugs for the spike protein of the coronavirus disease 2019(COVID-19),7 of the top 10 drugs have been confirmed potential by literature.Ultimately,the docking experiment demonstrates that FingerDTA can find novel drug candidates for targets.All codes are available at http://lanproxy.biodwhu.cn:9099/mszjaas/FingerDTA.git.

原文链接:

万方数据
维普

A Method for Bio-Sequence Analysis Algorithm Development Based on the PAR Platform

Haipeng ShiHuan ChenQinghong YangJun Wang...

11-20页

查看更多>>摘要：The problems of biological sequence analysis have great theoretical and practical value in modern bioinformatics.Numerous solving algorithms are used for these problems,and complex similarities and differences exist among these algorithms for the same problem,causing difficulty for researchers to select the appropriate one.To address this situation,combined with the formal partition-and-recur method,component technology,domain engineering,and generic programming,the paper presents a method for the development of a family of biological sequence analysis algorithms.It designs highly trustworthy reusable domain algorithm components and further assembles them to generate specifific biological sequence analysis algorithms.The experiment of the development of a dynamic programming based LCS algorithm family shows the proposed method enables the improvement of the reliability,understandability,and development efficiency of particular algorithms.

原文链接:

万方数据
维普

RF-PSSM:A Combination of Rotation Forest Algorithm and Position-Specific Scoring Matrix for Improved Prediction of Protein-Protein Interactions Between Hepatitis C Virus and Human

Xin LiuYaping LuLiang WangWei Geng...

21-31页

查看更多>>摘要：The identification of hepatitis C virus(HCV)virus-human protein interactions will not only help us understand the molecular mechanisms of related diseases but also be conductive to discovering new drug targets.An increasing number of clinically and experimentally validated interactions between HCV and human proteins have been documented in public databases,facilitating studies based on computational methods.In this study,we proposed a new computational approach,rotation forest position-specific scoring matrix(RF-PSSM),to predict the interactions among HCV and human proteins.In particular,PSSM was used to characterize each protein,two-dimensional principal component analysis(2DPCA)was then adopted for feature extraction of PSSM.Finally,rotation forest(RF)was used to implement classification.The results of various ablation experiments show that on independent datasets,the accuracy and area under curve(AUC)value of RF-PSSM can reach 93.74％and 94.29％,respectively,outperforming almost all cutting-edge research.In addition,we used RF-PSSM to predict 9 human proteins that may interact with HCV protein E1,which can provide theoretical guidance for future experimental studies.

原文链接:

万方数据
维普

Deep Convolutional Network Based Machine Intelligence Model for Satellite Cloud Image Classification

Kalyan Kumar JenaSourav Kumar BhoiSoumya Ranjan NayakRanjit Panigrahi...

32-43页

查看更多>>摘要：As a huge number of satellites revolve around the earth,a great probability exists to observe and determine the change phenomena on the earth through the analysis of satellite images on a real-time basis.Therefore,classifying satellite images plays strong assistance in remote sensing communities for predicting tropical cyclones.In this article,a classification approach is proposed using Deep Convolutional Neural Network(DCNN),comprising numerous layers,which extract the features through a downsampling process for classifying satellite cloud images.DCNN is trained marvelously on cloud images with an impressive amount of prediction accuracy.Delivery time decreases for testing images,whereas prediction accuracy increases using an appropriate deep convolutional network with a huge number of training dataset instances.The satellite images are taken from the Meteorological & Oceanographic Satellite Data Archival Centre,the organization is responsible for availing satellite cloud images of India and its subcontinent.The proposed cloud image classification shows 94％prediction accuracy with the DCNN framework.

原文链接:

万方数据
维普

Satellite Image Classification Using a Hybrid Manta Ray Foraging Optimization Neural Network

Amit Kumar RaiNirupama MandalKrishna Kant SinghIvan Izonin...

44-54页

查看更多>>摘要：A semi supervised image classification method for satellite images is proposed in this paper.The satellite images contain enormous data that can be used in various applications.The analysis of the data is a tedious task due to the amount of data and the heterogeneity of the data.Thus,in this paper,a Radial Basis Function Neural Network(RBFNN)trained using Manta Ray Foraging Optimization algorithm(MRFO)is proposed.RBFNN is a three-layer network comprising of input,output,and hidden layers that can process large amounts.The trained network can discover hidden data patterns in unseen data.The learning algorithm and seed selection play a vital role in the performance of the network.The seed selection is done using the spectral indices to further improve the performance of the network.The manta ray foraging optimization algorithm is inspired by the intelligent behaviour of manta rays.It emulates three unique foraging behaviours namelys chain,cyclone,and somersault foraging.The satellite images contain enormous amount of data and thus require exploration in large search space.The spiral movement of the MRFO algorithm enables it to explore large search spaces effectively.The proposed method is applied on pre and post flooding Landsat 8 Operational Land Imager(OLI)images of New Brunswick area.The method was applied to identify and classify the land cover changes in the area induced by flooding.The images are classified using the proposed method and a change map is developed using post classification comparison.The change map shows that a large amount of agricultural area was washed away due to flooding.The measurement of the affected area in square kilometres is also performed for mitigation activities.The results show that post flooding the area covered by water is increased whereas the vegetated area is decreased.The performance of the proposed method is done with existing state-of-the-art methods.

原文链接:

万方数据
维普

Intelligent Segment Routing:Toward Load Balancing with Limited Control Overheads

Shu YangRuiyu ChenLaizhong CuiXiaolei Chang...

55-71页

查看更多>>摘要：Segment routing has been a novel architecture for traffic engineering in recent years.However,segment routing brings control overheads,i.e.,additional packets headers should be inserted.The overheads can greatly reduce the forwarding efficiency for a large network,when segment headers become too long.To achieve the best of two targets,we propose the intelligent routing scheme for traffic engineering(IRTE),which can achieve load balancing with limited control overheads.To achieve optimal performance,we first formulate the problem as a mapping problem that maps different flows to key diversion points.Second,we prove the problem is nondeterministic polynomial(NP)-hard by reducing it to a k-dense subgraph problem.To solve this problem,we develop an ant colony optimization algorithm as improved ant colony optimization(IACO),which is widely used in network optimization problems.We also design the load balancing algorithm with diversion routing(LBA-DR),and analyze its theoretical performance.Finally,we evaluate the IRTE in different real-world topologies,and the results show that the IRTE outperforms traditional algorithms,e.g.,the maximum bandwidth is 24.6％lower than that of traditional algorithms when evaluating on BellCanada topology.

原文链接:

万方数据

Closed-Form Models of Accuracy Loss due to Subsampling in SVD Collaborative Filtering

Samin PoudelMarwan Bikdash

72-84页

查看更多>>摘要：We postulate and analyze a nonlinear subsampling accuracy loss(SSAL)model based on the root mean square error(RMSE)and two SSAL models based on the mean square error(MSE),suggested by extensive preliminary simulations.The SSAL models predict accuracy loss in terms of subsampling parameters like the fraction of users dropped(FUD)and the fraction of items dropped(FID).We seek to investigate whether the models depend on the characteristics of the dataset in a constant way across datasets when using the SVD collaborative filtering(CF)algorithm.The dataset characteristics considered include various densities of the rating matrix and the numbers of users and items.Extensive simulations and rigorous regression analysis led to empirical symmetrical SSAL models in terms of FID and FUD whose coefficients depend only on the data characteristics.The SSAL models came out to be multi-linear in terms of odds ratios of dropping a user(or an item)vs.not dropping it.Moreover,one MSE deterioration model turned out to be linear in the FID and FUD odds where their interaction term has a zero coefficient.Most importantly,the models are constant in the sense that they are written in closed-form using the considered data characteristics(densities and numbers of users and items).The models are validated through extensive simulations based on 850 synthetically generated primary(pre-subsampling)matrices derived from the 25M MovieLens dataset.Nearly 460 000 subsampled rating matrices were then simulated and subjected to the singular value decomposition(SVD)CF algorithm.Further validation was conducted using the 1M MovieLens and the Yahoo! Music Rating datasets.The models were constant and significant across all 3 datasets.

原文链接:

万方数据
维普

WTASR:Wavelet Transformer for Automatic Speech Recognition of Indian Languages

Tripti ChoudharyVishal GoyalAtul Bansal

85-91页

查看更多>>摘要：Automatic speech recognition systems are developed for translating the speech signals into the corresponding text representation.This translation is used in a variety of applications like voice enabled commands,assistive devices and bots,etc.There is a significant lack of efficient technology for Indian languages.In this paper,an wavelet transformer for automatic speech recognition(WTASR)of Indian language is proposed.The speech signals suffer from the problem of high and low frequency over different times due to variation in speech of the speaker.Thus,wavelets enable the network to analyze the signal in multiscale.The wavelet decomposition of the signal is fed in the network for generating the text.The transformer network comprises an encoder decoder system for speech translation.The model is trained on Indian language dataset for translation of speech into corresponding text.The proposed method is compared with other state of the art methods.The results show that the proposed WTASR has a low word error rate and can be used for effective speech recognition for Indian language.

原文链接:

万方数据
维普

Predicted Mean Vote of Subway Car Environment Based on Machine Learning

Kangkang HuangShihua LuXinjun LiKe Feng...

92-105页

查看更多>>摘要：The thermal comfort of passengers in the carriage cannot be ignored.Thus,this research aims to establish a prediction model for the thermal comfort of the internal environment of a subway car and find the optimal input combination in establishing the prediction model of the predicted mean vote(PMV)index.Data-driven modeling utilizes data from experiments and questionnaires conducted in Nanjing Metro.Support vector machine(SVM),decision tree(DT),random forest(RF),and logistic regression(LR)were used to build four models.This research aims to select the most appropriate input variables for the predictive model.All possible combinations of 11 input variables were used to determine the most accurate model,with variable selection for each model comprising 102 350 iterations.In the PMV prediction,the RF model was the best when using the correlation coefficients square(R2)as the evaluation indicator(R2:0.7680,mean squared error(MSE):0.2868).The variables include clothing temperature(CT),convective heat transfer coefficient between the surface of the human body and the environment(CHTC),black bulb temperature(BBT),and thermal resistance of clothes(TROC).The RF model with MSE as the evaluation index also had the highest accuracy(R2:0.7676,MSE:0.2836).The variables include clothing surface area coefficient(CSAC),CT,BBT,and air velocity(AV).The results show that the RF model can efficiently predict the PMV of the subway car environment.

原文链接:

万方数据
维普

Ultra-Short Wave Communication Squelch Algorithm Based on Deep Neural Network

Yuanxin XiangYi LvWenqiang LeiJiancheng Lv...

106-114页

查看更多>>摘要：The squelch problem of ultra-short wave communication under non-stationary noise and low Signal-to-Noise Ratio(SNR)in a complex electromagnetic environment is still challenging.To alleviate the problem,we proposed a squelch algorithm for ultra-short wave communication based on a deep neural network and the traditional energy decision method.The proposed algorithm first predicts the speech existence probability using a three-layer Gated Recurrent Unit(GRU)with the speech banding spectrum as the feature.Then it gets the final squelch result by combining the strength of the signal energy and the speech existence probability.Multiple simulations and experiments are done to verify the robustness and effectiveness of the proposed algorithm.We simulate the algorithm in three situations:the typical Amplitude Modulation(AM)and Frequency Modulation(FM)in the ultra-short wave communication under different SNR environments,the non-stationary burst-like noise environments,and the real received signal of the ultra-short wave radio.The experimental results show that the proposed algorithm performs better than the traditional squelch methods in all the simulations and experiments.In particular,the false alarm rate of the proposed squelch algorithm for non-stationary burst-like noise is significantly lower than that of traditional squelch methods.

原文链接:

万方数据
维普