查看更多>>摘要:Multivariate spectral signals are highly correlated. Often, variable selection techniques are deployed, aiming at model optimization, identification of key variables to explore the underlying physicochemical system or development of a cheap multi-spectral system based on key variables. However, many times the selected variables do not supply a good estimate of properties when tested on a new setting such as new measurements performed on a different spectrometer, different physical or chemical state of the samples and difference in the environmental factors around the experiment. Often the model based on variables selected in the first domain (specific conditions/instrument) does not generalize on the new domain (specific conditions/instrument). To deal with it, in the present work a new method to variable selection called domain invariant covariate selection (di-CovSel) is proposed. The method selects the most informative variables which are invariant to the differences in the instruments, physical or chemical state of the samples and the differences in the environmental factors around the experiment. The method is inspired by domain invariant partial least-square (di-PLS) and the covariate selection (CovSel). The potential of the method is demonstrated on four real cases related to the calibration of near-infrared (NIR) spectroscopy on agri-food materials. The results show that in all the cases, the domain invariant features selected by the di-CovSel have low prediction error compared to the standard variable selection with the CovSel approach when the models are tested on a new data domain. In summary, domain invariant features selected across domains support the development of calibration models with good generalization and supply a better understanding of the system by bypassing the external factors originating from differences in the instruments, physical or chemical states of the samples and the differences in the environmental factors around the experiment. Note that one key feature of the proposed method is that the most important variables which generalize well across domains can be identified without requiring reference measurements in the target domain.
查看更多>>摘要:Gene expression data analysis has always been challenging due to complex and high-dimensional samples and genes. Generally, the number of samples is much smaller than the number of genes in microarray gene expression data. Handling this imbalance data as machine learning tasks have the risk of generating an over-fitted learning model, reducing predictability, and unreadability of genetic data. These problems can be significantly decreased by choosing the more informative genes. Unsupervised gene selection techniques can estimate the relation among genes well. Though using mutual information and symmetric uncertainty can estimate the genes' relevancy well, their bivariate measures ignore the possible dependencies among several genes. To address this issue, we propose an unsupervised gene selection scheme based on information theoretic measures. It uses a similarity-based algorithm for gene clustering and then introduces some virtual genes as representatives of gene clusters. These representative genes will have the most common information with the genes in clusters and the least similarity with the representatives of other clusters. The experimental results on benchmark microarray gene expression datasets demonstrate the effectiveness of our approach, as compared to some information theoretic schemes beside to prototype- and density-based clustering methods in both unsupervised and supervised scenarios.
查看更多>>摘要:The increasing presence of microplastics in the marine environment has gained continuous concerns, and the accurate identification of microplastics is a precondition in waste management or waste recycling. Spectroscopic analytical technology is a widely used method in microplastics identification. It works well with a known material stock but fails in the blind identification of realistic samples. Generally, the required manual work of denying the unmatched samples cannot be avoided during the microplastics spectral analysis. This paper built an open model in identifying the diversity environmental microplastics sample. The result showed that the recognition accuracy of the FTIR technology combined with a novel opened classifier could reach 0.955 accuracy with all the non plastic sample denied automatically.
Mishra, PuneetRoger, Jean MichelMarini, FedericoBiancolillo, Alessandra...
10页
查看更多>>摘要:Ensemble pre-processing is emerging as a potential tool to avoid the tiring pre-processing selection and optimization task in near-infrared (NIR) spectral modelling. Furthermore, differently pre-processed data may carry complementary information, hence, ensemble pre-processing may represent the best suited modelling option to extract all the useful information from differently pre-processed data. Recently, multi-block techniques such as sequential (SPORT) and parallel (PORTO) orthogonalized partial least squares regression were proposed to extract complementary information present in differently pre-processed data. Although such multi-block techniques allowed efficient modelling of differently pre-processed data blocks, depending on the approach, challenges related to choosing block order, parameter tuning, block scaling and optimization time requirements still must be dealt with. To cope with such issues, the present study proposes the use of a recently developed faster, block order independent and scale independent, multi-block data modelling technique called response-oriented sequential alternation (ROSA) to process the multi-block data generated by differently pre-processing the same NIR data. This new method is called PROSAC, i.e., pre-processing ensembles with ROSA calibration. The potential of the approach is demonstrated on five real NIR spectral datasets. Furthermore, as baselines for comparison, partial least squares regression was done on individually pre-processed data sets, and using two multi-block pre-processing fusion approaches, i.e., SPORT and PORTO. The ensemble pre-processing with ROSA achieved either better performance compared to the baseline methods or achieved comparable performance without the need to worry about the pre-processing order, the scaling of data after pre-processing and optimization time requirements. PROSAC can be considered as a general tool for the ensemble pre-processing for NIR data modelling.
查看更多>>摘要:Fruit firmness is a complex trait that develops throughout fruit development, including post-harvest, and is influenced by both ripening and dehydration. There is a wide interest in predicting the firmness with nondestructive sensing techniques such as spectral analyses. However, often used reference techniques, such as acoustic firmness (AF), limited compression (LC) and Magness-Tyler (MT), respond differently to dehydration and ripening. This study aims to detangle how the firmness of 'Conference' pears relates to dehydration and ripening and to model ripening-related firmness using non-destructive sensing. Hereto, a pear fruit matrix was created with varying firmness and dehydration levels. To model fruit firmness (LC and MT) with Vis-NIR spectroscopy and explore whether AF information could complement Vis-NIR spectroscopy, a sequential multi-block analysis was performed. Single block Vis-NIR spectral data were made multi-block by partitioning the variance in spectral data into acoustic-dependent and-independent parts. A variation partitioning based approach was also presented to select the best pre-processing operation for Vis-NIR spectral data modelling. Multi-block regression to predict firmness and classification modelling of pear fruit in different firmness classes was also practised. The obtained results led to enhanced insights into the different fruit firmness measures and the capability of Vis-NIR and acoustic for non-destructive fruit firmness prediction. The results can benefit the scientific community working in the domain of fruit optical spectroscopy and chemometric modelling.
查看更多>>摘要:Fungal infections have become a serious health concern for human beings worldwide. Fungal infections usually occur when the invading fungus appear on a particular part of the body and become hard for the human immune system to resist. The existing antifungal treatments are considered inappropriate because of their severe side effects. With the rapid growth of this chronic disease across the world, an accurate prediction model for fungal infections has become a challenging task for scientists. To cope with these issues, several prediction methods have been established for antifungal peptides. However, due to the limited and unsatisfactory performance of these methods, it is still highly indispensable to develop an effective and reliable model of antifungal peptides. In this study, we present an intelligent learning approach for the accurate prediction of antifungal peptides. The sequential and evolutionary features are explored by three promising descriptors namely conjoint triad feature (CTF), Pseudo-position specific scoring matrix (PsePSSM), and Position-specific scoring matrix-Discrete wavelet transform (PSSM-DWT). Moreover, the extracted vectors of the encoding methods are then fused to get multi-perspective descriptors representing both sequential and evolutionary features. In addition, to reduce the size of the multi-information vector and to eradicate noisy and irrelevant descriptors, we applied minimum redun-dancy and maximum relevance (mRMR) based feature selection to choose the optimal feature set. In the next step, the selected feature vector is evaluated via four different machine learning models, i.e. Fuzzy K-nearest neighbor (FKNN), Random Forest (RF),k-nearest neighbor (KNN), and Support Vector Machine (SVM). In addition, the predicted labels of the individual learning algorithms are then provided to the genetic algorithm to form an ensemble classifier to further boost the prediction results. Furthermore, the SHAP and LIME methods were used to interpret the contribution of features to model predictions. Our proposed iAFPs-EnC-GA model achieved a higher prediction accuracy of 97.81% and 93.92% using training and independent datasets, respectively. Which is ~4% higher than existing models. It is suggested that the "iAFPs-EnC-GA" model will be a valuable tool for scientists and might play a key role in drug development and academic research. The source code and all datasets are publicly available at https://github.com/farmanit335/iAFPs-EnC-GA.
Aleixandre-Tudo, J. L.Castello-Cogollos, L.Aleixandre, J. L.Aleixandre-Benavent, R....
12页
查看更多>>摘要:Chemometrics has been defined as the discipline that provides maximum information from chemical data. Food science and technology applications use chemical data and are therefore suitable for chemometrics evaluation. Bibliometric studies provide an enhanced understanding of the progression, research status as well as future trends of a research field. The main aim of the study was therefore, to provide a bibliometric evaluation of the research literature employing chemometrics techniques in food science and technology applications. For this, a search strategy including the single term chemometric* was performed. The metadata obtained from the bibliometric search was subsequently analyzed. Indicators of the scientific productivity and quality such as the number of articles, citations, or funding activity, were obtained in combination with the most relevant keywords, authors, and countries. The progression over time of the bibliometric indicators was also presented and discussed. Chemometrics appeared as a prolific and healthy research field with increased funding received in the last decade. PCA, PLS and DA are still the preferred methods for most applications. A big part of the research is related to the combined use of spectroscopy and chemometrics. Finally, China and Brazil appeared as the leading countries in applying chemometrics to foodstuffs.
查看更多>>摘要:N-6-methyladenosine (m(6)A) is a prevalent RNA methylation modification, which plays an important role in various biological processes. Accurate identification of the m6A sites is fundamental to understand the biological functions and mechanisms of the modification deeply. However, the experimental methods for detecting m(6)A sites are usually time-consuming and expensive, and various computational methods have been developed to identify m6A sites in RNA. This paper proposes a novel cross-species computational method StackRAM using machine learning algorithms to identify the m(6)A sites in Saccharomyces cerevisiae (S. cerevisiae), Homo sapiens (H. sapiens), Arabidopsis thaliana (A. thaliana) and Mus musculus (M. musculus). First, the RNA sequence features are extracted through binary encoding, chemical property, nucleotide frequency, k-mer nucleotide frequency, pseudo dinucleotide composition, and position-specific trinucleotide propensity, and the initial feature dataset is obtained by feature fusion. Second, the Elastic Net is used for the first time to filter redundant and noisy information and retain important features for m(6)A sites classification. Finally, the base-classifiers output probabilities and the optimal feature subset corresponding to the Elastic Net are combined, and the combination feature is put into the second-stage meta-classifier SVM. The result of jackknife test on training dataset S. cerevisiae indicates that the prediction performance of StackRAM is superior to the current state-of-the-art methods. Prediction accuracy of StackRAM for independent test datasets H. sapiens, A. thaliana and M. musculus reach 92.30%, 87.06% and 91.86%, respectively. Therefore, StackRAM has developing potential in cross-species prediction and can be a useful method for identifying m(6)A sites.
查看更多>>摘要:The article uses artificial neural networks (ANN) to predict the antifungal properties of quaternary ammonium salts against Candida albicans. The antifungal activity expressed as the minimum inhibitory concentration (MIC) of microbial growth was determined experimentally by serial dilution method for a series of 140 new imidazole derivatives. Then, three-dimensional models of test compounds were constructed and the chemical information was converted into a useful number using computational chemistry. In the next step, neural network models were designed to solve regression and classification problems. Both models were characterized by high predictive ability. The quality of the regression model was determined on the basis of the level of correlation between the theoretically calculated activity and the activity determined experimentally (R-2 = 0.91 for the learning set, R-2 = 0.88 for the test set and R-2 = 0.91 for validation). The classification model differentiated the compounds into active or inactive with a classification accuracy of 91.67% for the learning set, 88.57% for the test set, and 95.24% for the validation set. Artificial neural networks are a predictive tool with impressive learning properties and non -linear information processing capabilities. ANN have the potential to reduce time and costs of discovering new antimicrobial substances and to support pharmaceutical development research.
查看更多>>摘要:In this study, the combination of the least absolute deviation-least absolute shrinkage and selection operator (LAD-LASSO) was introduced as a new variable selection method for the artificial neural network (ANN)-based quantitative structure-activity relationship (QSAR) studies. The biological activity of various chemical compounds was predicted using an ANN-based QSAR model combined with the efficient LAD-LASSO variable selection method. In this study, 3224 computed DRAGON descriptors were reduced to a smaller number using preprocessing methods. The descriptors with the most significant relevance to biological activities were chosen using the LAD-LASSO variable selection method. The selected descriptors were defined as ANN inputs and optimized the designed models. The biological activity of the test set compounds was predicted using the optimum ANN models. The coefficients of determination (R-2) for the test data in the different datasets were equal to 0.87, 0.84, and 0.87. Also, the MSE value of the test set is equal to 0.13, 0.07, and 0.11, respectively. The high R-2 and low MSE values demonstrate the good prediction ability of the constructed QSAR models. The applicability domain (AD) and Y-randomization test also proved the efficiency of the developed models. Finally, The performance of the QSAR model was evaluated by the identification of novel compounds with high potency. As a result, the weak structure of the dataset was identified and modified using the effect of selected descriptors on the biological activity, resulting in the establishment of new compounds with significant potency. The response value of the new suggested compounds was predicted using the optimum ANN models. Receptor-ligand interactions were extracted for all proposed compounds. The presence of different hydrophilic and hydrophobic interactions in the active site of the respective receptor indicates the high potential of suggested chemical compounds.