查看更多>>摘要:Drug response classification constitutes a major challenge in personalized medicine. The suitable drug selection for cancer patients is substantial and the drug response prediction is generally based on the target information, genomic cohort and chemical structure. Hence, for classification process, feature selection approaches are highly essential, which is driven by prior knowledge of gene expression signatures, drug targets, and target pathways. Further, the classification is performed to assess the accurate drug response prediction. To the best of our knowledge, this is the first work to assess different optimization techniques for feature selection and performing drug response classification using different classifiers based on effective optimization algorithm using Cancer Cell Line Encyclopaedia- CCLE and Genomics of Drug Sensitivity in Cancer- GDSC dataset. For performing feature selection, Firefly, Whale and Grey wolf optimization algorithms are examined. Further, to perform classification, Adaboost, gradient boost and random forest classifiers are utilized. In addition to that, a newly developed classification approach, namely Discriminative Weight Updated Tuned Deep Multi-Layer Perceptron (DWUT-MLP) is used and compared with the other classifiers. The optimization algorithms with newly developed DWUT-MLP and other existing classification techniques are evaluated and results shows that the effective feature selection algorithm with suitable classification algorithm improves anticancer drug response prediction's accuracy. Thus, this research is substantial in general for choosing an appropriate feature selection approach, has the probability of improving the accuracy from the interpretable proposed classifier model for an indicative anticancer drug response prediction. From the comparative analysis, it shows that the proposed model performs nearly 10-15% better than existing frameworks in classifying the anticancer drug response.
查看更多>>摘要:A novel methodology is proposed for defining multivariate raw material specifications providing assurance of quality with a certain confidence level for the critical to quality attributes (CQA) of the manufactured product. The capability of the raw material batches of producing final product with CQAs within specifications is estimated before producing a single unit of the product, and, therefore, can be used as a decision making tool to accept or reject any new supplier raw material batch. The method is based on Partial Least Squares (PLS) model inversion taking into account the prediction uncertainty and can be used with historical/happenstance data, typical in Industry 4.0. The methodology is illustrated using data from three real industrial processes.
查看更多>>摘要:Spectroscopy has become a popular method in research devoted to cancer diagnostics, therapy, and surgery - anywhere we need to detect tumor cells surrounded by non-cancerous ones. Usually, chemometrics methods are applied to classify cancerous and non-cancerous sites, so proper validation of classification models is required to ensure the reliability of the obtained results. In this study, we suggest using real data for simulation of spectral sets with varying characteristics (size, distribution of classes) - an analog of "sandbox" used in software development - and to validate the models in different conditions. Near-infrared spectra (939-1796 nm) measured from breast tumors and healthy tissues of laboratory mice (152 spectra) were used for simulation of spectral data sets of different sizes (50, 100, 150 spectra). We proposed a simple simulation method based on a singular value decomposition of the real spectral dataset and rearrangement of the calculated residuals. Several algorithms of training and test set selection have been applied to the simulated data (Kennard-Stone, DUPLEX, random, MonteCarlo cross-validation), and corresponding Support Vector Machines classification models have been trained, optimized, and validated by using a series of test sets with varying "healthy: tumor" classes distribution (1:1,3:1,1:3) and size (10%, 30%, and 50% of the training data set). Performance of the classification models, expressed in values of accuracy, sensitivity, and selectivity, has been compared, and a validation strategy has been proposed.
查看更多>>摘要:Feature selection of high-dimensional labeled data with limited observations is critical for making powerful predictive modeling accessible, scalable, and interpretable for domain experts. Spectroscopy data, which records the interaction between matter and electromagnetic radiation, particularly holds a lot of information in a single sample. Since acquiring such high-dimensional data is a complex task, it is crucial to exploit the best analytical tools to extract necessary information. In this paper, we investigate the most commonly used feature selection techniques and introduce applying recent explainable AI techniques to interpret the prediction outcomes of highdimensional and limited spectral data. Interpretation of the prediction outcome is beneficial for the domain experts as it ensures the transparency and faithfulness of the ML models to the domain knowledge. Due to the instrument resolution limitations, pinpointing important regions of the spectroscopy data creates a pathway to optimize the data collection process through the miniaturization of the spectrometer device. Reducing the device size and power and therefore cost is a requirement for the real-world deployment of such a sensor-to-prediction system as a whole. Furthermore, we consider a wide range of machine learning models that have been proven to be successful for the prediction of the Cetane Number of fuels. We specifically design three different scenarios to ensure that the evaluation of ML models is robust for the real-time practice of the developed methodologies and to uncover the hidden effect of noise sources on the final outcome. The evaluation is performed for both the full model and reduced models using different feature selection techniques on a real dataset. Finally, we propose a correctness metric for the feature selection techniques to assess the conformance of the selected subset of features to the domain expertise. As a result, the Support Vector Regression yields better prediction accuracy and generalization power as it leads to less complex and computationally more efficient than model Neural Network. More importantly, using the reduced subset of features from original data creates a pathway to deploying less complex, scalable, and explainable prediction models.
查看更多>>摘要:Terahertz (THz) Spectroscopy, featured by low energy, instantaneity and spectral fingerprint, is promising in material identification. Kernel Entropy Composition Analysis (KECA), different from the commonly used THz time-domain spectroscopy analysis methods of PCA and Kernel Principal Component Analysis (KPCA), focuses on decomposing eigenvalues and eigenvectors of a kernel matrix where its original spectral data are projected into a high-dimensional feature space, selecting eigenvectors as projection vectors which contribute most to Rayleigh entropy of original data and getting a new database. However, the parameters selection of kernel function in traditional KECA method have a significant effect on the accuracy of final analysis results. Aimed at this problem, this paper has proposed a new approach called improved KECA method for transgenic cotton seeds recognition based on Terahertz Spectroscopy. The improved KECA takes as the criterion function difference between intraclass and inter-class dispersion based on angular structure for Kernel parameter optimization, and selects kernel parameter with the maximum value of the criterion function as the best option. Then, the clustering method based on angular structure distance is used for the identification of different substance species. In order to test whether the proposed method is effective or not, this paper has applied the THz time-domain spectroscopy technology to detect the three transgenic cotton seeds of Xinluzhong6, Xinqiu107 and Yingmian8 respectively. And then the absorbance spectrum data of these three cotton seeds will be subject to clustering analysis through the proposed improved KECA method.
查看更多>>摘要:The age-dependent variation in wood properties is a very complex phenomenon because its pattern is highly dependent on tree species and wood traits. In this study, we evaluated the variation of multiple traits inclusively based on the distribution of eigenvalues calculated from the near-infrared spectral matrix at each cambial age. The experiments were conducted on four tree species with characteristic xylem structures, aiming to clarify the intrinsic behaviour of tree aging independent of tree species. The eigenvalues diffused with age in any species, such as in Dyson's Brownian motion. The gradual increase in the first eigenvalue, which is equivalent to the Helmholtz free energy, indicates that trees form a more ordered wood with age. As all the variations induced by various wood properties during the process of tree growth were aggregated into the set of eigenvalues, the FokkerPlanck equation representing the variation of eigenvalue distribution might provide a conclusive answer for the determination of demarcation between juvenile and mature wood. The age dependency of Shannon entropy and density matrix calculated from the probability associated with each energy eigenstate provided us with knowledge from the perspective of randomness; namely, tree aging from the perspective of the variation of wood properties was clearly an irreversible process. This result offers an important clue for sustainable forest management and the use of wood resources. The proposed method does not depend on a specific coordinate, thus, it will work well using data other than the near infrared spectrum.