基于筛选后主要化学成分对同类植物的品种分类研究
Classification for the varieties of single plant based on important chemical compositions by feature selections
沙云菲 1王亮 1刘太昂 2于洁 1葛炯 1李敏杰 3陆文聪 3孙翔1
作者信息
- 1. 上海烟草集团有限责任公司技术中心,上海市,200082
- 2. 上海帆阳信息科技有限公司,上海市,200444
- 3. 上海大学理学院化学系,上海市,200444
- 折叠
摘要
同类植物的品种会有独特的多样性,可能会导致极其不同的应用差异.为了探索同类植物不同品种之间的分类方法,我们以4组不同品种烟草为研究对象进行分类.基于烟草的化学成分数据,我们分别比较了最大相关最小冗余法、浮动后退法、遗传算法、随机森林4种变量筛选方法在支持向量机算法上的建模效果.结果 表明浮动后退法具有最好的分类准确率.脯氨酸,钾含量,芸香苷,柠檬酸,pH值是4个变量筛选集合的交集,具有很大的潜力应用于烟草的分类问题.这组方法也可能适用于其他植物的应用研究.
Abstract
Generally,the varieties of the same kind of plant render distinctive diversities that might lead to extremely different applications.To explore a potential method of classification for the varieties of one single plant,four groups of tobacco have been taken as an example to classify the different varieties.When classifying these four groups of tobacco varieties based on the dataset of chemical compositions,four feature selections,which are max-relevance-min-redundancy (mRMR),sequential backward floating selection (SBFS),genetic algorithm (GA),random forest (RF) have been employed.The result shows that SBFS is the best method,due to the highest accuracy in SVM models.Five features have been selected from the intersection of four algorithms.These four features are proline acid,potassium content,rutin,citric acid and pH,which might be the most important chemical compositions related to varieties of tobacco potential.The set of method might be possible to be applied into other plants' applications.
关键词
变量选择/化学成分/遗传算法/最大相关最小冗余Key words
Feature selections, Chemical compositions/Genetic algorithm/max-relevance-min-redundancy引用本文复制引用
基金项目
中国烟草总公司科技重大项目(Zhong Yan Ban [2016] 259)
国家重点研究开发计划(2016YFB070504)
China National Tobacco Corporation Science and Technology Major Project(Zhong Yan Ban [2016] 259)
National Key Research and Development Program of China(2016YFB0700504)
出版年
2019