面向高维小样本群组数据变量选择方法的比较与应用
Comparison and application of group variable selection methods for high-dimensional small sample data
李东升 1邱宇婷2
作者信息
- 1. 黔南民族师范学院数学与统计学院,贵州都匀 558000
- 2. 湖南师范大学附属湘才学校,贵州都匀 558000
- 折叠
摘要
高维小样本群组数据变量选择是统计学领域面临的主要问题之一.随着基因组信息学的快速发展,高维小样本数据随处可见,这给统计建模带来了极具挑战性的任务.在高维小样本数据中,有些数据集是呈现群组结构,如果使用单变量选择方法,就会忽略分组信息,从而可能导致变量选择效果大大降低.基于此,主要介绍几种处理高维数据和群组数据集的变量选择方法,并对此进行数值模拟和实证分析.结果表明,在高维小样本群组数据集背景下,当变量维度低于50维时,采用grLasso方法,变量的选择和模型的拟合优度会更优;当变量维度高于50维时,采用grMCP、grSubset+grLasso和grSubset方法,变量的选择和模型的拟合优度会更优.
Abstract
Variable selection of high-dimensional small sample group data is one of the main problems in statistics.With the rapid development of genomic informatics,high-dimensional small sample data can be seen everywhere,which brings challenging tasks to statistical modeling.In high-dimensional small sample data,some data sets present a group structure.If the univariate selection method is used,the grouping information will be ignored,which may lead to a significant reduction in the effect of variable selection.Based on this,this paper mainly introduces several variable selection methods for processing high-dimensional data and group data sets,and conducts numerical simulation and empirical analysis.The results show that in the context of high-dimensional small sample group data sets,when the variable dimension is less than 50 dimensions,the grLasso method will be better for variable selection and model goodness of fit;when the variable dimension is higher than 50 dimensions,the grMCP,grSubset+grLasso and grSubset methods will be better for variable selection and model goodness of fit.
关键词
高维小样本/群组结构/变量选择Key words
high-dimensional small sample/group structure/variable selection引用本文复制引用
基金项目
贵州省教育厅青年人才成长项目(黔教技[2022]380号)
黔南州哲学社会科学理论创新课题(Qnsk-2022-021)
出版年
2024