Nearest neighbor imputation for categorical data by weighting of attributes

扫码查看

原文链接

NSTL
Elsevier

外文摘要：Missing values are a common phenomenon in modern medical research of complex diseases. The data often contains nominal or categorical variables, for example, single nucleotide polymorphisms (SNPs) in genetic studies. If the missing values are not handled properly, the downstream statistical analysis of incomplete data may be biased. While various imputation methods are available for metrically scaled variables, methods for categorical data are scarce. An imputation method that has been shown to work well for high dimensional metrically scaled variables is the imputation by nearest neighbor methods. In this paper, we propose a weighted nearest neighbors approach to impute missing values in categorical variables in high dimensional datasets. The proposed method explicitly uses the information on the association among attributes. Using different simulation settings, the performance is compared with available imputation methods. A variety of real data sets, containing heart, DNA, and lymphatic cancer, is also used to support the results obtained by simulations. The results show that the weighting of attributes yields smaller imputation errors than existing approaches like random forest and MICE. (C) 2022 Elsevier Inc. All rights reserved.

外文关键词：

Data PreprocessingBiomedical dataMissing valuesCategorical data imputationHigh-dimensional dataMISSING VALUE IMPUTATIONMULTIPLE IMPUTATION

作者：

Faisal, Shahla、Tutz, Gerhard

展开 >

作者单位：

Govt Coll Univ Faisalabad

Ludwig Maximilians Univ Munchen

出版年：

2022

DOI：

10.1016/j.ins.2022.01.056

Information Sciences

EISCI

ISSN：0020-0255

年,卷(期)：2022.592

被引量8
参考文献量44