首页|Nearest neighbor imputation for categorical data by weighting of attributes

Nearest neighbor imputation for categorical data by weighting of attributes

扫码查看
Missing values are a common phenomenon in modern medical research of complex diseases. The data often contains nominal or categorical variables, for example, single nucleotide polymorphisms (SNPs) in genetic studies. If the missing values are not handled properly, the downstream statistical analysis of incomplete data may be biased. While various imputation methods are available for metrically scaled variables, methods for categorical data are scarce. An imputation method that has been shown to work well for high dimensional metrically scaled variables is the imputation by nearest neighbor methods. In this paper, we propose a weighted nearest neighbors approach to impute missing values in categorical variables in high dimensional datasets. The proposed method explicitly uses the information on the association among attributes. Using different simulation settings, the performance is compared with available imputation methods. A variety of real data sets, containing heart, DNA, and lymphatic cancer, is also used to support the results obtained by simulations. The results show that the weighting of attributes yields smaller imputation errors than existing approaches like random forest and MICE. (C) 2022 Elsevier Inc. All rights reserved.

Data PreprocessingBiomedical dataMissing valuesCategorical data imputationHigh-dimensional dataMISSING VALUE IMPUTATIONMULTIPLE IMPUTATION

Faisal, Shahla、Tutz, Gerhard

展开 >

Govt Coll Univ Faisalabad

Ludwig Maximilians Univ Munchen

2022

Information Sciences

Information Sciences

EISCI
ISSN:0020-0255
年,卷(期):2022.592
  • 8
  • 44