首页|基于机器学习的冠心病风险预测模型构建与比较

基于机器学习的冠心病风险预测模型构建与比较

扫码查看
背景 冠状动脉粥样硬化性心脏病(以下简称冠心病)是全球重要的死亡原因之一。目前关于冠心病风险评估的研究在逐年增长。然而,在这些研究中常忽略了数据不平衡的问题,而解决该问题对于提高分类算法中识别冠心病风险的准确性至关重要。目的 探索冠心病的影响因素,通过使用2种平衡数据的方法,基于5种算法建立冠心病风险相关的预测模型,比较这5种模型对冠心病风险的预测价值。方法 基于2021年美国国家行为风险因素监测系统(BRFSS)横断面调查数据筛选出112606名研究对象的健康相关风险行为、慢性健康状况等24个变量信息,结局指标为自我报告是否患有冠心病并据此分为冠心病组和非冠心病组。通过进行单因素分析和逐步Logistic回归分析探索冠心病发生的影响因素并筛选出纳入预测模型的变量。随机抽取112606名受访者的10%(共计11261名),以8:2的比例随机划分为训练与测试的数据集,采用随机过采样和合成少数过采样技术(SMOTE)两种过采样的方法处理不平衡数据,基于k最邻近算法(KNN)、Logistic回归、支持向量机(SVM)、决策树和XGBoost算法分别建立冠心病预测模型。结果 两组年龄、性别、BMI、种族、婚姻状态、教育水平、收入水平、家里有几个孩子、是否被告知患高血压、是否被告知处于高血压前期、是否被告知患妊娠高血压、现在是否在服用高血压药物、是否被告知患有高脂血症、是否被告知患有糖尿病、吸烟情况、过去30 d内是否至少喝过1次酒、是否为重度饮酒者、是否为酗酒者、过去30 d内是否有体育锻炼、心理健康状况以及自我健康评价比较,差异有统计学意义(P<0。05)。逐步Logistic回归分析结果显示:年龄、性别、BMI、种族、教育水平、收入水平、是否被告知患高血压、是否被告知处于高血压前期、是否被告知患妊娠高血压、现在是否在服用高血压药物、是否被告知患有高脂血症、是否被告知患有糖尿病、吸烟情况、过去30 d内是否至少喝过1次酒、是否为重度饮酒者、是否为酗酒者以及自我健康评价为冠心病的影响因素(P<0。05)。风险模型构建的分析结果显示:k最邻近算法、Logistic回归、支持向量机、决策树和XGBoost采用SMOTE处理不平衡数据的总体分类精度分别为59。2%、67。4%、66。2%、69。2%和85。9%,召回率分别为75。2%、71。4%、70。5%、62。9%和34。8%,精确度分别为15。4%、18。2%、17。5%、17。6%和28。7%,F值分别为0。256、0。290、0。280、0。275和0。315,受试者工作特征曲线下面积分别为0。80、0。78、0。72、0。72和0。82;采用随机过采样处理不平衡数据的总体分类精度分别为62。5%、68。5%、69。0%、60。2%和70。1%,召回率分别为70。0%、69。5%、71。9%、69。0%和67。6%;精确度分别为15。8%、18。4%、19。1%、14。8%和19。0%,F值分别为0。258、0。291、0。302、0。244和0。297,受试者工作特征曲线下面积分别为0。80、0。77、0。72、0。72和0。83。结论 本研究不仅确认了已知冠心病的影响因素,还发现了自我健康评价水平、收入水平和教育水平对冠心病具有潜在影响。在使用2种数据平衡方法后,5种算法的性能显著提高。其中XGBoost模型表现最佳,可作为未来优化冠心病预测模型的参考。此外,鉴于XGBoost模型的优异性能以及逐步Logistic回归的操作便捷和可解释性,推荐在冠心病风险预测模型中结合使用数据平衡后的XGBoost和逐步Logistic回归分析。
Coronary Heart Disease Risk Prediction Model Based on Machine Learning
Background Coronary atherosclerotic heart disease(CHD) is one of the leading causes of mortality worldwide,and research on risk assessment for CHD has been growing annually. However,the issue of data imbalance in these studies is often overlooked,despite its crucial role in enhancing the accuracy of CHD risk identification within classification algorithms. Objective To investigate the factors influencing CHD and to establish predictive models for CHD risk using two data balancing methods based on five algorithms,comparing the predictive value of these models for CHD risk. Methods Utilizing cross-sectional survey data from the 2021 Behavioral Risk Factor Surveillance System(BRFSS) in the United States,a cohort of 112606 participants was identified,featuring 24 variables related to risk behaviors and health status,with self-reported coronary heart disease(CHD) as the outcome measure. Factors influencing the incidence of CHD were explored through univariate analysis and stepwise logistic regression to select pertinent variables for inclusion in the predictive model. A random sample comprising 10% of the participants(11261 individuals) was drawn and then randomly divided into training and testing datasets at an 8:2 ratio. To address data imbalance,two over-sampling techniques were employed:random oversampling and the Synthetic Minority Over-sampling Technique(SMOTE). Based on these methods,CHD predictive models were constructed using five different algorithms:K-Nearest Neighbors(KNN),Logistic Regression,Support Vector Machine (SVM),Decision Tree,and XGBoost. Results Univariate analysis revealed significant differences(P<0.05) between the CHD and non-CHD groups across all input variables except for rental housing and being informed of prediabetic status. Stepwise Logistic regression identified age,gender,BMI,ethnicity,education level,income level,being informed of hypertension,being informed of prehypertension,being informed of pregnancy-induced hypertension,current use of antihypertensive medication,being informed of hyperlipidemia,being informed of diabetes,smoking status,alcohol consumption within the last 30 days,heavy drinking status,and self-assessed health as factors influencing CHD. The performance of risk models using SMOTE showed overall classification accuracies of 59.2%,67.4%,66.2%,69.2%,and 85.9%;recall rates of 75.2%,71.4%,70.5%,62.9%,and 34.8%;precision of 15.4%,18.2%,17.5%,17.6%,and 28.7%;F-values of 0.256,0.290,0.280,0.275,and 0.315;and AUC values of 0.80,0.78,0.72,0.72,and 0.82,respectively. Using random oversampling,the models achieved classification accuracies of 62.5%,68.5%,69.0%,60.2%,and 70.1%;recall rates of 70.0%,69.5%,71.9%,69.0%,and 67.6%;precision of 15.8%,18.4%,19.1%,14.8%,and 19.0%;F-values of 0.258,0.291,0.302,0.244,and 0.297;and AUC values of 0.80,0.77,0.72,0.72,and 0.83,respectively. Conclusion This study not only confirmed known factors affecting CHD but also identified potential impacts of self-assessed health level,income level,and education level on CHD. The performance of the five algorithms was significantly enhanced after employing two data balancing methods. Among them,the XGBoost model exhibited superior performance and can be referenced for future optimization of CHD prediction models. Additionally,considering the excellent performance of the XGBoost model and the convenience and interpretability of stepwise logistic regression,a combined use of these approaches after data balancing is recommended in CHD risk prediction models.

Coronary diseaseMachine learningRisk prediction modelLogistic regressionK-nearest neighborSupport vector machineDecision treeXGBoost

岳海涛、何婵婵、成羽攸、张森诚、吴悠、马晶

展开 >

518055 广东省深圳市,清华大学医院管理研究院

100084 北京市,清华大学医院管理研究院清华大学医学院

冠心病 机器学习 风险预测模型 Logistic回归 k最邻近算法 支持向量机 决策树 XGBoost

2025

中国全科医学
中国医院协会

中国全科医学

北大核心
影响因子:2.04
ISSN:1007-9572
年,卷(期):2025.28(4)