摘要
目的:提出一种基于高斯核函数的差分隐私模糊C均值聚类算法(DPFCM_GF),旨在优化大数据背景下医疗数据分析和挖掘带来的数据隐私安全问题,为数据隐私保护提供理论基础.方法:针对随机初始化模糊C-均值隶属度矩阵降低算法精度问题,采用最大距离法确定初始中心点,使用聚类中心点的高斯值计算隐私预算分配比率,并添加拉普拉斯噪声以完成差分隐私保护,构建DPFCM_GF.收集整理美国加州大学欧文分校机器学习存储库的心脏病、乳腺癌、甲状腺疾病及糖尿病公开数据集对DPFCM_GF有效性进行验证,收集2019年1月1日至2022年12月31日淮安市第二人民医院收治的756例胃癌和肺癌患者病例数据集,对DPFCM_GF的可用性进行验证,并将分析结果与模糊C均值聚类算法(FCM)以及差分隐私模糊C均值聚类算法(DPFCM)进行对比分析.结果:对于心脏病、乳腺癌、甲状腺疾病及糖尿病公开数据集,DPFCM_GF和DPFCM的最优聚类效果与FCM聚类效果相当;相较于DPFCM,DPFCM_GF迭代时间更快,聚集速度显著,差异有统计学意义(t=4.01、4.71、4.01、12.38,P<0.05).对于肺癌和胃癌数据集,随着隐私预算ε的增大,DPFCM_GF正确识别率逐渐聚集于91.9%和93.9%,受试者工作特征(ROC)曲线下面积(AUC)值分别为0.79和0.81;当隐私函数ε为0.1、0.5、1和2(ε<3)时,DPFCM_GF聚类效果显著优于DPFCM,且聚类效果更佳,差异有统计学意义(x2=12.25、87.12、68.58、7.76,P<0.05;x2=4.74、43.51、42.47、4.89,P<0.05).结论:DPFCM_GF是一种有效保护医疗数据隐私的方法,同时也可进行数据分析和挖掘任务,具有一定的研究意义和研究前景.
Abstract
Objective:To propose a differential privacy fuzzy C-means clustering algorithm based on Gaussian kernel function(DPFCM_GF)to optimize the data privacy and security issues brought about by data analysis and mining of medical data in the context of big data,and to provide a theoretical basis for data privacy protection.Methods:In order to solve the problem of reducing the accuracy of the algorithm by randomly initializing the fuzzy C-mean membership matrix,the maximum distance method was used to determine the initial center point,the Gaussian value of the clustering center point was used to calculate the privacy budget allocation ratio,and the Laplace noise was added to complete the differential privacy protection and the DPFCM_GF was constructed.The effectiveness of DPFCM_GF was verified by collecting and collating the heart disease,breast cancer,thyroid disease and diabetes public data sets from the machine learning repository of the University of California,Irvine,and the gastric cancer and lung cancer datasets of The Second People's Hospital Huai'an were collected to verify the usability of the DPFCM_GF,and the analysis results were compared with the fuzzy C-means clustering algorithm(FCM)and the differentially private fuzzy C-means clustering algorithm(DPFCM).Results:For public datasets of heart disease,breast cancer,thyroid disease and diabetes,the optimal clustering effects of DPFCM_GF and DPFCM were equivalent to those of FCM;compared with DPFCM,the iteration time of DPFCM_GF was faster,the convergence speed was significant,the difference was statistically significant(t=4.01,4.71,4.01,12.38,P<0.05).For lung cancer and gastric cancer dataset,with the increase of privacy budget ε,the correct recognition rate of DPFCM_GF gradually converged to 91.9%and 93.9%,and the AUC values converged to 0.79 and 0.81,respectively,and when the privacy function ε was 0.1,0.5,1 and 2(ε<3),the DPFCM_GF clustering effect was significantly better than that of DPFCM,and the clustering effect was better and the difference was statistically significant(x2=12.25,87.12,68.58,7.76,P<0.05;x2=4.74,43.51,42.47,4.89,P<0.05).Conclusion:The DPFCM_GF is an effective method for protecting the privacy of medical data,and can also perform data analysis and mining tasks,which has certain research significance and prospects.