首页|基于混合数据类型相关性度量的非正态数据合成

基于混合数据类型相关性度量的非正态数据合成

扫码查看
数据在机器学习、人工智能等领域的研究和开发工作中占据了极其重要的地位.然而现实中存在的一些因素导致数据需求者无法获得符合工作要求的真实数据集,例如隐私问题、数据稀缺和数据质量较差等.针对此现状,在SI(sampling-iteration)technique的基础上改进出一种非正态数据合成算法(KMSI).该算法使用混合类型相关系数矩阵以减小SI technique在目标设定、控制循环等步骤中的度量误差,通过替换Bootstrap采样法为核密度估计采样法以避免使用真实数据.实验结果表明,KMSI相较SI technique能够应对复杂分布和混合类型的数据集,且在合成结果中不包含真实数据;相较于其他改进方法,KMSI在合成数据集样本量上能够给予使用者更大的自定义空间.
Non-normal Data Synthesis Based on Mixed Data Type Correlation Measurement
Data plays an extremely important role in research and development in fields such as machine learning and artificial intelligence.However,some real-world factors prevent data consumers from obtaining real datasets that meet their work requirements,such as privacy issues,data scarcity,and poor data quality.In response to this situation,this study develops a non-normal data synthesis algorithm(KMSI)as an improvement to the sampling-iteration(SI)technique.This algorithm utilizes a mixed-type correlation coefficient matrix to reduce measurement errors in various steps of the SI technique,including target setting and control loops.It replaces Bootstrap sampling with kernel density estimation sampling to avoid using real data.Experimental results show that,compared to the SI technique,KMSI is capable of handling complex and mixed-type datasets and does not include real data in the synthetic results.Furthermore,compared to other enhancement methods,KMSI offers users more customization options for the sample size in synthetic datasets.

synthetic datasetprivacy protectioncorrelation coefficientkernel density estimation

王春东、张世鹏

展开 >

天津理工大学计算机科学与工程学院,天津 300384

合成数据集 隐私保护 相关系数 核密度估计

国家自然科学基金联合基金天津市科委重大专项

U153612215ZXDSGX00030

2024

计算机系统应用
中国科学院软件研究所

计算机系统应用

CSTPCD
影响因子:0.449
ISSN:1003-3254
年,卷(期):2024.33(3)
  • 27