基于混合数据类型相关性度量的非正态数据合成

扫码查看

原文链接

万方数据
维普

中文摘要：数据在机器学习、人工智能等领域的研究和开发工作中占据了极其重要的地位.然而现实中存在的一些因素导致数据需求者无法获得符合工作要求的真实数据集,例如隐私问题、数据稀缺和数据质量较差等.针对此现状,在SI(sampling-iteration)technique的基础上改进出一种非正态数据合成算法(KMSI).该算法使用混合类型相关系数矩阵以减小SI technique在目标设定、控制循环等步骤中的度量误差,通过替换Bootstrap采样法为核密度估计采样法以避免使用真实数据.实验结果表明,KMSI相较SI technique能够应对复杂分布和混合类型的数据集,且在合成结果中不包含真实数据;相较于其他改进方法,KMSI在合成数据集样本量上能够给予使用者更大的自定义空间.

外文标题：Non-normal Data Synthesis Based on Mixed Data Type Correlation Measurement

外文摘要：Data plays an extremely important role in research and development in fields such as machine learning and artificial intelligence.However,some real-world factors prevent data consumers from obtaining real datasets that meet their work requirements,such as privacy issues,data scarcity,and poor data quality.In response to this situation,this study develops a non-normal data synthesis algorithm(KMSI)as an improvement to the sampling-iteration(SI)technique.This algorithm utilizes a mixed-type correlation coefficient matrix to reduce measurement errors in various steps of the SI technique,including target setting and control loops.It replaces Bootstrap sampling with kernel density estimation sampling to avoid using real data.Experimental results show that,compared to the SI technique,KMSI is capable of handling complex and mixed-type datasets and does not include real data in the synthetic results.Furthermore,compared to other enhancement methods,KMSI offers users more customization options for the sample size in synthetic datasets.

外文关键词：

synthetic datasetprivacy protectioncorrelation coefficientkernel density estimation

作者：

王春东、张世鹏

展开 >

作者单位：

天津理工大学计算机科学与工程学院,天津 300384

关键词：

合成数据集隐私保护相关系数核密度估计

基金：

国家自然科学基金联合基金天津市科委重大专项

项目编号：

U153612215ZXDSGX00030

出版年：

2024

DOI：

10.15888/j.cnki.csa.009441

计算机系统应用

中国科学院软件研究所

计算机系统应用

CSTPCD

影响因子：0.449

ISSN：1003-3254

年,卷(期)：2024.33(3)

参考文献量27