基于多方表格数据关联策略的数据补全可视分析方法
A visual analysis approach for data imputation via multi-party tabular data correlation strategies
朱海洋 1韩东明 2潘嘉铖 2魏雅婷 3封颖超杰 2翁罗轩 2毛科添 2邢远凯 4闾建树 4万邱成 4陈为2
作者信息
- 1. 浙江大学计算机辅助设计与图形系统全国重点实验室,中国 杭州市,310058;物产中大数字科技有限公司,中国 杭州市,310020
- 2. 浙江大学计算机辅助设计与图形系统全国重点实验室,中国 杭州市,310058
- 3. 物产中大金属集团有限公司,中国 杭州市,310005
- 4. 物产中大数字科技有限公司,中国 杭州市,310020
- 折叠
摘要
数据补全是数据治理的一项重要预处理任务,目的是填补不完整的数据.然而,传统的数据补全方法只能通过单张数据表格在一定程度上缓解数据的不完整问题,并未能在补全值的准确性和效率之间达到最佳平衡.本文提出了一种新颖的数据补全可视化分析方法;设计了一套多方表格数据关联策略,采用智能算法识别相似列并在多个表格之间建立列之间的关联关系,然后利用其它表格中的相似数据条目对缺失数据进行初始补全;开发了一个可视分析系统来优化数据补全的候选值.本文中的交互式系统将多方数据补全方法与专家知识相结合,有助于更好地理解数据的关系结构,显著提高了数据补全的准确性和效率,提升了数据治理质量和数据资产内在价值.实验验证和用户调查表明,本文方法支持用户使用领域知识验证判断相关列及相似行.
Abstract
Data imputation is an essential pre-processing task for data governance,aimed at filling in incomplete data.However,conventional data imputation methods can only partly alleviate data incompleteness using isolated tabular data,and they fail to achieve the best balance between accuracy and efficiency.In this paper,we present a novel visual analysis approach for data imputation.We develop a multi-party tabular data association strategy that uses intelligent algorithms to identify similar columns and establish column correlations across multiple tables.Then,we perform the initial imputation of incomplete data using correlated data entries from other tables.Additionally,we develop a visual analysis system to refine data imputation candidates.Our interactive system combines the multi-party data imputation approach with expert knowledge,allowing for a better understanding of the relational structure of the data.This significantly enhances the accuracy and efficiency of data imputation,thereby enhancing the quality of data governance and the intrinsic value of data assets.Experimental validation and user surveys demonstrate that this method supports users in verifying and judging the associated columns and similar rows using their domain knowledge.
关键词
数据治理/数据不完整/数据补全/数据可视化/交互式可视分析Key words
Data governance/Data incompleteness/Data imputation/Data visualization/Interactive visual analysis引用本文复制引用
基金项目
Key R&D"Pioneer"Tackling Plan Program of Zhejiang Province,China(2023C01119)
the"Ten Thousand Talents Plan"Science and Technology Innovation Leading Talent Program of Zhejiang Province,China(2022R52044)
Major Standardization Pilot Projects for the Digital Economy(Digital Trade Sector)of Zhejiang Province,China(SJ-BZ/2023053)
National Natural Science Foundation of China(62132017)
出版年
2024