基于数据表相似度计算的数据血缘构建方法

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：大数据时代下,各业务部门基于已有业务数据积累激发数据价值已成为一种共识.由于各业务系统数据标准不统一,导致元数据杂乱无章、数据孤岛、低质数据等问题层出不穷,阻碍数据的有效利用,需进行必要的治理.这其中,数据血缘分析是元数据管理的关键任务之一,对于数据溯源和数据治理具有重要意义.然而,传统的数据血缘构建方法往往面临着计算复杂度高、准确性差、执行成本高等问题.为克服这些问题,提出一种基于数据表相似度计算的数据血缘构建方法:通过对数据表的命名、表结构和数据字段三要素进行文本特征表示,利用TFIDF计算数据表的相似度,并进一步通过改进的Jaro-Winkler Distances算法验证字段重合度、表名相似度的方法构建数据表血缘关系.结果表明,该算法在数据表血缘关系构建方面效果显著,促进了数据治理工作的顺利开展.

外文标题：Building Method for Data Lineage Based on Data Table Similarity Calculation

外文摘要：In the era of big data,it has become a consensus that various business departments can stimulate data value based on the accumulation of existing business data.However,due to the lack of unified data standards across different business systems,disorganized metadata,data silos,and low-quality data problems constantly emerge,hindering the effective utilization of data and necessitating necessary governance.Among them,data lineage analysis is one of the key tasks of metadata management,which is of great significance for data traceability and data governance.However,traditional methods for constructing data lineage often face high computational complexity,poor accuracy,and high execution costs.To overcome these issues,a data lineage construction method based on the similarity calculation of data tables is proposed:by text feature representation of the three elements of data table naming,table structure,and data fields,using TFIDF to calculate the similarity of data tables,and further constructing the data table lineage relationship through the improved Jaro-Winkler Distances algorithm to verify the field overlap and table name similarity.The results show that the algorithm has a significant effect on the construction of data table lineage,facilitating the smooth progress of data governance work.

外文关键词：

data lineagedata governancemetadatatable similarity

作者：

潘奇、蔡斯博、魏芳芳

展开 >

作者单位：

国家开放大学,北京 100039

数字化学习技术集成与应用教育部工程研究中心,北京 100039

关键词：

数据血缘数据治理元数据表相似度

出版年：

2024

电脑与电信

广东省对外科技交流中心

电脑与电信

影响因子：0.117

ISSN：1008-6609

年,卷(期)：2024.(6)