首页|基于数据表相似度计算的数据血缘构建方法

基于数据表相似度计算的数据血缘构建方法

扫码查看
大数据时代下,各业务部门基于已有业务数据积累激发数据价值已成为一种共识.由于各业务系统数据标准不统一,导致元数据杂乱无章、数据孤岛、低质数据等问题层出不穷,阻碍数据的有效利用,需进行必要的治理.这其中,数据血缘分析是元数据管理的关键任务之一,对于数据溯源和数据治理具有重要意义.然而,传统的数据血缘构建方法往往面临着计算复杂度高、准确性差、执行成本高等问题.为克服这些问题,提出一种基于数据表相似度计算的数据血缘构建方法:通过对数据表的命名、表结构和数据字段三要素进行文本特征表示,利用TFIDF计算数据表的相似度,并进一步通过改进的Jaro-Winkler Distances算法验证字段重合度、表名相似度的方法构建数据表血缘关系.结果表明,该算法在数据表血缘关系构建方面效果显著,促进了数据治理工作的顺利开展.
Building Method for Data Lineage Based on Data Table Similarity Calculation
In the era of big data,it has become a consensus that various business departments can stimulate data value based on the accumulation of existing business data.However,due to the lack of unified data standards across different business systems,disorganized metadata,data silos,and low-quality data problems constantly emerge,hindering the effective utilization of data and necessitating necessary governance.Among them,data lineage analysis is one of the key tasks of metadata management,which is of great significance for data traceability and data governance.However,traditional methods for constructing data lineage often face high computational complexity,poor accuracy,and high execution costs.To overcome these issues,a data lineage construction method based on the similarity calculation of data tables is proposed:by text feature representation of the three elements of data table naming,table structure,and data fields,using TFIDF to calculate the similarity of data tables,and further constructing the data table lineage relationship through the improved Jaro-Winkler Distances algorithm to verify the field overlap and table name similarity.The results show that the algorithm has a significant effect on the construction of data table lineage,facilitating the smooth progress of data governance work.

data lineagedata governancemetadatatable similarity

潘奇、蔡斯博、魏芳芳

展开 >

国家开放大学,北京 100039

数字化学习技术集成与应用教育部工程研究中心,北京 100039

数据血缘 数据治理 元数据 表相似度

2024

电脑与电信
广东省对外科技交流中心

电脑与电信

影响因子:0.117
ISSN:1008-6609
年,卷(期):2024.(6)