计算机技术与发展2023,Vol.33Issue(12) :49-57.DOI:10.3969/j.issn.1673-629X.2023.12.007

基于词嵌入的元组级数据溯源方法

A Tuple-level Data Lineage Approach Based on Word Embedding

杨彬 高俊涛 王志宝 李菲 马强 江树涛
计算机技术与发展2023,Vol.33Issue(12) :49-57.DOI:10.3969/j.issn.1673-629X.2023.12.007

基于词嵌入的元组级数据溯源方法

A Tuple-level Data Lineage Approach Based on Word Embedding

杨彬 1高俊涛 1王志宝 1李菲 2马强 2江树涛1
扫码查看

作者信息

  • 1. 东北石油大学 计算机与信息技术学院,黑龙江 大庆 163318
  • 2. 黑龙江八一农垦大学 信息与电气工程学院,黑龙江 大庆 163319
  • 折叠

摘要

在信息爆炸时代,数据量与日剧增,使用数据挖掘技术可挖掘其内在联系,但前提是所使用的数据正确无误,否则其后的一切工作将毫无意义.数据溯源技术可帮助数据分析人员快速定位到错误数据的来源和加工过程,减少错误数据的分析时间和难度,对数据质量控制与可信管理具有重要价值.现有的元组级数据溯源方法存在存储开销大和溯源效率低的问题,文章使用词嵌入技术改进元组级数据溯源方法.首先,研究元组向量化编码机制,依据元组向量相似度识别元组溯源关系;其次,提出基于属性重要性的优化算法提高溯源的精确率;再次,引入近似最近邻搜索和元组过滤优化机制降低溯源时间复杂度;最后,采用有向无环图展示元组数据的溯源关系.实验结果表明,该方法精确率较高、时间复杂度较低、存储消耗较少,能够有效改进元组级数据溯源方法.

Abstract

In the era of information explosion,the volume of data is increasing day by day,and data mining technology can be used to explore its inner connection,but only if the data used is correct,otherwise all the subsequent work will be meaningless.Data lineage tech-nology can help data analysts quickly locate the source and processing process of erroneous data,reduce the time and difficulty of analyzing erroneous data,and has important value for data quality control and trustworthy management.The existing tuple-level data lineage methods have the problems of high storage overhead and low lineage efficiency,and we use word embedding technology to improve the tuple-level data lineage methods.Firstly,the tuple vectorization encoding mechanism is investigated and tuple lineage rela-tionships based on the similarity of tuple vectors is identified.Secondly,an optimization algorithm based on attribute importance is proposed to improve the precision of lineage.Thirdly,the approximate nearest neighbor search and tuple filtering optimization mechanism is used to reduce the lineage time complexity.Finally,a directed acyclic graph is used to show the lineage relationships of tuple data.The experiment shows that the proposed method has higher lineage precision,lower time complexity and less storage consumption,and can ef-fectively improve the tuple-level data lineage method.

关键词

结构化数据/数据溯源/元组向量/相似度比较/词嵌入

Key words

structured data/data lineage/tuple vectors/similarity comparison/word embedding

引用本文复制引用

基金项目

国家自然科学基金资助项目(61902222)

东北石油大学优秀中青年科研创新团队培育基金(KYCXTDQ202101)

出版年

2023
计算机技术与发展
陕西省计算机学会

计算机技术与发展

CSTPCD
影响因子:0.621
ISSN:1673-629X
参考文献量3
段落导航相关论文