首页|融合机器学习和深度学习的大容量半结构化数据抽取算法

融合机器学习和深度学习的大容量半结构化数据抽取算法

扫码查看
由于半结构化数据具有很高的数据异构性,并且数据量巨大,不同来源的数据结构不一致,导致数据抽取的准确性和完整性较低。为此,本文将机器学习和深度学习深度融合,提出一种针对大容量半结构化数据的抽取算法。利用机器学习的主成分分析法,降低大容量半结构化数据的维度。基于深度学习的转换器网络结构,分别改进嵌入层、编码层-解码层和编码层等部分,得到用于识别数据命名实体和抽取数据实体关系的两种数据抽取算法,实现大容量半结构化数据的抽取。经测试结果验证,所提算法的正确抽取成效显著,无效数据项的最小抽取量仅有4个,且抽取复杂度较低,时效价值较高,F值和抽取时间的消融实验结果充分证明,两种技术的融合对数据抽取研究意义重大,F值始终保持在92以上,抽取时间缩短至125 ms内,具备较强的可行性,为提升运营效率、优化资源配置提供重要手段。
Large capacity semi structured data extraction algorithm combining machine learning and deep learning
Due to the high heterogeneity of semi-structured data and the huge amount of data,the data structure of different sources is inconsistent,resulting in low accuracy and integrity of data extraction.To this end,machine learning and deep learning are deeply integrated,and an extraction algorithm for large-capacity semi-structured data is proposed.By using the principal component analysis method of machine learning,the dimensionality of large volume semi-structured data is reduced.The converter network structure based on deep learning improves the embedding layer,encoding layer-decoding layer and encoding layer respectively,and obtains two kinds of data extraction algorithms for identifying the named entity of data and extracting the relationship of data entity,so as to realize the extraction of large-capacity semi-structured data.The test results verify that the proposed algorithm has a significant effect on correct extraction,the minimum extraction amount of invalid data items is only 4,the extraction complexity is low,and the aging value is high.The ablation experiment results of F-value and extraction time fully prove that the fusion of the two technologies is of great significance to the research of data extraction,and the F-value is always kept above 92,and the extraction time is shortened to 125 ms.It has strong feasibility and provides an important means for improving operational efficiency and optimizing resource allocation.

semi-structured datamachine learningdata capacity dimensionality reductiondeep learningnamed entity recognitionentity relationship extraction

张磊、焦晶、李勃昕、周延杰

展开 >

西安财经大学 信息学院,西安 710100

西北大学 经济管理学院,西安 710127

半结构化数据 机器学习 数据容量降维 深度学习 命名实体识别 实体关系抽取

中国(西安)丝绸之路研究院纵向项目中国(西安)丝绸之路研究院纵向项目西安财经大学横向项目

2019HZ022017SY052022250

2024

吉林大学学报(工学版)
吉林大学

吉林大学学报(工学版)

CSTPCD北大核心
影响因子:0.792
ISSN:1671-5497
年,卷(期):2024.54(9)