基于分类识别的工程档案数据快速检索技术

扫码查看

原文链接

万方数据
维普

中文摘要：传统的档案数据库检索方法数据处理时间长、效率低,而且使用关系型数据库等传统存储方式的成本高、鲁棒性差,文中以分布式存储数据库HBase为依托,提出了一种基于分类识别的工程档案数据快速检索模型.该模型主要由数据分类识别技术和数据快速检索技术模块组成.数据分类识别技术针对传统TF-IDF算法存在单词位置信息不敏感的缺点,使用类间和类内方法进行改进,并与朴素贝叶斯网络结合来提升分类识别准确率.数据快速检索技术模块则利用CNN和LSTM进行数据特征提取,采用哈希算法生成数据哈希码,提升了检索速度.在实验测试中,改进TF-IDF算法在不同数据集中准确率、召回率和F1值指标均为最优,检索时间缩短了10%以上且鲁棒性较强.实验结果表明所提方法超越了传统手段,兼具高效性与稳定性.

外文标题：Fast retrieval technology for engineering archive data based on classification recognition

外文摘要：The traditional archive database retrieval method has long data processing time,low efficiency,and high cost and poor robustness when using traditional storage methods such as relational databases.Based on the distributed storage databass HBase,a fast retrieval model for engineering archive data based on classification recognition is proposed in the artide.This model mainly consists of modules for data classification and recognition technology and rapid data retrieval technology.The data classification and recognition technology addresses the shortcomings of traditional TF-IDF algorithms that are not sensitive to word position information.It uses inter class and intra class methods to improve the accuracy of classification and recognition,and combines them with Naive Bayes networks to improve the accuracy.The data fast retrieval technology module utilizes CNN and LSTM for data feature extraction,and uses a Hash algorithm to generate data Hash codes,improving retrieval speed.In experimental testing,the improved TF-IDF algorithm achieved the best accuracy,recall,and F1 values in different datasets.The retrieval time was reduced by more than 10%and the robustness was high.The experimental result indicate that the proposed method surpasses traditional methods and combines efficiency and stability.

外文关键词：

distributed databaseTF-IDFdeep Hash algorithmNaive Bayesdata retrieval

作者：

王建忠、吴昕达

展开 >

作者单位：

浙江宁海抽水蓄能有限公司,浙江宁波 315621

关键词：

分布式数据库 TF-IDF 深度哈希算法朴素贝叶斯数据检索

出版年：

2025

DOI：