针对弱规范石油文档的自然语言数据收集方法

A natural language processing data collection method for weakly regulated professional petroleum documents

常启帆 ¹杨煦旻 ²朱方辉 ³金龙 ⁴刘伟 ⁵郑力会²

扫码查看

作者信息

1. 中国石油大学(北京)石油工程学院,北京昌平;四川省成都市第七中学,四川成都
2. 中国石油大学(北京)石油工程学院,北京昌平
3. 中国石油大学(北京)石油工程学院,北京昌平;中国石油长庆油田分公司油气工艺研究院,陕西西安
4. 中国石油大学(北京)石油工程学院,北京昌平;中国教育科学研究院教育统计分析研究所,北京海淀
5. 中国石油长庆油田分公司油气工艺研究院,陕西西安;低渗透油气田勘探开发国家工程实验室,陕西西安
折叠

摘要

(目的意义)现有数据收集方法在弱规范、强专业词汇文档中产生误读导致收集准确度低、时间长,通过建立适合石油工程文档特点的快速准确的数据收集模型,解决数据收集准确度低、收集时间长的难题,为大数据计算提供数据基础.(方法过程)首先建立层次化结构词条,识别石油工程文档记录差异性;然后建立专业词汇词典,让计算机能够识别文档中石油工程的专业术语;最后,在自然语言模型基础上通过大量数据训练构建了Structure Petroleum Bidirectional Encoder Representation from Transformer(缩写,SPBERT)数据收集模型.实现导入修井相关的Word文档,模型便可自动输出文档中的数据及对应标签.(结果现象)将模型与现有的 2种正则表达式方法和 3种通用自然语言模型的准确性在长庆油田现场修井数据上对比验证,并统计模型收集数据所用时间.5种数据收集模型平均准确度为 40.06%,SPBERT模型在修井数据收集中准确度为 82.3%,相比平均准确度提高 105.44%.每收集一组正确数据SPBERT模型收集时间为 402 ms,相比其余模型平均收集时间 554 ms,减少了 27.44%.(结论建议)SPBERT模型能以高准确度收集补充数据,模型收集时间短,可进一步增强自然语言模型的专业性,推进油田数智化建设.

Abstract

The existing data collection methods lead to low accuracy and long collection time due to misreading in weak specification and strong professional vocabulary documents,and the problem of low data collection accuracy and long collection time is solved by establishing a fast and accurate data collection model suitable for the characteristics of petroleum engineering documents and providing a data basis for big data computing.Firstly,a hierarchical structure of entries was established to identify the differences in petroleum engineering documentation.Then,a dictionary of technical terms is established so that the computer can recognize the technical terms of petroleum engineering in the document;Finally,based on the natural language model,the SPBERT data collection model was constructed through a large amount of data training.Realize the import of workover-related Word documents,and the model can automatically output the data and corresponding labels in the document.The accuracy of the model was verified by comparing the model with two existing regular methods,two general BERT models and one GPT model on the field workover data of Changqing Oilfield,and the time taken by the model to collect data was counted.The average accuracy of the five data collection models was 40.06%,and the accuracy of the SPBERT model in the workover data collection was 82.3%,which was more than 1 times higher than the average accuracy.The SPBERT model collected 402 milliseconds for each set of correct data collected,which was 27.44%less than the average collection time of 554 milliseconds for the rest of the models.The SPBERT model can collect supplementary data with high accuracy and short model collection time,which can further enhance the professionalism of natural language models and promote the construction of digital intelligence in oilfields.

关键词

数据资产/数据收集/自然语言处理/修井/智能油田/新质生产力

Key words

Data assets/Data collection/Natural language processing/Workover/Smart oilfield/New quality productivity

引用本文复制引用

出版年

2024

石油钻采工艺

华北油田分公司华北石油管理局

石油钻采工艺

CSTPCD北大核心

影响因子：0.975

ISSN：1000-7393

段落导航