稀疏样本下长术语的抽取方法

扫码查看

原文链接

万方数据
维普

中文摘要：[目的]为解决武器装备领域样本稀疏和长术语难以识别的问题,提出头尾指针和主动学习相结合的方法.[方法]首先,使用BERT预训练语言模型得到词向量表示,利用头尾指针网络对长术语进行抽取;然后提出新的主动学习采样策略,在未标注样本上筛选高质量样本不断迭代训练模型,降低模型对数据规模的依赖.[结果]模型针对长术语的抽取效果在F1值上有0.50个百分点的提升,通过引入主动学习后采样,仅大约50％高质量数据即可达到训练100％训练数据相同的F1值.[局限]限于计算能力,本文数据集规模较小;在文本处理阶段新增主动学习采样策略,进行大规模数据计算的时间成本较高.[结论]利用头尾指针和主动学习方法能够有效抽取长术语,同时降低数据标注的成本.

外文标题：Extracting Long Terms from Sparse Samples

外文摘要：[Objective]This paper proposes a model combining head and tail pointers with active learning,which addresses the sparse sample issues and helps us identify long terms on weapons.[Methods]Firstly,we used the BERT pre-trained language model to obtain the word vector representation.Then,we extracted the long terms by the head-tail pointer network.Third,we developed a new active learning sampling strategy to select high-quality unlabeled samples.Finally,we iteratively trained the model to reduce its dependence on the data scale.[Results]The Fl value for long term extraction was improved by 0.50％.With the help of active learning post-sampling,we used about 50％high-quality data to achieve the same Fl value with 100％high-quality training data.[Limitations]Due to the limitation of computing power,the data set in this paper was small,and the active learning sampling strategy requires more processing time.[Conclusions]Using head-tail pointer and active learning method can extract long terms effectively and reduce the cost of data annotation.

外文关键词：

Term ExtractionActive LearningHead-to-Tail Pointer NetworkBERTWeaponry

作者：

吕学强、杨雨婷、肖刚、李育贤、游新冬

展开 >

作者单位：

北京信息科技大学网络文化与数字传播北京市重点实验室北京 100101

中国人民解放军军事科学院系统工程研究院复杂系统仿真总体重点实验室北京 100101

关键词：

术语抽取主动学习头尾指针网络 BERT 武器装备

基金：

国家自然科学基金项目国防科技重点实验室基金项目北京市自然科学基金项目

项目编号：

6217104364120062004044212020

出版年：

2024

DOI：

10.11925/infotech.2096-3467.2022.1231

数据分析与知识发现

中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI

影响因子：1.452

ISSN：2096-3467

年,卷(期)：2024.8(1)

参考文献量19