融合继续预训练和分部池化的司法事件检测模型

Judicial Event Detection Model Based on Continuous Pre-training and Segment Pooling

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：事件检测任务在自然语言处理领域中具有重要的研究价值,其主要目标是从文本中识别并分类触发词,实现高级文本分析与语义理解.随着智慧司法建设的推进,自然语言处理模型与司法领域的结合日益紧密.然而,由于司法领域数据较为稀缺且一个句子大多包含多个触发词等问题,该研究在预训练阶段,通过收集的十二万条司法犯罪数据对BERT进行继续预训练,以提高预训练模型对司法领域知识的理解能力;在微调阶段提出了一种分部池化结构并融合PGD对抗训练的方法,以捕获触发词上下文和句子整体的语义特征.该模型在法研杯 CAIL 2022 事件检测赛道上取得了明显的性能提升,比基于BERT的基线模型平均F1 值提高了 3.0%.

外文摘要：The task of event detection as a Natural Language Processing(NLP)task aims to identify and classify trigger words from the text,enabling advanced text analysis and semantic understanding.Due to the scarcity of data in the judicial field and the fact that a sentence often contains multiple trigger words,our research continues to pre-train BERT with 120000 pieces of collected judicial crime data during the pre-training phase to enhance the under-standing of judicial knowledge.During the fine-tuning phase,we propose a partitioned pooling structure combined with PGD adversarial training to capture the semantic features of the trigger word context and the overall sentence.This model achieved notable performance improvement in the CAIL 2022 event detection track,with an average 3.0%improvement of F1-score than that of the BERT-based baseline model.

外文关键词：

event detectionjudicial fieldpre-training model

作者：

张家诚、孙媛媛、李志廷、杨亮、林鸿飞

展开 >

作者单位：

大连理工大学计算机科学与技术学院,辽宁大连 116024

最高人民检察院检察技术信息研究中心,北京 100726

关键词：

事件检测司法领域预训练模型

基金：

国家重点研究与发展计划中央高校基本科研业务费项目

项目编号：

2022YFC3301801DUT22ZD205

出版年：

2024

中文信息学报

中国中文信息学会,中国科学院软件研究所

中文信息学报

CSTPCDCHSSCD北大核心

影响因子：0.8

ISSN：1003-0077

年,卷(期)：2024.38(4)

参考文献量24