融合继续预训练和分部池化的司法事件检测模型
Judicial Event Detection Model Based on Continuous Pre-training and Segment Pooling
张家诚 1孙媛媛 1李志廷 2杨亮 1林鸿飞1
作者信息
- 1. 大连理工大学 计算机科学与技术学院,辽宁 大连 116024
- 2. 最高人民检察院 检察技术信息研究中心,北京 100726
- 折叠
摘要
事件检测任务在自然语言处理领域中具有重要的研究价值,其主要目标是从文本中识别并分类触发词,实现高级文本分析与语义理解.随着智慧司法建设的推进,自然语言处理模型与司法领域的结合日益紧密.然而,由于司法领域数据较为稀缺且一个句子大多包含多个触发词等问题,该研究在预训练阶段,通过收集的十二万条司法犯罪数据对BERT进行继续预训练,以提高预训练模型对司法领域知识的理解能力;在微调阶段提出了一种分部池化结构并融合PGD对抗训练的方法,以捕获触发词上下文和句子整体的语义特征.该模型在法研杯 CAIL 2022 事件检测赛道上取得了明显的性能提升,比基于BERT的基线模型平均F1 值提高了 3.0%.
Abstract
The task of event detection as a Natural Language Processing(NLP)task aims to identify and classify trigger words from the text,enabling advanced text analysis and semantic understanding.Due to the scarcity of data in the judicial field and the fact that a sentence often contains multiple trigger words,our research continues to pre-train BERT with 120000 pieces of collected judicial crime data during the pre-training phase to enhance the under-standing of judicial knowledge.During the fine-tuning phase,we propose a partitioned pooling structure combined with PGD adversarial training to capture the semantic features of the trigger word context and the overall sentence.This model achieved notable performance improvement in the CAIL 2022 event detection track,with an average 3.0%improvement of F1-score than that of the BERT-based baseline model.
关键词
事件检测/司法领域/预训练模型Key words
event detection/judicial field/pre-training model引用本文复制引用
基金项目
国家重点研究与发展计划(2022YFC3301801)
中央高校基本科研业务费项目(DUT22ZD205)
出版年
2024