网络空间安全领域实体识别的数据增强方法
Method of data augmentation for entity recognition in cyber security domain
廉龙颖 1高传凯 2刘兴丽2
作者信息
- 1. 黑龙江大学 信息管理学院,哈尔滨 150080;黑龙江科技大学 计算机与信息工程学院,哈尔滨 150022
- 2. 黑龙江科技大学 计算机与信息工程学院,哈尔滨 150022
- 折叠
摘要
针对网络空间安全领域标注数据成本高且难度大的问题,提出了数据增强改进方法.通过改进EDA算法,研究基于领域词典的同类型实体替换、实体保护的同词性替换、词性保护的随机插入以及语义保护的随机删除策略,采用单一策略和组合策略对小样本数据集进行数据扩充,利用BiLSTM-CRF模型进行实体识别验证.结果表明,单一策略及其组合策略能够增加数据集的规模,单一策略DER的F1 值提升比例达38.18%,组合策略EPR+PRI的F1 值提升比例达31.16%.该方法可以有效提升网络空间安全领域实体识别模型性能.
Abstract
This paper proposes a data augmentation improvement method to address the high cost and difficulty of annotating data in cyber security field and the limitation of model performance due to insuffi-cient training data.Based on the same type entity replacement of domain dictionary,the same part of speech replacement of the entity protection,the random insertion of part of speech protection,and the random deletion strategies of semantic protection,the improved Easy Data Augment algorithm involves ex-panding the tiny sample data set by using single and combined strategies;and recognizing and verifying the entity by using BiLSTM-CRF model.The results indicate that the single strategy and their combina-tion strategies can effectively increase the size of the dataset with F1 value improvement ratio of single strategy DER by 38.18%,and F1 value improvement ratio of combination strategy EPR+PRI by 31.16%.This method effectively improves the performance of entity recognition models in the cybersecu-rity field.
关键词
实体识别/数据增强/网络空间安全Key words
entity recognition/data augmentation/cyber security引用本文复制引用
基金项目
黑龙江省省属高等学校基本科研业务费项目(2022-KYYWF-0569)
黑龙江省教育科学"十四五"规划2023年度重点课题(GJB1423098)
出版年
2024