高技术通讯(英文版)2024,Vol.30Issue(4) :389-396.DOI:10.3772/j.issn.1006-6748.2024.04.006

An alert-situation text data augmentation method based on MLM

丁伟杰 MAO Tingyun CHEN Lili ZHOU Mingwei YUAN Ying HU Wentao
高技术通讯(英文版)2024,Vol.30Issue(4) :389-396.DOI:10.3772/j.issn.1006-6748.2024.04.006

An alert-situation text data augmentation method based on MLM

丁伟杰 1MAO Tingyun 2CHEN Lili 2ZHOU Mingwei 2YUAN Ying 3HU Wentao3
扫码查看

作者信息

  • 1. Key Laboratory of Public Security Information Application Based on Big-Data Architecture,Ministry of Public Security,Hangzhou 310053,P.R.China;Department of Computer and Information Security,Zhejiang Police College,Hangzhou 310053,P.R.China
  • 2. Zhejiang Dahua Technology Co.,Ltd,Hangzhou 310053,P.R.China
  • 3. Department of Computer and Information Security,Zhejiang Police College,Hangzhou 310053,P.R.China;Key Laboratory of Public Security Information Application Based on Big-Data Architecture,Ministry of Public Security,Hangzhou 310053,P.R.China
  • 折叠

Abstract

The performance of deep learning models is heavily reliant on the quality and quantity of train-ing data.Insufficient training data will lead to overfitting.However,in the task of alert-situation text classification,it is usually difficult to obtain a large amount of training data.This paper proposes a text data augmentation method based on masked language model(MLM),aiming to enhance the generalization capability of deep learning models by expanding the training data.The method em-ploys a Mask strategy to randomly conceal words in the text,effectively leveraging contextual infor-mation to predict and replace masked words based on MLM,thereby generating new training data.Three Mask strategies of character level,word level and N-gram are designed,and the performance of each Mask strategy under different Mask ratios is analyzed and studied.The experimental results show that the performance of the word-level Mask strategy is better than the traditional data augmen-tation method.

Key words

deep learning/text data augmentation/masked language model(MLM)/alert-sit-uation text classification

引用本文复制引用

出版年

2024
高技术通讯(英文版)
中国科学技术信息研究所(ISTIC)

高技术通讯(英文版)

影响因子:0.058
ISSN:1006-6748
段落导航相关论文