首页|Enhancing Offensive Language Detection with Data Augmentation and Knowledge Distillation

Enhancing Offensive Language Detection with Data Augmentation and Knowledge Distillation

扫码查看
Offensive language detection has received important attention and plays a crucial role in promoting healthy communication on social platforms,as well as promoting the safe deployment of large language models.Training data is the basis for developing detectors;however,the available offense-related dataset in Chinese is severely limited in terms of data scale and coverage when compared to English resources.This significantly affects the accuracy of Chinese offensive language detectors in practical applications,especially when dealing with hard cases or out-of-domain samples.To alleviate the limitations posed by available datasets,we introduce AugCOLD(Augmented Chinese Offensive Language Dataset),a large-scale unsupervised dataset containing 1 million samples gathered by data crawling and model generation.Furthermore,we employ a multiteacher distillation framework to enhance detection performance with unsupervised data.That is,we build multiple teachers with publicly accessible datasets and use them to assign soft labels to AugCOLD.The soft labels serve as a bridge for knowledge to be distilled from both AugCOLD and multiteacher to the student network,i.e.,the final offensive detector.We conduct experiments on multiple public test sets and our well-designed hard tests,demonstrating that our proposal can effectively improve the generalization and robustness of the offensive language detector.

Jiawen Deng、Zhuang Chen、Hao Sun、Zhexin Zhang、Jincenzi Wu、Satoshi Nakagawa、Fuji Ren、Minlie Huang

展开 >

The CoAl group,DCST

Institute for Artificial Intelligence

State Key Lab of Intelligent Technology and Systems

Beijing National Research Center for Information Science and Technology

Tsinghua University,Beijing 100084,China

Graduate School of Information Science & Technology,The University of Tokyo,Tokyo 1138654,Japan

School of Computer Science and Engineering,University of Electronic Science and Technology of China,Chengdu,Sichuan,611731,China

展开 >

National Science Foundation for Distinguished Young ScholarsNSFC projectsNSFC projectsGuoqiang Institute of Tsinghua UniversityGuoqiang Institute of Tsinghua UniversityTsinghua-Toyota Joint Research Fund

6212-560461936010618760962019GQG12020GQG0005

2024

研究(英文)

研究(英文)

CSTPCD
ISSN:
年,卷(期):2024.2024(2)
  • 43