用于小样本跨语言文本分类的元对比学习框架
Contrastive meta-learning framework for few-shot cross-lingual text classification
郭建铭 1赵彧然 1刘功申1
作者信息
- 1. 上海交通大学网络空间安全学院,上海 200240
- 折叠
摘要
众多的安全风控问题均为文本分类问题,国际场景下的舆情分析等风控问题涉及多种语言,是一大难点.先前的研究表明,通过跨语言语义知识迁移可以显著提高小样本文本分类任务的性能.然而,跨语言文本分类的发展仍面临着一系列挑战.获得语义无关的文本表征是一项困难的任务.不同语言之间的语法结构和句法规则引起文本表征的差异,因此提取通用的语义信息较为困难.此外,跨语言文本分类的标签数据十分稀缺.在很多现实场景中,只能获得少量的标记数据,这严重降低了许多方法的性能.因此需要有效的方式能够在小样本情况下准确地迁移知识,提高分类模型的泛化能力.为应对这些挑战,提出了集成对比学习和元学习的框架,该框架集成了对比学习和元学习的优势,利用对比学习来提取与语言无关的通用语义信息,同时利用元学习快速泛化的优势来改善小样本场景中的知识迁移.此外,提出了基于任务的数据增强方法,以进一步提高所提框架在小样本跨语言文本分类中的性能.通过在两个广泛使用的多语言文本分类数据集上进行大量实验,证实了所提方法能够有效提升文本分类的准确性,可有效应用于风控安全领域.
Abstract
Many security risk control issues,such as public opinion analysis in international scenarios,have been identified as text classification problems,which are challenging due to the involvement of multiple languages.Pre-vious studies have demonstrated that the performance of few-shot text classification tasks can be enhanced through cross-lingual semantic knowledge transfer.However,the advancement of cross-lingual text classification is faced with several challenges.Firstly,it has been found difficult to obtain language-agnostic representations that perform well in cross-lingual transfer.Moreover,the differences in grammatical structure and syntactic rules between differ-ent languages cause variations in text representation,making it difficult to extract general semantic information.Ad-ditionally,the scarcity of labeled data has been identified as a severe constraint on the performance of most existing methods.In many real-world scenarios,only a small amount of labeled data is available,which has been found to severely degrade the performance of many methods.Therefore,effective methods are needed to accurately transfer knowledge in few-shot situations and improve the generalization ability of classification models.To tackle these challenges,a novel framework was proposed that integrates contrastive learning and meta-learning.Within the framework,contrastive learning was utilized to extract general language-agnostic semantic information,while the rapid generalization advantages of meta-learning were leveraged to improve knowledge transfer in few-shot set-tings.Furthermore,a task-based data augmentation method was proposed to further improve the performance of the framework in few-shot cross-lingual classification.Extensive experiments conducted on two widely used multilin-gual text classification datasets show that the proposed method outperforms several strong baselines.This indicates that the method can be effectively applied in the field of risk control and security.
关键词
跨语言文本分类/元学习/对比学习/小样本Key words
cross-lingual text classification/meta-learning/contrastive learning/few-shot引用本文复制引用
基金项目
国家自然科学基金项目(U21B2020)
上海市科技计划项目(22511104400)
出版年
2024