首页|面向多领域的词汇复杂度评估研究

面向多领域的词汇复杂度评估研究

扫码查看
[目的]探索集成不同语料库的方式,从而提升评估词汇复杂程度的综合表现.[方法]提出一种多领域词汇复杂度评估模型,通过特征泛化模块适应各种领域,在下游微调任务中学习词汇复杂度预测,通过特征融合模块探索手工特征与神经网络深度特征的组合意义.[结果]在LCP-2021数据集上,本文模型相较于公开的现有最优结果,Pearson系数、MAE、MSE指标分别提升0.014 8、0.001 7、0.000 4,Spearman系数和R2系数的表现则下降0.003 8、0.025 5;集成手工特征后没有明显变化;二次迁移到CWI-2018数据集,本文模型在三个领域上的MAE指标,相较公开的基线结果分别提升0.008 6、0.020 9、0.017 4.[局限]采用向量拼接集成手工特征和深度特征,未能充分融合不同类型特征;设计特征泛化模块时的算法选择具有一定局限性;可以进一步尝试构建综合数据集.[结论]集成不同语料库,有助于提升模型在新领域下的整体评估效果.
Lexical Complexity Prediction Research for Multiple Domains
[Objective]To explore methods for integrating different corpora to improve the overall performance of vocabulary complexity assessment.[Methods]This study proposes a multi-domain vocabulary complexity assessment model.The feature generalization module is designed to adapt to different domains.In subsequent fine-tuning tasks,the model learns to predict vocabulary complexity.The feature fusion module is employed to explore the combined significance of hand-crafted features and deep features extracted by neural networks.[Results]On the LCP-2021 dataset,compared to the existing public optimal results,our model improved the Pearson correlation coefficient,MAE,and MSE by 0.0148,0.0017,and 0.0004 respectively.However,the Spearman correlation coefficient and R2 coefficient decreased by 0.0038 and 0.0255 respectively.There was no significant change after integrating hand-crafted features.When transferred to the CWI-2018 dataset,our model improved the MAE metrics in three new corpus domains by 0.0086,0.0209,and 0.0174 compared to the public baseline results.[Limitations]The method of vector concatenation could not effectively integrate the hand-crafted features and deep features effectively.The choice of algorithm for the design of the feature generalization module has certain limitations.Further attempts can be made to construct a comprehensive dataset.[Conclusions]Integrating different corpora helps to improve the overall evaluation performance of the model in new domains.

Multiple DomainsLexical ComplexityDomain GeneralizationFeature Fusion

李纲、黄建飞、毛进

展开 >

武汉大学信息资源研究中心 武汉 430072

武汉大学信息管理学院 武汉 430072

多领域 词汇复杂度 领域泛化 特征融合

国家社会科学基金重大项目

22&ZD326

2024

数据分析与知识发现
中国科学院文献情报中心

数据分析与知识发现

CSTPCDCSSCICHSSCD北大核心EI
影响因子:1.452
ISSN:2096-3467
年,卷(期):2024.8(7)