Lexical Complexity Prediction Research for Multiple Domains
[Objective]To explore methods for integrating different corpora to improve the overall performance of vocabulary complexity assessment.[Methods]This study proposes a multi-domain vocabulary complexity assessment model.The feature generalization module is designed to adapt to different domains.In subsequent fine-tuning tasks,the model learns to predict vocabulary complexity.The feature fusion module is employed to explore the combined significance of hand-crafted features and deep features extracted by neural networks.[Results]On the LCP-2021 dataset,compared to the existing public optimal results,our model improved the Pearson correlation coefficient,MAE,and MSE by 0.0148,0.0017,and 0.0004 respectively.However,the Spearman correlation coefficient and R2 coefficient decreased by 0.0038 and 0.0255 respectively.There was no significant change after integrating hand-crafted features.When transferred to the CWI-2018 dataset,our model improved the MAE metrics in three new corpus domains by 0.0086,0.0209,and 0.0174 compared to the public baseline results.[Limitations]The method of vector concatenation could not effectively integrate the hand-crafted features and deep features effectively.The choice of algorithm for the design of the feature generalization module has certain limitations.Further attempts can be made to construct a comprehensive dataset.[Conclusions]Integrating different corpora helps to improve the overall evaluation performance of the model in new domains.