[目的]研究无监督词性标注模型在低资源语言上的性能表现.[方法]尝试利用无监督词性标注模型,包括高斯隐马尔科夫模型(Gaussian HMM,GHMM)、最大化互信息模型(mutual information maximization,MIM)与条件随机场自编码器(conditional random filed autoencoder,CRF-AE),展开低资源词性标注实验.基于对前人工作的凝练,在英文宾州树库上设置了少样本和词典标注两种低资源场景.[结果]无监督词性标注模型能够在少样本场景中超越条件随机场模型,但在词典标注场景中却始终逊色于条件随机场模型.[结论]无监督损失更加擅长对高频词进行建模,使得模型在少样本场景下获得更好的性能表现;同时无监督损失倾向于生成更加均匀的词性分布,从而降低模型在词典标注场景下的性能.
An empirical comparison and analysis on low-resource POS tagging approaches based on unsupervised model
[Objective]Part-of-speech(POS)tagging aims to grammatically categorize each word in a sentence with a corresponding POS tag.While the performance of POS tagging models in rich-resource scenarios has indeed advanced,the room for improvements remains in low-resource scenarios,including the few-sample scenario and the dictionary-labeling scenario.Previous research primarily focused on enhancing models from the perspective of training data,with limited attention paid to the model itself.In this paper,we tackle this issue from the perspective of the model and attempt to leverage the unsupervised model so that unlabeled data can be learned.[Methods]Based on the work of predecessors,we set up few-sample scenario and dictionary-labeling scenario.Then,we selected several representative unsupervised POS tagging models,which included Gaussian hidden Markov models(GHMM),mutual information maximization(MIM),and conditional random field autoencoder(CRF-AE).Modifying the training objective functions of these models,we adapted them to those two low-resource scenarios set up by us.Additionally,we chose the traditional supervised POS tagging model,namely conditional random fields(CRF),as the baseline model for comparison.[Results]We conduct experiments on the Penn Tree Bank dataset,which is widely used for unsupervised POS tagging.In the few-sample scenario,MIM achieves the highest performance under the minimal sample size setting,and CRF-AE consistently outperforms CRF when pre-training language models are not employed.However,as the sample size increases,the performance advantage of CRF-AE over CRF diminishes,and the performance of GHMM and MIM also gradually declines in comparison to CRF.After applying the pre-trained language models to both CRF-AE and CRF,significant improvements in model performance are observed,but CRF-AE continues to outperform the CRF model in scenarios with limited and minimal samples.In the dictionary-labeling scenario,CRF consistently achieves the best results across all settings.In contrast,the performance of unsupervised POS tagging models consistently remains inferior.After incorporating pre-trained language models,the performance of CRF has significantly improved.However,the performance of CRF-AE remains substantially more inferior to that of CRF does.Overall,we observe that,in the few-sample scenario,the unsupervised model CRF-AE exhibits superior performance.However,in the dictionary-labeling scenario,the supervised model CRF consistently outperforms the unsupervised model by a significant margin.Furthermore,we analyze the performances of models based on the word frequency distribution predicted by the model.In the few-sample scenario,unsupervised POS tagging models outperform CRF for mid-frequency and high-frequency words,especially for extremely scarce training samples.However,in the dictionary-labeling scenario,although MIM achieves a slightly higher accuracy than CRF for mid-frequency and high-frequency words under the minimal dictionary setting,those unsupervised POS models do not demonstrate an advantage in modeling high-frequency words.To further investigate this phenomenon,we examine the predicted POS distribution of models.It is observed that the predicted POS distribution of unsupervised model deviates more from manual annotations and exhibits a flatter slope overall.Notably,the POS distribution of GHMM and CRF-AE even appears as a horizontal line,suggesting a near-uniform distribution that corresponds to poor performance.[Conclusions]In those two scenarios set herein,unsupervised POS tagging models secure distinct performance characteristics.Unsupervised models can surpass the CRF in the few-sample scenario,but in the dictionary-labeling scenario,the CRF significantly outperforms those unsupervised models.Through the analysis of the accuracy of the model for words of varying frequencies and the distribution of POS predictions made by the model,we find that the unsupervised training loss tends to focus more on modeling high-frequency words,thus enabling better prediction of high-frequency words in the few-sample scenario.However,this tendency towards a more uniform POS distribution results in deteriorating performance for the unsupervised model in the dictionary-labeling scenario.In the future work,we aim to incorporate prior knowledge of POS distribution into the unsupervised training loss to address these issues and improve overall performance.