Introducing a Large-Scale Dataset for Vietnamese POS Tagging on Conversational Texts

扫码查看

原文链接

NETL

外文摘要：This paper introduces a large-scale human-labeled dataset for the Vietnamese POS tagging task on conversational texts。 To this end， we propose a new tagging scheme (with 36 POS tags) consisting of exclusive tags for special phenomena of conversational words， develop the annotation guideline and manually annotate 16。310K sentences using this guideline。 Based on this corpus， a series of state-of-the-art tagging methods has been conducted to estimate their performances。 Experimental results showed that the Conditional Random Fields model using both automatically learnt features from deep neural networks and handcrafted features yielded the best performance。 This model achieved 93。36% in the accuracy score which is 1。6% and 2。7% higher than the model using either handcrafted features or automatically-learnt features， respectively。 This result is also a little bit higher than the model of fine-tuning BERT by 0。94% in the accuracy score。 The performance measured on each POS tag is also very high with ＞90% in the F1 score for 20 POS tags and ＞80% in the Fl score for 11 POS tags。 This work provides the public dataset and preliminary results for follow-up research on this interesting direction。

外文关键词：

Vietnamese POS taggingconversational textsCRFneural networks

作者：

Oanh Thi Tran、Tu Minh Pham、Vu Hoang Dang、Bang Ba Xuan Nguyen

展开 >

作者单位：

FPT Technology Research Institute - FPT University 82 Duy Tan, Cau Giay, Hanoi, Vietnam, International School, Vietnam National University. Hanoi 144 Xuan Thuy, Cau Gray, Hanoi, Vietnam

FPT Technology Research Institute - FPT University 82 Duy Tan, Cau Giay, Hanoi, Vietnam

会议名称：

International Conference on Language Resources and Evaluation

会议地点：

Marseille(FR)

会议母体文献：

Twelfth International Conference on Language Resources and Evaluation

页码：

3913-3921

出版时间：

2020