Unsupervised word segmentation method in power domain based on BERT
At present,some word segmentation tools have realized the word segmentation in general do-main,however,problems such as few related texts,missing labeled data,and high cost of manual labeling are existed in power domain.To overcome these difficulties,this paper puts forward an unsupervised word segmentation tool based on BERT.Masked Language Model(MLM)is adopted.Besides,on the basis of the feature codes of sentences partially masked by BERT's calculation,the similarity of each part of the sentence are measured,and the parts with low similarity would be split up.Then N-Gram combines the re-sults which are over-segmentation to realize the unsupervised word segmentation in power domain.The ex-periment results show that the proposed method is superior to the existing word segmentation tools in general fields,especially in power domain.
power textChinese word segmentationunsupervisionBERTMLM