Chinese hate speech detection method based on the replacement of homophonic noise words
Hate speech in social media often includes creatively disguised homophonic noise words,making it difficult for existing methods to adapt to this phenomenon and to meet the requirements of real-time detection.To locate this issue,a Chinese hate speech detection method is proposed to resolve the lag in processing by mining the original words to replace homophonic noise words,thereby to settle the lag problem in the solution of former method.Firstly,the texts were preprocessed,the candidate items of noise words were extracted through N-gram,and filtered by using the pointwise mutual information and branch entropy.Then,homophonic noise words and their corresponding candidate items of the original words were recognized by calculating phonetic similarity.The original words were determined through syntactic structure and contextual semantic similarity,and the homophonic noise words were replaced by them accordingly.The replaced texts were subsequently inputted into the classification layer for further processing.Finally,RoBERTa-wmm-ext was employed to extract semantic features and Softmax was used to calculate the hate sentiment tendency,achieving the detection task.Experimental results on the public COLDataset demonstrate that the proposed model can effectively improve the performance of Chinese hate speech detection.
hate speech detectionhomophonic noise wordsphonetic similaritysyntactic structurecontextual semanticsRoBERTa-wmm-extCNNN-gram