中文微博用户性别分类方法研究
User Gender Classification in Chinese Microblog
王晶晶 1李寿山 1黄磊1
作者信息
- 1. 苏州大学计算机科学与技术学院自然语言处理实验室,江苏苏州 215006
- 折叠
摘要
该文旨在研究中文微博用户的性别分类问题,即根据微博提供的中文文本信息对注册用户的性别进行识别.虽然基于微博的性别分类已经有一定研究,但是针对中文的性别分类工作还很缺乏.该文首先提出分别利用用户名和微博文本构建两个分类器对用户的性别类型进行判别,并对不同的特征(例如,字特征、词特征等)进行了研究分析;其次,在针对用户名和微博文本的两个分类器的基础上,使用贝叶斯融合方法进行分类器融合,从而达到采用这两种文本分类信息同时对用户性别进行性别判断.实验结果表明该文的方法可以达到较高的识别准确率,并且分类器融合的方法明显优于仅利用用户名或者微博文本的分类方法.
Abstract
This paper investigates the classification of users into male and female with the information provided by Chinese Microblog.Although some researchers have devoted their efforts on gender classification,there is still a lack of researches in Chinese gender classification.In this paper,firstly,a classification method using user names or messages (sent by the users) to recognize male and female is proposed.Different types of features (e.g.,character and word features) are adopted into the classification; Secondly,on the basis of the two classifiers trained by user names and messages,Bayes rule is employed to combine the two classifiers so as to make the prediction with the knowledge from both the user names and messages.Experimental results demonstrate that the proposed approach yields a nice performance to gender classification,and the combination method outperforms the individual classifiers trained with only user names or messages.
关键词
性别分类/新浪微博/文本分类/社交网络Key words
gender classification/Sina-weibo/text classification/social media引用本文复制引用
出版年
2014