首页|基于大规模用户视频弹幕的颜文字自动化发现

基于大规模用户视频弹幕的颜文字自动化发现

扫码查看
作为网络时代产生的新型表情符号,颜文字不仅受到了网络用户与社会主流媒体的青睐,被广泛应用于网络文本中,而且在情感表达、文化宣传等方面具有独特的价值.鉴于颜文字具有丰富的语义情感信息,结合颜文字对网络文本进行研究,能够促进对网络文本的分析与理解,提高多项自然语言处理任务的效果.对文本中的颜文字进行检测与提取,是结合颜文字进行文本分析的首要步骤;然而,由于颜文字具有结构灵活、种类丰富、更新换代快等特点,现有工作大多缺乏对颜文字的整体分析,具有准确率低、边界确定困难、时效性差等局限性.文中通过深入分析颜文字的特征,提出了一种基于大规模弹幕文本的颜文字检测与提取算法Emoly.该算法通过预处理方法提取出初步候选字符串,将多种改进的统计指标与过滤规则相结合,用于筛选出最终候选字符串,并通过文本相似度对其排序,输出最终结果.实验结果表明,Emoly算法在百万规模的弹幕文本中达到了 91%的召回率,能够全面而准确地将文本中的颜文字检测并提取出来,具有稳健性、优越性与通用性.同时,该算法还为中文分词、情感分析、输入法词库更新等任务提供了新的解决思路与方法,具有广泛的应用价值.
Automated Kaomoji Extraction Based on Large-scale Danmaku Texts
As a new type of emoticon symbol that emerged in the Internet age,kaomoji not only enjoys popularity among Internet users and mainstream social media but also has indispensable value in emotional expression,cultural promotion,and other as-pects.Considering that kaomoji carries rich semantic and emotional information,studying them in the context of Internet texts can promote the analysis and understanding of such texts,thus improving the effectiveness of various natural language processing tasks.Detecting and extracting kaomoji from texts are the primary steps in analyzing texts with kaomoji.However,due to the flexible structure,diverse types,and rapid evolution of kaomoji,most existing works lack a comprehensive analysis of kaomoji,re-sulting in limitations such as low accuracy,difficulty in determining boundaries,and poor timeliness.In this paper,through an in-depth analysis of kaomoji features,a kaomoji detection and extraction algorithm called Emoly based on a large-scale danmaku text dataset is proposed.It extracts preliminary candidate strings through preprocessing methods,combines various improved statisti-cal indicators and filtering rules to select the final candidate strings,and ranks them based on text similarity to produce the final results.Experimental results show that the Emoly algorithm achieves a recall rate of 91%in a dataset of millions of danmaku texts,effectively and accurately detecte and extracte kaomoji from the texts.It demonstrates robustness,superiority,and generali-ty.Additionally,the proposed algorithm provides new ideas and methods for tasks such as Chinese word segmentation,sentiment analysis,and input method dictionary updates,offering broad application value.

Natural language processingData analysisKaomojiVideo danmaku

毛馨、雷瞻遥、戚正伟

展开 >

上海交通大学电子信息与电气工程学院 上海 200240

自然语言处理 数据分析 颜文字 视频弹幕

国家自然科学基金

62141218

2024

计算机科学
重庆西南信息有限公司(原科技部西南信息中心)

计算机科学

CSTPCD北大核心
影响因子:0.944
ISSN:1002-137X
年,卷(期):2024.51(1)
  • 4