Multimedia tools and applications2024,Vol.83Issue(42) :89607-89633.DOI:10.1007/s11042-024-18791-y

CatRevenge: towards efective revenge text detection in online social media with paragraph embedding and CATBoost

CatRevenge:基于段落嵌入和CATBoost的网络社交媒体复仇文本检测

Sayani Ghosal Amita Jain
Multimedia tools and applications2024,Vol.83Issue(42) :89607-89633.DOI:10.1007/s11042-024-18791-y

CatRevenge: towards efective revenge text detection in online social media with paragraph embedding and CATBoost

CatRevenge:基于段落嵌入和CATBoost的网络社交媒体复仇文本检测

Sayani Ghosal 1Amita Jain2
扫码查看

作者信息

  • 1. NSUT East Campus(Erstwhile A.I.A.C.T.R.),Guru Gobind Singh Indraprastha University, Dwarka,Delhi,India||KIET Group of Institutions,Ghaziabad Delhi-NCR,India
  • 2. Netaji Subhas University of Technology,New Delhi,India
  • 折叠

摘要

大量的互联网数据是由互联网用户产生和消费的,其中大部分数据是用自然语言表达的,他们在社交媒体上表达自己的感受、情感和想法。在用户之间提供健康的通信系统是社交媒体提供商的责任。由于标记之间的语义关系消解,从社交媒体文本中检测复仇是一项非常具有挑战性的工作。因此,社交媒体提供商没有对识别散布报复的用户提供任何关注。本文提出了一种新的复仇模型&猫复仇,它可以区分主动复仇和被动复仇。该模型利用网络俚语语义词典对复仇文本进行预处理,更有效地检测复仇文本。CatRevenge根据词的相关性和tf-idf得分为句子中的每个词类分配影响权重。该模型还考虑了段落嵌入模型对复仇文本进行上下文语义分析。此外,本研究使用具有分类特征的梯度boosting CATBoost分类器来减少模型溢出。该特征排序方法可以通过对最重要的特征排序来降低数据的维数。这项研究考虑了Reddit社交媒体上的Revenge Posts英语语言数据集,在那里它采用二进制和多类分类进行评估。结果表明,使用加权F1度量,二进制增加6-10%,多类增加2.5-5%,可实现性能。

Abstract

Huge amount of internet data are produced and consumed by internet users, where most of the data are in natural language and they express their feelings, emotions and thoughts on social media. It is the responsibility of the social media provider to provide healthy com- munication system among users. It is very challenging job to detect revenge from the social media text due to long sentences where semantic relation dissolves between tokens. Due to that, the social media providers did not provide any attention towards identifying the users spreading revenge. This article propose a novel model named as CatRevenge which identi- fes both active and passive revenge. This model preprocess with Slangzy internet slang meaning dictionary to detect revenge text more efciently. CatRevenge assigns impact weight on each of parts of speech in the sentences based on its relevance and TF-IDF score of the words. The novel CatRevenge model also considers the paragraph embedding model for contextual semantic analysis of revenge text. In addition, this research applies gradi- ent boosting CATBoost classifer with categorical features to reduce model overftting. This feature ranking method can able to reduce the dimensionality of data by ranking the most signifcant feature. This research considers the revenge posts English language dataset from the Reddit social media where it evaluated with binary and multiclass classifcation. Results demonstrate achievable performance with a 6-10% increase in binary and a 2.5 -5% increase in multiclass with weighted F1 metric.

Key words

Online Social Network/Paragraph Embedding/Text Classifcation/CATBoost/Natural Language Processing

引用本文复制引用

出版年

2024
Multimedia tools and applications

Multimedia tools and applications

EISCI
ISSN:1380-7501
段落导航相关论文