首页|基于HDBSCAN的多模态高效不良网页聚类算法设计

基于HDBSCAN的多模态高效不良网页聚类算法设计

扫码查看
自二十一世纪以来,大量网页在互联网中被构建,为人们提供了各种信息,不仅加快了信息交换的速度,而且使信息流通的成本大大降低.与此同时大量不良网站不断涌现,然而对于不良网页的认定多基于人工识别,无法应对不良网站的大规模出现,因此提出基于HDBSCAN的多模态高效不良网页聚类算法.利用HDBSCAN对不良网页图片进行初步聚类,对初步聚类的结果叠加使用不良网页文本信息、不良网页结构信息等多个信息要素进一步归类合并,将相似网页合并为一个大而全的图片集合.实验结果表明,相比于HDBSCAN,改进后的聚类算法提高了聚类质量,具有更好的聚类效果,不良网站的处理效率得到明显提升.
Design of Multi-modal Efficient Bad Web Page Clustering Algorithm Based on HDBSCAN
Since the 21st century,a large number of Web pages are constructed on Internet,and provide people with various types of information,not only accelerating the speed of information exchange,but also greatly reducing the cost of information circulation.At the same time,a large number of bad Web pages are constantly emerging.However,the identification of bad Web pages is mostly based on manual recognition,which can not cope with the large-scale emergence of bad Web pages.This paper proposes a multi-modal efficient bad Web page clustering algorithm based on HDBSCAN.The HDBSCAN is used to pre-liminarily cluster bad Web page images.The preliminary clustering results are overlaid with multiple information elements such as bad Web page text information and bad Web page structure information to further classify and merge.Similar Web pages are merged into a large and complete set of images.The experimental results show that compared to HDBSCAN,the inproved clustering algorithm improves the clustering quality,has better clustering effects,and significantly improves the processing ef-ficiency of bad websites.

HDBSCANmulti-modalbad Web pagesclustering

史磊、邓桂英、张恒、刘宇、肖建芳

展开 >

中国互联网络信息中心,北京 100190

HDBSCAN 多模态 不良网页 聚类

2024

微型电脑应用
上海市微型电脑应用学会

微型电脑应用

CSTPCD
影响因子:0.359
ISSN:1007-757X
年,卷(期):2024.40(6)
  • 7