Design of Multi-modal Efficient Bad Web Page Clustering Algorithm Based on HDBSCAN
Since the 21st century,a large number of Web pages are constructed on Internet,and provide people with various types of information,not only accelerating the speed of information exchange,but also greatly reducing the cost of information circulation.At the same time,a large number of bad Web pages are constantly emerging.However,the identification of bad Web pages is mostly based on manual recognition,which can not cope with the large-scale emergence of bad Web pages.This paper proposes a multi-modal efficient bad Web page clustering algorithm based on HDBSCAN.The HDBSCAN is used to pre-liminarily cluster bad Web page images.The preliminary clustering results are overlaid with multiple information elements such as bad Web page text information and bad Web page structure information to further classify and merge.Similar Web pages are merged into a large and complete set of images.The experimental results show that compared to HDBSCAN,the inproved clustering algorithm improves the clustering quality,has better clustering effects,and significantly improves the processing ef-ficiency of bad websites.