Multi-view webpage classification dataset construction and evaluation
Webpage classification is an important task in Internet data mining,playing a crucial role in information retrieval,recommendation systems,and knowledge discovery,etc.However,existing public webpage datasets suffer from limitations such as scarcity,single sources and insuffcient information,which hinder the development of webpage classification techniques.To address these issues,we propose a publicly available dataset for webpage classification called Web-Minds,incorporating multi-view features by designing a three-step process of"collection-processing-annotation".Specifically,the relevant webpage data are collected and integrated from the open Internet.Then,a webpage parsing tool is employed to extract and clean multi-view information from the collected data,including text,structure,keywords,etc.We design a large language model and a"human-in-the-loop"annotation strategy to assign two types of labels,namely webpage type and webpage topic.Furthermore,we establish an algorithmic evaluation benchmark based on the Web-Minds dataset,containing such methods as machine learning,text classification,and webpage classification.The results demonstrate that compared to using single-view features alone,the comprehensive utilization of multi-view features significantly improves algorithm accuracy,with an increase of 5.49%and 5.61%in webpage type and topic classification tasks,respectively.