多视角网页分类数据集构建及性能评估

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：网页分类是互联网数据挖掘中的一项重要任务,在信息搜索、推荐系统和知识发现等领域发挥着关键作用.然而,现有的公开网页数据集缺乏多视角信息,难以适用于蕴含复杂特征的网页分类任务.针对上述问题,基于"收集-处理-标注"构建流程,提出一个涵盖文本语义、网页结构等多视角特征的网页数据集Web-Minds,该数据集包含600余个门户网站下的21828条网页.首先,在开放互联网中通过关键词检索采集得到相关网页数据;其次,使用网页解析工具对收集的数据中的文本、DOM结构树、关键词等多视角信息进行提取与清洗;最后,采用大语言模型与"人在回路"的联合标注策略,形成网页类型与网页主题两种标签.在此基础上,针对Web-Minds数据集,测试评估了机器学习、文本分类和网页分类多种算法,结果表明,综合利用多视角特征能有效提升算法的准确率,和仅应用单视角特征相比,在网页类型和主题分类任务上,准确率分别提升了 5.49％和5.61％.

外文标题：Multi-view webpage classification dataset construction and evaluation

外文摘要：Webpage classification is an important task in Internet data mining,playing a crucial role in information retrieval,recommendation systems,and knowledge discovery,etc.However,existing public webpage datasets suffer from limitations such as scarcity,single sources and insuffcient information,which hinder the development of webpage classification techniques.To address these issues,we propose a publicly available dataset for webpage classification called Web-Minds,incorporating multi-view features by designing a three-step process of"collection-processing-annotation".Specifically,the relevant webpage data are collected and integrated from the open Internet.Then,a webpage parsing tool is employed to extract and clean multi-view information from the collected data,including text,structure,keywords,etc.We design a large language model and a"human-in-the-loop"annotation strategy to assign two types of labels,namely webpage type and webpage topic.Furthermore,we establish an algorithmic evaluation benchmark based on the Web-Minds dataset,containing such methods as machine learning,text classification,and webpage classification.The results demonstrate that compared to using single-view features alone,the comprehensive utilization of multi-view features significantly improves algorithm accuracy,with an increase of 5.49％and 5.61％in webpage type and topic classification tasks,respectively.

外文关键词：

webpage datasetwebpage classificationtext classificationdata miningdeep learning

作者：

孙辰星、刘伟、卢彬、梁诗宇、诸云强、甘小莺

展开 >

作者单位：

上海交通大学电子信息与电气工程学院,上海,200240

中国科学院地理科学与资源研究所,北京,100101

关键词：

网页数据集网页分类文本分类数据挖掘深度学习

基金：

国家重点研发计划国家自然科学基金国家自然科学基金国家自然科学基金国家自然科学基金国家自然科学基金

项目编号：

2022YFB39042046227230142050105620201060056206114600261960206002

出版年：

2024

DOI：

10.13232/j.cnki.jnju.2024.03.005

南京大学学报(自然科学版)

南京大学

南京大学学报(自然科学版)

CSTPCD北大核心

影响因子：0.756

ISSN：0469-5097

年,卷(期)：2024.60(3)