Semi-supervised website topic classification based on hetero-geneous graph neural network
The rapid growth of the number of Internet websites has made existing methods challeng-ing to accurately classify specific website topics.URL-based methods,for example,struggle to handle topic information not reflected in the URL,while content-based methods face limitations due to data sparsity and challenges in capturing semantic relationships.To address this,a semi-supervised website topic classification method,HGNN-SWT,based on a heterogeneous graph neural network,is proposed.This method not only utilizes website text features to complement the limitations of using only URL fea-tures but also models sparse relationships between website text and words using a heterogeneous graph,improving classification performance by handling node and edge relationships within the graph.The ap-proach introduces a neighbor node sampling method based on random walks,considering both local fea-tures and the global graph structure of nodes.Additionally,a feature fusion strategy is proposed to cap-ture contextual relationships and feature interactions within website text data.Experimental results on a self-created Chinaz Website dataset demonstrate that HGNN-SWT achieves higher accuracy in website topic classification compared to existing methods.