Improved crawler algorithm based on hierarchical structure preservation
For improving the validity of Web pages gabbed by Web crawler algorithm,this paper proposed an improved Web crawler algorithm to obtain more useful information by designing a hierarchical structure preservation and URL filter mode.The proposed algorithm saved the website URLs hierarchically to store websites overall topology,which would turn the crisscross complex Web URL system from a graphic structure into a tree structure.The actual website BBS experiments show that the algorithm is much better than the basic Web crawler algorithm in crawling speed and download information such as the usefulness of baking.Furthermore,it provides a performing structure mode for the increment crawler algorithm.As a result,the hierarchical structure strategy and URL filter can improve the Web-grabbing function of Web crawler algorithm with a short amount of computational time.
Web crawlerURL filterhierarchical structure preservationfrequent mode