Through the analysis of main sources and popular file formats of online academic literature, and Heritrix's work principles, this paper develops a program to access to online academic literature based on Heritrix. Then it designs and analyses the overall pro- gram specifically from seed websites selection, crawl tasks configuration, file type and file size filtration, academic hterature determination. This paper also does experiments by building experimental platform and writing programs to verify the feasibility of this program, and points out the future research direction.
Heritrix academic literature file format PDF document crawl