To reduce the impact of websites on users and enhance their ability to remove duplicates,an innovative solution for removing duplicates has been designed that can be applied to large websites.Firstly,text preprocessing techniques are used to ex-tract keywords and long sentence feature codes from web page content.Secondly,the Simhash algorithm is used to map the feature codes into fingerprints and construct an inverted index of keywords pointing to the document.Finally,quickly find documents that are highly similar to the test document through keywords,and then simply compare the fingerprints of the test document with simi-lar documents to determine if the webpage is duplicated.The results show that the algorithm has a high recognition rate and good practicality.
关键词
网页去重/语义量化/特征指纹/长句/关键词
Key words
web page deduplication/semantic quantification/characteristic fingerprint/long sentence/the operative word