网络内容的去重算法与语义量化研究

Research on deduplication algorithm and semantic quantization of network content

扫码查看

原文链接

万方数据

中文摘要：为降低网站对用户的影响,同时提升去除重复的能力,设计了一种能够应用在大型网站的去除重复的创新方案.首先,利用文本预处理技术提取网页正文内容关键词和长句特征码;其次,使用Simhash算法把特征码映射成指纹,并构建关键词指向文档的倒排索引;最后,通过关键词快速找到与待测文档高度相似的文档,接着只需比较待测文档与相似文档的指纹,即可判断网页是否重复.结果显示,该算法识别率较高,实用性良好.

外文摘要：To reduce the impact of websites on users and enhance their ability to remove duplicates,an innovative solution for removing duplicates has been designed that can be applied to large websites.Firstly,text preprocessing techniques are used to ex-tract keywords and long sentence feature codes from web page content.Secondly,the Simhash algorithm is used to map the feature codes into fingerprints and construct an inverted index of keywords pointing to the document.Finally,quickly find documents that are highly similar to the test document through keywords,and then simply compare the fingerprints of the test document with simi-lar documents to determine if the webpage is duplicated.The results show that the algorithm has a high recognition rate and good practicality.

外文关键词：

web page deduplicationsemantic quantificationcharacteristic fingerprintlong sentencethe operative word

作者：

谢志豪、杨贤

展开 >

作者单位：

广东工业大学机电工程学院,广州 510006

关键词：

网页去重语义量化特征指纹长句关键词

出版年：

2024

DOI：