首页|网络内容的去重算法与语义量化研究

网络内容的去重算法与语义量化研究

扫码查看
为降低网站对用户的影响,同时提升去除重复的能力,设计了一种能够应用在大型网站的去除重复的创新方案.首先,利用文本预处理技术提取网页正文内容关键词和长句特征码;其次,使用Simhash算法把特征码映射成指纹,并构建关键词指向文档的倒排索引;最后,通过关键词快速找到与待测文档高度相似的文档,接着只需比较待测文档与相似文档的指纹,即可判断网页是否重复.结果显示,该算法识别率较高,实用性良好.
Research on deduplication algorithm and semantic quantization of network content
To reduce the impact of websites on users and enhance their ability to remove duplicates,an innovative solution for removing duplicates has been designed that can be applied to large websites.Firstly,text preprocessing techniques are used to ex-tract keywords and long sentence feature codes from web page content.Secondly,the Simhash algorithm is used to map the feature codes into fingerprints and construct an inverted index of keywords pointing to the document.Finally,quickly find documents that are highly similar to the test document through keywords,and then simply compare the fingerprints of the test document with simi-lar documents to determine if the webpage is duplicated.The results show that the algorithm has a high recognition rate and good practicality.

web page deduplicationsemantic quantificationcharacteristic fingerprintlong sentencethe operative word

谢志豪、杨贤

展开 >

广东工业大学机电工程学院,广州 510006

网页去重 语义量化 特征指纹 长句 关键词

2024

现代计算机
中大控股

现代计算机

影响因子:0.292
ISSN:1007-1423
年,卷(期):2024.30(17)