题名相似度模型在文献数据质量控制中的应用

Research on the Application of Title Similarity Calculation Model in Quality Control of Characteristics Literature Data

金光龙 ¹张光照 ¹张银玲 ¹YANG Fan²

扫码查看

作者信息

1. 贵州财经大学图书馆,贵州贵阳 550025
2. Guizhou University of Finance and Economics Library 550025
折叠

摘要

针对特色文献资源建设面临采访预订单中元数据描述不标准、字段不齐全、输入不规范,采访渠道广泛等问题给查重工作带来的难度,本文提出了基于题名相似度的查重模型,将题名经过数据预处理后利用word2vec提取题名的特征向量,计算题名之间的余弦相似度解决文献的查重问题.实验结果表明该查重模型具有较好的效果,为图书馆馆藏特色文献资源建设提供了可行的借鉴.

Abstract

Due to the problems such as non-standard metadata description,incomplete fields,non-standard input,and extensive interview channels in the interview booking for the construction of provincial characteristics of literature resources,the interview work is difficult in checking.This paper proposes a duplicate checking model based on title similarity,use word2vec to extract the feature vector of the title after data preprocessing,calculate cosine similarity between titles,finally solve the problem of title duplication of documents.The experimental results show that the checking model has a good effect,it provides a feasible reference for the construction of characteristic literature resources in library.

关键词

特色馆藏/元素据/题名查重/word2/vec/余弦相似度

Key words

special collection/metadata/title check/word2vec/cosine similarity

引用本文复制引用

基金项目

2022年度贵州财经大学校级项目(2022KYYB14)

出版年

2024

长江信息通信

湖北通信服务公司

长江信息通信

影响因子：0.338

ISSN：2096-9759

参考文献量7

段落导航