基于融合矩阵的文本相似度计算实现检索结果聚类

A Fusion Matrix-based Study on Text Clustering of Document Retrieval Results

赵悦阳 ¹崔雷²

扫码查看

作者信息

1. 中国医科大学附属盛京医院图书馆沈阳 110004
2. 中国医科大学医学健康管理学院沈阳 110122
折叠

摘要

目的/意义弥补医学文本语义表示方面的不足,实现PubMed数据库检索结果聚类.方法/过程采用Jaccard系数和TF-IDF构建融合矩阵方法,建立短语间、文档间、短语与文档内容间的相似性关系融合矩阵,训练聚类算法,将PubMed数据库检索结果集合分组,随后生成类别标签,描述每一类簇文档的含义.结果/结论基于融合矩阵的聚类效果较好,提取出描述类别的高频词能很好地区分类别含义,对检索结果文本聚类任务有效.

Abstract

Purpose/Significance To solve the deficiencies in the semantic representation of medical texts,and to realize the clustering of the retrieval results of the PubMed database.Method/Process The paper proposes a method to construct a fusion matrix by using the Jac-card coefficient and TF-IDF.Similarity relations between phrases,documents,and the contents of phrases and documents are combined to construct a fusion matrix,and several clustering algorithms are trained to group a collection of documents from the PubMed database.Cate-gory annotations are created to describe the meaning of each category of clustered documents.Result/Conclusion Experimental results show that the fusion matrix-based clustering is superior in grouping the document sets,and the extracted high-frequency words in the category de-scriptions distinguish the meanings of the categories well,so the fusion matrix design is effective for clustering descriptions of academic texts.

关键词

文献检索/文本聚类/融合矩阵/文本相似度

Key words

document retrieval/text clustering/fusion matrix/text similarity

引用本文复制引用

基金项目

辽宁省社会科学规划基金(L20BTQ003)

出版年

2024

医学信息学杂志

中国医学科学院

医学信息学杂志

CSTPCD

影响因子：1.348

ISSN：1673-6036

参考文献量22

段落导航