医学信息学杂志2024,Vol.45Issue(3) :58-64.DOI:10.3969/j.issn.1673-6036.2024.03.010

基于融合矩阵的文本相似度计算实现检索结果聚类

A Fusion Matrix-based Study on Text Clustering of Document Retrieval Results

赵悦阳 崔雷
医学信息学杂志2024,Vol.45Issue(3) :58-64.DOI:10.3969/j.issn.1673-6036.2024.03.010

基于融合矩阵的文本相似度计算实现检索结果聚类

A Fusion Matrix-based Study on Text Clustering of Document Retrieval Results

赵悦阳 1崔雷2
扫码查看

作者信息

  • 1. 中国医科大学附属盛京医院图书馆 沈阳 110004
  • 2. 中国医科大学医学健康管理学院 沈阳 110122
  • 折叠

摘要

目的/意义 弥补医学文本语义表示方面的不足,实现PubMed数据库检索结果聚类.方法/过程 采用Jaccard系数和TF-IDF构建融合矩阵方法,建立短语间、文档间、短语与文档内容间的相似性关系融合矩阵,训练聚类算法,将PubMed数据库检索结果集合分组,随后生成类别标签,描述每一类簇文档的含义.结果/结论 基于融合矩阵的聚类效果较好,提取出描述类别的高频词能很好地区分类别含义,对检索结果文本聚类任务有效.

Abstract

Purpose/Significance To solve the deficiencies in the semantic representation of medical texts,and to realize the clustering of the retrieval results of the PubMed database.Method/Process The paper proposes a method to construct a fusion matrix by using the Jac-card coefficient and TF-IDF.Similarity relations between phrases,documents,and the contents of phrases and documents are combined to construct a fusion matrix,and several clustering algorithms are trained to group a collection of documents from the PubMed database.Cate-gory annotations are created to describe the meaning of each category of clustered documents.Result/Conclusion Experimental results show that the fusion matrix-based clustering is superior in grouping the document sets,and the extracted high-frequency words in the category de-scriptions distinguish the meanings of the categories well,so the fusion matrix design is effective for clustering descriptions of academic texts.

关键词

文献检索/文本聚类/融合矩阵/文本相似度

Key words

document retrieval/text clustering/fusion matrix/text similarity

引用本文复制引用

基金项目

辽宁省社会科学规划基金(L20BTQ003)

出版年

2024
医学信息学杂志
中国医学科学院

医学信息学杂志

CSTPCD
影响因子:1.348
ISSN:1673-6036
参考文献量22
段落导航相关论文