基于LaBSE的藏文信息检索模型研究

Study on the Tibetan Information Retrieval Model based on LaBSE

严李强 ¹吴瑜 ¹拉巴顿珠 ¹梁炜恒¹

扫码查看

作者信息

1. 西藏大学信息科学技术学院西藏拉萨 850000
折叠

摘要

随着藏文数字资源和使用需求的增长,如何准确地检索到用户所需信息成为一项重要挑战.为解决藏文检索中查询信息和文档语义匹配问题,文章首先利用LaBSE模型从藏文文档中提取特征信息,然后将查询信息和特征信息一同输入模型,通过掩码语言模型和翻译语言模型等预训练任务,学习不同藏文音节字在不同语境下的深层语义信息;最后进行微调完成基于LaBSE的藏文信息检索模型的构建.实验结果表明,文章构建的藏文信息检索模型准确率达到93.57%,相比基于BERT的藏文信息检索模型准确率提高了6.37%,表明了文章构建的藏文信息检索模型能够更有效地匹配查询信息和藏文文档,为准确检索藏文资源问题提供了一种参考.

Abstract

With the growth of Tibetan resources and usage demand,it has become an important challenge to re-trieve the information required by users accurately.To solve the problem of query information and semantic matching between documents in Tibetan retrieval,a Tibetan information retrieval model based on LaBSE is pro-posed in this paper.For constructing the model,a LaBSE model was first used to extract feature information from Tibetan documents and then input the query information and feature information into the model together.Through pre-training tasks such as the mask language model and translation language model,the model learned the deep semantic information of different Tibetan characters from different contexts.Finally,fine-tuning was carried out to complete the construction of the model.The experimental results show that the accuracy of the Ti-betan information retrieval model constructed in this paper reaches 93.57％,which is 6.37％higher than that of the Tibetan information retrieval model based on BERT,indicating that our model can more effectively match the query information and Tibetan documents,which provides a reference for accurate retrieval of Tibetan resources.

关键词

藏文/信息检索模型/深度学习/LaBSE

Key words

Tibetan/information retrieval model/deep learning/LaBSE

引用本文复制引用

出版年

2024

高原科学研究

CSCD

ISSN：

段落导航