科技领域词汇语义表示的稳定性研究:多种词嵌入模型对比

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：在科技文献情报分析领域,词汇语义分析至关重要.分布式词嵌入技术可以有效学习词汇的语义表示,近年来逐渐成为科技词汇语义分析的共性基础技术.然而,主流词嵌入模型的随机初始化操作使得即使在相同的语料上,每次训练产生的词汇语义向量都有不同程度的偏差,干扰了下游语义分析任务结果的可靠性与可复现能力.为了厘清模型和各因素对词汇语义表示结果稳定性的干扰程度,本文开展多种对比实验,以量化指导后续技术选型.本文综合考虑了领域数据集大小、模型种类、训练算法、关键词频次、向量维度、上下文窗口大小等影响因素,设计了基于语义场重叠的稳定性评估指标和相应的实验方案.在"人工智能""免疫学""货币政策""量子纠缠"4个领域的MAG(Microsoft Academic Graph)论文语料集上,针对论文关键词开展多种模型词嵌入模型(Word2Vec、GloVe和fastText),训练并比较各种结果的稳定性.4个领域的研究结果均表明,在一定范围内,数据集越大,语义表示的稳定性越好,但GloVe例外;考虑语料规模、待分析关键词频次、词形相似等因素时,词嵌入模型的稳定性各有不同;向量维度为300,上下文窗口为5是较为合适的选择.最后,本文给出了多种因素组合下建议选择的词嵌入模型与技术,为后续科技词汇语义分析研究提供了量化证据和借鉴.

外文标题：A Study on the Stability of Semantic Representation of Entities in the Technology Domain-Comparison of Multiple Word Embedding Models

外文摘要：Lexical semantic analysis is crucial in the science and technology literature intelligence analysis field.Distribut-ed word embedding techniques(e.g.,fastText,GloVe,and Word2Vec),which can effectively represent lexical semantics and conveniently characterize the semantic similarity of lexical words,have recently become the mainstream technology for technological lexical semantic analysis.The use of word embedding techniques for lexical semantic analysis is highly dependent on computing the nearest semantic neighbors of words based on word vectors.However,because of random ini-tialization of the word embedding model,even if the nearest semantic neighbors generated by repeated training on the ex-act same data are not identical,the randomly perturbed nearest semantic neighbors introduce untrue information.To mini-mize the impact of random initialization,enhance reproducibility,and obtain more reliable and effective semantic analysis results,this study comprehensively examined the influence of dataset size,model type,training algorithm,keyword fre-quency,vector dimension,and context window size and designed a quantitative stability assessment index and correspond-ing experimental scheme.The present study investigated the Microsoft Academic Graph(MAG)paper corpus in four dis-tinct fields:artificial intelligence,immunology,monetary policy,and quantum entanglement.Specifically,we trained word embedding models on a corpus of MAG papers,performed word vector semantic representations for the keywords of the papers,and calculated evaluation metrics to ascertain the stability of semantic representations in conjunction with quantita-tive results.The results on the four domains demonstrate that the larger the dataset,the more stable the semantic representa-tion.However,this is not the case for GloVe.Different models and training algorithms must be targeted when considering structural grammatical information,such as lexical composition,character similarity,and keyword frequency.Furthermore,setting the vector dimension to 300 and the context window to 5 is a more appropriate choice.This empirical study offers a point of reference for intelligence workers engaged in the semantic analysis of scientific and technological vocabulary.

外文关键词：

science intelligence analysisdomain knowledge analysislexical semanticsemantic representation stabilityword embedding models

作者：

陈果、徐赞、洪思琪、吴嘉桓、肖璐

展开 >

作者单位：

南京理工大学经济管理学院,南京 210094

南京财经大学新闻学院,南京 210023

关键词：

科技情报分析领域知识分析词汇语义语义表示稳定性词嵌入模型

出版年：

2024

DOI：

10.3772/j.issn.1000-0135.2024.12.006

情报学报

中国科学技术情报学会　中国科学技术信息研究所

情报学报

CSTPCDCSSCICHSSCD北大核心

影响因子：1.296

ISSN：1000-0135

年,卷(期)：2024.43(12)