A Study on the Stability of Semantic Representation of Entities in the Technology Domain-Comparison of Multiple Word Embedding Models
Lexical semantic analysis is crucial in the science and technology literature intelligence analysis field.Distribut-ed word embedding techniques(e.g.,fastText,GloVe,and Word2Vec),which can effectively represent lexical semantics and conveniently characterize the semantic similarity of lexical words,have recently become the mainstream technology for technological lexical semantic analysis.The use of word embedding techniques for lexical semantic analysis is highly dependent on computing the nearest semantic neighbors of words based on word vectors.However,because of random ini-tialization of the word embedding model,even if the nearest semantic neighbors generated by repeated training on the ex-act same data are not identical,the randomly perturbed nearest semantic neighbors introduce untrue information.To mini-mize the impact of random initialization,enhance reproducibility,and obtain more reliable and effective semantic analysis results,this study comprehensively examined the influence of dataset size,model type,training algorithm,keyword fre-quency,vector dimension,and context window size and designed a quantitative stability assessment index and correspond-ing experimental scheme.The present study investigated the Microsoft Academic Graph(MAG)paper corpus in four dis-tinct fields:artificial intelligence,immunology,monetary policy,and quantum entanglement.Specifically,we trained word embedding models on a corpus of MAG papers,performed word vector semantic representations for the keywords of the papers,and calculated evaluation metrics to ascertain the stability of semantic representations in conjunction with quantita-tive results.The results on the four domains demonstrate that the larger the dataset,the more stable the semantic representa-tion.However,this is not the case for GloVe.Different models and training algorithms must be targeted when considering structural grammatical information,such as lexical composition,character similarity,and keyword frequency.Furthermore,setting the vector dimension to 300 and the context window to 5 is a more appropriate choice.This empirical study offers a point of reference for intelligence workers engaged in the semantic analysis of scientific and technological vocabulary.