摘要
债券市场充斥着海量且复杂的信息,而构建能够表达债券市场复杂语义的数字词典(预训练词向量),是充分利用这些信息并实现金融科技赋能业务的关键.目前,不仅缺乏债券领域专用的预训练词向量,而且词向量的评估也是一大挑战.上述研究提出了一种联合字组件、字和词信息的的债券领域多粒度词向量训练框架(BondJWE).此外,上述研究为了实现对该词向量的科学评估,针对已有数据特点设计了下游文本分类任务.以上研究弥补了债券领域的专用预训练词向量研究的空白,且其实验结果表明BondJWE的性能优于其它基线模型,说明以上研究所提供的多粒度词向量有着更好的语义表达能力和鲁棒性.
Abstract
The bond market is flooded with massive and complex information,while the key to fully utilizing this information and implementing the aim that fintech enables businesses is to construct a digital dictionary(namely,pre-trained word embeddings),which can describe complex semantics in the bond market.So far,there has been a lack of pre-trained bond-specific embeddings,and their evaluation has also been a big challenge.On the basis of joint infor-mation of components,characters and words,this study proposed a multi-granularity word embeddings training frame-work for the bond field,named BondJWE.Moreover,to evaluate these embeddings scientifically,this study designed a downstream task,text classification,according to intrinsic features of data.This study makes up for the blank of re-search on pre-trained bond-specific embeddings.And results show that the performance of BondJWE is better than that of other baseline models,which indicates that these multi-granularity word embeddings can better express seman-tics and are more robust.
基金项目
绿色发展大数据决策北京市重点实验室项目(dm202103)
中国博士后科学基金(2022M723692)