首页|融合维基知识的变分半监督百度百科分类

融合维基知识的变分半监督百度百科分类

扫码查看
跨语言知识图谱构架多利用维基百科,但其中文实体较少,构建大规模以中文为核心的跨语言知识图谱比较困难.如何利用百度百科等现有的大规模中文百科知识库来辅助构建跨语言知识图谱是亟待解决的问题,然而维基百科和百度百科属于不同的分类体系,增加了跨百科检索的范围和难度.基于此,提出一种融合少量带分类标签的维基知识指导下的半监督百度百科分类方法.基于词嵌入和词袋模型分别获得百科摘要文本的语义特征和统计特征;融合两者作为变分自编码模型的输入,获得其语义表征;利用少量维基百科分类损失和海量无标签百度百科重构损失,构造半监督分类损失,实现分类体系统一.实验结果表明,所提方法能够准确实现百度百科到维基百科分类体系的迁移.
VARIATIONAL SEMI-SUPERVISED BAIDU ENCYCLOPEDIA CLASSIFICATION BASED ON WIKI KNOWLEDGE
The framework of cross-language knowledge graph is mostly made use of Wikipedia,but with few Chinese entities,it is difficult to build a large-scale cross-language knowledge graph with Chinese as the core.How to use the existing large-scale Chinese encyclopedia knowledge base such as Baidu Encyclopedia to assist the construction of cross-language knowledge map is an urgent problem to be solved.However,Wikipedia and Baidu Encyclopedia belong to different classification systems,which increases the scope and difficulty of cross-encyclopedia retrieval.On this basis,a semi-supervised Baidu Encyclopedia classification method is proposed,which integrates a small amount of Wikipedia knowledge with classification labels.The semantic features and statistical features of the encyclopedia abstract text were obtained based on the word embedding and BoW model.The two were fused as the input of the variational autoencoder to obtain the semantic representation of the encyclopedia text.A small amount of Wikipedia classification loss and a large amount of unlabeled Baidu Encyclopedia reconstruction loss were used to construct semi-supervised classification loss and realize the unification of classification system.Experimental results show that the proposed method can achieve the accurate migration from Baidu Encyclopedia to Wikipedia classification system.

Classification systemText classificationSemi supervisionBag of wordsVariational autoencoder

韩佩甫、余正涛、郭军军、高盛祥、赖华

展开 >

昆明理工大学信息工程与自动化学院 云南昆明 650500

昆明理工大学云南省人工智能重点实验室 云南昆明 650500

分类体系 文本分类 半监督 词袋模型 变分自编码

国家自然科学基金项目国家自然科学基金项目国家自然科学基金项目云南省重大科技专项计划项目云南省高新技术产业专项云南省应用基础研究计划重点项目

619721866176205661472168202002AD0800012016062019FA023

2024

计算机应用与软件
上海市计算技术研究所 上海计算机软件技术开发中心

计算机应用与软件

CSTPCD北大核心
影响因子:0.615
ISSN:1000-386X
年,卷(期):2024.41(7)