Research on Hierarchical Topic Analysis for a Course Based on BERT Embedding and Knowledge Distillation
The tree-structured neural topic model based on the variational auto-encoder can effectively mine the hier-archical semantic features of the text.However,the existing tree-structured neural topic model only uses statistical features such as word frequency and ignores the prior external knowledge.Aiming at the topic analysis of a course,we propose a tree-structured neural topic model based on BERT embedding and knowledge distillation by integrating the idea of transfer learning.Firstly,the BERT-CRF word segmentation model is constructed,and a small amount of domain text is used to train BERT twice to optimize the representation of domain words.After the second train-ing,the BERT word embedding is dynamically fused to obtain coarse-grained domain word embedding,alleviating the mismatch between word embedding and a bag-of-words representation.Secondly,the BERT autoencoder is con-structed with document reconstruction as the goal to solve the problem of sparse bag-of-words representation data.The supervised document representation is distilled to guide the document reconstruction learning of the topic model and improve the document reconstruction of the quality of the topic.Finally,a tree-structured neural topic model is optimized to fit auxiliary information-rich BERT word embedding,and supervised distillation knowledge is used to guide the document reconstruction of the unsupervised topic model.Experiments show that the proposed method can summarize the course topics more effectively.