有效进行学术文献特征表示,不仅可以提高学术文献的检索、分类和排序效率,还可以为用户提供更加智能、有效的文献推荐和个性化服务.受图书情报学领域引文邻近分析研究启发,本文基于自监督对比学习框架,提出了基于引文共现的层次采样算法,从结构化全文数据中挖掘文献间的潜在关联,构造自监督前置训练任务用于训练文献层级的学术文本表示模型,即CCHT(citation co-occurrence hierarchical transformer).使用S2ORC数据集和SPEC-TER训练集从句子共现、段落共现和章节共现层次来构造三元组集合,训练对应的模型,并用于论文分类、用户行为预测、引文预测和论文推荐四大SciDocs基准测试集任务.对于不同任务,本文采用了不同的评估指标.在论文分类任务中,使用F1值进行评估;在用户行为预测和引文预测任务中,使用nDCG(normalized discounted cumulative gain)和MAP(mean average precision)进行评估;在论文推荐任务中,使用P@1和(nDCG)进行评估.研究结果表明,①CCHT模型在SciDocs基准测试集中的性能优于其他基线模型,并且当固定正样本采样层次为句子共现时性能最佳;②基于引文层次共现进行困难负样本采样,易受噪声数据影响,导致模型性能出现逐步降低的趋势.
Citation Co-occurrence Hierarchical Sampling for Academic Document Representation Learning
Effective feature representations of academic papers can be used in the classification and ranking of academic papers,thereby improving the efficiency of searches to provide users with more intelligent and effective literature recom-mendations and personalized services.Inspired by the study of citation proximity analysis(CPA)in information science,we utilize a self-supervised contrastive learning framework,to propose a co-citation hierarchical sampling algorithm that allows mining of potential associations among documents from structured full-text data.A self-supervised prior training task is constructed for training the citation co-occurrence hierarchical transformer(CCHT),which is an academic text rep-resentation model at the document level.The S2ORC and SPECTER training sets were used to construct triplets from co-citations of the same sentence,paragraph,and chapter to train the proposed research models,which were subsequently ap-plied to the four major SciDocs benchmark tasks of document classification,user behavior prediction,citation prediction,and paper recommendation.Different evaluation metrics were adopted for the different tasks.Specifically,in the document classification task,the F1 metric;in the user behavior prediction and citation prediction tasks,the normalized discounted cumulative gain(nDCG)and mean average precision(MAP)metrics;and in the paper recommendation task,P@1 and(nDCG)were used for evaluation.The results demonstrate that(1)the CCHT model outperformed the other baseline models in the SciDocs benchmark test set,performing best when positive samples with fixed sampling levels were co-citations of the same sentence;(2)hierarchical citation co-occurrence based hard negative sampling may introduce noisy data during training,which degrades performance.
representation learningcontrastive learningsampling strategylearning to rank