基于双向编码表示转换的双模态软件分类模型

Bimodal software classification model based on bidirectional encoder representation from transformer

扫码查看

原文链接

万方数据

中文摘要：针对已有方法在软件分类方面只考虑单一分类因素和精确率较低的不足,提出基于双向编码表示转换(BERT)的双模态软件分类方法.该方法遵循最新的国家标准对软件进行分类,通过集成基于代码的BERT(Code-BERT)和基于掩码语言模型的纠错BERT(MacBERT)双向编码的优势,其中CodeBERT用于深入分析源码内容,MacBERT处理文本描述信息如注释和文档,利用这2种双模态信息联合生成词嵌入.结合卷积神经网络(CNN)提取局部特征,通过提出的交叉自注意力机制(CSAM)融合模型结果,实现对复杂软件系统的准确分类.实验结果表明,本文方法在同时考虑文本和源码数据的情况下精确率高达93.3％,与从奥集能和gitee平台收集并处理的数据集上训练的BERT模型和CodeBERT模型相比,平均精确率提高了5.4％.这表明了双向编码和双模态分类方法在软件分类中的高效性和准确性,证明了提出方法的实用性.

外文摘要：A bimodal software categorization method based on bidirectional encoder representations from transformers (BERT) was proposed addressing the limitations of existing methods that only consider a single factor in software categorization and suffer from low precision. The method followed the latest national standards for software classification. The advantages of bidirectional encoding from code based BERT (CodeBERT) and masked language model as correction BERT (MacBERT) were integrated. CodeBERT was used for in-depth analysis of source code content,while MacBERT handled textual description information such as comments and documents. The above bimodal information was utilized to jointly generate word embeddings. Convolutional neural network (CNN) was combined for local feature extraction,and the proposed cross self-attention mechanism (CSAM) was employed to fuse model results in order to achieve accurate classification of complex software system. The experimental results demonstrate that the method achieves a high precision of 93.3％ with text and source code data,which is 5.4％ higher on average than the BERT and CodeBERT models trained on datasets processed from the Orginone and gitee platforms. Results show the efficiency and accuracy of bidirectional encoding and bimodal classification methods in software categorization,while proves the practicality of the proposed approach.

外文关键词：

software classificationbidirectional encoder representation from transformer (BERT)convolutional neural networkbimodalcross self-attention mechanism

作者：

付晓峰、陈威岐、孙曜、潘宇泽

展开 >

作者单位：

杭州电子科技大学计算机学院,浙江杭州 310018

杭州电子科技大学自动化学院,浙江杭州 310018

关键词：

软件分类双向编码表示转换(BERT) 卷积神经网络双模态交叉自注意力机制

出版年：

2024

DOI：

10.3785/j.issn.1008-973X.2024.11.005

浙江大学学报(工学版)

浙江大学

浙江大学学报(工学版)

CSTPCD北大核心

影响因子：0.625

ISSN：1008-973X

年,卷(期)：2024.58(11)