Bimodal software classification model based on bidirectional encoder representation from transformer
A bimodal software categorization method based on bidirectional encoder representations from transformers (BERT) was proposed addressing the limitations of existing methods that only consider a single factor in software categorization and suffer from low precision. The method followed the latest national standards for software classification. The advantages of bidirectional encoding from code based BERT (CodeBERT) and masked language model as correction BERT (MacBERT) were integrated. CodeBERT was used for in-depth analysis of source code content,while MacBERT handled textual description information such as comments and documents. The above bimodal information was utilized to jointly generate word embeddings. Convolutional neural network (CNN) was combined for local feature extraction,and the proposed cross self-attention mechanism (CSAM) was employed to fuse model results in order to achieve accurate classification of complex software system. The experimental results demonstrate that the method achieves a high precision of 93.3% with text and source code data,which is 5.4% higher on average than the BERT and CodeBERT models trained on datasets processed from the Orginone and gitee platforms. Results show the efficiency and accuracy of bidirectional encoding and bimodal classification methods in software categorization,while proves the practicality of the proposed approach.