基于API序列和预训练模型的恶意软件检测
Malware detection based on API sequences and pre-training
窦建民 1师智斌 1于孟洋 1霍帅 1张舒娟1
作者信息
- 1. 中北大学计算机科学与技术学院,山西太原 030051
- 折叠
摘要
针对现有方法存在特征表达受限、无法捕获API序列全局语义信息,且恶意软件数据集通常包含大量无标注数据,无法直接进行有监督学习等问题,利用 自然语言预训练模型技术,提出一种基于API调用序列和预训练模型的恶意软件检测方法.使用原始API序列构建分词器;基于BERT模型构建出动态掩码序列模型进行无监督学习的预训练,同时获取API序列的全局动态编码表示;使用该编码构造检测模型.实验结果表明,所提方法能有效检测出恶意软件.
Abstract
In response to the existing limitations in feature expression and the inability to capture the global semantic information of API sequences,and confronted with the issue of an abundance of unlabeled data typically present in malware datasets,which impedes direct supervised learning,a method for malicious software detection based on pre-trained models utilizing API call sequences was proposed through the application of natural language pre-training model technology.A tokenizer was constructed using the original API sequence.Subsequently,a dynamic mask sequence model was constructed based on the BERT model for unsupervised pre-training,facilitating the extraction of a global encoding representation of the API sequence.This encoding was employed for the construction of a detection model.Experimental results demonstrate the effective detection of malicious soft-ware using the method proposed.
关键词
恶意软件检测/预训练模型/无监督学习/动态掩码/软件调用序列/模型微调/编码表示Key words
malware detection/pre-trained model/unsupervised learning/dynamic mask/software call sequence/model fine-tuning/coded representation引用本文复制引用
基金项目
山西省基础研究计划基金项目(20210302123018)
出版年
2024