基于API序列和预训练模型的恶意软件检测

Malware detection based on API sequences and pre-training

窦建民 ¹师智斌 ¹于孟洋 ¹霍帅 ¹张舒娟¹

扫码查看

作者信息

1. 中北大学计算机科学与技术学院,山西太原 030051
折叠

摘要

针对现有方法存在特征表达受限、无法捕获API序列全局语义信息,且恶意软件数据集通常包含大量无标注数据,无法直接进行有监督学习等问题,利用自然语言预训练模型技术,提出一种基于API调用序列和预训练模型的恶意软件检测方法.使用原始API序列构建分词器;基于BERT模型构建出动态掩码序列模型进行无监督学习的预训练,同时获取API序列的全局动态编码表示;使用该编码构造检测模型.实验结果表明,所提方法能有效检测出恶意软件.

Abstract

In response to the existing limitations in feature expression and the inability to capture the global semantic information of API sequences,and confronted with the issue of an abundance of unlabeled data typically present in malware datasets,which impedes direct supervised learning,a method for malicious software detection based on pre-trained models utilizing API call sequences was proposed through the application of natural language pre-training model technology.A tokenizer was constructed using the original API sequence.Subsequently,a dynamic mask sequence model was constructed based on the BERT model for unsupervised pre-training,facilitating the extraction of a global encoding representation of the API sequence.This encoding was employed for the construction of a detection model.Experimental results demonstrate the effective detection of malicious soft-ware using the method proposed.

关键词

恶意软件检测/预训练模型/无监督学习/动态掩码/软件调用序列/模型微调/编码表示

Key words

malware detection/pre-trained model/unsupervised learning/dynamic mask/software call sequence/model fine-tuning/coded representation

引用本文复制引用

基金项目

山西省基础研究计划基金项目(20210302123018)

出版年

2024

计算机工程与设计

中国航天科工集团二院706所

计算机工程与设计

CSTPCD北大核心

影响因子：0.617

ISSN：1000-7024

参考文献量23

段落导航