Excel 4.0宏自动化反混淆与家族分类系统
Automated deobfuscation and family classification system for Excel 4.0 macros
李晨光 1杨秀璋 1彭国军1
作者信息
- 1. 武汉大学国家网络安全学院,湖北 武汉 430072
- 折叠
摘要
近年来,使用恶意Excel 4.0宏(XLM)文档的攻击迎来了爆发,而XLM代码往往经过复杂的混淆,现有方法或检测系统难以分析海量样本的真实功能.因此,针对恶意样本中使用的各类混淆技术,基于抽象语法树和模拟执行,设计和实现了包含138个宏函数处理程序的自动化XLM反混淆与关键威胁指标(IOC,indicators of compromise)提取系统XLMRevealer;在此基础上,根据XLM代码特点提取Word和Token特征,通过特征融合能够捕获多层次细粒度特征,并在XLMRevealer中构造CNN-BiLSTM(convolu-tion neural network-bidirectional long short term memory)模型,从不同维度挖掘家族样本的关联性和完成家族分类.最后,从5个来源构建包含2 346个样本的数据集并用于反混淆实验和家族分类实验.实验结果表明,XLMRevealer的反混淆成功率达到71.3%,相比XLMMacroDeobfuscator和SYMBEXCEL工具分别提高了20.8%和15.8%;反混淆效率稳定,平均耗时仅为0.512 s.XLMRevealer对去混淆XLM代码的家族分类准确率高达94.88%,效果优于所有基线模型,有效体现Word和Token特征融合的优势.此外,为探索反混淆对家族分类的影响,并考虑不同家族使用的混淆技术可能有所不同,模型会识别到混淆技术的特征,分别对反混淆前和反混淆后再统一混淆的XLM代码进行实验,家族分类准确率为89.58%、53.61%,证明模型能够学习混淆技术特征,更验证了反混淆对家族分类极大的促进作用.
Abstract
In recent years,a surge has been witnessed in cyber-attacks that leverage malicious Excel 4.0 macros(XLM)within documents.Malicious XLM codes often undergo complex obfuscation,posing a substantial chal-lenge for conventional analysis methods and detection systems to discern the actual functionality within a vast array of samples.Consequently,an automated system for deobfuscating XLM and extracting key Indicators of Compromise(IOCs),named XLMRevealer,was developed to counter the diverse obfuscation strategies employed in malicious samples.XLMRevealer was architected upon abstract syntax trees and execution simulation,encompassing 138 com-prehensive macro function handlers.Based on that,Word and Token features tailored to XLM code peculiarities were extracted,capturing multi-level,fine-grained features through feature fusion.XLMRevealer incorporated a CNNBiL-STM model to discern familial correlations across dimensions,facilitating family classification.Finally,a dataset com-prising 2346 samples from five distinct sources was constructed for both deobfuscation and family classification experi-ments.Results indicated that XLMRevealer achieved a 71.3%deobfuscation success rate,outperforming XLMMacro-Deobfuscator and SYMBEXCEL by 20.8%and 15.8%,respectively.Its efficiency was stable,with an average pro-cessing time of only 0.512 seconds.The family classification accuracy for deobfuscated XLM codes stood at 94.88%,surpassing all baseline models and underscoring the efficacy of Word and Token feature integration.Fur-thermore,to assess the impact of deobfuscation on family classification and account for variability in obfuscation techniques across families,experiments were conducted on both the original and uniformly obfuscated XLM codes.The accuracies were 89.58%and 53.61%,respectively,demonstrating the model's capability to learn obfuscation features and confirming the significant enhancement deobfuscation provides for family classification.
关键词
恶意宏文档/Excel/4.0宏/反混淆/家族分类Key words
malicious macro document/Excel 4.0 macro/deobfuscation/family classification引用本文复制引用
基金项目
国家自然科学基金(62172308)
国家自然科学基金(U1636107)
国家自然科学基金(61972297)
国家自然科学基金(62172144)
中央网信办网络安全学院学生创新资助计划()
出版年
2024