DUWe:动态未知词嵌入方法在Web异常检测中的应用

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：现有的基于深度学习模型的词嵌入方法用于Web异常检测时,通常将语料库中没有出现的未知词汇(Out of Vocabu-lary,OOV)设置为unknown,并赋予零或随机向量输入到模型中进行训练,未考虑未知词汇在Web请求语句中的上下文关系.同时,在Web系统代码开发过程中,基于个人习惯并为了增加代码的可读性,程序员设计的请求路径代码往往存在一定的模式.因此,考虑到 Web请求的模式和单词语义间的相关性,研究基于Word2vec的动态未知词表示方法DUWe(Dynamic Un-known Word Embedding),该方法通过分析 Web请求路径中单词上下文的关系来赋予未知词向量的表示内容.在CSIC-2010和WAF Dataset数据集上的实验评估表明,增加未知词表示方法比仅用Word2vec静态特征提取方法具有更好的性能,同时在准确性、精准率、召回率和F1-Score方面均有提高,在训练时间上最大降低1.14倍.

外文标题：DUWe:Dynamic Unknown Word Embedding Approach for Web Anomaly Detection

外文摘要：When the existing deep-learning model-based word embedding methods are used to detect Web anomalies,the vocabu-lary not appearing in the corpus is usually called out of vocabulary(OOV)and is set as unknown,and given zero or random vector as the input of the depth model for training without considering the context of unknown word in the web request.In the process of code development,in order to increase the readability of code,programmers often design request path code based on a certain pattern which usually makes web requests semantically related.Considering that there are certain request patterns in web requests and pattern correlation between semantics,this paper studies and proposes a dynamic unknown word embedding method DU We based on Word2vec,which assigns unknown word representation through word context inference.Evaluation on CSIC-2010 and WAF dataset shows that adding unknown word embedding methods have better performance than word2vec feature extraction methods.The accuracy,precision,recall rate and F1-Score are improved,and the maximum reduction in training time is 1.14 times.

外文关键词：

Unknown wordWeb anomaly detectionDynamic unknown word embeddingWord embedding optimizationDeep learning

作者：

王丽、陈刚、夏明山、胡皓

展开 >

作者单位：

中国科学院高能物理研究所北京 100049

散裂中子源科学中心广东东莞 523803

中国科学院大学北京 100049

关键词：

未知词汇 Web异常检测动态词嵌入词嵌入优化深度学习

基金：

国家自然科学基金国家自然科学基金国家自然科学基金

项目编号：

119052391200524812105303

出版年：

2024

DOI：

10.11896/jsjkx.230300191

计算机科学

重庆西南信息有限公司（原科技部西南信息中心）

计算机科学

CSTPCD北大核心

影响因子：0.944

ISSN：1002-137X

年,卷(期)：2024.51(z1)

参考文献量29