现有的基于深度学习模型的词嵌入方法用于Web异常检测时,通常将语料库中没有出现的未知词汇(Out of Vocabu-lary,OOV)设置为unknown,并赋予零或随机向量输入到模型中进行训练,未考虑未知词汇在Web请求语句中的上下文关系.同时,在Web系统代码开发过程中,基于个人习惯并为了增加代码的可读性,程序员设计的请求路径代码往往存在一定的模式.因此,考虑到 Web请求的模式和单词语义间的相关性,研究基于Word2vec的动态未知词表示方法DUWe(Dynamic Un-known Word Embedding),该方法通过分析 Web请求路径中单词上下文的关系来赋予未知词向量的表示内容.在CSIC-2010和WAF Dataset数据集上的实验评估表明,增加未知词表示方法比仅用Word2vec静态特征提取方法具有更好的性能,同时在准确性、精准率、召回率和F1-Score方面均有提高,在训练时间上最大降低1.14倍.
DUWe:Dynamic Unknown Word Embedding Approach for Web Anomaly Detection
When the existing deep-learning model-based word embedding methods are used to detect Web anomalies,the vocabu-lary not appearing in the corpus is usually called out of vocabulary(OOV)and is set as unknown,and given zero or random vector as the input of the depth model for training without considering the context of unknown word in the web request.In the process of code development,in order to increase the readability of code,programmers often design request path code based on a certain pattern which usually makes web requests semantically related.Considering that there are certain request patterns in web requests and pattern correlation between semantics,this paper studies and proposes a dynamic unknown word embedding method DU We based on Word2vec,which assigns unknown word representation through word context inference.Evaluation on CSIC-2010 and WAF dataset shows that adding unknown word embedding methods have better performance than word2vec feature extraction methods.The accuracy,precision,recall rate and F1-Score are improved,and the maximum reduction in training time is 1.14 times.
Unknown wordWeb anomaly detectionDynamic unknown word embeddingWord embedding optimizationDeep learning