基于LLM的多粒度口令分析研究

Research on multi-granularity password analysis based on LLM

洪萌 ¹邱卫东 ¹王杨德¹

扫码查看

作者信息

1. 上海交通大学网络空间安全学院,上海 200240
折叠

摘要

基于口令的认证是常见的身份认证机制.然而,大规模口令泄露事件时有发生,表明口令仍面临着被猜测或者盗用等风险.由于口令可以被视作一种特殊的自然语言,近年来运用自然语言处理技术进行口令分析的研究工作逐渐展开.目前少有工作在大语言模型(LLM,large language model)上探究口令文本分词粒度对口令分析效果的影响.为此,提出了基于LLM的多粒度口令分析框架,总体上沿用预训练范式,在大量未标记数据集上自主学习口令分布先验知识.该框架由同步网络、主干网络、尾部网络3个模块构成.其中,同步网络模块实现了 char-level、template-level和chunk-level这3种粒度的口令分词,并提取了口令的字符分布、结构、词块组成等特征知识;主干网络模块构建了通用的口令模型来学习口令组成规律;尾部网络模块生成了候选口令对目标库进行猜测分析.在Tianya、Twitter等8个口令库上进行大量实验,分析总结了多粒度分词下所提框架在不同语言环境中的口令分析效果.实验结果表明,在中文用户场景中,基于char-level和chunk-level分词的框架口令分析性能接近一致,且显著优于基于template-level分词的框架;在英文用户场景中,基于chunk-level分词的框架口令分析性能最佳.

Abstract

Password-based authentication has been widely used as the primary authentication mechanism.However,occasional large-scale password leaks have highlighted the vulnerability of passwords to risks such as guessing or theft.In recent years,research on password analysis using natural language processing techniques has progressed,treating passwords as a special form of natural language.Nevertheless,limited studies have investigated the impact of password text segmentation granularity on the effectiveness of password analysis with large language models.A multi-granularity password-analyzing framework was proposed based on a large language model,which follows the pre-training paradigm and autonomously learns prior knowledge of password distribution from large unlabelled da-tasets.The framework comprised three modules:the synchronization network,backbone network,and tail network.The synchronization network module implemented char-level,template-level,and chunk-level password segmenta-tion,extracting knowledge on character distribution,structure,word chunk composition,and other password features.The backbone network module constructed a generic password model to learn the rules governing password compo-sition.The tail network module generated candidate passwords for guessing and analyzing target databases.Experi-mental evaluations were conducted on eight password databases including Tianya and Twitter,analyzing and sum-marizing the effectiveness of the proposed framework under different language environments and word segmenta-tion granularities.The results indicate that in Chinese user scenarios,the performance of the password-analyzing framework based on char-level and chunk-level segmentation is comparable,and significantly superior to the framework based on template-level segmentation.In English user scenarios,the framework based on chunk-level segmentation demonstrates the best password-analyzing performance.

关键词

大语言模型/口令分析/自然语言处理/分词

Key words

large language model/password analysis/natural language processing/word segmentation

引用本文复制引用

基金项目

国家自然科学基金(61972249)

国家重点研发计划(2023YFB3106501)

出版年

2024

网络与信息安全学报

人民邮电出版社

网络与信息安全学报

CSTPCD

ISSN：2096-109X

参考文献量28

段落导航