首页|DySpec: Faster speculative decoding with dynamic token tree structure
DySpec: Faster speculative decoding with dynamic token tree structure
扫码查看
点击上方二维码区域,可以放大扫码查看
原文链接
NETL
NSTL
Springer Nature
Abstract While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods usually organize predicted tokens as independent chains or fixed token trees, which fail to generalize to diverse query distributions. In this paper, we propose DySpec, a faster speculative decoding algorithm with a novel dynamic token tree structure. We begin by bridging the draft distribution and acceptance rate from intuitive and empirical clues and successfully show that the two variables are strongly correlated. Based on this, we employ a greedy strategy to dynamically expand the token tree at run-time. Theoretically, we show that our method can achieve optimal results under mild assumptions. Empirically, DySpec yields a higher acceptance rate and acceleration than fixed trees. DySpec can drastically improve throughput and reduce latency of token generation across various data distribution and model sizes, which outperforms strong competitors significantly, including Specinfer and Sequoia. Under low temperature setting, DySpec can improve throughput up to 9.1 ×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and reduce latency up to 9.4 ×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} on Llama2-70B. Under high temperature setting, DySpec can also improve throughput up to 6.21 ×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}, despite the increasing difficulty of speculating more than one token per step for the draft model.
Artificial intelligenceLarge language modelsInference accelerationSpeculative decoding