DySpec: Faster speculative decoding with dynamic token tree structure

扫码查看

原文链接

NETL
NSTL
Springer Nature

外文摘要：Abstract While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods usually organize predicted tokens as independent chains or fixed token trees, which fail to generalize to diverse query distributions. In this paper, we propose DySpec, a faster speculative decoding algorithm with a novel dynamic token tree structure. We begin by bridging the draft distribution and acceptance rate from intuitive and empirical clues and successfully show that the two variables are strongly correlated. Based on this, we employ a greedy strategy to dynamically expand the token tree at run-time. Theoretically, we show that our method can achieve optimal results under mild assumptions. Empirically, DySpec yields a higher acceptance rate and acceleration than fixed trees. DySpec can drastically improve throughput and reduce latency of token generation across various data distribution and model sizes, which outperforms strong competitors significantly, including Specinfer and Sequoia. Under low temperature setting, DySpec can improve throughput up to 9.1 ×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} and reduce latency up to 9.4 ×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} on Llama2-70B. Under high temperature setting, DySpec can also improve throughput up to 6.21 ×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}, despite the increasing difficulty of speculating more than one token per step for the draft model.

外文关键词：

Artificial intelligenceLarge language modelsInference accelerationSpeculative decoding

作者：

Yunfan Xiong、Ruoyu Zhang、Yanzeng Li、Lei Zou

展开 >

作者单位：

Peking University

出版年：

2025

DOI：

10.1007/s11280-025-01344-0

World wide web

ISSN：1386-145X

年,卷(期)：2025.28(3)

参考文献量23