Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

1 1 minute read

1741776155 arxiv logo fb.png

[Submitted on 20 Feb 2025 (v1), last revised 11 Mar 2025 (this version, v2)]

Authors:Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun

PDF view of the paper entitled FR-SPEC: accelerating the large Vocabulary language models by taking samples of speculation made of frequency, by Weilin Zhao and 11 other books

PDF HTML (experimental) view

a summary:Speculation samples have emerged as an important method to accelerate the automatic generation generation of large language models (LLMS) by using an eight draft eight to produce multiple symbols for each pass. While modern speculative samples methods only use one layer and head modeling head (LM) as a draft model for impressive layer pressure, the gains of their efficiency are significantly reduced for major LLMS, such as Llama-3-8B with vocabulary of 128 km. To address this, we offer the FR-SPEC, which is a framework to take skewing samples made of frequency that improves the draft selection of the filter by compressing the vocabulary area. By restricting the research draft in a symbolic sub -group that has been identified for the frequency, our way of 75 % LM HEAD account reduces the final output distribution equation. Experiences across multiple data sets appear on average speeding $ 1.12 \ Times $ on the method of taking modern speculative samples Eagle-2. A code available in this URL https.