Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

PDF view of the paper entitled FR-SPEC: accelerating the large Vocabulary language models by taking samples of speculation made of frequency, by Weilin Zhao and 11 other books
PDF HTML (experimental) view
a summary:Speculation samples have emerged as an important method to accelerate the automatic generation generation of large language models (LLMS) by using an eight draft eight to produce multiple symbols for each pass. While modern speculative samples methods only use one layer and head modeling head (LM) as a draft model for impressive layer pressure, the gains of their efficiency are significantly reduced for major LLMS, such as Llama-3-8B with vocabulary of 128 km. To address this, we offer the FR-SPEC, which is a framework to take skewing samples made of frequency that improves the draft selection of the filter by compressing the vocabulary area. By restricting the research draft in a symbolic sub -group that has been identified for the frequency, our way of 75 % LM HEAD account reduces the final output distribution equation. Experiences across multiple data sets appear on average speeding $ 1.12 \ Times $ on the method of taking modern speculative samples Eagle-2. A code available in this URL https.
The application date
From: Weilin Zhao [view email]
[v1]
Thursday, 20 Feb 2025 18:58:10 UTC (1,285 KB)
[v2]
Tuesday, 11 Mar 2025 08:54:55 UTC (1,285 KB)
2025-03-12 04:00:00