Unlocking the Initial Token’s Power to Enhance Large Language Models Without Training

0 2 minutes read

Unlocking the Initial Tokens Power to Enhance Large Language Models.png

[Submitted on 16 May 2025 (v1), last revised 26 Sep 2025 (this version, v2)]

View the PDF file from the paper entitled Zerotung: Opening the strength of the first distinctive symbol to enhance large language models without training, by Feijiang Han and 5 other authors

PDF HTML (experimental) view

a summary:Attention adjustment appeared at the symbol level, a group of training -free methods including attention after the custom (pasta) and attention to attention (ACT), as a promising way to improve frozen LLMS with interpretable interventions. However, these methods depend on the additional reasoning for determining the “important” “important” symbols, which can provide bias and reduce the application of the application when the symbolic importance is unclear or when using an improved nucleus where attention maps cannot be accessed. We suggest a simpler and more elegant alternative: only behaves on the initial code (for example, In Lama). Theoretically, adding lightweight biases to the distinctive symbol’s attention records this controls in an entrance to the distribution of attention in the direction of the estuary – an effect that enlarges its natural function as an aquarium. Our experimental analysis reveals that this synthesis can positively affect LLMS and unlock its pre -knowledge better, with stronger effects in the early layers and distinctive scaling preferences through attention heads. Based on these ideas, we offer Zerotung: a training -free method that improves LLM performance by applying the hole interest adjustments to the first distinctive symbol, and requires zero paradigms updates. We offer two types: a subject of supervision that is treated for examples of health verification, and a new situation that is not subject to supervision directly reduces the exit of the form. This lightweight method, Kerneel-Agnostic, requires only four lines of modification to the standard LlamaaTation code. It achieves wide gains across 15 data sets and outperforms previous and more complex roads; For example, with Llama-3.1-8B, it results in 19.9 % relative improvements on the classification, 4.5 % on answering questions, and 2.1 % on dialogue. Zerotuning also works outside the box with quantitative inference and maintains performance improvements while increasing context lengths.