AI

[2503.07453] Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration

PDF view of the paper entitled is a good basis necessary for learning effective reinforcement? The mathematical role of the basic model in exploration, by Dylan C. Foster and Zakaria Mohamed and Dhruv Rohatgi

PDF view

a summary:The alignment of the language model (or, reinforcement learning) that benefits from active exploration-deliberately encouraging the model to produce various responses rich in information-the promise of high human capabilities. However, the current understanding of the design of the primitive algorithm for effective exploration in terms of mathematical with language models is limited. To understand how to use better than reaching powerful obstetric models pre -trained to improve exploration efficiency, we offer a new RL account with language models, as the learner interacts with the model through samples. Focusing on the SoftMax Model Teacher, we offer new results that reveal the usual mathematical scales for effective exploration:

1. The necessity of coverage: Coverage indicates the extent in which the model that has been previously trained is covered by almost optimal responses-a form of hidden knowledge. We show this coverage, although it is not necessary for data efficiency, limiting the time of operation of any algorithm in our work frame.

2. Explore the time of reasoning: We offer a new algorithm, Spannersamping, which gets the efficiency of optimal and arithmetic data whenever the previously trained model enjoys enough coverage, with our minimum matching. SpannersAMPLing works to take advantage of the inference time with the pre -trained model to reduce the effective search space for exploration.

3. The failure of training time interventions: We meet the above result by showing that the interventions at the time of training that produce appropriate policies cannot achieve similar guarantees in a multi -border time.

4. Mutilation benefits for multi -turn exploration: Finally, we show that in light of additional representative assumptions, one can achieve improved operation time (replacement of coverage at the level of sequence with coverage at the level of the distinctive symbol) through multi -turn exploration.

The application date

From: Dylan Foster [view email]
[v1]

Mon, 10 Mar 2025 15:31:42 UTC (111 kB)
[v2]

Thursday, 13 Mar 2025 23:15:55 UTC (112 KB)

2025-03-17 04:00:00

Related Articles

Back to top button