[2410.19982] Random Policy Enables In-Context Reinforcement Learning within Trust Horizons

1 2 minutes read

[Submitted on 25 Oct 2024 (v1), last revised 2 May 2025 (this version, v3)]

PDF display of the paper entitled “RAM” allows learning to reinforce within the context within the prospects of confidence, by Weiqin Chen and 1 other authors

PDF HTML (experimental) view

a summary:Preparation models showed an unusual performance in learning in the context, allowing the zero circular to the new tasks that were not faced during training. In the case of reinforcement learning (RL), RL appears in the context (ICRL) when training FMS on decision -making problems in an automatic supervision. However, modern ICRL algorithms, such as the algorithm distillation, the prior transformer and the resolution’s importance, impose strict requirements on the pre -data set regarding source policies, context information and work marks. It is worth noting that these algorithms either require perfect policies or require varying degrees of behavior policies well trained for all pre -environments. This greatly hinders the ICRL application on scenarios in the real world, where it can be perfect or well -trained policies for a large size of the real world’s training environments. To overcome this challenge, we present a new approach, called distillation in case of case (SAD), which allows the creation of an effective data set to direct it only through random policies. In particular, SAD chooses query cases and corresponding procedures posters by distillation of the pending work of the entire state and work spaces using random policies on the horizon of confidence, then inherits the mechanism that is supervised by classic automatic spontaneity during training before training. To our knowledge, this is the first work that enables the effective ICRL under random policies and random contexts. We also create a quantitative and trustworthy analysis as well as performance guarantees for sadness. Moreover, our experimental results across many famous ICRL environments show that, on average, SAD outperforms the best basis line by 236.3 % in the internet -related evaluation and 135.2 % in online evaluation.