IN-CONTEXT POLICY ITERATION

Abstract

This work presents In-Context Policy Iteration, an algorithm for performing Reinforcement Learning (RL), in-context, using foundation models. While the application of foundation models to RL has received considerable attention, most approaches rely on either (1) the curation of expert demonstrations (either through manual design or task-specific pretraining) or (2) adaptation to the task of interest using gradient methods (either fine-tuning or training of adapter layers). Both of these techniques have drawbacks. Collecting demonstrations is labor-intensive, and algorithms that rely on them do not outperform the experts from which the demonstrations were derived. All gradient techniques are inherently slow, sacrificing the "few-shot" quality that made in-context learning attractive to begin with. In this work, we present an algorithm, ICPI, that learns to perform RL tasks without expert demonstrations or gradients. Instead we present a policy-iteration method in which the prompt content is the entire locus of learning. ICPI iteratively updates the contents of the prompt from which it derives its policy through trial-and-error interaction with an RL environment. In order to eliminate the role of in-weights learning (on which approaches like Decision Transformer rely heavily), we demonstrate our algorithm using Codex (Chen et al., 2021b), a language model with no prior knowledge of the domains on which we evaluate it.

1. INTRODUCTION

In-context learning describes the ability of sequence-prediction models to generalize to novel downstream tasks when prompted with a small number of exemplars (Lu et al., 2021; Brown et al., 2020) . The introduction of the Transformer architecture (Vaswani et al., 2017) has significantly increased interest in this phenomenon, since this architecture demonstrates much stronger generalization capacity than its predecessors (Chan et al., 2022) . Another interesting property of in-context learning in the case of large pre-trained models (or "foundation models") is that the models are not directly trained to optimize a meta-learning objective, but demonstrate an emergent capacity to generalize (or at least specialize) to diverse downstream task-distributions (Brown et al., 2020; Wei et al., 2022a) . A litany of existing work has explored methods for applying this remarkable capability to downstream tasks (see Related Work), including Reinforcement Learning (RL). Most work in this area either (1) assumes access to expert demonstrations -collected either from human experts (Huang et al., 2022a; Baker et al., 2022) , or domain-specific pre-trained RL agents (Chen et al., 2021a; Lee et al., 2022; Janner et al., 2021; Reed et al., 2022; Xu et al., 2022) . -or (2) relies on gradient-based methodse.g. fine-tuning of the foundation models parameters as a whole (Lee et al., 2022; Reed et al., 2022; Baker et al., 2022) or newly training an adapter layer or prefix vectors while keeping the original foundation models frozen (Li & Liang, 2021; Singh et al., 2022; Karimi Mahabadi et al., 2022) . Our work presents an algorithm, In-Context Policy Iteration (ICPI) which relaxes these assumptions. ICPI is a form of policy iteration in which the prompt content is the locus of learning. Because our method operates on the prompt itself rather than the model parameters, we are able to avoid gradient methods. Furthermore, the use of policy iteration frees us from expert demonstrations because suboptimal prompts can be improved over the course of training. We illustrate the algorithm empirically on six small illustrative RL tasks-chain, distractor-chain, maze, mini-catch, mini-invaders, and point-mass-in which the algorithm very quickly finds good policies. We also compare five pretrained Large Language Models (LLMs), including two different size models trained on natural language-OPT-30B and GPT-J-and three different sizes of a model trained on program code-two sizes of Codex as well as InCoder. On our six domains, we find that only the largest model (the code-davinci-001 variant of Codex) consistently demonstrates learning.

2. RELATED WORK

A common application of foundation models to RL involves tasks that have language input, for example natural language instructions/goals (Garg et al., 2022a; Hill et al., 2020) or text-based games (Peng et al., 2021; Singh et al., 2021; Majumdar et al., 2020; Ammanabrolu & Riedl, 2021) . Another approach encodes RL trajectories into token sequences, and processes them with a foundation model, model representations as input to deep RL architectures (Li et al., 2022; Tarasov et al., 2022; Tam et al., 2022) . Finally, a recent set of approaches treat RL as a sequence modeling problem and use the foundation models itself to predict states or actions. This section will focus on this last category. Huang et al. (2022a) , which uses the frozen model to select good prompts from a given fixed library of goal/plan descriptions).

2.3. IN-CONTEXT LEARNING

Several recent papers have specifically studied in-context learning. Min et al. (2022) demonstrates that LLMs can learn in-context, even when the labels in the prompt are randomized, problemetizing the conventional understanding of in-context learning and showing that label distribution is more important than label correctness. Chan et al. (2022) and Garg et al. (2022b) provide analyses of the properties that drive in-context learning, the first in the context of image classification, the second in the context of regression onto a continuous function. These papers identify various properties, including "burstiness," model-size, and model-architecture, that in-context learning depends on. Chen et al. (2022) studies the sensitivity of in-context learning to small perturbations of the context. They propose a novel method that uses sensitivity as a proxy for model certainty.

3. ALGORITHM

How can standard policy iteration make use of in-context learning? Policy iteration is either modelbased-using a world-model to plan future trajectories in the environment-or model-free-inferring value-estimates without explicit planning. Both methods can be realized with in-context learning. We choose model-based learning because planned trajectories make the underlying logic of value- Algorithm 1 Training Loop 1: function TRAIN(environment) 2: initialize D ▷ replay buffer containing full history of behavior 3: while training do 4: s 0 ← Reset environment. 5: while episode is not done do 6: a t ← arg max a Q(s t , a, D) ▷ policy improvement 7: s t+1 , r t , b t ← Execute a t in environment. 8: t ← t + 1 9: end while 10: D ← D ∪ (s 0 , a 0 , r 0 , b 0 , s 1 , . . . , s t , a t , r t , b t , s t+1 ) ▷ add trajectory to buffer 11: end while 12: end function Algorithm 2 Computing Q-values 1: function Q(s t , a, D) 2: u ← t 3: s 1 = s t 4: a 1 = a 5:

repeat

▷ All samples come from the experience buffer D  s u+1 ∼ LLM (D s , s u , a u ) 12: D a ∼ c recent trajectories 13: a u+1 ∼ LLM s u+1 , D a 14: u ← u + 1 15: until b u is terminal 16: return u k=1 γ k-1 r k 17: end function estimates explicit to our foundation model backbone, providing a concrete instantiation of a trajectory that realizes the values. This ties into recent work (Wei et al., 2022b; Nye et al., 2021) demonstrating that "chains of thought" can significantly improve few-shot performance of foundation models. Model-based RL requires two ingredients, a rollout-policy used to act during planning and a worldmodel used to predict future rewards, terminations, and states. Since our approach avoids any mutation of the foundation model's parameters (this would require gradients), we must instead induce the rollout-policy and the world-model using in-context learning, i.e. by selecting appropriate prompts. We induce the rollout-policy by prompting the foundation model with trajectories drawn from the current (or recent) behavior policy (distinct from the rollout-policy). Similarly, we induce the world-model by prompting the foundation models with transitions drawn from the agent's history of experience. Note that our approach assumes access to some translation between the state-space of the environment and the medium (language, images, etc.) of the foundation models. This explains how an algorithm might plan and estimate values using a foundation model. It also explains how the rollout-policy approximately tracks the behavior policy. How does the policy improve? When acting in the environment (as opposed to planning), we choose the action that maximizes the estimated Q-value from the current state (see Training Loop pseudocode, line 6). At time step t, the agent observes the state of the environment (denoted s t ) and executes action a t = arg max a∈A Q πt (s t , a), where A = {A(1), • • • , A(n)} denotes the set of n actions available, π t denotes the policy of the agent at time step t, and Q π denotes the Q-estimate for policy π. Taking the greedy (arg max) actions with respect to Q πt implements a new and improved policy. Computing Q-values This section provides details on the prompts that we use in our computation of Q-values (see Computing Q-values pseudocode & Figure 1 ). During training, We maintain a buffer D of transitions experienced by the agent. To compute Q πt (s t , a) at time step t in the real-world we rollout a simulated trajectory s 1 = s t , a 1 = a, r 1 , s 2 , a 2 , r 2 , • • • , s T , a T , r T , s T +1 by predicting, at each simulation time step u: reward r u ∼ LLM (D r , s u , a u ); termination b u ∼ LLM (D b , s u , a u ); observation s u+1 ∼ LLM (D s , s u , a u ); action a u+1 ∼ LLM (D a , s u ). Termination b u decides whether the simulated trajectory ends at step u. The prompts D r , D b contain data sampled from the replay buffer. For each prompt, we choose some subset of replay buffer transitions, shuffle them, convert them to text (examples are provided in table 4.1) and clip the prompt at the 4000-token Codex context limit. We use the same method for D a , except that we use random trajectory subsequences. In order to maximize the relevance of the prompt contents to the current inference we select transitions using the following criteria. D b contains (s k , a k , b k ) tuples such that a k equals a u , the action for which the LLM must infer termination. D r contains (s k , a k , r k ) tuples, again constraining a k = a u but also constraining b k = b k -that the tuple corresponds to a terminal time-step if the LLM inferred b u = true, and to a non-terminal time-step if b u = false. For D s , the prompt includes (s k , a k s k+1 ) tuples with a k = a u and b k = false (only non-terminal states need to be modelled). We also maintain a balance of certain kinds of transitions in the prompt. For termination prediction, we balance terminal and non-terminal time-steps. Since non-terminal time-steps far outnumber terminal time-steps, this eliminates a situation wherein the randomly sampled prompt time-steps are entirely non-terminal, all but ensuring that the LLM will predict non-termination. Similarly, for reward prediction, we balance the number of time-steps corresponding to each reward value stored in D. In order to balance two collections of unequal size, we take the smaller and duplicate randomly chosen members until the sizes are equal. In contrast to the other predictions, we condition the rollout policy on trajectory subsequences, not individual time-steps. Prompting with sequences better enables the foundation model to apprehend the logic behind a policy. Trajectory subsequences consist of (s k , a k ) pairs, randomly clipped from the c most recent trajectories. More recent trajectories will, in general demonstrate higher performance, since they come from policies that have benefited from more rounds of improvement. Finally, the Q-value estimate is simply the discounted sum of rewards for the simulated episode. Given this description of Q-value estimation, we now return to the concept of policy improvement. Policy-Improvement The arg max (line 6 of Algorithm 1) drives policy improvement in ICPI. Critically this is not simply a one-step improvement as with Trajectory Transformer (Janner et al., 2021) but a mechanism that builds improvement on top of improvement. This occurs through a cycle in which the arg max improves behavior. The improved behavior is stored in the buffer D, and then used to condition the rollout policy. This improves the returns generated by the LLM during planning rollouts. These improved rollouts improve the Q-estimates for each action. Completing the cycle, this improves the actions chosen by the arg max. Because this process feeds into itself, it can drive improvement without bound until optimality is achieved. Note that this process takes advantage of properties specific to in-context learning. In particular, it relies on the assumption that the rollout policy, when prompted with trajectories drawn from a mixture of policies, will approximate something like an average of these policies. Given this assumption, the rollout policy will improve with the improvement of the mixture of policies from which its prompt-trajectories are drawn. This results in a kind of rapid policy improvement that works without any use of gradients.

Prompt-Format

The LLM cannot take non-linguistic prompts, so our algorithm assumes access to a textual representation of the environment-of states, actions, terminations, and rewards-and some way to recover the original action, termination, and reward values from their textual representation (we do not attempt to recover states). Since our primary results use the Codex language model (see Table 2 ), we use Python code to represent these values (examples are available in Table 1 ). In our experiments, we discovered that the LLM world-model was unable to reliably predict rewards, terminations, and next-states on some of the more difficult environments. We experimented with providing domain hints in the form of prompt formats that make explicit useful information -similar to Chain of Thought Prompting (Wei et al., 2022b) . For example, for the chain domain, the hint includes an explicit comparison (== or !=) of the current state with the goal state. Note that while hints are provided in the initial context, the LLM must infer the hint content in rollouts generated from this context. We use a consistent idiom for rewards and terminations, namely assert reward == x and assert done or assert not done. Some decisions had to be made when representing states and actions. In general, we strove to use simple, idiomatic, concise Python. On the more challenging environments, we did search over several options for the choice of hint. We anticipate that in the future, stronger foundation models will be increasingly robust to these decisions. We have three main goals in our experiments: (1) Demonstrate that the agent algorithm can in fact quickly learn good policies, using pretrained LLMs, in a set of six simple illustrative domains of increasing challenge; (2) provide evidence through an ablation that the policy-improvement step-taking the arg max over Q-values computed through LLM rollouts-accelerates learning; and (3) investigate the impact of using different LLMs (see Table 2 )-different sizes and trained on different data, in particular, trained on (mostly) natural language (GPT-3 and GPT-J) vs. program code (Codex and InCoder). We next describe the six domains and their associated prompt formats, and then describe the experimental methodology and results.

Chain.

In this environment, the agent occupies an 8-state chain. The agent has three actions: Left, right, and try goal. The try goal action always terminates the episode, conferring a reward of 1 on state 4 (the goal state) and 0 on all other states. Episodes also terminate after 8 time-steps. States are represented as numbers from 0 to 7, as in assert state == n, with the appropriate integer substituted for n. The actions are represented as functions left(), right(), and try_goal(). For the hint, we simply indicate whether or not the current state matches the goal state, 4. Distractor Chain. This environment is an 8-state chain, identical to the chain environment, except that the observation is a pair of integers, the first indicating the true state of the agent and the second acting as a distractor which transitions randomly within {0, . . . , 7}. The agent must therefore learn to ignore the distractor integer and base its inferrences on the information contained in the first integer. Aside from the addition of this distractor integer to the observation, all text representations and hints are identical to the chain environment. Maze. The agent navigates a small 3×3 gridworld with obstacles. The agent can move up, down, left, or right. The episode terminates with a reward of 1 once the agent navigates to the goal grid, or with a reward of 0 after 8 time-steps. This environment tests our algorithms capacity to handle 2-dimensional movement and obstacles, as well as a 4-action state-space. We represent the states as namedtuples -C(x, y), with integers substituted for x and y. Similar to chain, the hint indicates whether or not the state corresponds to the goal state. Mini Catch. The agent operates a paddle to catch a falling ball. The ball falls from a height of 5 units, descending one unit per time step. The paddle can stay in place (not move), or move left or right along the bottom of the 4-unit wide plane. The agent receives a reward of 1 for catching the ball and 0 for other time-steps. The episode ends when the ball's height reaches the paddle regardless of whether or not the paddle catches the ball. We chose this environment specifically to challenge the action-inference/rollout-policy component of our algorithm. Specifically, note that the success condition in Mini Catch allows the paddle to meander before moving under the ball-as long as it gets there on the final time-step. Successful trajectories that include movement away from the ball thus make a good rollout policies more challenging to learn (i.e., elicit from the LLM via prompts). Again, we represent both the paddle and the ball as namedtuples C(x, y) and we represent actions as methods of the paddle object: paddle.stay(), paddle.left(), and paddle.right(). For the hint, we call out the location of the paddle's x-position, the ball's x-position, the relation between these positions (which is larger than which, or whether they are equal) and the ball's y-position. Table 1 provides an example. We also include the text ball.descend() to account for the change in the ball's position between states.

Mini Invaders.

The agent operates a ship that shoots down aliens which descend from the top of the screen. At the beginning of an episode, two aliens spawn at a random location in two of four columns. The episode terminates when an alien reaches the ground (resulting in 0 reward) or when the ship shoots down both aliens (the agent receives 1 reward per alien). The agent can move left, right, or shoot. This domain highlights ICPI's capacity to learn incrementally, rather than discovering an optimal policy through random exploration and then imitating that policy, which is how our "No arg max" baseline learns (see Comparison of ICPI with baseline algorithms). ICPI initially learns to shoot down one alien, and then builds on this good but suboptimal policy to discover the better policy of shooting down both aliens. In contrast, random exploration takes much longer to discover the optimal policy and the "No arg max" baseline has only experienced one or two successful trajectories by the end of training. We represent the ship by its namedtuple coordinate (C(x, y)) and the aliens as a list of these namedtuples. When an alien is shot down, we substitute None for the tuple, as in aliens == [C(x, y), None]. We add the text for a in aliens: a.descend() in order to account for the change in the alien's position between states.

Point-Mass.

A point-mass spawns at a random position on a continuous line between -6 and +6 with a velocity of 0. The agent can either accelerate the point-mass (increase velocity by 1) or decelerate it (decrease the velocity by 1). The point-mass position changes by the amount of its velocity each timestep. The episode terminates with a reward of 1 once the point-mass is between -2 and +2 and its velocity is 0 once again. The episode also terminates after 8 time-steps. This domain tests the algorithm's ability to handle continuous states. States are represented as assert pos == p and vel == v, substituting floats rounded to two decimals for p and v. The actions are accel(pos, vel) and decel(pos, vel). The hint indicates whether the success conditions are met, namely the relationship of pos to -2 and +2 and the whether or not vel == 0. The hint includes identification of the aliens' and the ship's x-positions as well as a comparison between them.

4.2. EXPERIMENT METHODOLOGY AND RESULTS

Methodology and evaluation. For the results, we record the agent's regret over the course of training relative to an optimal policy computed with a discount factor of 0.8. For all experiments c = 8 (the number of most recent successful trajectories to include in the prompt). We did not have time for hyperparameter search and chose this number based on intuition, however the c = 16 baseline demonstrates results when this hyperparameter is doubled. All results use 4 seeds. For both versions of Codex, we used the OpenAI Beta under the API Terms of Use. For GPT-J (Wang & Komatsuzaki, 2021) , InCoder (Fried et al., 2022) and OPT-30B (Zhang et al., 2022) , we used the open-source implementations from Huggingface Transformers (Wolf et al., 2020) , each running on one Nvidia A40 GPU. All language models use a sampling temperature of 0.1. Comparison of ICPI with baseline algorithms. We compare ICPI with three baselines (Fig. 2 ). The "No arg max" baseline learns a good policy through random exploration and then imitates this policy. This baseline assumes access to a "success threshold" for each domain -an undiscounted cumulative return greater than which a trajectory is considered successful. The action selection mechanism emulates ICPI's rollout policy: prompting the LLM with a set of trajectories and eliciting an action as output. For this baseline, we only include trajectories in the prompt whose cumulative return exceeds the success threshold. Thus the policy improves as the number of successful trajectories in the prompt increases over time. Note that at the start of learning, the agent will have experienced too few successful trajectories to effectively populate the policy prompt. In order to facilitate exploration, we act randomly until the agent experiences 3 successes. "Tabular Q" is a standard tabular Q-learning algorithm, which uses a learning rate of 1.0 and optimistically initializes the Q-values to 1.0. "Matching Model" is a baseline which uses the trajectory history instead of an LLM to perform modelling. This baseline searches the trajectory buffer for the most recent instance of the current state, and in the case of transition/reward/termination prediction, the current action. If a match is found, the model's outputs the historical value (e.g. the reward associated with the state-action pair found in the buffer). If no match is found, the modelling rollout is terminated. Recall that ICPI breaks ties randomly during action selection so this will often lead to random action selection. As our results demonstrate, only ICPI learns good policies on all domains. We attribute this advantage to ICPI's ability to generalize from its context to unseen states and state/action pairs (unlike "Tabular Q" and "Matching Model"). Unlike "No arg max" ICPI is able to learn progressively, improving the policy before experiencing good trajectories. Ablation of ICPI components. With these experiments, we ablate those components of the algorithm which are not, in principle, essential to learning (Fig. 3 ). "No Hints" ablates the hints described in the Prompt-Format paragraph. "No Balance" removes the balancing of different kinds of time-steps described in the Computing Q-values paragraph (for example, D b is allowed to contain an unequal number of terminal and non-terminal time-steps). The "No Constraints" baseline removes the constraints on these time-steps described in the same paragraph. For example, D r is allowed to contain a mixture of terminal and non-terminal time-steps (regardless of the model's termination prediction). Finally, "c = 16" prompts the rollout policy with the last 16 trajectories (instead of the last 8, as in ICPI). We find that while some ablations match ICPI's performance in several domains, none match its performance on all six. Comparison of Different Language Models. While our lab lacks the resources to do a full study of scaling properties, we did compare several language models of varying size (Fig. 4 ). See Table 2 for details about these models. Both code-davinci-002 and code-cushman-001 are variations of the Codex language model. The exact number of parameters in these models is proprietary according to OpenAI, but Chen et al. (2021b) describes Codex as fine-tuned from GPT-3 Brown et al. (2020) , which contains 185 billion parameters. As for the distinction between the variations, the OpenAI website describes code-cushman-001 as "almost as capable as Davinci Codex, but slightly faster." 180B tokens of predominantly English data including "The Pile" (Gao et al., 2020) and "PushShift.io Reddit" (Baumgartner et al., 2020) Codex (Chen et al., 2021b) 185 billion 179 GB of GitHub code We found that none of the smaller models were capable of demonstrating learning on any domain but the simplest, chain. Examining the trajectories generated by agents trained using these models, we noted that in several cases, they seemed to struggle to apprehend the underlying "logic" of successful trajectories, which hampered the ability of the rollout policy to produce good actions. Since these smaller models were not trained on identical data, we are unable to isolate the role of size in these results. However, the failure of all of these smaller models to learn suggests that size has some role to play in performance. We conjecture that larger models developed in the future may demonstrate comparable improvements in performance over our Codex model. Limitations In principle, ICPI can work on any control task with a discrete state space. All that the method requires is a sequence prediction model (in the paper we use Codex) capable of inferring state transitions and action probabilities given a history of behavior. Such a model will ensure that the rollouts (described in Computing Q-values) are unbiased monte-carlo estimates of value for the current policy. Given such estimates, the algorithm will implement policy iteration, a method which is proven to converge. The domains in our paper were limited by the abilities of Codex to accurately predict transitions and actions in more complex domains. As sequence models mature, we anticipate that more domains will become tractable for ICPI.

5. CONCLUSION

Our main contribution is a gradient-free policy iteration algorithm, using foundation models to solve RL tasks, in which the entire locus of learning is improved prompt content. The algorithm uses a foundation models as both a world model and policy to compute Q-values via rollouts. Although we presented the algorithm here as text-based, the approach is quite general and could be applied to any foundation models that works through prompting, including multi-modal models like Reed et al. (2022) and Seo et al. (2022) . In experiments we showed that the algorithm works in six illustrative domains imposing different challenges for ICPI, confirming the benefit of the LLM-rollout-based policy improvement. While the empirical results are quite modest, we believe the approach provides an important new way to use LLMs that will increase in effectiveness as the models themselves become better, as our size comparison study suggests.



EXPERIMENTS



Figure 1: For each possible action A(1), . . . , A(n), the LLM generates a rollout by alternately predicting transitions and selecting actions. Q-value estimates are discounted sums of rewards. The action is chosen greedily with respect to Q-values. Both state/reward prediction and next action selection use trajectories from D to create prompts for the LLM. Changes to the content of D change the prompts that the LLM receives, allowing the model to improve its behavior over time.

D b ∼ time-steps with action a u ▷ balancing terminal and non-terminal 7: b u ∼ LLM (D b , s u , a u ) 8: D r ∼ time-steps with action a u and termination b u ▷ balancing reward 9: r u ∼ LLM (D r , s u , a u ) 10: D s ∼ time-steps with action a u and termination b u ▷ no balancing 11:

Figure 2: Comparison of ICPI with three baselines, "No arg max", "Tabular Q," and "Nearest Neighbor." The y-axis depicts regret (normalized between 0 and 1), computed relative to an optimal return with a discount-factor of 0.8. The x-axis depicts time-steps during training. Error bars are standard errors from four seeds.

Figure 3: Comparison of ICPI with ablations. The y-axis depicts regret (normalized between 0 and 1), computed relative to an optimal return with a discount-factor of 0.8. The x-axis depicts time-steps during training. Error bars are standard errors from four seeds.

Figure 4: Comparison of ICPI with ablations. The y-axis depicts regret (normalized between 0 and 1), computed relative to an optimal return with a discount-factor of 0.8. The x-axis depicts time-steps of training. Error bars are standard errors from four seeds.

Many recent sequence-based approaches to reinforcement learning use demonstrations that come either from human experts or pretrained RL agents. For example,Huang et al. (2022a)  use a frozen LLM as a planner for everyday household tasks by constructing a prefix from human-generated task instructions, and then using the LLM to generate instructions for new tasks. This work is extended byHuang et al. (2022b). Similarly, Ahn et al. (2022) use a value function that is trained on human demonstrations to rank candidate actions produced by an LLM.Baker et al. (2022) use human demonstrations to train the foundation model itself: they use video recordings of human Minecraft players to train a foundation models that plays Minecraft. Works that rely on pretrained RL agents includeJanner et al. (2021) who train a "Trajectory Transformer" to predict trajectory sequences in continuous control tasks by using trajectories generated by pretrained agents, andChen et al. (2021a), who use a dataset of offline trajectories to train a "Decision Transformer" that predicts actions from state-action-reward sequences in RL environments like Atari. Two approaches build on this method to



Pretrained Large Language Models (LLMs) Used in Experiments

