META-REINFORCEMENT LEARNING WITH INFORMED POLICY REGULARIZATION

Abstract

Meta-reinforcement learning aims at finding a policy able to generalize to new environments. When facing a new environment, this policy must explore to identify its particular characteristics and then exploit this information for collecting reward. We consider the online adaptation setting where the agent needs to trade-off between the two types of behaviour within the same episode. Even though policies based on recurrent neural networks can be used in this setting by training them on multiple environments, they often fail to model this trade-off, or solve it at a very high computational cost. In this paper, we propose a new algorithm that uses privileged information in the form of a task descriptor at train time to improve the learning of recurrent policies. Our method learns an informed policy (i.e., a policy receiving as input the description of the current task) that is used to both construct task embeddings from the descriptors, and to regularize the training of the recurrent policy through parameters sharing and an auxiliary objective. This approach significantly reduces the learning sample complexity without altering the representational power of RNNs, by focusing on the relevant characteristics of the task, and by exploiting them efficiently. We evaluate our algorithm in a variety of environments that require sophisticated exploration/exploitation strategies and show that it outperforms vanilla RNNs, Thompson sampling and the task-inference approaches to meta-reinforcement learning.

1. INTRODUCTION

Deep Reinforcement Learning has been used to successfully train agents on a range of challenging environments such as Atari games (Mnih et al., 2013; Bellemare et al., 2013; Hessel et al., 2017) or continuous control (Peng et al., 2017; Schulman et al., 2017) . Nonetheless, in these problems, RL agents perform exploration strategies to discover the environment and implement algorithms to learn a policy that is tailored to solving a single task. Whenever the task changes, RL agents generalize poorly and the whole process of exploration and learning restarts from scratch. On the other hand, we expect an intelligent agent to fully master a problem when it is able to generalize from a few instances (tasks) and achieve the objective of the problem under many variations of the environment. For instance, children know how to ride a bike (i.e., the problem) when they can reach their destination irrespective of the specific bike they are riding, which requires to adapt to the weight of the bike, the friction of the brakes and tires, and the road conditions (i.e., the tasks). How to enable agents to generalize across tasks has been studied in Multi-task Reinforcement Learning (e.g. Wilson et al., 2007; Teh et al., 2017) , Transfer Learning (e.g. Taylor & Stone, 2011; Lazaric, 2012) and Meta-Reinforcement Learning (Finn et al., 2017; Hausman et al., 2018; Rakelly et al., 2019; Humplik et al., 2019) . These works fall into two categories. Learning to learn approaches aim at speeding up learning on new tasks, by pre-training feature extractors or learning good initializations of policy weights (Raghu et al., 2019) . In contrast, we study in this paper the online adaptation setting where a single policy is trained for a fixed family of tasks. When facing a new task, the policy must then balance exploration (or probing), to reduce the uncertainty about the current task, and exploitation to maximize the cumulative reward of the task. Agents are evaluated on their ability to manage this trade-off within a single episode of the same task. The online adaptation setting is a special case of a partially observed markov decision problem, where the unobserved variables are the descriptors of the current task. It is thus Optimal informed policies are shortest paths from start to either G1 or G2, which never visit the sign. Thompson sampling cannot represent the optimal exploration/exploitation policy (go to the sign first) since going to the sign is not feasible by any informed policy. possible to rely on recurrent neural networks (RNNs) (Bakker, 2001; Heess et al., 2015) , since they can theoretically represent optimal policies in POMDPs if given enough capacity. Unfortunately, the training of RNN policies has often prohibitive sample complexity and it may converge to suboptimal local minima. To overcome this drawback, efficient online adaptation methods leverage the knowledge of the task at training time. The main approach is to pair an exploration strategy with the training of informed policies, i.e. policies taking the description of the current task as input. Probe-then-Exploit (PTE) algorithms (e.g. Zhou et al., 2019) operate in two stages. They first rely on an exploration policy to identify the task. Then, they commit to the identified task by playing the associated informed policy. Thompson Sampling (TS) approaches (Thompson, 1933; Osband et al., 2016; 2019 ) maintain a distribution over plausible tasks and play the informed policy of a task sampled from the posterior following a predefined schedule. PTE and TS are expected to be sample-efficient relatively to RNNs as learning informed policies is a fully observable problem. However, as we discuss in Section 3, PTE and TS cannot represent effective exploration/exploitation policies in many environments. Humplik et al. (2019) proposed an alternative approach, Task Inference (TI), which trains a full RNN policy with the current task prediction as an auxiliary loss. TI avoids the suboptimality of PTE/TS by not constraining the structure of the exploration/exploitation policy. However, in TI, the task descriptors are used as targets and not as inputs, so TI focuses on reconstructing even irrelevant features of the task descriptor and it does not leverage the faster learning of informed policies. In this paper, we introduce IMPORT (InforMed POlicy RegularizaTion), a novel policy architecture for efficient online adaptation that combines the rich expressivity of RNNs with the efficient learning of informed policies. At train time, a shared policy head receives as input the current observation, together with either a (learned) embedding of the current task, or the hidden state of an RNN such that the informed policy and the RNN policy are learned simultaneously. At test time, the hidden state of the RNN replaces the task embedding, and the agent acts without having access to the current task. This leads to several advantages: 1) IMPORT benefits from informed policy to speed up learning; 2) it avoids to reconstruct features of the task descriptor that are irrelevant for learning; and as a consequence, 3) it adapts faster to unknown environments, showing better generalization capabilities. We evaluate IMPORT against the main approaches to online adaptation on environments that require sophisticated exploration/exploitation strategies. We confirm that TS suffers from its limited expressivity, and show that the policy regularization of IMPORT significantly speeds up learning compared to TI. Moreover, the learnt task embeddings of IMPORT make it robust to irrelevant or minimally informative task descriptors, and able to generalize when learning on few training tasks.

2. SETTING

Let M be the space of possible tasks. Each µ ∈ M is associated to an episodic µ-MDP M µ = (S, A, p µ , r µ , γ) whose dynamics p µ and rewards r µ are task dependent, while state and action spaces are shared across tasks and γ is the discount factor. The descriptor µ can be a simple id (µ ∈ N) or a set of parameters (µ ∈ R d ). When the reward function and the transition probabilities are unknown, RL agents need to devise a strategy that balances exploration to gather information about the system and exploitation to maximize the cumulative reward. Such a strategy can be defined as the solution of a partially observable MDP (POMDP), where the hidden variable is the descriptor µ of the MDP. Given a trajectory τ t = (s 1 , a 1 , r 1 , . . . , s t-1 , a t-1 , r t-1 , s t ), a POMDP policy π(a t |τ t ) maps the trajectory to actions. In particular, the optimal policy in a POMDP is a history-dependent policy that uses τ t to construct a belief state b t , which describes the uncertainty about the task at hand, and then maps it to the action that maximizes the expected sum of rewards (e.g. Kaelbling et al., 1998) . In this case, maximizing the rewards may require taking explorative actions that improve the belief state enough so that future actions are more effective in collecting reward. The task is sampled at the beginning of an episode from a distribution q(µ). After training, the agent returns a policy π(a t |τ t ) that aims at maximizing the average performance across tasks generated from q, i.e., E µ∼q(µ) |τ | t=1 γ t-1 r µ t π . ( ) where the expectation is taken over a full-episode trajectory τ and task distribution q, and |τ | is the length of the trajectory. The objective is then to find an architecture for π that is able to express strategies that perform the best according to Eq. 1 and, at the same time, can be efficiently learned even for moderately short training phases. At training time, we assume the agent has unrestricted access to the task descriptor µ. Access to such a task descriptor during training is a common assumption in the multi-task literature and captures a large variety of concrete problems. It can be of two types: i) a vector of features corresponding to (physical) parameters of the environment/agent (for instance, such features maybe available in robotics, or when learning on a simulator) (Yu et al., 2018; Mehta et al., 2019; Tobin et al., 2017) . ii) It can be a single task identifier (i.e an integer) which is a less restrictive assumption (Choi et al., 2001; Humplik et al., 2019) and corresponds to different concrete problems: learning in a set of M training levels in a video game, learning to control M different robots or learning to interact with M different users.

3. RELATED WORK AND CONTRIBUTIONS

In this section, we review how the online adaptation setting has been tackled in the literature. The main approaches are depicted in Fig. 2 . We first compare the different methods in terms of expressiveness, and whether they leverage the efficient learning of informed policies. We then discuss learning task embeddings and how the various methods deal with unknown or irrelevant task descriptors. The last subsection summarizes our contributions. Evaluation of RL agent in Meta-Reinforcement Learning. The online adaptation evaluation setting is standard in the Meta-RL literature (Yu et al., 2017; Humplik et al., 2019) but is not the only way to evaluate agents on unseen tasks in the meta-RL literature. Indeed, several works have considered that given a new task, an agent is given an amount of "free" interactions episodes or steps to perform system identification, then is evaluated on the cumulative reward on one (Bharadhwaj et al., 2019; Rakelly et al., 2019) or several execution episodes (Liu et al., 2020) . This is different to what we study here where the agent has to identify the task to solve and solved it within one episode, the reward of the agent being considered during all these steps. Online Adaptation with Deep RL. In the previous section we mentioned that the best strategy corresponds to the optimal policy of the associated POMDP. Since the belief state b t is a sufficient statistic of the history τ t , POMDP policies takes the form π(a t |τ t ) = π(a t |s t , b t ). While it is impractical to compute the exact belief state even for toy discrete problems, approximations can be learnt using Recurrent Neural Networks (RNNs) (Bakker, 2001; Heess et al., 2015) . RNN-based policies are trained to maximize the cumulative reward and do not leverage task descriptors at train time. While this class of policies can represent rich exploratory strategies, their large training complexity makes them impractical. In order to reduce the training complexity of RNN policies, existing strategies have constrained the set of possible exploratory behaviors by leveraging privileged information about the task. Probe-Then-Exploit (PTE) (e.g. Zhou et al., 2019) works in two phases. First, it executes a pure exploratory policy with the objective of identifying the underlying task µ, i.e. maximizing the likelihood of the task, then runs the optimal policy associated to the estimated task. Both the probing and the informed policies are learned using task descriptors, leading to a much more efficient training process. PTE has two main limitations. First, similarly to explore-then-commit approaches in bandits (e.g. Garivier et al., 2016) , the exploration can be suboptimal because it is not reward-driven: valuable time is wasted to estimate unnecessary information. Second, the switch between probing and exploiting is hard to tune and problem-dependent. Thompson Sampling (TS) (Thompson, 1933) leverages randomization to mix exploration and exploitation. Similarly to the belief state of an RNN policy, TS maintains a distribution over task descriptors that represents the uncertainty on the current task given τ t . The policy samples a task from the posterior and executes the corresponding informed policy for several steps. Training is limited to learning informed policies together with a maximum likelihood estimator to map trajectories to distributions over tasks. This strategy proved successful in a variety of problems (e.g. Chapelle & Li, 2011; Osband & Roy, 2017) . However, as shown in Fig. 1 , TS cannot represent certain probing policies because it is constrained to executing informed policies. Another drawback of TS approaches is that the re-sampling frequency needs to be carefully tuned. The Task Inference (TI) approach (Humplik et al., 2019) is a RNN trained to simultaneously learn a good policy and predict the task descriptor µ. Denoting by m : H → Z the mapping from histories to a latent representation of the belief state (Z ⊆ R d ), the policy π(a t |z t ) selects the action based on the representation z t = m(τ t ) constructed by the RNN. During training, z t is also used to predict the task descriptor µ, using the task-identification module g : Z → M. The overall objective is: E |τ | t=1 γ t-1 r µ t π + βE |τ | t=1 (µ, g(z t )) π (2) where (µ, g(z t )) is the log-likelihood of µ under distribution g(z t ). The auxiliary loss is meant to structure the memory of the RNN m rather than be an additional reward for the policy, so training is done by ignoring the effect of m on π when computing the gradient of the auxiliary loss with respect to m. Humplik et al. (2019) proposed two variants, AuxTask and TI, described in Fig. 2 (b) and (c). In TI, the gradient of the policy sub-network is not backpropagated through the RNN (the dashed green arrow in Fig. 2c , and the policy subnetwork receives the original state features as additional input. For both AuxTask and TI, the training of π in TI is purely reward-driven, so they do not suffer from the suboptimality of PTE/TS. However, in contrast to PTE/TS, they do not leverage the smaller sample complexity of training informed policies, and the auxiliary loss is defined over the whole value of µ while only some dimensions may be relevant to solve the task. Learning Task Embeddings While in principle the minimal requirement for the approaches above is access to task identifiers, i.e. one-hot encodings of the task, these approaches are sensitive to the encoding on task descriptions, and prior knowledge on them. In particular, irrelevant variables have a significant impact on PTE approaches since the probing policy aims at identifying the task. For instance, an agent might waste time reconstructing the full µ when only part of µ is needed to act optimally w.r.t the reward. Moreover, TS, TI and AuxTask are guided by a prior distribution over µ that has to be chosen by hand to fit the ground-truth distribution of tasks. Rakelly et al. (2019) proposed to use a factored Gaussian distribution over transitions as a task embedding architecture rather than a RNN. Several approaches have been proposed to learn task embeddings (Gupta et al., 2018; Rakelly et al., 2019; Zintgraf et al., 2019; Hausman et al., 2018) . The usual approach is to train embeddings of task identifiers jointly with the policies. Humplik et al. ( 2019) mentions using TI with task embeddings, but the embeddings are pre-trained separately, which requires either additional interactions with the environment or expert traces. Nonetheless, we show in our experiments that TI can be used with task descriptors, considering task prediction as a multiclass classification problem.

Summary of the contributions

As for RNN/TI, IMPORT learns an RNN policy to maximize cumulative reward, with no decoupling between probing and exploitation. As such, our approach does not suffer from scheduling difficulties instrinsic to PTE/TS approaches. On the other hand, similarly to PTE/TS and contrarily to RNN/TI, IMPORT leverages the fast training of informed policies through a joint training of an RNN and an informed policy. In addition, IMPORT does not rely on probabilistic models of task descriptors. Learning task embeddings makes the approach robust to irrelevant task descriptors contrary to TI, makes IMPORT applicable when only task identifiers are available and able to better generalize when few training tasks are available.' 

4. METHOD

In this section, we describe the main components of the IMPORT model (described in Fig. 2 ), as well as the online optimization procedure and an additional auxiliary loss to further speed-up learning. Our approach leverages the knowledge of the task descriptor µ and informed policies to construct a latent representation of the task that is purely reward driven. Since µ is unknown at testing time, we use this informed representation to train a predictor based on a recurrent neural network. To leverage the efficiency of informed policies even in this phase, we propose an architecture sharing parameters between the informed policy and the final policy such that the final policy will benefit from parameters learned with privileged information. The idea is to constrain the final policy to stay close to the informed policy while allowing it to perform probing actions when needed to effectively reduce the uncertainty about the task. We call this approach InforMed POlicy RegularizaTion (IMPORT). Formally, we denote by π µ (a t |s t , µ) and π H (a t |τ t ) the informed policy and the history-dependent (RNN) policy that is used at test time. The informed policy π µ = φ • f µ is the functional composition of f µ and φ, where f µ : M → Z projects µ in a latent space Z ⊆ R k and φ : S × Z → A selects the action based on the latent representation. The idea is that f µ (µ) captures the relevant information contained in µ while ignoring dimensions that are not relevant for learning the optimal policy. This behavior is obtained by training π µ directly to maximize the task reward r µ . While π µ leverages the knowledge of µ at training time, π H acts based on the sole history. To encourage π H to behave like the informed policy while preserving the ability to probe, π H and π µ share φ, the mapping from latent representations to actions. We thus define as π H = φ • f H where f H : H → Z encodes the history into the latent space. By sharing the policy head φ, the approximate belief state constructed by the RNN is mapped to the same latent space as µ. When the uncertainty about the task is small, π H then benefits from the joint training with π µ . Figure 3 : Maze 3D. The goal is either located at the blue or the red box. When the back wall (i.e. not observed in the leftmost image) has a wooden texture, the correct goal is the blue box, whereas if the texture is green, the red box is the goal. More precisely, let θ, ω, σ the parameters of φ, f H and f µ respectively, so that π σθ µ (a t |s t , µ) = φ θ • f σ µ = φ θ (a t |s t , f σ µ (µ)) and π ωθ H (a t |τ t ) = φ θ • f ω H = φ θ (a t |s t , f ω H (τ t )). The goal of IMPORT is to maximize over θ, ω, σ the objective function defined in Eq. 3. E |τ | t=1 γ t-1 r µ t π ω,θ H (A) + E |τ | t=1 γ t-1 r µ t π σ,θ µ (B) -βE |τ | t=1 D f µ (µ), f H (τ t ) (C) Speeding Up the Learning. The optimization of (B) in Eq. 3 produces a reward-driven latent representation of the task through f µ . In order to encourage the history-based policy to predict a task embedding close to the one predicted by the informed policy, we augment the objective with an auxiliary loss (C) weighted by β > 0. D is the squared 2-norm in our experiments. Note that because we treat the objective (C) as an auxiliary loss, only the average gradient of D with respect to f H is backpropagated, ignoring the effect of f H on π H . The expectation of (C) is optimized over trajectories generated using π ω,θ H and π σ,θ µ , respectively used to compute (A) and (B). Optimization. IMPORT is trained using Advantage Actor Critic (A2C) (Mnih et al., 2016) with generalized advantage estimation (GAE) (Schulman et al., 2015) . There are two value functionsfoot_0 , one for each objective (A) and (B). The algorithm is summarized in Alg. 1. Each iteration collects a batch of M transitions using either π H or π µ .foot_1 If the batch is sampled according to π H , we update with A2C-GAE the parameters of the policy ω and θ according to both objectives (A) and (C), as well as the parameters of the value function associated to objective (A). If the batch is sampled according to π µ , we update with A2C-GAE the parameters of the policy σ and θ according to both objectives (B) and (C), as well as the parameters of the value function associated to objective (B).

5. EXPERIMENTS

We performed experiments on five environments. The CartPole and Acrobot environments from OpenAI Gym (Brockman et al., 2016) , where the task descriptor µ represents parameters of the physical system, e.g., the weight of the cart, the size of the pole, etc. The dimension of µ is 5 for Cartpole and 7 for Acrobot. The entries of µ are normalized in [-1, 1] and sampled uniformly. These environments provide basic comparison points where the optimal exploration/exploitation policy is relatively straightforward, since the dynamics can be inferred from a few actions. The Bandit environment is a standard Bernoulli multi-armed bandit problem with K arms. The vector µ ∈ R K denotes the probability of success of the independent Bernoulli distributions. Each dimension of µ is sampled uniformly between 0 and 0.5, the best arm is randomly selected and associated to a probability of 0.9. An episode is 100 arm pulls. At every timestep the agent is allowed to pull an arm in [1, K] and observes the resulting reward. Although relatively simple, this environment assesses the ability of algorithms to learn nontrivial probing/exploitation strategies. The Tabular MDP environment is a finite MDP with S states and A actions such that the transition matrix is sampled from a flat Dirichlet distribution, and the reward function is sampled from a uniform distribution in [0, 1] as in Duan et al. (2016) . In that case, µ is the concatenation of the transition and the reward functions, resulting in a vector of size S 2 A + SA. This environment is much more challenging as µ is high-dimensional, there is nearly complete uncertainty on the task at hand and each task is a reinforcement learning problem. Finally, the Maze 3D environment is a 3D version of the toy problem depicted in Fig. 1 , implemented using gym-miniworld (Chevalier-Boisvert, 2018). It has three discrete actions (forward, left, right) and the objective is to reach one of the two possible goals (see Figure 15 in appendix), resulting in a reward of +1 (resp. -1) when the correct (resp. wrong) goal is reached. The episode terminates when the agent touches a box or after 100 steps. The agent always starts at a random position, with a random orientation. The information about which goal to reach at each episode is encoded by the use of two different textures on the wall located at the opposite side of the maze w.r.t. the goals. This domain allows to evaluate the models when observations are high dimensional (3 × 60 × 60 RGB images). The maximum episode length is 100 on CartPole, Bandit, Tabular-MDP and Maze3D, and 500 on Acrobot. To evaluate the ability of IMPORT and the baselines to deal with different types of task descriptors µ, we also perform experiments on CartPole and Tabular-MDP in the setting where µ is only a task identifier (i.e., a one-hot vector representing the index of the training task) which is a very weak supervision available at train time. N = 10 N = 20 N = 100 N = 10 N = 20 N = We compare to previously discussed baselines. First, a vanilla RNN policy (Heess et al., 2015) using GRUs that never uses µ. Second, we compare to TS, TI and AuxTask, with µ only observed at train time, similarly to IMPORT. For TS, at train time, the policy conditions on the true µ, whereas at test time, the policy conditions on an estimated µ resampled from the posterior every k steps where k ∈ {1, 5, 10, 20}. On bandits, UCB (Auer, 2002) with tuned exploration parameters is our topline. Implementation details Contrarily to IMPORT, TS, TI and AuxTask are based on maximizing the log-likelihood of µ. When using informative task descriptors (i.e. a vector of real values), the log-likelihood uses a Gaussian distribution with learnt mean and diagonal covariance matrix. For the bandit setting, we have also performed experiments using a beta distribution which may be more relevant for this type of problem. When using task identifiers, a multinomial distribution is used. All approaches are trained using A2C with Generalized Advantage Estimation (Mnih et al., 2016 ; All approaches use similar network architectures with the same number of hidden layers and units. Evaluation The meta-learning scenario is implemented by sampling N training tasks, N validation tasks and 10, 000 test tasks with no overlap between task sets (except in Maze3D where there is only two possible tasks). Each sampled training task is given a unique identifier. Each model is trained on the training tasks, and the best model is selected on the validation tasks. We report the performance on the test tasks, averaged over three trials with different random seeds, corresponding to different sets of train/validation/test tasks. Training uses a discount factor, but for validation and test, we compute the undiscounted cumulative reward on the validation/test tasks. The learning curves show test reward as a function of the environment steps. They are the average of the three curves associated to the best validation model of each of the three seeds used to generate different tasks sets. Overall performances. IMPORT performs better than its competitors in almost all the settings. For instance, on CartPole with 10 tasks (see Table 1 ), our model reaches 94.4 reward while TI reaches only 91.5. Qualitatively similar results are found on Acrobot (Table 5 in Appendix), as well as on Bandit with 20 arms (Table 3 ), even though AuxTask performs best with only 10 arms. IMPORT particularly shines when µ encodes complex information, as on Tabular-MDP (see Table 2 ) where it outperforms all baselines in all settings. By varying the number of training tasks on CartPole and Acrobot, we also show that IMPORT's advantage over the baselines is larger with fewer training tasks. In all our experiments, as expected, the vanilla RNN performs worse than the other algorithms. Sample Efficiency. Figure 5 shows the convergence curves on CartPole with 10 and 100 training tasks and are representative of what we obtain on other environments (see Appendix). IMPORT tends to converge faster than the baselines. We also observe a positive effect of using the auxiliary loss (β > 0) on sample efficiency, in particular with few training tasks. Note that using the auxiliary loss is particularly efficient in environments where the final policy tends to behave like the informed on. Influence of µ. The experiments with uninformative µ (i.e., task identifiers) reported in Table 1 and 2 for CartPole and Tabular-MDP respectively show that the methods are effective even when the task descriptors do not include any prior knowledge. In the two cases, IMPORT can use these tasks descriptors to generalize well. Moreover, experimental results on CartPole (Fig. 11 ) and Tabular MDP (Fig. 17 ) suggest that when µ is a vector of features (and not a task identifier only) , it improves sample efficiency but does not change the final performance. This can be explained by the fact that informed policies are faster to learn with features in µ since, in that case, µ is capturing similarities between tasks. Equivalent performance of IMPORT on both types of task descriptors is observed and shows that our method can deal with different (rich and weak) task descriptors. We further analyze the impact of the encoding of µ on the models, by using non-linear projections of the informative µ to change the shape of the prior knowledge. Figure 5c shows the learning curves of TI and IMPORT on CartPole with task identifiers, the original µ and polynomial expansions of µ of order 2 and 3, resulting in 21 and 56 features. IMPORT's task embedding approach is robust to the encoding of µ, while TI's log-likelihood approach underperforms with the polynomial transformation. Task embeddings. To have a qualitative assessment of the task embedding learnt by IMPORT, we consider a bandit problem with 10 arms and embedding dimension 16. Figure 6 shows the clusters of task embeddings obtained with t-SNE (Maaten & Hinton, 2008) . Each cluster maps to an optimal arm, showing that IMPORT structures the embedding space based on the relevant information. In addition, we have studied the influence of the β hyperparameter from Eq. 3 (in Fig. 4 and Section D). It shows that the auxiliary loss helps to speed-up the learning process, but is not necessary to achieve great performance. High dimensional input space. We show the learning curves on the Maze3D environment in Figure 5d . IMPORT is succeeding in 90% of cases (reward ≈ 0.8), while TI succeeds only in 70% of cases. This shows that IMPORT is even more effective with high-dimensional observations (here, pixels). IMPORT and TI benefit from knowing µ at train time, which allows them to rapidly identify that the wall texture behind the agent is informative, while the vanilla RNN struggles and reaches random goals. TS is not reported since this environment is a typical failure case as discussed in Fig. 1 . Additional results. In Appendix C.1, we show that IMPORT outperforms TI by a larger margin when the task embedding dimension is small. We also show that IMPORT outperforms its competitors in dynamic environments, i.e., when the task changes during the episode.

6. CONCLUSION

We proposed a new policy architecture for meta reinforcement learrning. The IMPORT model is trained only on the reward objective, and leverages the informed policy to discover effective trade-offs between exploration and exploitation. It is thus able to learn better strategies than Thompson Sampling approaches, and faster than recurrent neural network policies and Task Inference approaches.

A THE IMPORT ALGORITHM

The algorithm is described in details in Algorithm 2. In our implementation, the value function network used for (A) and (B) is the same, i.e. shared. We specialize the input, i.e. for (A) the input will be (s t , f H (τ t )) and (s t , f µ (µ t )) for (B). Collect M transitions according to πµ in buffer Bµ. end if δσ, δω, δ θ = 0, 0, 0 R µ ← compute gae returns(Bµ, γGAE) R H ← compute gae returns(BH , γGAE) δ θ,ω += 1 |B H | b∈B H T t=1 [R µ,b t -Vν (s b t , z b t )]∇ θ,ω log πH (a b t |s b t , z b t ) δ θ,ω += λ h |B H | b∈B H T t=1 ∇ θ,ω H πH (a b t |s b t , z b t ) δω -= 2β |B H | b∈B H T t=1 [f ω H (s b t , z b t ) -fµ(s b t , µ b t )]∇ωf ω H (s b t , z b t ) δν -= 2λc |B H | b∈B H T t=1 [R H,b t -Vν (s b t , z b t )]∇ν Vν (s b t , z b t ) δ θ,σ += 1 |Bµ| b∈Bµ T t=1 [R H,b t -Vν (s b t , µ b t )]∇ θ,σ log πµ(a b t |s b t , µ b t ) δ θ,σ += λ h |Bµ| b∈Bµ T t=1 ∇ θ,σ H πµ(a b t |s b t , µ b t ) δν -= 2λc |Bµ| b∈Bµ T t=1 [R µ,b t -Vν (s b t , µ b t )]∇ν Vν (s b t , µ b t ) θ ← Optim(θ, δ θ ) ω ← Optim(ω, δω) σ ← Optim(σ, δσ) ν ← Optim(ν, δν ) end for B IMPLEMENTATION DETAILS B.

1. DATA COLLECTION AND OPTIMIZATION

We focus on on-policy training for which we use the actor-critic method A2C (Mnih et al., 2016) algorithm with generalized advantage estimation. We use a distributed execution to accelerate experience collection. Several worker processes independently collect trajectories. As workers progress, a shared replay buffer is filled with trajectories and an optimization step happens when the buffer's capacity bs is reached. After model updates, replay buffer is emptied and the parameters of all workers are updated to guarantee synchronisation.

B.2 NETWORK ARCHITECTURES

The architecture of the different methods remains the same in all our experiments, except that the number of hidden units changes across considered environments and we consider convolutional neural networks for the Maze3d environment. A description of the architectures of each method is given in Fig. 2 . Unless otherwise specified, MLP blocks represent single linear layers activated with a tanh function and their output size is hs. All methods aggregate the trajectory into an embedding z t using a GRU with hidden size hs. Its input is the concatenation of representations of the last action a All methods use a sof tmax activation to obtain a probability distribution over actions. The use of the hidden-state z t differs across methods. While RNNs only use z t as an input to the policy and critic, both TS and TI map z t to a belief distribution that is problem-specific, e.g. Gaussian for control problems, Beta distribution for bandits, and a multinomial distribution for Maze and CartPole-task environments. For instance, z t is mapped to a Gaussian distribution by using two MLPs whose outputs of size |µ| correspond to the mean and variance. The variance values are mapped to [0, 1] using a sigmoid activation. λ h {1., 1e -1 } {1e -1 , 1e -2 , 1e -3 } γ GAE {0.0, 1.0} clip gradient 40 η {1e -3 , 3e -4 } λ c {1., 1e -1 , 1e -2 } β {1e -1 , 1e -2 , 0.} IMPORT maps z t to an embedding f H , whereas the task embedding f µ is obtained by using a tanh-activated linear mapping of µ t . Both embeddings have size hs µ , tuned by cross-validation onto a set of validation tasks. The input of the shared policy head φ is the embedding associated with the policy to use, i.e. either f H when using π H or f µ when using f µ . For the Maze3d experiment and in all methods, we pre-process the pixel input s t with three convolutional layers (with output channels 32, stride is 2 and respective kernel sizes are 5, 5 and 4) and LeakyReLU activation. We also use a batch-norm after each convolutional layer. The output is flattened, linearly mapped to a vector of size hs and tanh-activated.

C EXPERIMENTS

In this section, we explain in deeper details the environments and the set of hyper-parameters we considered. We add learning curves of all experiments to supplement results from Table 1 , 2, 3 and 5 in order to study sample efficiency. Task descriptor. Note that for CartPole and Acrobot µ is normalized to be in [-1, 1] D where D is the task descriptor dimension. The task distribution q is always uniform, see the description of the environments for details. For experiments with task identifiers, we associate to each sampled task an integer value corresponding to the order of generation, and encode it usong a one-hot vector. Hyperparameters. Hyperparameter ranges are specified in Table 4 . For TS, we consider sampling µ from the posterior dynamics distribution every k steps with k ∈ {1, 5, 10, 20}. C.1 CARTPOLE. We consider the classic CartPole control environment where the environment dynamics change within a set M (|µ| = 5) described by the following physical variables: gravity, cart mass, pole mass, pole length, magnetic force. Knowing some components of µ might not be required to behave optimally. The discrete action space is {-1, 1}. Episode length is T = 100. Final performance and sample efficiency. Table 1 shows IMPORT's performance is marginally superior to other methods in most settings. Learning curves in Figure 7 allow analyzing the sample efficiency of the different methods. Overall, IMPORT is more sample efficient than other methods in the privileged information µ setting. Moreover, the use of the auxiliary loss (β > usually speed-up the learning convergence by enforcing the RNN to quickly produce a coherent embedding. We can see that only sharing parameters (β = 0) already helps improving over RNNs. Trajectory and task embeddings. In Figure 11 , we plot both the evolution of f H (τ t ) during an episode of the final model obtained training IMPORT with two-dimensional task embeddings on CartPole with task identifiers (left) and task embedding f µ (µ) learnt by the informed policy (right). As expected, the history embedding gets close to the task embedding after just a few timesteps (left). Interestingly, task embeddings f µ (µ) are able to capture relevant information from the task. For instance, they are highly correlated with the magnetic force which is a very strong factor to "understand" from each new environment to control the system correctly. At the opposite, gravity is less correlated since it does not influence the optimal policy -whatever the gravity is, if the pole is on the left, then you have to go right and vice-versa. Acrobot consists of two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height. Environment dynamics are determined by the length of the two links, their masses, their maximum velocity. Their respective pre-normalized domains are [0.5, 1.5], [0.5, 1.5], [0.5, 1.5], [0.5, 1.5], [3π, 5π] and [7π, 11π] . Unlike CartPole, the environment is stochastic because the simulator applies noise to the applied force. The action space is {-1, 0, 1}. We also add an extra dynamics parameter which controls whether the action order is inverted, i.e. {1, 0, -1}, thus |µ| = 7. Episode length is 500. IMPORT outperforms all baselines in settings with small training task sets (Figure 12 and Table 5 ) and perform similarly to TI on larger training task sets. This environment allows to evaluate the models in a setting where the observation is a high dimensional space (3x60x60 RGB image). The mapping between the RGB image and the task target in {-1, 1} is challenging and the informed policy should provide better auxiliary task targets than TI thanks to the "easy" training of the informed policy. IMPORT outperforms TI on this environment (Figure 16 ) in both final performance and sample efficiency. IMPORT outperforms all baselines in all settings (Figure 17 and Table 2 ). 



In our implementation, the value network is shared and takes as an input either f µ(µ) or fH (τt). In practice, data collection is multithreaded. We collect 20 transitions per thread with 24 to 64 threads depending on the environment, based on available GPU memory



Figure1: An environment with two tasks: the goal location (G1 or G2) changes at each episode. The sign reveals the location of the goal. Optimal informed policies are shortest paths from start to either G1 or G2, which never visit the sign. Thompson sampling cannot represent the optimal exploration/exploitation policy (go to the sign first) since going to the sign is not feasible by any informed policy.

Figure 2: Representation of the different architectures. IMPORT is composed of two models sharing parameters: The (black+blue) architecture is the informed policy π µ optimized through (B) while the (black+red) architecture is the history-based policy π H (used at test time) trained through (A)+(C).

Figure 4: Test performance of IMPORT for different values of β from Eq. 3

(a) CartPole N = 10 (b) CartPole N = 100 (c) Effect of transforming µ (CartPole, N = 20). (d) Maze 3D.

Figure 5: Learning curves on CartPole (a and b) and Maze3D (d) test tasks. Figure (c) studies the impact of the structure of the task descriptor on the performances of TI and IMPORT in CartPole.

Figure 6: Task embeddings learnt on Bandit (10 arms). Colors indicate the best arm.

Details of IMPORT Training Initialize σ, ω, θ, ν arbitrarily Hyperparameters: Number of iterations K, Number of transitions per update steps M , discount factor γ, GAE parameter γGAE, Adam learning rate η, weighting of the (C) objective β, weighting of the entropy objective λ h , weighting of the critic objective λc Optim = Adam(η) for k = 1, . . . , K do if k is odd then Collect M transitions according to πH in buffer BH . else

(a) CartPole with µ and N = 10 (b) CartPole with TID and N = 10 (c) CartPole with µ and N = 20 (d) CartPole with TID and N = 20 (e) CartPole with µ and N = 50 (f) CartPole with TID and N = 50 (g) CartPole with µ and N = 100 (h) CartPole with TID and N = 100

Figure 7: Evaluation on CartPole

Figure 8: CartPole (non-stationary).Figure9: Non-stationary CartPole with N = 10

Figure 8: CartPole (non-stationary).Figure9: Non-stationary CartPole with N = 10

(a) Value of fH (τt) among episodes steps on CartPole with task identifiers. The green circle is the value of fµ(µ). The image shows that IMPORT starts with a random embedding, and is able to discover the task embedding with a reasonable performance in a few steps.Task embeddings fµ(µ) for Cartpole with task identifiers. The color of the point corresponds to the value of one of the 'real' physics component of the environment (unknown to the model).

Figure 11: Visualization of task embeddings upon Cartpole

Figure 14: Learning curves on the bandit problem.

Figure 16: Learning curves on the Maze 3D environment

(a) TMDP with µ and |S| = 1 (b) TMDP with TID and |S| = 1 (c) TMDP with µ and |S| = 3 (d) TMDP with TID and |S| = 3 (e) TMDP with µ and |S| = 5 (f) TMDP with TID and |S| = 5

Figure 17: Evaluation on Tabular-MDP with different parameters and task descriptors (TID stands for task identifier).

CartPole with different number N of training tasks. Note that RNN does not µ at train time.



Hyperparameters tested per environments. At each training epoch, we run our agent on E environments in parallel collecting T r transitions on each of them resulting in batches of M = E * T r transitions.current state s t obtained separately. Actions are encoded as one-hot vectors. When episodes begin, we initialize the last action with a vector of zeros. For bandits environments, the current state corresponds to the previous reward. TS uses the same GRU architecture to aggregate the history into z t .

The value of µ are uniformly sampled.

Acrobot

annex

The Bandit environment is a standard Bernoulli multi-armed bandit problem with K arms. The vector µ ∈ R K denotes the probability of success of the independent distributions. Each dimension of µ is sampled uniformly between 0 and 0.5, the best arm is randomly selected and associated to a probability of 0.9. Although relatively simple, this environment assesses the ability of algorithms to learn nontrivial exploration/exploitation strategies.Note that it is not surprising that UCB outperforms the other algorithms in this setting. UCB is an optimal algorithm for MAB and we have optimized it for achieving the best empirical performance. Moreover, IMPORT cannot leverage correlations between tasks since, due to the generation process, tasks are independent.We visualize the task embeddings learnt by the informed policy in 13. 

D IMPACT OF THE β HYPERPARAMETER

We study the sensibility of the β parameter on IMPORT. Figure 18 clearly shows the benefits of using the auxiliary objective. On all but the Tabular-MDP environments, the recurrent policy successfully leverages the auxiliary objective to improve both sample efficiency and final performance for Acrobot. We only report performance on informative µ task descriptors.

