MINIMUM DESCRIPTION LENGTH CONTROL

Abstract

We propose a novel framework for multitask reinforcement learning based on the minimum description length (MDL) principle. In this approach, which we term MDL-control (MDL-C), the agent learns the common structure among the tasks with which it is faced and then distills it into a simpler representation which facilitates faster convergence and generalization to new tasks. In doing so, MDL-C naturally balances adaptation to each task with epistemic uncertainty about the task distribution. We motivate MDL-C via formal connections between the MDL principle and Bayesian inference, derive theoretical performance guarantees, and demonstrate MDL-C's empirical effectiveness on both discrete and highdimensional continuous control tasks.

1. INTRODUCTION

In order to learn efficiently in a complex world with multiple rapidly changing objectives, both animals and machines must leverage past experience. This is a challenging task, as processing and storing all relevant information is computationally infeasible. How can an intelligent agent address this problem? We hypothesize that one route may lie in the dual process theory of cognition, a longstanding framework in cognitive psychology introduced by William James (James, 1890 ) which lies at the heart of many dichotomies in both cognitive science and machine learning. Examples include goal-directed versus habitual behavior (Graybiel, 2008) , model-based versus model-free reinforcement learning (Daw et al., 2011; Sutton and Barto, 2018) , and "System 1" versus "System 2" thinking (Kahneman, 2011) . In each of these paradigms, a complex, "control" process trades off with a simple, "default" process to guide actions. Why has this been such a successful and enduring conceptual motif? Our hypothesis is that default processes often serve to distill common structure from the tasks consistently faced by animals and agents, facilitating generalization and rapid learning on new objectives. For example, drivers can automatically traverse commonly traveled roads en route to new destinations, and chefs quickly learn new dishes on the back of well-honed fundamental techniques. Importantly, even intricate tasks can become automatic, if repeated often enough (e.g., the combination of fine motor commands required to swing a tennis racket): the default process must be sufficiently expressive to learn common behaviors, regardless of their complexity. In reality, most processes likely lie on a continuum between simplicity and complexity. In reinforcement learning (RL; Sutton and Barto, 2018) , improving sample efficiency on new tasks is crucial to the developement of general agents which can learn effectively in the real world (Botvinick et al., 2015; Kirk et al., 2021) . Intriguingly, one family of approaches which have shown promise in this regard are regularized policy optimization algorithms, in which a goal-specific control policy is paired with a simple yet general default policy to facilitate learning across multiple tasks (Teh et al., 2017; Galashov et al., 2019; Goyal et al., 2020; 2019; Moskovitz et al., 2022a) . One difficulty in algorithm design, however, is how much or how little to constrain the default policy, and in what way. An overly simple default policy will fail to identify and exploit commonalities among tasks, while a complex model may overfit to a single task and fail to generalize. Most approaches manually specify an asymmetry between the control and default policies, such as hiding input information (Galashov et al., 2019) or constraining the model class (Lai and Gershman, 2021) . Ideally, we'd like an adaptive approach that learns the appropriate degree of complexity via experience. The minimum description length principle (MDL; Rissanen, 1978) , which in general holds that one should prefer the simplest model that accurately fits the data, offers a guiding framework for algorithm design that does just that, enabling the default policy to optimally trade off between adapting to information from new tasks and maintaining simplicity. Inspired by dual process theory and the MDL principle, we propose MDL-control (MDL-C, pronounced "middle-cee"), a principled RPO framework for multitask RL. In Section 2, we formally introduce multitask RL and describe RPO approaches within this setting. In Section 3, we describe MDL and the variational coding framework, from which we extract MDL-C and derive its formal performance characteristics. In Section 5, we demonstrate its empirical effectiveness in both discrete and continuous control settings. Finally, we discuss related ideas from the the literature (Section 6) and conclude (Section 7).

2. REINFORCEMENT LEARNING PRELIMINARIES

The single-task setting We model a task as a Markov decision process (MDP; Puterman, 2010) M = (S, A, P, r, γ, ρ), where S, A are state and action spaces, respectively, P : S × A → P(S) is the state transition distribution, r : S × A → [0, 1] is a reward function, γ ∈ [0, 1) is a discount factor, and ρ ∈ P(S) is the starting state distribution. P(•) is the space of probability distributions defined over a given space. The agent takes actions using a policy π : S → P(A). In large or continuous domains, the policy is often parameterized: π → π θ , θ ∈ Θ, where Θ ⊆ R d represents a particular model class with d parameters. In conjunction with the transition dynamics, the policy induces a distribution over trajectories τ = (s h , a h ) ∞ h=0 , P π θ (τ ). In a single task, the agent seeks to maximize its value V π θ = E τ ∼P π θ R(τ ), where R(τ ) := h≥0 γ h r(s h , a h ) is called the return. We denote by d π ρ the state-occupancy distribution induced by policy π with starting state distribution ρ: d π ρ (s) = E ρ (1 -γ) h≥0 γ h Pr(s h = s|s 0 ). Multiple tasks There are a number of frameworks for multitask RL in the literature (Yu et al., 2019; Zahavy et al., 2021; Finn et al., 2017; Brunskill and Li, 2013) . For a more detailed discussion, see Appendix Section B. In this paper, we focus primarily on what we term the sequential and parallel task settings. The objective in each case is to maximize average reward across tasks, equivalent to minimizing cumulative regret over the agent's 'lifetime.' More specifically, we assume a (possibly infinite) set of tasks (MDPs) M = {M } presented to the agent by sampling from some task distribution P M ∈ P(M). In the sequential task setting (Moskovitz et al., 2022a; Pacchiano et al., 2022) , tasks (MDPs) are sampled one at a time from P M , with the agent training on each until convergence. In the parallel task training (Yu et al., 2019) , a new MDP is sampled from P M at the start of every episode and is associated with a particular input feature g ∈ G that indicates to the agent which task has been sampled.

Regularized Policy Optimization

One common approach which improves performance is regularized policy optimization (RPO; Schulman et al., 2017; 2018; Levine, 2018; Agarwal et al., 2020; Pacchiano et al., 2020; Tirumala et al., 2020; Abdolmaleki et al., 2018) . In RPO, a convex regularization term Ω(θ) is added to the objective: J RPO λ (θ) = V π θ -λΩ(θ). In the single-task setting, the regularization term is often used to approximate trust region (Schulman et al., 2015) , proximal point (Schulman et al., 2017) , or natural gradient (Kakade, 2002; Pacchiano et al., 2020; Moskovitz et al., 2020) optimization, or to prevent premature convergence to local maxima (Haarnoja et al., 2018; Lee et al., 2018) . In multitask settings, the regularization term for RPO typically takes the form of a divergence measure penalizing the policy responsible for taking actions π θ , which we'll refer to as the control policy, for deviating from some default policy π w , which is intended to encode generally useful behavior for some family of tasks (Teh et al., 2017; Galashov et al., 2019; Goyal et al., 2019; 2020; Moskovitz et al., 2022a) . By capturing behavior which is on average useful across tasks, π w can provide a form of beneficial supervision to π θ when obtaining reward is challenging, either because π θ has been insufficiently trained or rewards are sparse.

3. THE MINIMUM DESCRIPTION LENGTH PRINCIPLE

General principle Storing all environment interactions across multiple tasks is computationally infeasible, so multitask RPO algorithms offer a compressed representation in the form of a default policy. However, the type of information which is compressed (and that which is lost) is often hard-coded a priori. Preferably, we'd like an approach which can distill structural regularities among tasks without needing to know what they are beforehand. The minimum description length (MDL) framework offers a principled approach to this problem. So-called "ideal" MDL seeks to find the shortest solution written in a general-purpose programming languagefoot_0 which accurately reproduces the data-an idea rooted in the concept of Kolmogorov complexity (Li and Vitnyi, 2008) . Given the known impossibility of computing Kolmogorov complexity for all but the simplest cases, a more practical MDL approach instead prescribes selecting the hypothesis H ⋆ from some hypothesis class H which minimizes the two-part code H ⋆ = argmin H∈H L(D|H) + L(H), where L(D|H) is the number of bits required to encode the data given the hypothesis and L(H) is the number of bits needed to encode the hypothesis itself. There are a variety of so-called universal coding schemes which can be used to model these quantities. Variational code One popular encoding scheme is the variational code (Blier and Ollivier, 2018; Hinton and Van Camp, 1993; Honkela and Valpola, 2004) : L var ν (D) = E θ∼ν [-log p θ (D)] L var (D|H) + KL[ν(•), p(•)] L var (H) (3.1) where the hypothesis class is of a set of parametric models H = {p θ (D) : θ ∈ Θ}. The model parameters are random variables with prior distribution p(θ) and ν(θ) is any distribution over Θ. Minimizing L var ν (D) with respect to ν is equivalent to performing variational inference, maximizing a lower-bound to the data log-likelihood log p(D) = log p(θ)p θ (D)dθ ≥ -L var ν (D). Roughly speaking, MDL encourages the choice of "simple" models when limited data are available (Grunwald, 2004) . In the variational coding scheme, simplicity is enforced via the choice of prior. Sparsity-inducing priors and variational dropout Sparsity-inducing priors can be used to improve the compression rate within the variational coding scheme, as they encourage the model to prune out parameters that do not contribute to reducing L var (D|θ). Many sparsity-inducing priors belong to the family of scale mixtures of normal distributions (Andrews and Mallows, 1974): z ∼ p(z), θ ∼ p(θ|z) = N (w; 0, z 2 ) where p(z) defines a distribution over the variance z 2 . Common choices of p(z) include the Jeffreys prior p(z) ∝ |z| -1 (Jeffreys, 1946) , the inverse-Gamma distribution, and the half-Cauchy distribution (Polson and Scott, 2012; Gelman, 2006) . Such priors have deep connections to MDL theory. For example, the Jeffreys prior in conjunction with an exponential family likelihood is asymptotically identical to the normalized maximum likelihood estimator, perhaps the most fundamental 'MDL' estimator (Grünwald and Roos, 2019) . Variational dropout (VDO) is an effective algorithm for minimizing Equation (3.1) for these sparsity-inducing priors (Louizos et al., 2017; Kingma et al., 2015; Molchanov et al., 2017) . Briefly, this involves choosing an approximate posterior distribution with the form p(w, z|D) ≈ ν(w, z) = N (z; µ z , ασ 2 z )N (w; zµ, z 2 σ 2 I d ) (3.2) and optimizing Equation (3.1) via stochastic gradient descent on the variational parameters given by {α, µ z , σ 2 z , µ, σ 2 }. As its name suggests-and importantly for its ease of application to large models-VDO can be implemented as a form of dropout (Srivastava et al., 2014) by reparameterizing the noise on the weights as activation noise (Kingma et al., 2015) . Application of VDO to Bayesian neural networks has achieved impressive compression rates, sparsifying deep neural networks while maintaining prediction performance on supervised learning problems (Molchanov et al., 2017; Louizos et al., 2017) . Equipped with a powerful approach for MDL-grounded posterior inference, we can now integrate these ideas with multitask RPO.

4. MINIMUM DESCRIPTION LENGTH CONTROL

As part of its underlying philosophy, the MDL principle holds that 1) learning is the process of discovering regularity in data, and 2) any regularity in the data can be used to compress it (Grunwald, 2004) . Applying this perspective to RL is non-obvious-from the agent's perspective, what 'data' is it trying to compress? Our hypothesis, which forms the basis for the framework Algorithm 1: MDL-C for Sequential Multitask Learning with Persistent Replay 1: Require: task distribution P M , policy class Θ, non-increasing coefficients {η k } K k=1 2: Initialize: default policy distribution ν 1 ∈ N ⊆ P(Θ), default policy dataset D 0 ← ∅ 3: for tasks k = 1, 2, . . . , K do 4: Sample a task M k = (S, A, P k , r k , γ k , ρ k ) ∼ P M (•) 5: Optimize control policy: θ k ← argmax θ∈Θ V π M k -αE s∼d π ρ k E w∼ν k KL[π w (•|s), π θ (•|s)] (4.2) 6: Add data to default policy replay (M = |S| for finite/small state spaces): D k ← D k-1 ∪ {(s m , πθ k (s m ))} M m=1 . (4.3) 7: Update default policy distribution: ν k+1 ← argmin ν∈N 1 η k-1 KL[ν(•), p(•)] + k i=1 M m=1 E w∼ν KL[π ⋆ θi (•|s m ), π w (•|s m )] (4.4) 8: end for we propose in this paper, is that an agent faced with a set of tasks in the world should seek to elucidate structural regularity from the environment interactions generated by the optimal policies for the tasks. This makes intuitive sense: the agent ought to compress information which indicates how to correctly perform the tasks with which it is faced. That is, we propose that the data in multitask RL are the state-action interactions generated by the optimal policies for a set of tasks: D = {D M } M ∈M = {(s, a) : ∀s ∈ S, a ∼ π ⋆ M (•|s)} M ∈M This interpretation is in line with work suggesting that a useful operational definition of 'task' can be derived directly from the set of optimal (or near-optimal) policies it induces (Abel et al., 2021) . It also suggests a natural mapping to the multitask RPO framework. In this view, the control policy is responsible for learning and the default policy for compression: by converging to the optimal policy for a given task, the control policy "discovers" regularity which is then distilled into a low-complexity representation by the default policy. In our approach, the default policy is encouraged to learn a compressed representation not by artificially constraining the network architecture or via hand-designed information asymmetry, but rather through a prior distribution p(w) over its parameters which biases a variational posterior ν(w) towards simplicity. The default policy is therefore trained to minimize the variational code: argmin ν∈N Es,a∼D w∼ν -log π w (a|s) + KL[ν(•), p(•)] = argmin ν∈N E M ∼P M E s∼d π ⋆ M w∼ν ϕ KL[π ⋆ M (•|s), π w (•|s)] + KL[ν(•), p(•)], where N is the distribution family for the posterior. This suggests the approach presented in Algorithm 1, in which for each task M k , the control policy π θ is trained to approximate the optimal policy π ⋆ k via RPO, and the result is compressed into a new default policy distribution ν k+1 . We now further motivate sparsity-inducing priors for the default policy in multitask settings, derive formal performance guarantees for MDL-C, and demonstrate its empirical effectiveness.

4.1. MOTIVATING THE CHOICE OF SPARSITY-INDUCING PRIORS

In Section 3, compression (via pruning extraneous parameters) is the primary motivation for using sparsity-inducing priors that belong to the family of scaled-mixtures of normal distributions. Intuitively, placing a distribution over the default parameters reflects the agent's epistemic uncertainty about the task distribution-when few tasks have been sampled, a sparse prior prevents the default policy from overfitting to spurious correlations in the limited data that the agent has collected. Here, we make this motivation more precise, describing an example generative model of optimal policy parameters which provides a principled interpretation for prior choice p(z) in multitask RL. Generative model of optimal policy parameters Consider a set of tasks M = {M ik } I,Ki i=1,k=1 that are clustered into I groups, such that the MDPs in each group are more similar to one another than to members of other groups. As an example, the overall family M could be all sports, while clusters M i ⊆ M could consist of, say, ball sports or endurance competitions. To make this precise, we assume that the optimal policies of every MDP belong to a parametric family Π = {π w (•|s) : w ∈ R d , ∀s ∈ S} (e.g., softmax policies with parameters w), and that the optimal policies for each group are randomly distributed within parameter space. In particular, we assume that the parameters of the optimal policies of M have the following generative model: w i |β, σ 2 ∼ N w m ; 0, (1 -β)β -1 σ 2 I d , w ik |w i , σ 2 ∼ N w ik ; w i , σ 2 I d . where I d is the d-dimensional identity matrix. If we marginalize out w i , we get the marginal distribution p(w ik |β, σ 2 ) = N (w ik ; 0, σ 2 β -1 I d ). We can then visualize the parameter distribution of the optimal policies for M as a d-dimensional Gaussian within which lie clusters of optimal policies for related tasks which are themselves normally distributed (see Fig. 4 .1A for d = 2).

Interpretation of β

The parameter β ∈ (0, 1] can be interpreted as encoding the squared distance between optimal policy parameters within a group divided by the squared distance between optimal policies in M. Intuitively, β determines how much information one gains about the optimal parameters of a task in a group, given knowledge about the optimal parameters of another task in the same group. To see this, we compute our posterior belief about the value w i given observation of w ik : p(w i |w ik , β, σ 2 ) = N w i ; (1 -β)w ik , (1 -β)σ 2 I d . When β = 1 (inner circle in Figure 4 .1A has the same radius as the outer circle), our posterior mean estimate of w i is simply 0, suggesting we have learned nothing new about the mean of the optimal parameters in group i, by observing w ik . In the other extreme when β → 0, the posterior mean approaches the maximum-likelihood estimator w ik , suggesting that observation of w ik provides maximal information about the optimal parameters in group i. Any β in between the two extremes results in an estimator that "shrinks" w ik towards 0. The value of β thus has important implications for multitask learning. Suppose an RL agent learns the optimal parameters w 11 (task 1, group 1), and proceeds to learn task 2 in group 1. The value of β determines whether w 11 can be used to inform the agent's learning of w 21 . In this way, β determines the effective degree of epistemic uncertainty the agent has about the task distribution. Choice of p(β) and connection to p(z) Given its importance, it's natural to ask what value β should take. Instead of treating β as a parameter, we can choose a prior p(β) and perform Bayesian inference. Ideally, p(β) should (i) encode our prior belief about the extent to which the optimal parameters cluster into groups and (ii) result in a posterior mean estimator ŵ(p(β)) (x) = 1 -E [β|x] x that is close to w for x|w ∼ N (x; w, σ 2 ). This condition encourages the expected default policy (under the posterior ν; Equation (4.1)) to be close to optimal policies in the same MDP group (centered at w). One prior choice that satisfies both conditions is p(β) ∝ β -1 . It places high probability for small β and low probability for high β, thus encoding the prior belief that the optimal task parameters are clustered (see Figure 4 .1B; blue). It is instructive to compare p(β) ∝ β -1 with two extreme choices of p(β). When p(β) = δ(β -1), p(z) = δ(σ) and the marginal p(w) is the often-used Gaussian prior over the parameters w with fixed variance σ 2 . This corresponds to the prior belief that knowing w i1 provides no information about w i2 . On the other hand, p(β) = δ(β) recovers a uniform prior over the parameters w and reflects the prior belief that the MDP groups are infinitely far apart. In relation to (ii), one can show the ŵ(p(β)) strictly dominates the maximum-likelihood estimator ŵ(ML) (x) = x (Efron and Morris, 1973; Section D) , for p(β) ∝ β -1 . This means MSE(w, ŵ(p(β)) ) ≤ MSE(w, ŵ(ML) ) for all w, where MSE(w, ŵ) = E x∼N (x;w,σ 2 ) ∥w -ŵ(x)∥ 2 . Connection to p(z) and application of VDO Defining z 2 = σ 2 β -1 and applying the changeof-variable formula to p(β) ∝ β -1 gives p(z) ∝ |z| -1 and thus the Normal-Jeffreys prior in Section 3. VDO (see Section 3) can then be applied to obtain an approximate posterior ν(w, z) which minimizes the variational code Equation (4.1). Similar correspondences may also be derived for the inverse-Gamma distribution and the half-Cauchy distribution (Figure 4 .1B; Section D).

4.2. PERFORMANCE ANALYSIS

At a fundamental level, we'd like assurance (i) that MDL-C's default policy will be able to effectively distill the optimal policies for previously observed tasks, and (ii) that regularization using this default policy gives strong performance guarantees for the control policy on future tasks. Default policy performance One way we can verify (i) is to obtain an upper bound on the average KL between default policies sampled from the default policy distribution and an optimal policy for a task sampled from the task distribution. This enables us to perform analysis using online convex optimization (OCO). In OCO, the learner observes a series of convex loss functions ℓ k : N → R, k = 1, . . . , K, where N ⊆ R d is a convex set. After each round, the learner produces an output x k ∈ N for which it will then incur a loss ℓ k (x k ) (Orabona, 2019) . At round k, the learner is usually assumed to have knowledge of ℓ 1 , . . . , ℓ k-1 , but no other assumptions are made about the sequence of loss functions. The learner's goal is to minimize its average regret. For further background, see Section F. Crucially, the MDL-C learning procedure for the default policy distribution is equivalent to follow the regularized leader (FTRL), an OCO algorithm which enjoys sublinear regret. In each round of FTRL, the learner selects the solution x ∈ N according to the following general objective: x k+1 = argmin x∈N ψ k (x) + k-1 i=1 ℓ i (x), where ψ : N → R is a convex regularization function. Using standard results, this connection allows us to bound MDL-C's regret in learning the default policy distribution. All proofs are provided in Section G. Proposition 4.1 (Persistent Replay FTRL Regret). Let tasks M k be independently drawn from P M at every round, and let them each be associated with a deterministic optimal policy π ⋆ k : S → A. We make the following mild assumptions: i) π w (a ⋆ |s) ≥ ϵ > 0 ∀s ∈ S, where a ⋆ = π ⋆ k (s) and ϵ is a constant. ii) min ν KL[ν(•), p(•)] = 0 asymptotically as Var[ν] → ∞. Then with η k-1 = log(1/ϵ) √ k, Algorithm 1 guarantees 1 K K k=1 ℓ k (ν k ) - 1 K K k=1 ℓ k (ν K ) ≤ (KL[ν K , p] + 1) log(1/ϵ) √ K , where νK = argmin ν∈N K k=1 ℓ k (ν). Control policy performance Intuitively, this result shows that the average regret is upper-bounded by factors which depend on the divergence of the barycenter distribution from the prior and the "worstcase" prediction of the default policy. Importantly, the KL between the default policy distribution and the barycenter distribution goes to zero as K → ∞. We can also now be assured of point (ii) above, in that this result can be used to obtain a sample-complexity bound for the control policy. Specifically, we can use Proposition G.1 to place an upper-bound on the total variation distance between default policies sampled from ν and the KL between the maximum likelihood solution and a sparsity-inducing prior p. This is useful, as it allows to translate low regret for the default policy into a sample complexity result for the control policy using Moskovitz et al. (2022a) , Lemma 5.2. Proposition 4.2 (Control Policy Sample Complexity). Under the setting described in Proposition G.1, denote by T k the number of iterations to reach ϵ-error for M k in the sense that min t≤T k {V π ⋆ k -V (t) } ≤ ϵ. whenever t > T k . Further, denote the upper-bound in Eq. (G.1) by G(K). In a finite MDP, from any initial θ (0) , and following gradient ascent, where E M k ∼P M [T k ] satisfies: E M k ∼P M i [T k ] ≥ 80|A| 2 |S| 2 ϵ 2 (1 -γ) 6 E M k ∼P M i s∼Unif S   κ α k A (s) d π * k ρ µ 2 ∞   , α k (s) := d TV (π ⋆ k (•|s), π0 (•|s)) ≤ G(K), κ α k A (s) = 2|A|(1-α(s)) 2|A|(1-α(s))-1 , and µ is a measure over S such that µ(s) > 0 ∀s ∈ S. Intuitively, this means that when the average number of samples is sufficiently large, the control policy is guaranteed to have reached ε error. Therefore, as the agent is trained on more tasks, the default policy distribution regret, upper-bounded by G(K),decreases asymptotically to zero, and as the default policy regret decreases, the control policy will learn more rapidly, as poly(G(K)).

5. EXPERIMENTS

We tested MDL-C applied to discrete and continuous control in both the sequential and parallel task settings. To quantify performance, in addition to measuring per-task reward, we also report the cumulative regret for each method in each experimental setting in Section I.1.

5.1. 2D NAVIGATION

We first test MDL-C in the classic FOURROOMS environment (Fig. 5 .1a, (Sutton et al., 1999) ). The baselines in this case are PO entropy-regularized policy optimization (PO), regularized policy optimization with no constraint on the default policy (RPO), an agent with VDO applied to the control policy and no default policy (VDO-PO), and MANUALIA (Galashov et al., 2019) in which the goal feature is manually witheld from the default policy. Details can be found in Section H.

Generalization Across Goals

In the first setting, we test MDL-C's ability to facilitate rapid learning on previously unseen goals. In the first phase of training, a single goal location is randomly sampled at the start of each episode, and may be placed anywhere in two of the four rooms in the environment (Fig. 5 .1a, top left). In the second phase, the goal location is again randomly sampled at the start of each episode, but in this case, only in the rooms which were held out in the first phase. Additionally, the agent is limited to 25 rather than 100 steps per episode. Importantly, VDO induces the MDL-C default policy to ignore input features which are, on average, less predictive of the control policy's behavior. In this case, the default policy learns to ignore the goal feature and the reward obtained on the previous timestep. This is because, when averaging across goal locations, the agent's current position (s h ) and its previous direction (a h-1 ) are more informative of its next action-typically, heading towards the nearest door. In contrast, the un-regularized default policy of the RPO agent does not drop these features (Section I for a visualization and Section H for more details). By learning to ignore the goals present in phase 1 and encoding useful behavior regardless of goal location, MDL-C's develops more effective regularization in phase 2, enabling it to adapt more quickly than other methods (Fig. 5 .1c, top), particularly RPO, which overfits to phase 1's goals. MANUALIA also adapts quickly, as its default policy is hard-coded to ignore the goal feature. Robustness to Rule Changes In this setting, there are only two possible goal locations, one at the top left of the environment, and the other at the bottom right. In training phase 1, the agent receives a goal feature as input which indicates the state index of the rewarded location for that episode. In phase 2, the goal feature switches from marking the reward location to marking the unrewarded location. That is, if the reward is in the top left, the goal feature will point to the bottom right. Here, the danger for the agent isn't overfitting to a particular goal or goals, but rather "overfitting" to the reward-based rules associated with a given feature. As we saw in Fig. 5 .1c (top), an un-regularized default policy, will copy the control policy and overfit to a particular setting. Fortunately, the MDL-C default policy learns to ignore features which are, on average, less useful for predicting the control policy's behavior-the goal and previous reward features. This renders the agent more robust to contingency switches like the one described, as we can see in Fig. 5 .1c (bottom). These examples illustrate that MDL-C enables agents to effectively learn the consistent structure of a group of tasks, regardless of its semantics, and "compress out" information which is less informative on average.

5.2. CONTINUOUS CONTROL

A more challenging application area is that of high-dimensional continuous control. In this setting, we presented agents with multitask learning problems using environments from the DeepMind Control Suite (DMC; (Tassa et al., 2018) ). We used soft actor critic (SAC; (Haarnoja et al., 2018) ) as the base agent. We tested MDL-C in both the sequential and parallel settings on two domains from DMC: walker and cartpole (Fig. 5 .2a). Additional details can be found in Section H. Sequential Tasks In the sequential setting, tasks are sampled one at a time uniformly without replacement from the available tasks within each domain, with the default policy distribution conserved across tasks. For walker, these tasks are stand, walk, and run. In stand, the agent is rewarded for increasing the height of its center of mass, and in the latter two tasks, an additional reward is given for forward velocity. For cartpole, there are four tasks: balance, balance-sparse, swingup, and swingup-sparse. In the balance tasks, the agent must keep a rotating pole upright, and in the swingup tasks, it must additionally learn to swing the pole upwards from an initial downward orientation. Performance results for the hardest task within each domain (run in walker and swingup-sparse in cartpole) for each method are plotted in Fig. 5 .2b, where k indicates the task round at which the task was sampled. We can see that as k increases (as more tasks have been seen previously), MDL-C's performance improves. Importantly, the RPO agent's default policy, which is un-regularized, overfits to the previous task, essentially copying the optimal policy's behavior. This can severely hinder the agent's performance when the subsequent task requires different behavior. For example, on swingup-sparse, if the previous task is swingup, the RPO agent performs well, as the goal is identical. However, if the previous task is balance or balance-sparse, the agent never learns to swing the pole upwards, significantly reducing its average performance. Another important point to note is that because the agent is not given an explicit goal feature in this setting, methods like MANUALIA which rely on prior knowledge about the agent's inputs cannot be applied. Parallel Tasks We also tested parallel-task versions of SAC, MANUALIA, and MDL-C based on the model of Yu et al. (2019) . In this framework, a task within each domain is randomly sampled at the start of each episodeand the agent learns a single control policy for all tasks. Performance is plotted in Fig. 5 .2c, where we can again see that MDL-C accelerates convergence relative to the baseline methods. This marks a difference compared to the easier FourRooms environment, in which MDL-C and MANUALIA performed roughly the same. As before, one clue to the difference can be found in the input features that the MDL-C default policy chooses to ignore (Fig. 5 .2d). For walker, inputs are 24-dimensional, with 14 features related to the joint orientations, 1 feature indicating the height of the agent's center of mass, and 9 features indicating velocity components. For cartpole, there are 5 input dimensions, with 3 pertaining to position and 2 to velocity. In the walker domain, where the performance difference is greatest, the MDL-C agent not only ignores the added task ID feature, but also the several features related to velocity. In contrast, in the cartpole domain, MDL-C only ignores the task ID feature, just as MANUALIA does, and the performance gap is smaller. This illustrates that MDL-C learns to compress out spurious information even in settings for which it is difficult to identify a priori. In order to test the effect of the learned asymmetry on performance more directly, we implemented a variant of MANUALIA in which all of the features which MDL-C learned to ignore were manually hidden from the default policy (Fig. I .4). Interestingly, while this method improved over standard MANUALIA, it didn't completely close the gap with MDL-C, indicating there are downstream effects within the network beyond input processing which are important for the default policy's effectiveness. We hope to explore these effects in more detail in future work.

6. RELATED WORK

MDL-C can be viewed as an extension of recent approaches to learning default policies ("behavioral priors") from the optimal policies of related tasks (Teh et al., 2017; Tirumala et al., 2020) . For a default policy to be useful for transfer learning, it is crucial to balance the ability of the default policy to "copy" the control policies with its expressiveness. If the default policy is too expressive, it is likely to overfit on past tasks and fail to generalize to unseen tasks. Whereas prior work primarily handcrafts structural constraints into the default policies to avoid overfitting (e.g., by hiding certain state information from the default policy; Galashov et al., 2019) , MDL-C learns such a balance from data with sparsity-inducing priors via variational inference. MDL-C may also be derived from the RL-asinference framework (Levine, 2018 ; Section A). MDL-C thus has close connections with algorithms such as MPO (Abdolmaleki et al., 2018) and VIREL (Fellows et al., 2020) , discussed in Section A. As a general framework, MDL-C is also connected to the long and well-established literature on choosing appropriate Bayesian priors (Jeffreys, 1946; Bernardo, 2005; Casella, 1985) , and more recent work that focuses on learning such priors for large-scale machine learning models (Nalisnick and Smyth, 2017; Nalisnick et al., 2021; Atanov et al., 2018) . For a further discussion of related work, particularly concerning the application of MDL to the RL setting, see Section C.

7. CONCLUSION

Inspired by dual process theories and the MDL principle, we propose a regularized policy optimization framework for multitask RL which aims to learn a simple default policy encoding a low-complexity distillation of the optimal behavior for some family of tasks. By encouraging the default policy to maintain a low effective description length, MDL-C ensures that it does not overfit to spurious correlations among the (approximately) optimal policies learned by the agent. We described MDL-C's formal properties and demonstrated its empirical effectiveness in discrete and continuous control tasks. There are of course limitations of MDL-C, which we believe represent opportunities for future work (see Section E). Promising research directions include integrating MDL-C with multitask RL approaches which balance a larger set of policies (Barreto et al., 2020; Moskovitz et al., 2022b; Thakoor et al., 2022) as well considering nonstationary environments (Parker-Holder et al., 2022) . We hope MDL-C inspires further advances in multitask RL.

Minimum Description Length Control Supplementary Information

A REINFORCEMENT LEARNING AS INFERENCE The control as inference framework (Levine, 2018) associates every time step h with a binary "optimality" random variable O h ∈ {0, 1} that indicates whether a h is optimal at state s h (O h = 1 for optimal, and O h = 0 for not). The optimality variable has the conditional distribution P (O h = 1|s h , a h ) = exp(r(s h , a h )), which scales exponentially with the reward received taking action a h in state s h . Denote O H as the event that O s = 1 for s = 0, . . . , H -1. Then the log-likelihood that a policy π w (a|s) is optimal over a horizon H is given by: P(O H ) = P(O H |τ )P πw (τ |w)p(w)dτ dw. By performing variational inference, we can lower-bound the log-likelihood with the ELBO: log P(O H ) ≥ E νπ(τ ) H-1 h=0 r(s h , a h ) -E ν θ (w) KL[π θ (a h |s h ), π w (a h |s h )] -KL[ν ϕ (w), p(w)], (A.1) where ν θ,ϕ (τ, w) = ν θ (τ )ν ϕ (w) is the variational posterior, ν θ (τ ) = ρ(s 0 ) H-1 h=0 P(s h+1 |s h , a h )π θ (a h , s h ) and {θ, ϕ} are the variational parameters. We can maximize this objective iteratively by performing coordinate ascent on {θ, ϕ}: θ ← θ + η∇ θ E ν θ (τ ) H-1 h=0 r(s h , a h ) -E ν θ (w) KL[π θ (a h |s h ), π w (a h |s h )] , (A.2) ϕ ← ϕ -η∇ ϕ E ν θ (τ ) H-1 h=0 E ν θ (w) KL[π θ (a h |s h ), π w (a h |s h )] + KL[ν ϕ (w), p(w)] (A.3) where η is a learning rate parameter. Note that Equation (A.3) is equivalent to Equation (4.1) and Equation (G.8), and Equation (A.2) is equivalent to Equation (G.7) with the KL reversed. Connection to Maximum a Posteriori Policy Optimization (MPO) MDL-C is closely related to MPO (Abdolmaleki et al., 2018) , with three key differences. First, MDL-C performs variational inference on the parameters of the default policy with an approximate posterior ν ϕ (w), whereas MPO performs MAP inference. Second, MPO places a normal prior on w, which in effect penalizes the L2 norm of w. In contrast, MDL-C uses sparsity-inducing priors such as the normal-Jeffreys prior. Third, MDL-C uses a parametric π θ , whereas MPO uses a non-parametric onefoot_1 . While there is also a parametric variant of MPO, this variant does not maintain θ and ϕ separately. Instead, this variant directly sets θ to ϕ in Equation (A.2). This illustrates the key conceptual difference between MDL-C and MPO. MDL-C makes a clear distinction between the control policy π θ and the default policy π w , with the two policies serving two distinct purposes: the control policy for performing on the current task, the default policy for distilling optimal policies across tasks and generalizing to new ones. MPO, on the other hand, treats π θ and π w as fundamentally the same object. Like MPO, VIREL (Fellows et al., 2020) can be derived from the control as inference framework. In fact, Fellows et al. showed that a parametric variant of MPO can be derived from VIREL (Fellows et al., 2020) . The key novelty that sets VIREL apart from both MPO and MDL-C is an adaptive temperature parameter that dynamically updates the influence of the KL term in Equation (A.2).

B MULTITASK RL FRAMEWORKS

We believe the objective which best captures naturalistic settings is the average reward obtained over the agent's "lifetime": lim T →∞ 1 T E T t=1 r(s t , a t ). Typical objectives include finding either a single policy or a set of policies which maximize worst-or average-case value: max π min M ∈M V π M (Zahavy et al., 2021) or max π E P M V π M (Moskovitz et al., 2022a) . When the emphasis is on decreasing the required sample complexity of learning new tasks, a useful metric is cumulative regret: the agent's total shortfall across training compared to an optimal agent. In practice, it's often simplest to consider the task distribution P M to be a categorical distribution defined over a discrete set of tasks M := {M k } K k=1 , though continuous densities over MDPs are also possible. Two multitask settings which we consider here are parallel task RL and sequential task RL. In typical parallel task training (Yu et al., 2019) , a new MDP is sampled from P M at the start of every episode and is associated with a particular input feature g ∈ G that indicates to the agent which task has been sampled. The agent's performance is evaluated on all tasks M ∈ M together. In the sequential task setting (Moskovitz et al., 2022a; Pacchiano et al., 2022) , tasks (MDPs) are sampled one at a time from P M , with the agent training on each until convergence. In contrast to continual learning (Kessler et al., 2021) , the agent's goal is simply to learn a new policy for each task more quickly as more are sampled, rather than learning a single policy which maintains its performance across tasks. Another important setting is meta-RL, which we do not consider here. In the meta-RL setting, the agent trains on each sampled task for only a few episodes each with the goal of improving few-shot performance and is meta-tested on a set of held-out tasks (Yu et al., 2019; Finn et al., 2017) . Another strain of work in multitask RL assumes some form of shared structure in the transition dynamics (Pacchiano et al., 2022; ?; ?) . Specifically, the core assumption made by these works is that the transition dynamics are linearly decodable from a set of features which is shared across tasks or in which the transition matrix admits a low-rank decomposition. This is very different from our own structural assumption-that is, in its simplest form, that the optimal policies of the tasks with which our agents are faced take similar actions in at least some part of the state space. Beyond this, the MDPs in M need only share the same state and action space, with no direct assumptions about transitions or rewards. This is important, because the assumed structures in the transition distribution made by Pacchiano et al. (2022) ; ?); ? act as the starting points for algorithm development. MDL-C/RPO/TVPO however, can leverage similarity among optimal policies when it exists, but are not dependent on it as a prerequisite. (E.g., TVPO (and RPO/MDL-C) is guaranteed to perform no worse than log-barrier regularization, which has a polynomial sample complexity guarantee.) Ideally, we'd like a generalist method which can identify on its own and exploit different types of structure in the environment.

C ADDITIONAL RELATED WORK

Previous work has also applied the MDL principle in an RL context, though primarily in the context of unsupervised skill learning (Zhang et al., 2021; Thrun and Schwartz, 1994) . For example, Thrun and Schwartz (1994) are concerned with a set of "skills" which are policies defined only over a subset of the state space that are reused across tasks. They consider tabular methods, measuring a pseudo-description length as DL = s∈S M ∈M P * M (s) + n∈N |S n |, (C.1) where P * M (s) is the probability that no skill selects an action in state s for task M and the agent must compute the optimal Q-values in state s for M , N is the number of skills, and |S n | is the number states for which skill n is defined. They then trade off this description length term with performance across a series of tabular environments. One other related method is DISTRAL (Teh et al., 2017) , which uses the following objective in the parallel task setting: J Distral (θ, ϕ) = V π θ -E s∼d π θ [αKL[π θ (•|s), π ϕ (•|s)] + βH[π θ (•|s)]] . (C.2) That is, like the un-regularized RPO method, DISTRAL can be seen as performing maximumlikelihood estimation to learn the (unconstrained) default policy, while adding an entropy bonus to the control policy. Another important method in the sequential setting is TVPO (Moskovitz et al., 2022a) , in which (in the tabular case) the default policy is defined as a softmax over the average action frequencies of the optimal policies for the tasks that the agent has seen so far. That is, if the average optimal action in a state s is given by ξk (s, a) = 1 k k i=1 1(π ⋆ i (s) = a), then the TVPO default policy is π w (a|s) = softmax ξk (s, a)/β(k) , where β(k) is a temperature which decays as k → ∞. In high-dimensional state and action spaces, this tabular solution can be approximated by training a default policy to predict the converged control policy's actions in each task. Importantly, this is equivalent to using KL distillation in that the default policies will converge to the same barycenter policy (Moskovitz et al., 2022a) , as long as the distillation is only performed once the control policy has converged in each task. Using KL distillation in this way is exactly the RPO baseline that we use in this paper. Crucially, the use of the softmax with decaying temperature was introduced by Moskovitz et al. (2022a) as a useful 'hack' to prevent the default policy from overfitting to early tasks, as the optimal default policy is the barycenter policy (approximated as the number of draws from the task distribution grows). Thus, MDL-C can itself be seen as a scalable advancement of TVPO which models the agent's epistemic uncertainty about the task distribution by placing a sparse prior over the default policy parameters (and uses a distillation loss rather than action prediction). In other words, MDL-C represents a principled approach to reducing the risk of default policy overfitting in the low-data regime. Finally, Brunskill and Li (2013) consider a similar training and task structure to our own, but use a model-based approach to learn the underlying MDPs.

D MOTIVATING THE CHOICE OF SPARSITY-INDUCING PRIORS

As a reminder, the generative model of optimal parameters in Section 4.1 is given by: w i |β, σ 2 ∼ N (0, 1 -β β σ 2 I d ), (D.1) w ik |w i , σ 2 , β ∼ N (w, σ 2 I d ) (D.2) with marginal and posterior densities p(w ik |σ 2 , β) = N (0, σ 2 β -1 I d ), (D.3) p(w i |w ik , σ 2 , β) = N (1 -β)w ik , (1 -β)σ 2 I d . (D.4) In the rest of this section, we set σ 2 = 1 for simplicity and drop the indices on w and w to remove clutter.

D.1 CORRESPONDENCE BETWEEN p(z) AND p(β)

In Section 4.1, we draw a connection between p(β) ∝ β -1 and the normal-Jeffreys prior, which is commonly used for compressing deep neural networks (Louizos et al., 2017) . In Table 1 , we expand on this connection and list p(β) for two other commonly-used priors for scale mixture of normal distributions: Jeffreys, Inverse-gamma, and Inverse-beta. Note that the half-Cauchy distribution p(z) ∝ (1 + z 2 ) -1 is a special case of the inverse-beta distribution for s = t = 1/2. Half-cauchy prior is another commonly used prior for compressing Bayesian neural networks (Louizos et al., 2017) .

D.2 MSE RISK

In this section, we prove that the Bayes estimators for the Jeffreys, inverse-gamma, and the inversebeta (by extension the half-Cauchy) distributions dominate the maximum-likelihood estimator with respect to the mean-squared error. Define the mean-squared error of an estimator ŵ(x) of w as MSE(w, ŵ) = E x ∥ ŵ(x) -w∥ 2 , (D.5) where the expectation is taken over N (x; w, α 2 ). Immediately, we have R(w, ŵ(ML) ) = d, where ŵ(ML) (x) = x is the maximum-likelihood estimator. An estimator ŵ(a) (x) is said to dominate another estimator ŵ(b) (x) if MSE(w, ŵa ) ≤ MSE(w, ŵb ) for all w and the inequality is strict for a set of positive Lesbesgue measure. It is well-known that the maximum-likelihood estimator is minimax (George et al., 2006) , and thus any estimator that dominates the maximum-likelihood estimator is also minimax. To compute the mean-squared error risk for an estimator ŵ(x), observe that ∥ ŵ(x) -w∥ 2 = ∥x -ŵ(x)∥ 2 -∥x -w∥ 2 + 2( ŵ(x) -w) ⊤ (x -w). (D.6) Taking expectations on both sides gives MSE(w, ŵ) = E x ∥x -ŵ(x)∥ 2 -d + 2 d i=1 Cov( ŵi (x), x i ) (D.7) = E x ∥x -ŵ(x)∥ 2 -d + 2E x ∇ • ŵ(x) (D.8) where ∇ = (∂/∂x 1 , . . . , ∂/∂x d ) and we apply Stein's lemma cov( ŵi (x), x i ) = E x ∂ ŵi /∂x i in the last line. If the estimator takes the form ŵ(x) = x + γ(x), the expression simplifies as: MSE(w, ŵ) = d + E x ∥γ(x)∥ 2 + 2E x ∇ • γ(x). (D.9) Therefore, an estimator ŵ(x) = x + γ(x) dominates ŵ(ML) (x) if MSE(w, ŵ) -MSE(w, ŵ(ML) ) = E x ∥γ(x)∥ 2 + 2∇ • γ(x) ≤ 0 (D.10) for all w and the inequality is strict on a set of positive Lesbesgue measure.

D.2.1 JAMES-STEIN ESTIMATOR

The famous Jame-Stein estimator is defined as ŵ(JS) (x) = x + γ (JS) (x), γ (JS) (x) = -(d -2)x/∥x∥ 2 , (D.11) with ∇ • γ (JS) (x) = d i=1 - d -2 ∥x∥ 2 + 2 d -2 (∥x∥ 2 ) 2 x 2 i = - (d -2) 2 ∥x∥ 2 , (D.12) ∥γ (JS) (x)∥ 2 = (d -2) 2 ∥x∥ 2 . (D.13) Substituting ∇ • γ (JS) (x) and ∥γ (JS) (x)∥ 2 into Equation (D.10), we have MSE(w, ŵ(JS) ) -MSE(w, ŵ(ML) ) = E x (d -2) 2 ∥x∥ 2 . (D.14) Thus, the James-Stein estimator dominates the maximum-likelihood estimator for d > 2. Prior name p(z 2 ) p(β) Jeffreys p(z 2 ) ∝ z -2 p(β) ∝ β -1 Inverse-gamma p(z 2 ) ∝ z -2(s+1) e -t/(2z 2 ) p(β) ∝ β s-1 e -tβ/2 Inverse-beta p(z 2 ) ∝ (z 2 ) t-1 (1 + z 2 ) -(s+t) p(β) ∝ β -(s+2t+1) (1 + β) -(s+t) Table 1: Correspondence between p(z 2 ) and p(β).

D.2.2 BAYES

The Bayes estimator for a prior choice p(β) is given by (?): ŵ(p(β)) (x) = x + γ (p(β)) (x), γ (p(β)) (x) = ∇ log m(x), (D.15) where m(x) = N (x; 0, β -1 I d )p(β)dβ (D.16) = (2π) -1 2 β d/2 exp -βx 2 /2 p(β)dβ. (D.17) Substituting γ (p(β)) (x) into Equation (D.10), we find that the condition for the Bayes estimator to be minimax is given by (George et al., 2006) : MSE(w, ŵ(B) ) -MSE(w, ŵ(ML) ) = E x -∥∇ log m(x)∥ 2 + 2 ∇ 2 m(x) m(x) (D.18) = E x 4 ∇ 2 m(x) m(x) ≤ 0, (D.19) where ∇ 2 = i ∂ 2 /∂x 2 i is the Laplace operator. This condition holds when m(x) is superharmonic (i.e., m(x) ≤ 0, ∀x ∈ R d ), suggesting a recipe for constructing Bayes estimators that dominate the maximum likelihood estimator, summarized in the following proposition. Proposition D.1 (Extension of Theorem 1 in Fourdrinier et al., 1998) . Let p(β) be a positive function such that f (β) = βp ′ (β)/p(β) can be decomposed as Proof. This proof largely follows the proof of Theorem 1 in (Fourdrinier et al., 1998) . f 1 (β) + f 2 (β) where f 1 is non-decreasing, f 1 ≤ A, 0 < f 2 ≤ B, Note that Equation (D.18) holds if (D.20) or equivalently ∇ 2 m(x) = 1 2 m(x) ∇ 2 m(x) - 1 2 ∥∇m(x)∥ 2 m(x) ≤ 0 ∀x ∈ R d , ∇ 2 m(x) ∥∇m(x)∥ - 1 2 ∥∇m(x)∥ m(x) ≤ 0 ∀x ∈ R d . (D.21) Computing the derivatives, we get the condition 1 0 β∥x∥ 2 -d β d/2+1 e -β∥x∥ 2 /2 p(β)dβ ∥x∥ 1 0 β d/2+1 e -β∥x∥ 2 /2 p(β)dβ - 1 2 ∥x∥ 1 0 β d/2+1 e -β∥x∥ 2 /2 p(β)dβ 1 0 β d/2 e -β∥x∥ 2 /2 p(β)dβ ≤ 0. (D.22) Divide both sides by ∥x∥ and rearrange to get 1 0 β d/2+2 e -β∥x∥ 2 /2 p(β)dβ 1 0 β d/2+1 e -β∥x∥ 2 /2 p(β)dβ - 1 2 1 0 β d/2+1 e -β∥x∥ 2 /2 p(β)dβ β d/2 e -β∥x∥ 2 /2 p(β)dβ ≤ d ∥x∥ 2 . (D.23) Next, we integrate by parts the numerator of the first term on the left-hand side to get: 1 0 β d/2+2 e -β∥x∥ 2 /2 p(β)dβ = - 2 ∥x∥ 2 β d/2+2 e -β∥x∥ 2 /2 p(β) 1 0 (D.24) + d + 4 ∥x∥ 2 1 0 β d/2+1 e -β∥x∥ 2 /2 p(β)dβ + 2 ∥x∥ 2 1 0 β d/2+2 e -β∥x∥ 2 /2 p ′ (β)dβ, where the middle term the same as the denominator of the first term in Equation (D.23). Integrating by parts the second term gives the same expression as that of the first term, but with d -2 in place of d everywhere. Substituting these expressions back into Equation (D.23), collecting like terms, and dividing both sides by 2/∥x∥ 2 , gives: 1 0 β d/2+2 e -β∥x∥ 2 /2 p ′ (β)dβ 1 0 β d/2+1 e -β∥x∥ 2 /2 p(β)dβ - 1 2 1 0 β d/2+1 e -β∥x∥ 2 /2 p ′ (β)dβ 1 0 β d/2 e -β∥x∥ 2 /2 p(β)dβ + κ 0 + κ 1 (D.25) ≤ d 2 - d + 4 2 + 1 2 d + 2 2 = d -6 4 , where κ 1 = - lim β→1 β d/2+2 e -β∥x∥ 2 /2 p(β) 1 0 β d/2+1 e -β∥x∥ 2 /2 p(β)dβ + 1 2 lim β→1 β d/2+1 e -β∥x∥ 2 /2 p(β) 1 0 β d/2 e -β∥x∥ 2 /2 p(β)dβ , (D.26) κ 0 = lim β→0 β d/2+2 e -β∥x∥ 2 /2 p(β) 1 0 β d/2+1 e -β∥x∥ 2 /2 p(β)dβ - 1 2 lim β→0 β d/2+1 e -β∥x∥ 2 /2 p(β) 1 0 β d/2 e -β∥x∥ 2 /2 p(β)dβ . (D.27) Here, both κ 0 and κ 1 are nonpositive: (i) κ 0 is nonpositive because the first term vanishes due to the boundary conditions and the second term is nonpositive, and (ii) κ 1 is nonpositive because the limits of the numerators of the two terms are equal while the denominator of the second term is larger than that of the first. We can thus drop κ 0 and κ 1 to get the sufficient condition: E d (f ) - 1 2 E d-2 (f ) ≤ d -6 4 , (D.28) where E d denotes expectation with respect to the density g d (β) = β d/2+1 e -β∥x∥ 2 /2 p(β) 1 0 β d/2+1 e -β∥x∥ 2 /2 p(β)dβ (D.29) and where f (β) = βp ′ (β)/p(β). Because g d (β) is a family of monotone increasing likelihood ratio in d and f 1 is nonincreasing and bounded by A, we have E d (f 1 ) -E d-2 (f 1 )/2 ≤ A/2. We have E d (f 2 ) -E d-2 (f 2 )/2 ≤ B because 0 < f 2 ≤ B. Taken together, we have E d (f ) -E d-2 (f )/2 ≤ A/2 + B ≤ (k -6)/4. (D.30) When the inequality is strict (i.e., A/2 + B < (k -6)/4), then ∇ 2 m(x) < 0 and the Bayes estimator dominates the maximum-likelihood estimator. Checking whether a given p(β) satisfy the conditions in Proposition D.1 may be tedious. The following corollary is useful for construction p(β) that satisfies the conditions in Proposition D.1. Corollary D.1 (Extension of Corollary 1 in Fourdrinier et al., 1998) . Let ψ be a continuous function that can be decomposed as ψ 1 + ψ 2 , with ψ 1 ≤ C, ψ 1 non-decreasing, 0 < ψ 2 ≤ D, and C/2 + D ≤ 0. Let p(β) = exp 1 2 β β0 2ψ(u) + d -6 u du ∀β 0 ≥ 0, (D.31) such that lim β→0 β d/2+2 p(β) = 0 and β 0 ∈ (0, 1) is a constant. Then, p(β) results in a minimax Bayes estimator, which dominates the maximum likelihood estimator when C/2 + D < 0. Proof. The proof is the same as that of Corollary 1 in Fourdrinier et al., 1998, with Proposition D. 1 in place of Theorem 1 in Fourdrinier et al., 1998. Using Corollary D.1, we now check that the three priors listed in Table 1 and referenced in Section 4.1 lead to Bayes estimators that dominate the maximum-likelihood estimator. Jeffreys prior Let ψ 1 (u) = a for a ≤ 0 ψ 2 (u) = 0. We have p(β) = exp 1 2 β β0 2a + d -6 u du ∝ β a+(d-6)/2 . (D.32) To satisfy lim β→0 β d/2+2 p(β) = 0, we require 1 -d < a ≤ 0. We recover the improper normal-Jeffreys prior p(β) ∝ β -1 , for a = 2 -d/2. The corresponding Bayes estimator dominates the maximum likelihood estimator when d > 4. Inverse-gamma prior Let ψ 1 (u) = a and ψ 2 (u) = b(1 -u)/2 for a ≤ 0 and b ≥ 0. We have p(β) = exp β β0 a + b(1 -u)/2 + (d -6)/2 u du ∝ β a+(b+d-6)/2 e -bβ/2 . (D.33) Setting C = a and D = b/2, we get the followings conditions: a + b ≤ 0 and 1 -d ≤ a + b/2. Note that when these conditions are met with s = a + (b + d -4)/2 and t = b, we recover the inverse-gamma prior in Table 1 . Inverse-beta (half-Cauchy) prior Let ψ 1 (u) = a and ψ 2 (u) = b/(u + 1) for a ≤ 0 and b ≥ 0. We have p(β) = exp β β0 a + b/(1 + u) + (d -6)/2 u du ∝ β a+b+(d-6)/2 (1 + β) -b . (D.34) Setting C = a and D = b, we get the condition a/2 + b ≤ 0. To satisfy lim β→0 β d/2+2 p(β) = 0, we require 1 -d < a + b ≤ 0. Note that this corresponds to the inverse-beta prior in Table 1 with t = a + (d -8)/2 and s = b -t. To recover the half-Cauchy prior, we set b = 1 and a = (5 -d)/2. All conditions in Corollary D.1 are satisfied when d > 9.

E LIMITATIONS

One weakness of the current theoretical analysis regarding the choice of sparsity-inducing priors is the assumption of Gaussian (and in particular, isotropic Gaussian) structure in the parameter space of optimal policies for clusters of tasks. In reality, there is likely a nontrivial degree of covariance among task parameterizations. Extending our analysis to more realistic forms of task structure is an important direction for future work. In a similar vein, the assumption that tasks are drawn iid from a fixed distribution is also unrealistic in naturalistic settings. It would be interesting to introduce some form of sequential structure (e.g., tasks are drawn from a Markov process). Another direction for future work is expanding beyond the "one control policy, one default policy" setup-having, for example, one default policy per task cluster and the ability to reuse and select (for example, using successor feature-like representations (Barreto et al., 2020; Barth-Maron et al., 2018; Moskovitz et al., 2022b) ) among an actively-maintained set of control policies across tasks and task clusters would be useful.

F OCO BACKGROUND

In online convex optimization (OCO), the learner observes a series of convex loss functions ℓ k : N → R, k = 1, . . . , K, where N ⊆ R d is a convex set. After each round, the learner produces an output x k ∈ N for which it will then incur a loss ℓ k (x k ) (Orabona, 2019) . At round k, the learner is usually assumed to have knowledge of ℓ 1 , . . . , ℓ k-1 , but no other assumptions are made about the sequence of loss functions. The learner's goal is to minimize its average regret: RK := 1 K K k=1 ℓ k (x k ) -min x∈N 1 K K k=1 ℓ k (x). (F.1) One OCO algorithm which enjoys sublinear regret is follow the leader (FTRL). In each round of FTRL, the learner selects the solution x ∈ N according to the following objective: x k+1 = argmin x∈N ψ k (x) + k-1 i=1 ℓ i (x), (F.2) where ψ k : N → R is a convex regularization function.

G PROOFS OF PERFORMANCE BOUNDS AND ADDITIONAL THEORETICAL RESULTS

The following result is useful. Lemma G.1. The function ℓ(ν) = E w∼ν f (w) is L-Lipschitz as long as f : W → R lies within [0, L] ∀w ∈ W, where W ⊆ R d is a Hilbert space and L < ∞. Proof. We have |ℓ(ν 1 ) -ℓ(ν 2 )| = |E w∼ν1 f (w) -E w∼ν2 f (w)| = W (ν 1 (w) -ν 2 (w))f (w) dw = |⟨f, ν 1 -ν 2 ⟩ W | ≤ ∥f ∥ W ∥ν 1 -ν 2 ∥ W ≤ L∥ν 1 -ν 2 ∥ W , where the first inequality is due to Cauchy-Schwarz and the second is by assumption on f . Proposition G.1 (Default Policy Distribution Regret). Let tasks M k be independently drawn from P M at every round, and let them each be associated with a deterministic optimal policy π ⋆ k : S → A. We make the following mild assumptions: i) π w (a ⋆ |s) ≥ ϵ > 0 ∀s ∈ S, where a ⋆ = π ⋆ k (s) and ϵ is a constant. ii) min ν KL[ν(•), p(•)] → 0 as Var[ν] → ∞ for an appropriate choice of sparsity-inducing prior p. Then Algorithm 2 guarantees E P M [ℓ K (ν K ) -ℓ K (ν K )] ≤ (E P M KL[ν K , p] + 1) log(1/ϵ) √ K . (G.1) where νK = argmin ν∈N K k=1 ℓ k (ν). Proof. The first part of the proof sets up an application of Orabona (2019), Corollary 7.9. To establish grounds for its application, we first note the standard result that the regularization functional ψ(ν) = KL[ν(w), p(w)] for probability measures ν, p ∈ P(W) is 1-strongly convex in ν (Melbourne, 2020). Finally, assumption (i) implies that the KL between the default policy and the optimal policy is upper-bounded: KL[π ⋆ k , π w ] ≤ log 1/ϵ. Then by Lemma G.1, ℓ k (ν) is L-Lipschitz wrt the TV distance, where L = log 1/ϵ. Note also that under a Gaussian parameterization for ν, the distribution space N is the Gaussian (Boyd and Vandenberghe, 2004) . parameter space N = {(µ, Σ) : µ ∈ R d , Σ ∈ R d×d , Σ ⪰ 0}, which is convex Then Orabona (2019), Corollary 7.9 gives 1 K K k=1 ℓ k (ν k ) - 1 K K k=1 ℓ k (ν K ) ≤ 1 α KL[ν K , p] + α L √ K , (G.2) where νK = argmin ν K k=1 ℓ k (ν). The constant α ∈ R + is a hyperparameter, so we are free to set it to 1 (Orabona, 2019) . Finally, we observe that E P M i 1 K K k=1 ℓ(ν k ) = E P M i ℓ K (ν K ) and take the expectation with respect to P Mi of both sides of Eq. (G.2) get the desired result: E P M i [ℓ K (ν K ) -ℓ K (ν K )] ≤ E P M i KL[ν K , p] + 1 L √ K . (G.3) Proposition 4.2 (Control Policy Sample Complexity). Under the setting described in Proposition G.1, denote by T k the number of iterations to reach ϵ-error for M k in the sense that min t≤T k {V π ⋆ k -V (t) } ≤ ϵ. whenever t > T k . Further, denote the upper-bound in Eq. (G.1) by G(K). In a finite MDP, from any initial θ (0) , and following gradient ascent, E M k ∼P M [T k ] satisfies: E M k ∼P M i [T k ] ≥ 80|A| 2 |S| 2 ϵ 2 (1 -γ) 6 E M k ∼P M i s∼Unif S   κ α k A (s) d π * k ρ µ 2 ∞   , where α k (s) := d TV (π ⋆ k (•|s), π0 (•|s)) ≤ G(K), κ α k A (s) = 2|A|(1-α(s)) 2|A|(1-α(s))-1 , and µ is a measure over S such that µ(s) > 0 ∀s ∈ S. Note: In the above, there is a small error-it should be α k (s ) := E w∼ν d TV (π ⋆ k (•|s), π w (•|s)) ≤ 1 2 G(K) . d π ρ refers to the discounted state-occupancy distribution under π with initial state distribution ρ: d π ρ (s) = E s0∼ρ (1 -γ) h≥0 γ h P π (s h = s|s 0 ). (G.4) Division between probability mass functions is assumed to be element-wise. Proof. Without loss of generality, we prove the bound for a fixed state s ∈ S, noting that the bound applies independently of our choice of s. We use the shorthand KL[π(•|s), π w (•|s)] → KL[π, π w ] for brevity. We start by multiplying both sides of the bound from Proposition G.1 by 1/2 and rearranging: 1 2 E P M i ℓ K (ν K ) + L √ K E P M i KL[ν K , p] + 1 ≥ E P M i 1 2 ℓ K (ν K ) = E P M i E ν K 1 2 KL[π ⋆ K , π w ] (i) = E P M i   Var ν K 1 2 KL[π ⋆ K , π w ] + E ν K 1 2 KL[π ⋆ K , π w ] 2   (ii) ≥ E P M i   E ν K 1 2 KL[π ⋆ K , π w ] 2   (G.5) where (i) follows from the definition of the variance, and (ii) follows from its non-negativity. We can rearrange to get L 2 √ K E P M i KL[ν K , p] + 1 ≥ E P M i E ν K 1 2 KL[π ⋆ K , π w ] 2 (ii) ≥ E P M i E ν K [d TV (π ⋆ K , π w )] 2 (G.6) where (ii) follows from Pinsker's inequality. Letting α K (s) = 1 2 G(K) and applying Moskovitz et al. (2022a) , Lemma 5.2 gives the desired result. This upper-bound is signficant, as it shows that, all else being equal, a high complexity barycenter default policy distribution νK (where complexity is measured by KL[ν K , p]) leads to a slower convergence rate in the control policy. Optimize control policy: π⋆ k = argmax π∈Π V π M k -λE s∼d π E w∼ν k KL[π w (a|s), π(a|s)] (G.7) 6: Update default policy distribution: ν k+1 = argmin ν∈N KL[ν, p] + E w∼ν KL[π ⋆ k , π w ] (G.8) 7: end for G.1 MDL-C WITH PERSISTENT REPLAY Rather than rely on iid task draws to yield a bound on the expected regret under the task distribution, a more general formulation of MDL-C for sequential task learning is described in Algorithm 1. In this setting, the dataset of optimal agent-environment interactions is explicitly constructed by way of a replay buffer which persists across tasks and is used to train the default policy distribution. This is much more directly in line with standard FTRL, and we can obtain the standard FTRL bound. Proposition G.2 (Persistent Replay FTRL Regret; (Orabona, 2019) , Corollary 7.9). Let tasks M k be independently drawn from P M at every round, and let them each be associated with a deterministic optimal policy π ⋆ k : S → A. We make the following mild assumptions: i) π w (a ⋆ |s) ≥ ϵ > 0 ∀s ∈ S, where a ⋆ = π ⋆ k (s) and ϵ is a constant. ii) min ν KL[ν(•), p(•)] = 0 asymptotically as Var[ν] → ∞. Then with η k-1 = L √ k, Algorithm 1 guarantees 1 K K k=1 ℓ k (ν k ) - 1 K K k=1 ℓ k (ν K ) ≤ (KL[ν K , p] + 1) L √ K , (G.9) where νK = argmin ν∈N K k=1 ℓ k (ν). Proof. This follows directly from the arguments made in the proof of Proposition G.1. As before, this result can be used to obtain a performance bound for the control policy. Proposition G.3 (Control Policy Sample Complexity for MDL-C with Persistent Replay). Under the setting described in Proposition G.2, denote by T k the number of iterations to reach ϵ-error for M k in the sense that min t≤T k {V π ⋆ k -V (t) } ≤ ϵ. In a finite MDP, from any initial θ (0) , and following gradient ascent, E M k ∼P M [T k ] satisfies: E M k ∼P M i [T k ] ≥ 80|A| 2 |S| 2 ϵ 2 (1 -γ) 6 E M k ∼P M i s∼Unif S   κ α k A (s) d π * k ρ µ 2 ∞   , where α k (s) := E w∼ν d TV (π ⋆ k (•|s), π w (•|s)) ≤ 1 2 G(K), G(K) := ℓ K (ν K ) + K-1 k=1 (ℓ k (ν K ) -ℓ k (ν k )) + (KL[ν, p] + 1) L √ K, κ α k A (s) = 2|A|(1-α(s)) 2|A|(1-α(s))-1 , and µ is a probability measure over S such that µ(s) > 0 ∀s ∈ S. Proof. Without loss of generality, we select a single state s ∈ S, observing that the same analysis applies ∀s ∈ S. Collect trajectory τ = (s 0 , a 0 , r 0 , . . . , sH-1 , a H-1 , r H-1 ) ∼ P π θ (•), store experience D k ← D k-1 ∪ {(s h , a h , r h , sh+1 )} H-1 h=0 (G.13) where sh := (s h , g k ). 8: if R(τ ) ≥ R ⋆ (i.e., π θ ≈ π ⋆ k ) then 9: Add to default policy replay: D ϕ k ← D ϕ k-1 ∪ {(s h , π θ (•|s h )} H-1 h=0 (G.14) Note that, e.g., when π θ (a|s) = N (a; µ(s, g k ), Σ(s, g k )) is a Gaussian policy, µ(s h , g k ), Σ(s h , g k ) are added to the replay with sh . Update Q-function(s) as in Haarnoja et al. (2018) . 13: Update control policy: θ ← argmin θ ′ E Unif D V π θ ′ -αE w∼ν ϕ KL[π θ ′ (•|s h ), π w (•|s h )] (G.15) 14: Update default policy distribution: .16) 15: end while ϕ ← argmin ϕ ′ KL[ν ϕ ′ (•), p(•)] + E Unif D ϕ k E w∼ν KL[π θ (•|s h ), π w (•|s h )] (G We leave more in-depth theoretical analysis of this setting to future work, but note that as the task experience is interleaved, πw = E ν π w will converge to the prior-weighted KL barycenter. If, in expectation, this distribution is a TV distance of less than 1 -1/|A| from π ⋆ k , then the control policy will converge faster than for log-barrier regularization (Moskovitz et al., 2022a) .

H ADDITIONAL EXPERIMENTAL DETAILS

Below, we describe experimental details for the two environment domains in the paper.

H.1 FOURROOMS

As input, the agent receives a 16-dimensional vector containing the index of the current state, a flattened 3 × 3 local view of its surrounding environment, its previous action taken encoded as a 4-dimensional one-hot vector, the reward on the previous timestep, and a feature indicating the goal state index. The base learning algorithm in all cases is advantage actor critic (A2C; (Mnih et al., 2016) ). Environment The FOURROOMS experiments are set in an 11 × 11 gridworld. The actions available to the agent are the four cardinal directions, up, down, left, and right, and transitions are deterministic. In both FOURROOMS experiments, the agent can begin an episode anywhere in the environment (sampled uniformly at random), and a single location with reward r = 50 is sampled at the beginning of each episode from a set of possible goal states which varies depending on the experiment and the current phase. A reward of r = -1 is given if the agent contacts the walls. All other states give a reward of zero. Episodes end when either a time (number of timesteps) limit is reached or the agent reaches the goal state. Observations were 16-dimensional vectors consisting of the current state index (1d), flattened 3 × 3 local window surrounding the agent (includes walls, but not goals), a one-hot encoding of the action on the previous timestep (4d), the reward on the previous timestep (1d), and the index of the current goal (1d). In the "goal generalization" experiment, goals may be sampled anywhere in either the top left or bottom right rooms in the first phase and either the top right or bottom left rooms in the second phase. Each phase comprises 20,000 episodes, and in each phase, the agent may start each episode anywhere in the environment. In the first phase, the agent was allowed 100 steps per episode, and in the second phase 25 steps. In the "contingency change" experiment, the possible reward states in each phase were the top left state and bottom right state. In the second phase of training, however, the semantics of the goal feature change from indicating the location of the reward to the location where it is absent. Each phase consisted of 8,000 episodes with maximum length 100 timesteps. Results are averaged over 10 random seeds. Agents All agents were trained on-policy with advantage actor-critic (Mnih et al., 2016) . The architecture was a single-layer LSTM (Hochreiter and Schmidhuber, 1997) with 128 hidden units. To produce the feature sensitivity plots in Fig. 5 .1c, a gating function was added to the input layer of the network: The baseline agent objective functions are as follows: x h = σ(bκ) ⊙ o h , J PO (θ) = V π θ + αE s∼d π θ H[π θ (•|s)] J RPO (θ, ϕ) = V π θ -αE s∼d π θ KL[π θ (•|s), π ϕ (•|s)] J VDO-PO (θ) = E w∼ν θ V πw -βKL[ν θ (•), p(•)] J ManualIA (θ, ϕ) = V π θ -αE s∼d π θ KL[π θ (•|s), π ϕ (•|s d )]; s d = s \ g. (H.2) In all cases α = 0.1, β = 1.0, and learning rates for all agents were set to 0.0007. Agents were optimized with Adam (Kingma and Ba, 2014). Agent control policies were reset after phase 1.

H.2 DEEPMIND CONTROL SUITE

Environments/Task Settings We use the walker and cartpole environments from the Deep-Mind Control Suite (Tassa et al., 2018) . We consider two multitask settings: sequential tasks and parallel tasks. All results are averaged over 10 random seeds, and agents are trained for 500k timesteps. In the sequential task setting, tasks are sampled one at a time without replacement and solved by the agent. The control policy is reset after each task, but the default policy is preserved. For methods which have a default policy which can be preserved, performance on task k is averaged over runs with all possible previous tasks in all possible orders. For example, when walker-run is the third task, performance is averaged over previous tasks being stand then walk and walk then stand. In the parallel task setting, a different task is sampled randomly at the start of each episode, and a one-hot task ID vector is appended to the state observation. Learning was done directly from states, not from pixels. Agents The base agent in all cases was SAC with automatic temperature tuning, following Haarnoja et al. (2018) . Standard SAC seeks to optimize the maximum-entropy RL objective: J max-ent (π) = V π + αE s∼d π H[π(•|s)] = V π + αE s∼d π KL[π(•|s), Unif A ] (H.3) Effectively, then, SAC uses a uniform default policy. The RPO algorithms with learned default policies replace KL [π(•|s), Unif A ] with KL[π(•|s), π w (•|s)] (or KL[π w (•|s), π(•|s)]). As MDL-C, RPO, and TVPO require that the control policy approximate the optimal policy before being used to generated the a learning signal for the default policy, in the sequential setting, the default policy is updated only after halfway through training. Because variational dropout can cause the network to over-sparsify (and not learn the learn adequately) if turned on too early in training, we follow the strategy of Molchanov et al. (2017) , linearly ramping up a coefficient β on the variational dropout KL from 0 to 1 starting from 70% through training to 80% through training. Note that MANUALIA is not applicable to the sequential task setting, as there is no explicit goal feature. In the sequential task setting, we took inspiration from Haarnoja et al. (2018) and Abdolmaleki et al. (2018) and reframed the soft KL penalty for methods learned default policies as a constraint, i.e., max π V π -αEKL[π w , π] -→ max π V π s.t. EKL[π w , π] ≤ ε, where ε > 0 was a target KL divergence. Under this formulation, α is treated as a dual variable via Lagrangian relaxation and optimized with the following objective: max α≥0 J(α) := EαKL[π w , π] -αε. In the parallel task setting, we convert the base SAC agent into the "multitask" variant used by Yu et al. (2019) , in which the agent learns a vector of temperature parameters [α 1 , . . . , α K ], one for each task. In this setting, we found it more effective to set α to a constant value. Test performance was computed by averaging performance across all K tasks presented to the agent. The baseline agent objectives are as in Eq. (H.2), and the Distral objective is given by J Distral (θ, ϕ) = V π θ -αE s∼d π θ KL[π θ (•|s), π ϕ (•|s)] + λE s∼d π θ H[π θ (•|s)]. TVPO is trained in the same way as RPO, with the difference being that the default policy objective is to predict the control policy action, rather than a distillation objective. Hyperparameters shared by all agents can be viewed in Table 2 . As a note on performance, Distral performs very strongly in the parallel task setting, with overall performance slightly worse than MDL-C on Walker and virtually the same on Cartpole. However, the gap is significantly greater in the sequential setting, particularly on Walker. We hypothesize that this is due to the fact that by regularizing the control policy to be close to the default policy, but also encouraging the control policy to have high entropy (rather than regularizing the default policy as MDL-C does), Distral can in effect provide a conflicting objective to the control policy when strong structure is present. In particular, on Walker, the optimal policies for each task have significant overlap, and so by encouraging high entropy in the control policy even on the third task, Distral negates the effect of an informative default policy. As evidence, both RPO and TVPO, which only regularize the control policy to be close to the default policy, perform significantly more strongly on Walker in the sequential setting.

Hyperparameter Value

Collection where w = E ν w for MDL-C, averaged over all possible goal states. The RPO default policy nearly perfectly matches the control policy, while the MDL-C default policy diverges most strongly from the control policy at the doorways. This is because the direction chosen by the policy in the doorways is highly goal-dependent. Because the MDL-C default policy learns to ignore the goal feature, it's roughly uniform in the doorways, whereas the control policy is highly deterministic, having access to the goal feature. To test the effect of information asymmetry on its on performance, we trained a variant of MANUALIA in which we withheld the input features that MDL-C learned to gate out (Fig. 5 .2) in addition to the task ID feature. We call this modified method MANUALIA+. Average performance is plotted above over 10 seeds, with the shading representing one unit of standard error. We can see that while MANUALIA+ narrowly outperforms MANUALIA, the performance gains of MDL-C can't solely be ascribed to effective information asymmetry. 



The invariance theorem(Kolmogorov, 1965) ensures that, given a sufficiently long sequence, Kolmogorov complexity is invariant to the choice of general-purpose language. In practice, MPO parametrizes π θ implicitly with a parameterized action-value function and the default policy.



Figure 4.1: (A) Illustration of a generative model of optimal policy parameters. ŵ1 = (1 -β)w 11 shrinks towards the origin, growing closer to w 1 than w 11 . (B) Sparsity-inducing priors over β.

Figure 5.1: MDL-C rapidly adapts to new goal locations (top row) and rule changes (bottom row). All curves represent averages taken over 10 random seeds, with the shading indicating standard error.

Figure 5.2: MDL-C improves both sequential and parallel learning in continuous control tasks. All curves represent averages taken over 8 random seeds, with the shading indicating standard error. In (b), insets show the improvement of MDL-C as k increases, and in (d), solid curves represent averages over each feature within a category.

and A/2 + B ≤ (d -6)/4. Assume also that lim β→0 β d/2+2 p(β) = 0. Then, ∇ 2 m(x) ≤ 0 and the Bayes estimator is minimax. If A/2 + B < (d -6)/4, then the Bayes estimator dominates ŵ(ML) (x).

Published as a conference paper at ICLR 2023 Algorithm 2: Idealized MDL-C for Multitask Learning 1: require: task distribution P M , policy class Π, coefficients {η k } 2: initialize: default policy distribution ν 1 ∈ N for tasks k = 1, 2, . . . , K do 4:Sample a task M k ∼ P M (•)5:

For simplicity, we denote π(•|s) by π. We start by multiplying each side of Eq. (G.2) Algorithm 3: Off-Policy MDL-C for Parallel Multitask Learning 1: require: task distribution P M , policy class Π 2: initialize: default policy distribution ν 1 ∈ N, control replay D 0 ← ∅, default replay D ϕ 0 ← 3: initialize control policy parameters θ and default policy distribution parameters ϕ. 4: while not done do 5:for episodes k = 1, 2, . . . , K do 6:Sample a task M k ∼ P M (•) with goal ID feature g k 7:

H.1) where o h is the current observation, σ(•) was the sigmoid funcion, b ∈ R is a constant (set to b = 150 in all experiments), x h ∈ R d is the filter layer output, and κ ∈ R d is a parameter trained using backpropagation. In this way, as κ d → ∞, σ(bκ d ) → 1, allowing input feature o h , d through the gate. As κ d → -∞, the gate is shut. The plots in Fig. 5.1c track σ(bκ d ) over the course of training.

01e5 ± 2.01e3 1.46e5 ± 5.11e3 ManualIA 9.90e4 ± 1.87e3 1.50e5 ± 3.86e3 MDL-C 9.47e4 ± 8.36e2 1.31e5 ± 1.35e3 Table 5: DM Control Suite, Parallel: Average cumulative regret across 8 random seeds in the parallel task setting. ± values are standard error.

Figure I.1: Heatmaps of KL[π θ (•|s), π w (•|s)] ∀s ∈ S for RPO and KL[π θ (•|s), π w(•|s)] ∀s ∈ S,where w = E ν w for MDL-C, averaged over all possible goal states. The RPO default policy nearly perfectly matches the control policy, while the MDL-C default policy diverges most strongly from the control policy at the doorways. This is because the direction chosen by the policy in the doorways is highly goal-dependent. Because the MDL-C default policy learns to ignore the goal feature, it's roughly uniform in the doorways, whereas the control policy is highly deterministic, having access to the goal feature.

Figure I.2: Without a sparse prior, RPO does not learn to ignore spurious input features.

Figure I.5: Test reward on each individual task in the walker domain over the course of parallel task training. Average performance is plotted above over 10 seeds, with the shading representing one unit of standard error. We can see the biggest performance difference on walker, run, the most challenging task.

DM control suite hyperparameters, used for all experiments. * In the parallel setting, α was simply set to 0.1 for methods with learned default policies.

FourRooms: Average cumulative regret across 8 random seeds in phase 2 of the goal change and contingency change experiments for each method. ± values are standard error.

DM Control Suite, Sequential: Average cumulative regret across 8 random seeds in the sequential setting. ± values are standard error.

acknowledgement

Acknowledgements The authors would like to thank Kevin Miller, DJ Strouse, Marcel Binz, and Alexander Galashov for useful discussions and suggested improvements to the manuscript. Work funded by the Gatsby Charitable Foundation.

annex

by K and rearranging:We can multiply both sides by 1/2 and expand ℓ K (ν K ):where (i) follows from the definition of the variance, (ii) follows from its non-negativity, and (iii) follows from Pinsker's inequality. We then have(G.12)Letting α K (s) = 1 2 G(K) and applying Moskovitz et al. (2022a) , Lemma 5.2 gives the desired result.

G.2 COMMENT ON IMPROVEMENT ACROSS TASKS

To gain intuition for these bounds, there are several important values of α(s) that we can consider. First, as α(s) → 1 -1/|A|, which is the TV distance between a uniform default policy and a deterministic optimal policy, κ α A (s) → 2. This is an important value because it's the coefficient obtained for log-barrier regularization-that is, when the default policy is uniform and encodes no information about the task distribution. Next, as α(s) → 0 (that is, as the TV distance between the default policy and the optimal policy decreases), κ α A (s) → 2|A|/(2|A| -1) < 2 for |A| > 1. So, we get faster as the distance between the default policy and the optimal policy decreases, as we would hope. Another crucial point to note is that as |A| → ∞ in this case, κ α A (s) → 1. Finally, and importantly for MDL-C, as α(sIn other words, a sufficiently bad default policy can preclude convergence entirely if it puts too much mass on a suboptimal action. For an illustration of this phenomenon, see Moskovitz et al. (2022a) Figure 4 .1. Indeed, this is why our Proposition 4.1 is so useful-by effectively placing an upper bound on α(s) which shrinks as the number of tasks K increases, MDL-C's default policy is guaranteed to a) avoid putting too much mass on a suboptimal action and thereby preclude or delay convergence for the control policy, and b) improve the rate as the default policy regret drops.

G.3 PARALLEL TASK SETTING

An overview of MDL-C as applied in the parallel task setting is presented in Algorithm 3. One important feature to note is the return threshold R ⋆ . As a proxy for the control policy converging to π ⋆ k , data are only added to the default policy replay buffer when a trajectory return is above this threshold performance (on DM control suite tasks, R ⋆ corresponded to a test reward of at least 700).

