MINIMUM DESCRIPTION LENGTH CONTROL

Abstract

We propose a novel framework for multitask reinforcement learning based on the minimum description length (MDL) principle. In this approach, which we term MDL-control (MDL-C), the agent learns the common structure among the tasks with which it is faced and then distills it into a simpler representation which facilitates faster convergence and generalization to new tasks. In doing so, MDL-C naturally balances adaptation to each task with epistemic uncertainty about the task distribution. We motivate MDL-C via formal connections between the MDL principle and Bayesian inference, derive theoretical performance guarantees, and demonstrate MDL-C's empirical effectiveness on both discrete and highdimensional continuous control tasks.

1. INTRODUCTION

In order to learn efficiently in a complex world with multiple rapidly changing objectives, both animals and machines must leverage past experience. This is a challenging task, as processing and storing all relevant information is computationally infeasible. How can an intelligent agent address this problem? We hypothesize that one route may lie in the dual process theory of cognition, a longstanding framework in cognitive psychology introduced by William James (James, 1890) which lies at the heart of many dichotomies in both cognitive science and machine learning. Examples include goal-directed versus habitual behavior (Graybiel, 2008) , model-based versus model-free reinforcement learning (Daw et al., 2011; Sutton and Barto, 2018) , and "System 1" versus "System 2" thinking (Kahneman, 2011) . In each of these paradigms, a complex, "control" process trades off with a simple, "default" process to guide actions. Why has this been such a successful and enduring conceptual motif? Our hypothesis is that default processes often serve to distill common structure from the tasks consistently faced by animals and agents, facilitating generalization and rapid learning on new objectives. For example, drivers can automatically traverse commonly traveled roads en route to new destinations, and chefs quickly learn new dishes on the back of well-honed fundamental techniques. Importantly, even intricate tasks can become automatic, if repeated often enough (e.g., the combination of fine motor commands required to swing a tennis racket): the default process must be sufficiently expressive to learn common behaviors, regardless of their complexity. In reality, most processes likely lie on a continuum between simplicity and complexity. In reinforcement learning (RL; Sutton and Barto, 2018) , improving sample efficiency on new tasks is crucial to the developement of general agents which can learn effectively in the real world (Botvinick et al., 2015; Kirk et al., 2021) . Intriguingly, one family of approaches which have shown promise in this regard are regularized policy optimization algorithms, in which a goal-specific control policy is paired with a simple yet general default policy to facilitate learning across multiple tasks (Teh et al., 2017; Galashov et al., 2019; Goyal et al., 2020; 2019; Moskovitz et al., 2022a) . One difficulty in algorithm design, however, is how much or how little to constrain the default policy, and in what way. An overly simple default policy will fail to identify and exploit commonalities among tasks, while a complex model may overfit to a single task and fail to generalize. Most approaches manually specify an asymmetry between the control and default policies, such as hiding input information (Galashov et al., 2019) or constraining the model class (Lai and Gershman, 2021) . Ideally, we'd like an adaptive approach that learns the appropriate degree of complexity via experience. The minimum description length principle (MDL; Rissanen, 1978) , which in general holds that one should prefer the simplest model that accurately fits the data, offers a guiding framework for algorithm design that does just that, enabling the default policy to optimally trade off between adapting to information from new tasks and maintaining simplicity. Inspired by dual process theory and the MDL principle, we propose MDL-control (MDL-C, pronounced "middle-cee"), a principled RPO framework for multitask RL. In Section 2, we formally introduce multitask RL and describe RPO approaches within this setting. In Section 3, we describe MDL and the variational coding framework, from which we extract MDL-C and derive its formal performance characteristics. In Section 5, we demonstrate its empirical effectiveness in both discrete and continuous control settings. Finally, we discuss related ideas from the the literature (Section 6) and conclude (Section 7).

2. REINFORCEMENT LEARNING PRELIMINARIES

The single-task setting We model a task as a Markov decision process (MDP; Puterman, 2010) M = (S, A, P, r, γ, ρ), where S, A are state and action spaces, respectively, P : S × A → P(S) is the state transition distribution, r : S × A → [0, 1] is a reward function, γ ∈ [0, 1) is a discount factor, and ρ ∈ P(S) is the starting state distribution. P(•) is the space of probability distributions defined over a given space. The agent takes actions using a policy π : S → P(A). In large or continuous domains, the policy is often parameterized: π → π θ , θ ∈ Θ, where Θ ⊆ R d represents a particular model class with d parameters. In conjunction with the transition dynamics, the policy induces a distribution over trajectories τ = (s h , a h ) ∞ h=0 , P π θ (τ ). In a single task, the agent seeks to maximize its value V π θ = E τ ∼P π θ R(τ ), where R(τ ) := h≥0 γ h r(s h , a h ) is called the return. We denote by d π ρ the state-occupancy distribution induced by policy π with starting state distribution ρ: d π ρ (s) = E ρ (1 -γ) h≥0 γ h Pr(s h = s|s 0 ). Multiple tasks There are a number of frameworks for multitask RL in the literature (Yu et al., 2019; Zahavy et al., 2021; Finn et al., 2017; Brunskill and Li, 2013) . For a more detailed discussion, see Appendix Section B. In this paper, we focus primarily on what we term the sequential and parallel task settings. The objective in each case is to maximize average reward across tasks, equivalent to minimizing cumulative regret over the agent's 'lifetime.' More specifically, we assume a (possibly infinite) set of tasks (MDPs) M = {M } presented to the agent by sampling from some task distribution P M ∈ P(M). In the sequential task setting (Moskovitz et al., 2022a; Pacchiano et al., 2022) , tasks (MDPs) are sampled one at a time from P M , with the agent training on each until convergence. In the parallel task training (Yu et al., 2019) , a new MDP is sampled from P M at the start of every episode and is associated with a particular input feature g ∈ G that indicates to the agent which task has been sampled.

Regularized Policy Optimization

One common approach which improves performance is regularized policy optimization (RPO; Schulman et al., 2017; 2018; Levine, 2018; Agarwal et al., 2020; Pacchiano et al., 2020; Tirumala et al., 2020; Abdolmaleki et al., 2018) . In RPO, a convex regularization term Ω(θ) is added to the objective: J RPO λ (θ) = V π θ -λΩ(θ). In the single-task setting, the regularization term is often used to approximate trust region (Schulman et al., 2015) , proximal point (Schulman et al., 2017) , or natural gradient (Kakade, 2002; Pacchiano et al., 2020; Moskovitz et al., 2020) optimization, or to prevent premature convergence to local maxima (Haarnoja et al., 2018; Lee et al., 2018) . In multitask settings, the regularization term for RPO typically takes the form of a divergence measure penalizing the policy responsible for taking actions π θ , which we'll refer to as the control policy, for deviating from some default policy π w , which is intended to encode generally useful behavior for some family of tasks (Teh et al., 2017; Galashov et al., 2019; Goyal et al., 2019; 2020; Moskovitz et al., 2022a) . By capturing behavior which is on average useful across tasks, π w can provide a form of beneficial supervision to π θ when obtaining reward is challenging, either because π θ has been insufficiently trained or rewards are sparse.

3. THE MINIMUM DESCRIPTION LENGTH PRINCIPLE

General principle Storing all environment interactions across multiple tasks is computationally infeasible, so multitask RPO algorithms offer a compressed representation in the form of a default policy. However, the type of information which is compressed (and that which is lost) is often

