MINIMUM DESCRIPTION LENGTH CONTROL

Abstract

We propose a novel framework for multitask reinforcement learning based on the minimum description length (MDL) principle. In this approach, which we term MDL-control (MDL-C), the agent learns the common structure among the tasks with which it is faced and then distills it into a simpler representation which facilitates faster convergence and generalization to new tasks. In doing so, MDL-C naturally balances adaptation to each task with epistemic uncertainty about the task distribution. We motivate MDL-C via formal connections between the MDL principle and Bayesian inference, derive theoretical performance guarantees, and demonstrate MDL-C's empirical effectiveness on both discrete and highdimensional continuous control tasks.

1. INTRODUCTION

In order to learn efficiently in a complex world with multiple rapidly changing objectives, both animals and machines must leverage past experience. This is a challenging task, as processing and storing all relevant information is computationally infeasible. How can an intelligent agent address this problem? We hypothesize that one route may lie in the dual process theory of cognition, a longstanding framework in cognitive psychology introduced by William James (James, 1890) which lies at the heart of many dichotomies in both cognitive science and machine learning. Examples include goal-directed versus habitual behavior (Graybiel, 2008) , model-based versus model-free reinforcement learning (Daw et al., 2011; Sutton and Barto, 2018) , and "System 1" versus "System 2" thinking (Kahneman, 2011) . In each of these paradigms, a complex, "control" process trades off with a simple, "default" process to guide actions. Why has this been such a successful and enduring conceptual motif? Our hypothesis is that default processes often serve to distill common structure from the tasks consistently faced by animals and agents, facilitating generalization and rapid learning on new objectives. For example, drivers can automatically traverse commonly traveled roads en route to new destinations, and chefs quickly learn new dishes on the back of well-honed fundamental techniques. Importantly, even intricate tasks can become automatic, if repeated often enough (e.g., the combination of fine motor commands required to swing a tennis racket): the default process must be sufficiently expressive to learn common behaviors, regardless of their complexity. In reality, most processes likely lie on a continuum between simplicity and complexity. In reinforcement learning (RL; Sutton and Barto, 2018) , improving sample efficiency on new tasks is crucial to the developement of general agents which can learn effectively in the real world (Botvinick et al., 2015; Kirk et al., 2021) . Intriguingly, one family of approaches which have shown promise in this regard are regularized policy optimization algorithms, in which a goal-specific control policy is paired with a simple yet general default policy to facilitate learning across multiple tasks (Teh et al., 2017; Galashov et al., 2019; Goyal et al., 2020; 2019; Moskovitz et al., 2022a) . One difficulty in algorithm design, however, is how much or how little to constrain the default policy, and in what way. An overly simple default policy will fail to identify and exploit commonalities among tasks, while a complex model may overfit to a single task and fail to generalize. Most approaches manually specify an asymmetry between the control and default policies, such as hiding input information (Galashov et al., 2019) or constraining the model class (Lai and Gershman, 2021) . Ideally, we'd like an adaptive approach that learns the appropriate degree of complexity via experience.

