WHICH MUTUAL-INFORMATION REPRESENTATION LEARNING OBJECTIVES ARE SUFFICIENT FOR CONTROL?

Abstract

Mutual information maximization provides an appealing formalism for learning representations of data. In the context of reinforcement learning, such representations can accelerate learning by discarding irrelevant and redundant information, while retaining the information necessary for control. Much of the prior work on these methods has addressed the practical difficulties of estimating mutual information from samples of high-dimensional observations, while comparatively less is understood about which mutual information objectives are sufficient for reinforcement learning (RL) from a theoretical perspective. In this paper we identify conditions under which representations that maximize specific mutual-information objectives are theoretically sufficient for learning and representing the optimal policy. Somewhat surprisingly, we find that several popular objectives can yield insufficient representations given mild and common assumptions on the structure of the MDP. We corroborate our theoretical results with empirical results experiments on a simulated game environment with visual observations.

1. INTRODUCTION

While deep reinforcement learning (RL) algorithms are capable of learning policies from highdimensional observations, such as images (Mnih et al., 2013; Lee et al., 2019; Kalashnikov et al., 2018) , in practice policy learning faces a bottleneck in acquiring useful representations of the observation space (Shelhamer et al., 2016) . State representation learning approaches aim to remedy this issue by learning structured and compact representations on which to perform RL. While a wide range of representation learning objectives have been proposed (Lesort et al., 2018) , a particularly appealing class of methods that is amenable to rigorous analysis is based on maximizing mutual information (MI) between variables. In the unsupervised learning setting, this is often realized as the InfoMax principle (Linsker, 1988; Bell & Sejnowski, 1995) , which maximizes the mutual information between the input and its latent representation subject to domain-specific constraints. This approach has been widely applied in unsupervised learning in the domains of image, audio, and natural language understanding (Oord et al., 2018; Hjelm et al., 2018; Ravanelli & Bengio, 2019) . In RL, the variables of interest for MI maximization are sequential states, actions, and rewards (see Figure 1 ). As we will discuss, several popular methods for representation learning in RL involve mutual information maximization with different combinations of these variables (Anand et al., 2019; Oord et al., 2018; Pathak et al., 2017; Shelhamer et al., 2016) . A useful representation should retain the factors of variation that are necessary to learn and represent the optimal policy or the optimal value function, and discard irrelevant and redundant information. While much prior work has focused on the problem of how to optimize various mutual information objectives in high dimensions (Song & Ermon, 2019; Belghazi et al., 2018; Oord et al., 2018; Hjelm et al., 2018) , we focus instead on whether the representations that maximize these objectives are actually theoretically sufficient for learning and representing the optimal policy or value function. We find that some commonly used objectives are insufficient given relatively mild and common assumptions on the structure of the MDP, and identify other objectives which are sufficient. We show these results theoretically and illustrate the analysis empirically in didactic examples in which MI can be computed exactly. Our results provide some guidance to the deep RL practitioner on when and why objectives may be expected to work well or fail, and also provide a framework to analyze newly proposed representation learning objectives based on MI. To investigate how our theoretical results pertain to deep RL, we compare the performance of RL agents in a simulated game trained with state representations learned by maximizing the MI objective given visual inputs. The experimental results corroborate our theoretical findings, and demonstrate that the sufficiency of a representation can have a substantial impact on the performance of an RL agent that uses that representation.

2. RELATED WORK

In this paper, we analyze several widely used mutual information objectives for control. In this section we first review MI-based unsupervised learning, then the application of these techniques to the RL setting. Finally, we discuss alternative perspectives on representation learning in RL. Mutual information-based unsupervised learning. Mutual information-based methods are particularly appealing for representation learning as they admit both rigorous analysis and intuitive interpretation. Tracing its roots to the InfoMax principle (Linsker, 1988; Bell & Sejnowski, 1995) , a common technique is to maximize the MI between the input and its latent representation subject to domain-specific constraints (Becker & Hinton, 1992) . This technique has been applied to learn representations for natural language (Devlin et al., 2019 ), video (Sun et al., 2019) , and images (Bachman et al., 2019; Hjelm et al., 2018) . A major challenge to using MI maximization methods in practice is the difficulty of estimating MI from samples (McAllester & Statos, 2018) and with high-dimensional inputs (Song & Ermon, 2019) . Much recent work has focused on improving MI estimation via variational methods (Song & Ermon, 2019; Poole et al., 2019; Oord et al., 2018; Belghazi et al., 2018) . In this work we are concerned with analyzing the MI objectives, and not the estimation method. In our experiments with image observations, we make use of noise contrastive estimation methods (Gutmann & Hyvärinen, 2010) , though other choices could also suffice. Mutual information objectives in RL. Reinforcement learning adds aspects of temporal structure and control to the standard unsupervised learning problem discussed above (see Figure 1 ). This structure can be leveraged by maximizing MI between sequential states, actions, or combinations thereof. Some works omit the action, maximizing the MI between current and future states (Anand et al., 2019; Oord et al., 2018; Stooke et al., 2020) . Much prior work learns latent forward dynamics models (Watter et al., 2015; Karl et al., 2016; Zhang et al., 2018b; Hafner et al., 2019; Lee et al., 2019) , related to the forward information objective we introduce in Section 4. Multi-step inverse models, closely related to the inverse information objective (Section 4), have been used to learn control-centric representations (Yu et al., 2019; Gregor et al., 2016) . Single-step inverse models have been deployed as regularization of forward models (Zhang et al., 2018a; Agrawal et al., 2016) and as an auxiliary loss for policy gradient RL Shelhamer et al. (2016); Pathak et al. (2017) . The MI objectives that we study have also been used as reward bonuses to improve exploration, without impacting the representation, in the form of empowerment (Klyubin et al., 2008; 2005; Mohamed & Rezende, 2015; Leibfried et al., 2019) and information-theoretic curiosity (Still & Precup, 2012) . Representation learning for reinforcement learning. In RL, the problem of finding a compact state space has been studied as state aggregation or abstraction (Bean et al., 1987; Li et al., 2006) . Abstraction schemes include bisimulation (Givan et al., 2003 ), homomorphism (Ravindran & Barto, 2003) , utile distinction (McCallum, 1996) , and policy irrelevance (Jong & Stone, 2005) . While efficient algorithms exist for MDPs with known transition models for some abstraction schemes such as bisimulation (Ferns et al., 2006; Givan et al., 2003) , in general obtaining error-free abstractions is highly impractical for most problems of interest. For approximate abstractions prior work has bounded the sub-optimality of the policy (Bertsekas et al., 1988; Dean & Givan, 1997; Abel et al., 2016) as well as the sample efficiency (Lattimore & Szepesvari, 2019; Van Roy & Dong, 2019; Du et al., 2019) , with some results extending to the deep learning setting (Gelada et al., 2019; Nachum et al., 2018) . In this paper, we focus on whether a representation can be used to learn the optimal policy, and not the tractability of learning. Alternatively, priors based on the structure of the physical world can be used to guide representation learning (Jonschkowski & Brock, 2015) . In deep RL, many auxiliary objectives distinct from the objectives that we study have been proposed, including meta-learning general value functions (Veeriah et al., 2019) , predicting multiple value functions (Bellemare et al., 2019; Fedus et al., 2019; Jaderberg et al., 2016) and predicting domainspecific measurements (Mirowski, 2019; Dosovitskiy & Koltun, 2016) . We restrict our analysis to objectives that can be expressed as MI-maximization.

