REINFORCEMENT LEARNING WITH BAYESIAN CLASSI-FIERS: EFFICIENT SKILL LEARNING FROM OUTCOME EXAMPLES

Abstract

Exploration in reinforcement learning is, in general, a challenging problem. In this work, we study a more tractable class of reinforcement learning problems defined by data that provides examples of successful outcome states. In this case, the reward function can be obtained automatically by training a classifier to classify states as successful or not. We argue that, with appropriate representation and regularization, such a classifier can guide a reinforcement learning algorithm to an effective solution. However, as we will show, this requires the classifier to make uncertainty-aware predictions that are very difficult with standard deep networks. To address this, we propose a novel mechanism for obtaining calibrated uncertainty based on an amortized technique for computing the normalized maximum likelihood distribution. We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions from data, while also being able to guide algorithms towards the specified goal more effectively. We show how using amortized normalized maximum likelihood for reward inference is able to provide effective reward guidance for solving a number of challenging navigation and robotic manipulation tasks which prove difficult for other algorithms.

1. INTRODUCTION

While reinforcement learning (RL) has been shown to successfully solve problems with careful reward design (Rajeswaran et al., 2018; OpenAI et al., 2019) , RL in its most general form, with no assumptions on the dynamics or reward function, requires solving a challenging uninformed search problem in which rewards are sparsely observed. Techniques which explicitly provide "rewardshaping" (Ng et al., 1999) , or modify the reward function to guide learning, can help take some of the burden off of exploration, but shaped rewards can be difficult to obtain without domain knowledge. In this paper, we aim to reformulate the reinforcement learning problem to make it easier for the user to specify the task and to provide a tractable reinforcement learning objective. Instead of requiring a reward function designed for an objective, our method instead assumes a user-provided set of successful outcome examples: states in which the desired task has been accomplished successfully. The algorithm aims to estimate the distribution over these states and maximize the probability of reaching states that are likely under the distribution. Prior work on learning from success examples (Fu et al., 2018b; Zhu et al., 2020) focused primarily on alleviating the need for manual reward design. In our work, we focus on the potential for this mode of task specification to produce more tractable RL problems and solve more challenging classes of tasks. Intuitively, when provided with explicit examples of successful states, the RL algorithm should be able to direct its exploration, rather than simply hope to randomly chance upon high reward states. The main challenge in instantiating this idea into a practical algorithm is performing appropriate uncertainty quantification in estimating whether a given state corresponds to a successful outcome. Our approach trains a classifier to distinguish successful states, provided by the user, from those generated by the current policy, analogously to generative adversarial networks (Goodfellow et al., 2014) and previously proposed methods for inverse reinforcement learning (Fu et al., 2018a) . In general, such a classifier is not guaranteed to provide a good optimization landscape for learning the policy. We discuss how a particular form of uncertainty quantification based on the normalized maximum likelihood (NML) distribution produces better reward guidance for learning. We also connect our approach to count-based exploration methods, showing that a classifier with suitable uncertainty estimates reduces to a count-based exploration method in the absence of any generalization across states, while also discussing how it improves over count-based exploration in the presence of good generalization. We then propose a practical algorithm to train success classifiers in a computationally efficient way with NML, and show how this form of reward inference allows us to solve difficult problems more efficiently, providing experimental results which outperform existing algorithms on a number of navigation and robotic manipulation domains.

2. RELATED WORK

A number of techniques have been proposed to improve exploration.These techniques either add reward bonuses that encourage a policy to visit novel states in a task-agnostic manner (Wiering and Schmidhuber, 1998; Auer et al., 2002; Schaul et al., 2011; Houthooft et al., 2016; Pathak et al., 2017; Tang et al., 2017; Stadie et al., 2015; Bellemare et al., 2016; Burda et al., 2018a; O'Donoghue, 2018) or perform Thompson sampling or approximate Thompson sampling based on a prior over value functions (Strens, 2000; Osband et al., 2013; 2016) . While these techniques are uninformed about the actual task, we consider a constrained set of problems where examples of successes can allow for more task-directed exploration. In real world problems, designing well-shaped reward functions makes exploration easier but often requires significant domain knowledge (Andrychowicz et al., 2020) , access to privileged information about the environment (Levine et al., 2016) and/or a human in the loop providing rewards (Knox and Stone, 2009; Singh et al., 2019b) . Prior work has considered specifying rewards by providing example demonstrations and inferring rewards with inverse RL (Abbeel and Ng, 2004; Ziebart et al., 2008; Ho and Ermon, 2016; Fu et al., 2018a) . This requires expensive expert demonstrations to be provided to the agent. In contrast, our work has the minimal requirement of simply providing successful outcome states, which can be done cheaply and more intuitively. This subclass of problems is also related to goal conditioned RL (Kaelbling, 1993; Schaul et al., 2015; Zhu et al., 2017; Andrychowicz et al., 2017; Nair et al., 2018; Veeriah et al., 2018; Rauber et al., 2018; Warde-Farley et al., 2018; Colas et al., 2019; Ghosh et al., 2019; Pong et al., 2020) but is more general, since it allows for a more abstract notion of task success. A core idea behind our work is using a Bayesian classifier to learn a suitable reward function. Bayesian inference with expressive models and high dimensional data can often be intractable, requiring assumptions on the form of the posterior (Hoffman et al., 2013; Blundell et al., 2015; Maddox et al., 2019) . In this work, we build on the concept of normalized maximum likelihood (Rissanen, 1996; Shtar'kov, 1987) , or NML, to learn Bayesian classifiers. Although NML is typically considered from the perspective of optimal coding (Grünwald, 2007; Fogel and Feder, 2018) , we show how it can be used to learn success classifiers, and discuss its connections to exploration and reward shaping in RL.

3. PRELIMINARIES

In this paper, we study a modified reinforcement learning problem, where instead of the standard reward function, the agent is provided with successful outcome examples. This reformulation not only provides a modality for task specification that may be more natural for users to provide in some settings (Fu et al., 2018b; Zhu et al., 2020; Singh et al., 2019a) , but, as we will show, can also make learning easier. We also derive a meta-learned variant of the conditional normalized maximum likelihood (CNML) distribution for representing our reward function, in order to make evaluation tractable. We discuss background on successful outcome examples and CNML in this section.

3.1. REINFORCEMENT LEARNING WITH EXAMPLES OF SUCCESSFUL OUTCOMES

We follow the framework proposed by Fu et al. (2018b) and assume that we are provided with a Markov decision process (MDP) without a reward function, given by M, where M = (S, A, T , γ, µ 0 ), as well as successful outcome examples S + = {s k + } K k=1 , which is a set of states in which the desired task has been accomplished. This formalism is easiest to describe in terms of the control as inference framework (Levine, 2018) . The relevant graphical model in Figure 9 consists of states and actions, as well as binary success variables e t which represent the occurrence of a particular

