OPTIMISTIC EXPLORATION IN REINFORCEMENT LEARNING USING SYMBOLIC MODEL ESTIMATES

Abstract

There has been an increasing interest in using symbolic models along with reinforcement learning (RL) problems, where these coarser abstract models are used as a way to provide RL agents with higher level guidance. However, most of these works are inherently limited by their assumption of having an access to a symbolic approximation of the underlying problem. To address this issue, we introduce a new method for learning optimistic symbolic approximations of the underlying world model. We will see how these representations, coupled with fast diverse planners developed by the automated planning community, provide us with a new paradigm for optimistic exploration in sparse reward settings. We investigate the possibility of speeding up the learning process by generalizing learned model dynamics across similar actions with minimal human input. Finally, we evaluate the method, by testing it on multiple benchmark domains and compare it with other RL strategies for sparse reward settings, including hierarchical RL and intrinsic reward based exploration.

1. INTRODUCTION

A popular trend in recent years is using symbolic planning models with reinforcement learning (RL) algorithms. Works have shown how these models could be used to provide guidance to RL agents (Yang et al., 2018; Lee et al., 2022; Gehring et al., 2022) , to provide explanations (Sreedharan et al., 2022b) , and as an interface to receive guidance and advice from humans (Kambhampati et al., 2022) . Coupled with the fact that advances in automated planning has made available a number of robust tools that RL researchers could adapt directly to their problems (cf. (Francés et al., 2018; Muise et al., 2022; Silver & Chitnis, 2020) ), these methods have the potential to help addressing many problems faced by state-of-the-art RL methods. However, a major hurdle to using these methods is the need to access a complete and correct symbolic model of the underlying sequential-decision problems. While there have been efforts from the planning community to learn such models (Juba & Stern, 2022; Yang et al., 2007) , most of those methods have focused on cases where the models are synthesized from a set of plan traces, hence corresponding to the traditional offline reinforcement learning setting. Interestingly, very few works have been done in synthesizing such models in the arguably more prominent RL paradigm, namely, online RL. To fill this gap, in this paper we propose a novel algorithm to learn relevant fragments of symbolic models in an online fashion. We show how it could be used to address one of the central problems within RL, namely effective exploration. We show how our method allows us to perform goaldirected optimistic exploration, while providing rigorous theoretical guarantees. The exploration mechanism leverages two distinct components: (a) a representation that captures the most optimistic model that is consistent with the set of observations received, and (b) the use of a fast and suboptimal diverse planner that generates multiple possible exploration paths, which are still goal-directed. The idea of optimistic exploration is not new within the context of RL. The most prominent method being the RMax algorithm (Brafman & Tennenholtz, 2002) . RMax modifies the reward function to develop agents that are optimistic under uncertainty. Our use of symbolic models, however, allows us to maintain an optimistic hypothesis regarding the underlying transition function. Coupled with a goal-directed planner, this lets us perform directed exploration in sparse reward settings, where we have a clear specification of the goal state but no intermediate rewards. As we show in this work, for a finite state deterministic MDP our method is guaranteed to generate a goal-reaching policy. Additionally, we investigate the use of a structured form of generalization rule that leverages a very simple intuition, namely the effects of an action don't depend on specific object labels but only on object types. Commonly referred to as lifted representation in planning literature, we show this rule to speed up learning with minimal human input. The rest of the paper is structured as follows. We start with related work in Section 2. Section 3 provides a formal definition of the exact problem we are investigating and Section 5 shows the empirical evaluation of our method against a set of baselines. Finally, Section 6 concludes the paper with a discussion of the methods and possible future directions.

2. RELATED WORK

As mentioned earlier, one of the foundational works in optimistic exploration in the context of reinforcement learning is R-max (Brafman & Tennenholtz, 2002) . Even before the formulation in its current popular form, the idea of optimism under uncertainty has found several uses within the RL literature (cf. (Kaelbling et al., 1996) ). R-max can be seen as an instance of a larger class of intrinsic reward based learning (Aubret et al., 2019) , but one where the reward is tied to state novelty. Other forms of intrinsic rewards incentivizes the agent to learn potentially useful skills and new knowledge. A context where model simplification has been used in areas related to RL is in the context of stochasticity, where methods like certainty equivalence and hindsight optimization has been applied (Bertsekas, 2021; Yoon et al., 2008) . In Section 6, we will see how we can also apply our methods directly in settings with stochastic dynamics. In regards to the user of symbolic models, the most common use is in the context of hierarchical reinforcement learning. Many works (Lee et al., 2022; Illanes et al., 2020; Yang et al., 2018; Lyu et al., 2019) , have investigated the possibility of using the symbolic model to generate potential options and then using a meta-controller to learn policies over such options. While most of these work assume that the model is in someway an approximation of the true model, all inferences performed at the symbolic level is performed over the original model provided as part of the problem. While in this work, we focused on cases where the symbolic model could in theory exactly capture the underlying model, the same techniques can also be applied to cases where the planning model may represent some abstraction of the true model. Another popular use of symbolic model is as source of reward shaping information (cf. (Gehring et al., 2022) ). In this context, works have also looked at symbolic models as a vehicle to precisely specify their objective (Icarte et al., 2018; Giacomo et al., 2019) . In terms of learning symbolic model, interestingly the work has mostly focused on learning plan or execution traces (Yang et al., 2007; Juba & Stern, 2022; Callanan et al., 2022; Cresswell et al., 2013) . In most of these works, the theoretical guarantee you are aiming for is to generate more pessimistic models that are always guaranteed to work, but may overlook plausible plans. This is completely antithetical to considerations one must employ when performing explorations in common online RL settings, where the agent is either operating in a safe environment or interacting with a simulator. To the best of our knowledge, all existing online methods to acquire symbolic models (Carbonell & Gil, 1990; Lamanna et al., 2021) , focuses on extracting an exact representation of the true underlying model. Since our primary motivation for learning this model is to drive the exploration process, we do not have that limitation. Instead, we focus on learning a (more permissive) optimistic approximation. Also it is worth noting that, the assumption that the system will be provided action arguments (something we will leverage in Section 4.3) is one commonly made by most of these works. There are also some works that are trying to automatically acquire abstract symbolic models from an underlying MDP (including potential symbols) like that by Konidaris et al. (2018) . This direction is orthogonal to our work, as symbols produced by them may be meaningless to the human and we are explicitly trying to leverage human's intuitions about the problem.

3.1. BLOCKSWORLD

To concretize the problem, we will be using a Blocks World problem as a running example throughout the paper. Blocks world has a long history within AI as a useful benchmark to visualize sequential decision making problems (going all the way back to early 1970's with Winograd's

