LEARNING DYNAMIC ABSTRACT REPRESENTATIONS FOR SAMPLE-EFFICIENT REINFORCEMENT LEARNING

Abstract

In many real-world problems, the learning agent needs to learn a problem's abstractions and solution simultaneously. However, most such abstractions need to be designed and refined by hand for different problems and domains of application. This paper presents a novel top-down approach for constructing state abstractions while carrying out reinforcement learning. Starting with state variables and a simulator, it presents a novel domain-independent approach for dynamically computing an abstraction based on the dispersion of Q-values in abstract states as the agent continues acting and learning. Extensive empirical evaluation on multiple domains and problems shows that this approach automatically learns abstractions that are finely-tuned to the problem, yield powerful sample efficiency, and result in the RL agent significantly outperforming existing approaches.

1. INTRODUCTION

It is well known that good abstract representations can play a vital role in improving the scalability and efficiency of reinforcement learning (RL) (Sutton & Barto, 2018; Yu, 2018; Konidaris, 2019) . However, it is not very clear how good abstract representations could be efficiently learned without extensive hand-coding. Several authors have investigated methods for aggregating concrete states based on similarities in value functions but this approach can be difficult to scale as the number of concrete states or the transition graph grows. This paper presents a novel approach for top-down construction and refinement of abstractions for sample efficient reinforcement learning. Rather than aggregating concrete states based on the agent's experience, our approach starts with a default, auto-generated coarse abstraction that collapses the domain of each state variable (e.g., the location of each taxi and each passenger in the classic taxi world) to one or two abstract values. This eliminates the need to consider concrete states individually, although this initial abstraction is likely to be too coarse for most practical problems. The overall algorithm proceeds by interleaving the process of refining this abstraction with learning and evaluation of policies, and results in automatically generated, problem and reward-function specific abstractions that aid learning. This process not only helps in creating a succinct representation of cumulative value functions, but it also makes learning more sample efficient by using the abstraction to locally transfer states' values and cleaving abstract states only when it is observed that an abstract state contains states featuring a large spread in their value functions. This approach is related to research on abstraction for reinforcement learning and on abstraction refinement for model checking Dams & Grumberg (2018); Clarke et al. (2000) (a detailed survey of related work is presented in the next section). However, unlike existing streams of work, we develop a process that automatically generates conditional abstractions, where the final abstraction on the set of values of a variable can depend on the specific values of other variables. For instance, Fig. 1 displays a taxi world where for different values of the state variables (destination and passengers locations), meaningful conditional abstractions are constructed for the taxi location. A meaningful abstraction provides greater details in the taxi-location variable around the passenger location when the taxi needs to pick up a passenger (Fig. 1 (middle)). When the taxi has the passenger, the abstraction should show greater details around the destination (Fig. 1 (right)). Furthermore, our approach goes beyond the concept of counter-example driven abstraction refinement to consider the reward function as well as stochastic dynamics, and it uses measures of dispersion such as the standard deviation of Q-values to drive the refinement process. The main contributions of this paper are mechanisms for building conditional abstraction trees that help compute and represent such abstractions, and the process of interleaving RL episodes with phases of abstraction and refinement. Although this process could be adapted to numerous RL algorithms, we focus on developing and investigating it with Q-learning in this paper. Figure 1 : Consider a classic taxi world with two passengers and a building as the drop-off location where the green area is impassable (left). Meaningful conditional abstractions can be constructed, for example, for situations where both passengers are at their pickup locations (middle), or one passenger has already been picked-up (right). The presented approach for dynamic abstractions for RL (DAR+RL) can be thought of as a dynamic abstraction scheme because the refinement is tied to the dispersion of Q-values based on the agent's evolving policy during learning. It provides adjustable degrees of compression (Abel et al., 2016) where the aggressiveness of abstraction can be controlled by tuning the definition of variation in the dispersion of Q-values. Extensive empirical evaluation on multiple domains and problems shows that this approach automatically learns abstract representations that effectively draw out similarities across the state space, and yield powerful sample efficiency in learning. Comparative evaluation shows that Q-learning based RL agents enhanced with our approach outperform state-of-the-art RL approaches in both discrete and continuous domains while learning meaningful abstract representations. The rest of this paper is organized as follows. Sec. 2 summarizes the related work followed by a discussion on the necessary backgrounds in Sec. 3. Sec. 4 presents our dynamic abstraction learning method for sample-efficient RL. The empirical evaluations are demonstrated in Sec. 5 followed by the conclusions in Sec. 6.

2. RELATED WORK

Offline State Abstraction. Most early studies focus on action-specific (Dietterich, 1999) and option-specific (Jonsson & Barto, 2000) state abstraction. Further, Givan et al. (2003) introduced the notion of state equivalence to possibly reduce the state space size by which two states can be aggregated into one abstract state if applying a mutual action leads to equivalence states with similar rewards. Later on, Ravindran & Barto (2004) relaxed this definition of state equivalence by allowing the actions to be different if there is a valid mapping between them. Offline state abstraction has further been studied for generalization and transfer in RL (Karia & Srivastava, 2022) and planning (Srivastava et al., 2012) . Graph-Theoretic State Abstraction. Mannor et al. ( 2004) developed a graph-theoretic state abstraction approach that utilizes the topological similarities of a state transition graph (STG) to aggregate states in an online manner. Mannor's definition of state abstraction follows Givan's notion of equivalence states except they update the partial STG iteratively to find the abstractions. Another comparable method proposed by Chiu & Soo (2010) carries out spectral graph analysis on STG to decompose the graph into multiple sub-graphs. However, most graph-theoretic analyses on STG, such as computing the eigenvectors in Chiu & Soo's work, can become infeasible for problems with large-scale state space. Monte-Carlo Tree Search (MCTS). MCTS approaches offer viable and tractable algorithms for large state-space Markovian decision problems (Kocsis & Szepesvári, 2006) . Jiang et al. (2014) demonstrated that proper abstraction effectively enhances the performance of MCTS algorithms. However, their clustering-based state abstraction approach is limited to the states enumerated by their algorithm within the partially expanded tree, which makes it ineffectual when limited samples are available to the planning/learning agent. Anand et al. (2015) advanced Jiang's method by comprehensively aggregating states and state-action pairs aiming to uncover more symmetries in the domain. Owing to their novel state-action pair abstraction extending Givan and Ravindran's notions of abstractions, Anand et al.'s method results in higher quality policies compared to other approaches based on MCTS. However, their bottom-up abstraction scheme makes their method computation-

