LEARNING ACHIEVEMENT STRUCTURE FOR STRUC-TURED EXPLORATION IN DOMAINS WITH SPARSE RE

Abstract

We propose Structured Exploration with Achievements (SEA), a multi-stage reinforcement learning algorithm designed for achievement-based environments, a particular type of environment with an internal achievement set. SEA first uses offline data to learn a representation of the known achievements with a determinant loss function, then recovers the dependency graph of the learned achievements with a heuristic algorithm, and finally interacts with the environment online to learn policies that master known achievements and explore new ones with a controller built with the recovered dependency graph. We empirically demonstrate that SEA can recover the achievement structure accurately and improve exploration in hard domains such as Crafter that are procedurally generated with highdimensional observations like images.

1. INTRODUCTION

Exploration in complex environments with long horizon and high-dimensional input such as images has always been challenging in reinforcement learning. In recent years, multiple works (Stadie et al., 2015; Bellemare et al., 2016; Tang et al., 2017; Pathak et al., 2017; Burda et al., 2019a; b; Badia et al., 2020) propose to use intrinsic motivation to encourage the agent to explore less-frequently visited states or transitions and gain success in some hard exploration tasks in RL. Go-explore (Ecoffet et al., 2021) proposes to solve the hard exploration problem by building a state archive and training worker policies to reach the old states and explore new states. The shared trait of these two streams of RL exploration algorithms is that they are both built on the idea of increasing the visiting frequency of unfamiliar states, either by providing intrinsic motivation or doing explicit planning. However, This idea can be ineffective in procedurally generated environments since states in such environments can be infinite and scattered, to the extent that the same state may never appear twice. This is why exploration in such environments can be extremely hard. However, in most procedurally generated environments, there will be a finite underlying structure fixed in all environment instantiations. This underlying structure is essentially what makes the environment "meaningful". Therefore, discovering this structure becomes vital in solving procedurally generated environments. In this work, we set our sight on achievement-based environments. Specifically, an achievementbased environment has a finite set of achievements. The agent only gets rewarded when it completes one of the achievements for the first time in an episode. Achievement-based environments are commonly seen in video games such as Minecraft. We argue that this type of environment also models many real-life tasks. For example, when we try to learn a new dish, the recipe usually decomposes the whole dish into multiple intermediate steps, where completing each step can be seen as an achievement. A good example of achievement-based environments in reinforcement learning community is Crafter (Hafner, 2021), a 2d open-world game with procedurally generated world maps that includes 22foot_0 achievements. We introduce this environment in more detail in Sec. 3.1. Achievement-based environments bring both benefits and challenges to reinforcement learning exploration. On one hand, the achievement structure is a perfect invariant structure in a procedurally generated environment. On the other hand, the reward signal in such an environment is sparse since the agent is only rewarded when an achievement is unlocked. The reward signal can even be confusing since unlocking different achievements grants the same reward. We propose Structured Exploration with Achievements (SEA), a multi-stage algorithm that effectively solves exploration in achievement-based environments. SEA first tries to recover the achievement structure of the environment from collected trajectories, either from experts or other bootstrapping RL algorithms. With the recovered structure, SEA deploys a nonparametric meta-controller to learn a set of sub-policies that reach every discovered achievement and start from there to explore new ones. We empirically evaluate SEA in Crafter and an achievement-based environment TreeMaze that we designed based on MiniGrid (Chevalier-Boisvert et al., 2018) . In Crafter, our algorithm can complete the hard exploration achievements consistently and reach the hardest achievement (collect diamond) in the game at a non-negligible frequency. None of our tested baselines can even complete any of the hard exploration achievements, and to the best of our knowledge, SEA is the first algorithm to complete this challenge. We also show that SEA can accurately recover the achievement dependency structure in both environments. Although uncommon in the current RL scene, we argue that the achievement-based reward system can be a good environment design choice. Designing a set of achievements can be easier than crafting specific rewards system for the agent to learn. With the structure recovery module in SEA, the agent can automatically discover the optimal progression route to the desired task, alleviating the need for a high-quality achievement set. Furthermore, as shown in the TreeMaze experiments, the achievements can even be not semantically relevant to the tasks.

2. RELATED WORK

Exploration in reinforcement learning. Exploration in RL has long been an important topic, which can date back to ϵ-greedy and UCB algorithms in multi-armed-bandit problems. In recent years, intrinsic motivation-based exploration algorithms (Stadie et al., 2015; Bellemare et al., 2016; Tang et al., 2017; Pathak et al., 2017; Burda et al., 2019a; b; Badia et al., 2020) gained great success in this area. These algorithms aim to find an estimation of the state or transition visiting frequency and use an intrinsic reward to encourage the agent to visit the less frequently visited states. However, they can be ineffective when the environment is procedurally generated, where the number of states and transitions can be infinite and scattered. Another line of work in this area relevant to ours is the go-explore (Ecoffet et al., 2021) series. Go-explore builds an archive of explored states which are grouped into cells and learns a set of goalconditioned policies that reach each cell. It then starts exploration from the archived cells to find new cells and updating the archive iteratively. Go-explore shows success in the hard exploration tasks in Atari games. However, go-explore can also be ineffective in procedurally generated environments since it uses image down-sampling as the default state abstraction method. Another point where our work differs from go-explore is that we don't seek to group all states into cells, but instead only group the few import ones. This is helpful since archiving all the states and learning all the goal-conditioned policies can take up huge amount of resources. Hierarchical reinforcement learning. Hierarchical reinforcement learning (HRL) explores the idea of dividing a large and complex task into smaller sub-tasks with reusable skills (Dayan & Hinton, 1992; Sutton et al., 1999; Bacon et al., 2017; Vezhnevets et al., 2017; Nachum et al., 2018) . Eysenbach et al. (2019) and Hartikainen et al. (2020) propose to first learn a set of unsupervised skills then build a policy with the learned skills for faster learning and better generalization; Bacon et al. ( 2017) builds a set of policy options to learn. Goal-conditioned HRL (Vezhnevets et al., 2017; Nachum et al., 2018; 2019) train a manager policy that proposes parameterized goals and a goalconditioned worker policy to reach the proposed goals. 2022) train a set of sub-policies to solve the subtasks provided by the environment and a meta-controller to maximize the reward. This line of work is similar to ours in that they assume the subtasks are provided by the environment. However, in our work, we don't assume access to any additional subgoal information such as subgoal completion signals from the environment that are typically required in this line of work. Additionally, the



21 in our modified version.



Kulkarni et al. (2016); Lyu et al. (2019); Rafati & Noelle (2019); Sohn et al. (2020); Costales et al. (

