LEARNING ACHIEVEMENT STRUCTURE FOR STRUC-TURED EXPLORATION IN DOMAINS WITH SPARSE RE

Abstract

We propose Structured Exploration with Achievements (SEA), a multi-stage reinforcement learning algorithm designed for achievement-based environments, a particular type of environment with an internal achievement set. SEA first uses offline data to learn a representation of the known achievements with a determinant loss function, then recovers the dependency graph of the learned achievements with a heuristic algorithm, and finally interacts with the environment online to learn policies that master known achievements and explore new ones with a controller built with the recovered dependency graph. We empirically demonstrate that SEA can recover the achievement structure accurately and improve exploration in hard domains such as Crafter that are procedurally generated with highdimensional observations like images.

1. INTRODUCTION

Exploration in complex environments with long horizon and high-dimensional input such as images has always been challenging in reinforcement learning. In recent years, multiple works (Stadie et al., 2015; Bellemare et al., 2016; Tang et al., 2017; Pathak et al., 2017; Burda et al., 2019a; b; Badia et al., 2020) propose to use intrinsic motivation to encourage the agent to explore less-frequently visited states or transitions and gain success in some hard exploration tasks in RL. Go-explore (Ecoffet et al., 2021) proposes to solve the hard exploration problem by building a state archive and training worker policies to reach the old states and explore new states. The shared trait of these two streams of RL exploration algorithms is that they are both built on the idea of increasing the visiting frequency of unfamiliar states, either by providing intrinsic motivation or doing explicit planning. However, This idea can be ineffective in procedurally generated environments since states in such environments can be infinite and scattered, to the extent that the same state may never appear twice. This is why exploration in such environments can be extremely hard. However, in most procedurally generated environments, there will be a finite underlying structure fixed in all environment instantiations. This underlying structure is essentially what makes the environment "meaningful". Therefore, discovering this structure becomes vital in solving procedurally generated environments. In this work, we set our sight on achievement-based environments. Specifically, an achievementbased environment has a finite set of achievements. The agent only gets rewarded when it completes one of the achievements for the first time in an episode. Achievement-based environments are commonly seen in video games such as Minecraft. We argue that this type of environment also models many real-life tasks. For example, when we try to learn a new dish, the recipe usually decomposes the whole dish into multiple intermediate steps, where completing each step can be seen as an achievement. A good example of achievement-based environments in reinforcement learning community is Crafter (Hafner, 2021) , a 2d open-world game with procedurally generated world maps that includes 22 1 achievements. We introduce this environment in more detail in Sec. 3.1. Achievement-based environments bring both benefits and challenges to reinforcement learning exploration. On one hand, the achievement structure is a perfect invariant structure in a procedurally generated environment. On the other hand, the reward signal in such an environment is sparse since 1 21 in our modified version. 1

