ACTION GUIDANCE: GETTING THE BEST OF SPARSE REWARDS AND SHAPED REWARDS FOR REAL-TIME STRATEGY GAMES Anonymous authors Paper under double-blind review

Abstract

Training agents using Reinforcement Learning in games with sparse rewards is a challenging problem, since large amounts of exploration are required to retrieve even the first reward. To tackle this problem, a common approach is to use reward shaping to help exploration. However, an important drawback of reward shaping is that agents sometimes learn to optimize the shaped reward instead of the true objective. In this paper, we present a novel technique that we call action guidance that successfully trains agents to eventually optimize the true objective in games with sparse rewards while maintaining most of the sample efficiency that comes with reward shaping. We evaluate our approach in a simplified real-time strategy (RTS) game simulator called µRTS.



Training agents using Reinforcement Learning with sparse rewards is often difficult (Pathak et al., 2017) . First, due to the sparsity of the reward, the agent often spends the majority of the training time doing inefficient exploration and sometimes not even reaching the first sparse reward during the entirety of its training. Second, even if the agents have successfully retrieved some sparse rewards, performing proper credit assignment is challenging among complex sequences of actions that have led to theses sparse rewards. Reward shaping (Ng et al., 1999) is a widely-used technique designed to mitigate this problem. It works by providing intermediate rewards that lead the agent towards the sparse rewards, which are the true objective. For example, the sparse reward for a game of Chess is naturally +1 for winning, -1 for losing, and 0 for drawing, while a possible shaped reward might be +1 for every enemy piece the agent takes. One of the critical drawbacks for reward shaping is that the agent sometimes learns to optimize for the shaped reward instead of the real objective. Using the Chess example, the agent might learn to take as many enemy pieces as possible while still losing the game. A good shaped reward achieves a nice balance between letting the agent find the sparse reward and being too shaped (so the agent learns to just maximize the shaped reward), but this balance can be difficult to find. In this paper, we present a novel technique called action guidance that successfully trains the agent to eventually optimize over sparse rewards while maintaining most of the sample efficiency that comes with reward shaping. It works by constructing a main policy that only learns from the sparse reward function R M and some auxiliary policies that learn from the shaped reward function R A1 , R A2 , . . . , R An . During training, we use the same rollouts to train the main and auxiliary policies and initially set a high-probability of the main policy to take action guidance from the auxiliary policies, that is, the main policy will execute actions sampled from the auxiliary policies. Then the main policy and auxiliary policies are updated via off-policy policy gradient. As the training goes on, the main policy will get more independent and execute more actions sampled from its own policy. Auxiliary policies learn from shaped rewards and therefore make the training sampleefficient, while the main policy learns from the original sparse reward and therefore makes sure that the agents will eventually optimize over the true objective. We can see action guidance as combining reward shaping to train auxiliary policies interlieaved with a sort of imitation learning to guide the main policy from these auxiliary policies. We examine action guidance in the context of a real-time strategy (RTS) game simulator called µRTS for three sparse rewards tasks of varying difficulty. For each task, we compare the performance of training agents with the sparse reward function R M , a shaped reward function R A1 , and action guidance with a singular auxiliary policy learning from R A1 . The main highlights are: Action guidance is sample-efficient. Since the auxiliary policy learns from R A1 and the main policy takes action guidance from the auxiliary policy during the initial stage of training, the main policy is more likely to discover the first sparse reward more quickly and learn more efficiently. Empirically, action guidance reaches almost the same level of sample efficiency as reward shaping in all of the three tasks tested. The true objective is being optimized. During the course of training, the main policy has never seen the shaped rewards. This ensures that the main policy, which is the agent we are really interested in, is always optimizing against the true objective and is less biased by the shaped rewards. As an example, Figure 1 shows that the main policy trained with action guidance eventually learns to win the game as fast as possible, even though it has only learned from the match outcome reward (+1 for winning, -1 for losing, and 0 for drawing). In contrast, the agents trained with reward shaping learn more diverse sets of behaviors which result in high shaped reward. To support further research in this field, we make our source code available at GitHub 1 , as well as all the metrics, logs, and recorded videos 2 .

1. RELATED WORK

In this section, we briefly summarize the popular techniques proposed to address the challenge of sparse rewards. Reward Shaping. Reward shaping is a common technique where the human designer uses domain knowledge to define additional intermediate rewards for the agents. Ng et al. (1999) show that a slightly more restricted form of state-based reward shaping has better theoretical properties for preserving the optimal policy. Transfer and Curriculum Learning. Sometimes learning the target tasks with sparse rewards is too challenging, and it is more preferable to learn some easier tasks first. Transfer learning leverages this idea and trains agents with some easier source tasks and then later transfer the knowledge through value function (Taylor et al., 2007) or reward shaping (Svetlik et al., 2017) . Curriculum learning further extends transfer learning by automatically designing and choosing a full sequences of source tasks (i.e. a curriculum) (Narvekar & Stone, 2018) . Imitation Learning. Alternatively, it is possible to directly provide examples of human demonstration or expert replay for the agents to mimic via Behavior Cloning (BC) (Bain & Sammut, 1995) , which uses supervised learning to learn a policy given the state-action pairs from expert replays. Alternatively, Inverse Reinforcement Learning (IRL) (Abbeel & Ng, 2004 ) recovers a reward function from expert demonstrations to be used to train agents. Curiosity-driven Learning. Curiosity driven learning seeks to design intrinsic reward functions (Burda et al., 2019) using metrics such as prediction errors (Houthooft et al., 2016) and "visit counts" (Bellemare et al., 2016; Lopes et al., 2012) . These intrinsic rewards encourage the agents to explore unseen states. Goal-oriented Learning. In certain tasks, it is possible to describe a goal state and use it in conjunction with the current state as input (Schaul et al., 2015) . Hindsight experience replay (HER) (Andrychowicz et al., 2017) develops better utilization of existing data in experience replay by replaying each episode with different goals. HER is shown to be an effective technique in sparse rewards tasks. Hierarchical Reinforcement Learning (HRL). If the target task is difficult to learn directly, it is also possible to hierarchically structure the task using experts' knowledge and train hierarchical agents, which generally involves a main policy that learns abstract goals, time, and actions, as well as auxiliary policies that learn primitive actions and specific goals (Dietterich, 2000) . HRL is especially popular in RTS games with combinatorial action spaces (Pang et al., 2019; Ye et al., 2020) . The most closely related work is perhaps Scheduled Auxiliary Control (SAC-X) (Riedmiller et al., 2018) , which is an HRL algorithm that trains auxiliary policies to perform primitive actions with 1 https://github.com/anonymous-research-code/action-guidance 2 Blinded for peer review

