ACTION GUIDANCE: GETTING THE BEST OF SPARSE REWARDS AND SHAPED REWARDS FOR REAL-TIME STRATEGY GAMES Anonymous authors Paper under double-blind review

Abstract

Training agents using Reinforcement Learning in games with sparse rewards is a challenging problem, since large amounts of exploration are required to retrieve even the first reward. To tackle this problem, a common approach is to use reward shaping to help exploration. However, an important drawback of reward shaping is that agents sometimes learn to optimize the shaped reward instead of the true objective. In this paper, we present a novel technique that we call action guidance that successfully trains agents to eventually optimize the true objective in games with sparse rewards while maintaining most of the sample efficiency that comes with reward shaping. We evaluate our approach in a simplified real-time strategy (RTS) game simulator called µRTS.



Training agents using Reinforcement Learning with sparse rewards is often difficult (Pathak et al., 2017) . First, due to the sparsity of the reward, the agent often spends the majority of the training time doing inefficient exploration and sometimes not even reaching the first sparse reward during the entirety of its training. Second, even if the agents have successfully retrieved some sparse rewards, performing proper credit assignment is challenging among complex sequences of actions that have led to theses sparse rewards. Reward shaping (Ng et al., 1999) is a widely-used technique designed to mitigate this problem. It works by providing intermediate rewards that lead the agent towards the sparse rewards, which are the true objective. For example, the sparse reward for a game of Chess is naturally +1 for winning, -1 for losing, and 0 for drawing, while a possible shaped reward might be +1 for every enemy piece the agent takes. One of the critical drawbacks for reward shaping is that the agent sometimes learns to optimize for the shaped reward instead of the real objective. Using the Chess example, the agent might learn to take as many enemy pieces as possible while still losing the game. A good shaped reward achieves a nice balance between letting the agent find the sparse reward and being too shaped (so the agent learns to just maximize the shaped reward), but this balance can be difficult to find. In this paper, we present a novel technique called action guidance that successfully trains the agent to eventually optimize over sparse rewards while maintaining most of the sample efficiency that comes with reward shaping. It works by constructing a main policy that only learns from the sparse reward function R M and some auxiliary policies that learn from the shaped reward function R A1 , R A2 , . . . , R An . During training, we use the same rollouts to train the main and auxiliary policies and initially set a high-probability of the main policy to take action guidance from the auxiliary policies, that is, the main policy will execute actions sampled from the auxiliary policies. Then the main policy and auxiliary policies are updated via off-policy policy gradient. As the training goes on, the main policy will get more independent and execute more actions sampled from its own policy. Auxiliary policies learn from shaped rewards and therefore make the training sampleefficient, while the main policy learns from the original sparse reward and therefore makes sure that the agents will eventually optimize over the true objective. We can see action guidance as combining reward shaping to train auxiliary policies interlieaved with a sort of imitation learning to guide the main policy from these auxiliary policies. We examine action guidance in the context of a real-time strategy (RTS) game simulator called µRTS for three sparse rewards tasks of varying difficulty. For each task, we compare the performance of training agents with the sparse reward function R M , a shaped reward function R A1 , and action guidance with a singular auxiliary policy learning from R A1 . The main highlights are:

