ADAPTING TO REWARD PROGRESSIVITY VIA SPECTRAL REINFORCEMENT LEARNING

Abstract

In this paper we consider reinforcement learning tasks with progressive rewards; that is, tasks where the rewards tend to increase in magnitude over time. We hypothesise that this property may be problematic for value-based deep reinforcement learning agents, particularly if the agent must first succeed in relatively unrewarding regions of the task in order to reach more rewarding regions. To address this issue, we propose Spectral DQN, which decomposes the reward into frequencies such that the high frequencies only activate when large rewards are found. This allows the training loss to be balanced so that it gives more even weighting across small and large reward regions. In two domains with extreme reward progressivity, where standard value-based methods struggle significantly, Spectral DQN is able to make much farther progress. Moreover, when evaluated on a set of six standard Atari games that do not overtly favour the approach, Spectral DQN remains more than competitive: While it underperforms one of the benchmarks in a single game, it comfortably surpasses the benchmarks in three games. These results demonstrate that the approach is not overfit to its target problem, and suggest that Spectral DQN may have advantages beyond addressing reward progressivity.

1. INTRODUCTION

In decision making tasks involving compounding returns, such as stock market investing, it is common for the rewards received by the agent to increase in magnitude over time. An investor that starts out with little capital will achieve relatively small profits or losses early on, but if they are successful, their capital -and hence their potential profits and losses -will gradually increase. This property also arises in many games settings. For example, in the Atari game Video Pinball, the player can increase a "bonus multiplier" that increases the score paid per bumper hit. Since the bonus multiplier does not reset until the player dies, the rewards typically increase with time. In this paper, we refer to any task exhibiting this phenomenon as a progressive reward task. We hypothesise that reward progressivity may be problematic for value-based deep reinforcement learning agents. Our rationale is that the temporal difference errors that arise under such algorithms typically scale with the magnitude of the training targets. Accordingly, experience collected from states with large expected returns may dominate training, causing the agent's performance to degrade in other states. While this is rational in one sense -generally, it is more important to make accurate decisions in high-stakes situations -it is potentially harmful on progressive reward tasks. This is because, on progressive reward tasks, it may be necessary for the agent to perform well in relatively unrewarding regions before it can reach more rewarding regions. For example, in stock market investing, an investor can only reach states where they have large capital if they first perform well with small capital. Note too that the rewards need not be strictly progressive for this problem to arise; all that is needed is for the rewards to be progressive over some extended period. Algorithms that clip the reward, such as DQN (Mnih et al., 2015) , are less susceptible to this problem, because they do not perceive increases in reward magnitude beyond the clipping point. However, it is straightforward to construct examples where reward clipping masks the optimal solution. For example, in Bowling, clipping the rewards to [-1, 1] makes bowling a strike appear no better than hitting a single pin (Pohlen et al., 2018) . While subsequent approaches make learning feasible without reward clipping (van Hasselt et al., 2016; Pohlen et al., 2018) , we show in Section 3 that these methods are negatively impacted by strong reward progressivity. Motivated by this, we seek an algorithm that is capable of learning from unclipped rewards, while mitigating the negative impact that large rewards have on the agent's ability to learn in less rewarding regions of the task. To meet this aim, we propose Spectral DQN. Under this approach, rewards are decomposed into a multidimensional spectral reward, where each component is bound to [-1, 1] . The upper frequencies of the spectrum activate only when large magnitude rewards are received. For example, in Video Pinball, the lowest frequency activates on any score, while the upper frequencies only activate when the player has accumulated a large bonus multiplier. The bounded nature of the spectral rewards allows the agent to learn expected spectral returns without instability arising from large training targets. Meanwhile, the full, unclipped expected return can be recovered by summing across the return spectrum. (The full, technical definitions of these terms are provided in Section 4). In terms of addressing reward progressivity, a key advantage of this approach is that it allows us to balance the training loss and prevent the upper frequencies from receiving undue weight. To test whether this approach helps mitigate the impact of reward progressivity, we perform two set of experiments: First, we apply Spectral DQN to two domains that we specifically designed to exhibit strong reward progressivity. While previous methods struggle on these tasks, failing to learn almost anything at all in the more extreme domain, Spectral DQN performs markedly better, making considerable progress on both tasks. Next, we apply our approach to a less constructed set of tasks; namely, 6 standard Atari games. Only some of these games exhibit reward progressivity, and none of them nearly so strongly as the extreme domains from earlier. While Spectral DQN does not outperform the benchmarks as noticeably in these domains (it is beaten by one of the benchmarks in a single game, but comfortably outperforms both benchmarks in three games), the results clearly show that Spectral DQN is not overfit to constructed domains. Moreover, since its outperformance in some of the games cannot be attributed solely to reward progressivity, it appears that the approach may offer additional advantages, as explained later in our analysis of the results.

2. PREVIOUS APPROACHES TO HANDLING REWARD VARIABILITY

In the Atari domain, where the DQN algorithm was first evaluated, the agent receives a reward equal to the score increase at each frame. However, since score magnitudes vary greatly across games, DQN clips all rewards to [-1, +1]. This constrains the size of the training targets so that it is easier to find a step size that performs well across the entire suite of games. While this heuristic turns out to perform well in Atari, a clear downside of the approach is that the agent becomes unable to distinguish large rewards from small rewards. The Pop-Art algorithm (van Hasselt et al., 2016) offers a more principled solution. It trains from the unclipped reward, while normalising the training targets to have zero mean and unit variance. To ensure that the network's predictions are preserved when the normalisation parameters are updated, the output of the final layer is scaled and shifted by an offsetting amount. Pohlen et al. (2018) propose an alternative approach to handling unclipped rewards that reduces the variance of the training targets by applying a squashing function, h. The agent learns squashed action-values, Q(s, a) = h(Q(s, a)), via the transformed Bellman backup: Q(s t , a t ) ← Q(s t , a t ) + α h r + γ max a h -1 ( Q(s t+1 , a )) -Q(s t , a t ) The authors prove that the transformed Bellman operator remains a contraction provided that both h and h -1 are Lipschitz continuous and γ < 1/L h L h -1 (where L h and L h -1 are the respective Lipschitz constants). In their experiments, they use the following squashing function: h(x) = sign(x)( |x| + 1 -1) + x (2) where > 0 ensures that h -1 is Lipschitz continuous. For the remainder of this paper, we refer to this approach as target compression. In the Atari domain, Pohlen et al. found target compression to perform significantly better than Pop-Art. It has since become the dominant method for handling unclipped rewards in Atari, with several recent largescale agents (Kapturowski et al., 2019; Badia et al., 2020a; b) , including the state-of-the-art Agent57, favouring target compression over Pop-Art.

