REINFORCEMENT LEARNING WITH RANDOM DELAYS

Abstract

Action and observation delays commonly occur in many Reinforcement Learning applications, such as remote control scenarios. We study the anatomy of randomly delayed environments, and show that partially resampling trajectory fragments in hindsight allows for off-policy multi-step value estimation. We apply this principle to derive Delay-Correcting Actor-Critic (DCAC), an algorithm based on Soft Actor-Critic with significantly better performance in environments with delays. This is shown theoretically and also demonstrated practically on a delay-augmented version of the MuJoCo continuous control benchmark.

1. INTRODUCTION

This article is concerned with the Reinforcement Learning (RL) scenario depicted in Figure 1 , which is commonly encountered in real-world applications (Mahmood et al., 2018; Fuchs et al., 2020; Hwangbo et al., 2017) . Oftentimes, actions generated by the agent are not immediately applied in the environment, and observations do not immediately reach the agent. Such environments have mainly been studied under the unrealistic assumption of constant delays (Nilsson et al., 1998; Ge et al., 2013; Mahmood et al., 2018) . Here, prior work has proposed different planning algorithms which naively try to undelay the environment by simulating future observations (Walsh et al., 2008; Schuitema et al., 2010; Firoiu et al., 2018) . We propose an off-policy, planning-free approach that enables lowbias and low-variance multi-step value estimation in environments with random delays. First, we study the anatomy of such environments in order to exploit their structure, defining Random-Delay Markov Decision Processes (RDMDP). Then, we show how to transform trajectory fragments collected under one policy into trajectory fragments distributed according to another policy. We demonstrate this principle by deriving a novel off-policy algorithm (DCAC) based on Soft Actor-Critic (SAC), and exhibiting greatly improved performance in delayed environments. Along with this work we release our code, including a wrapper that conveniently augments any OpenAI gym environment with custom delays.

2. DELAYED ENVIRONMENTS

We frame the general setting of real-world Reinforcement Learning in terms of an agent, random observation delays, random action delays, and an undelayed environment. At the beginning of each time-step, the agent starts computing a new action from the most recent available delayed observation. Meanwhile, a new observation is sent and the most recent delayed action is applied in the undelayed environment. Real-valued delays are rounded up to the next integer time-step. * equal contribution 1



Figure 1: A delayed environment can be decomposed into an undelayed environment and delayed communication dynamics.

