REINFORCEMENT LEARNING WITH RANDOM DELAYS

Abstract

Action and observation delays commonly occur in many Reinforcement Learning applications, such as remote control scenarios. We study the anatomy of randomly delayed environments, and show that partially resampling trajectory fragments in hindsight allows for off-policy multi-step value estimation. We apply this principle to derive Delay-Correcting Actor-Critic (DCAC), an algorithm based on Soft Actor-Critic with significantly better performance in environments with delays. This is shown theoretically and also demonstrated practically on a delay-augmented version of the MuJoCo continuous control benchmark.

1. INTRODUCTION

This article is concerned with the Reinforcement Learning (RL) scenario depicted in Figure 1 , which is commonly encountered in real-world applications (Mahmood et al., 2018; Fuchs et al., 2020; Hwangbo et al., 2017) . Oftentimes, actions generated by the agent are not immediately applied in the environment, and observations do not immediately reach the agent. Such environments have mainly been studied under the unrealistic assumption of constant delays (Nilsson et al., 1998; Ge et al., 2013; Mahmood et al., 2018) . Here, prior work has proposed different planning algorithms which naively try to undelay the environment by simulating future observations (Walsh et al., 2008; Schuitema et al., 2010; Firoiu et al., 2018) . We propose an off-policy, planning-free approach that enables lowbias and low-variance multi-step value estimation in environments with random delays. First, we study the anatomy of such environments in order to exploit their structure, defining Random-Delay Markov Decision Processes (RDMDP). Then, we show how to transform trajectory fragments collected under one policy into trajectory fragments distributed according to another policy. We demonstrate this principle by deriving a novel off-policy algorithm (DCAC) based on Soft Actor-Critic (SAC), and exhibiting greatly improved performance in delayed environments. Along with this work we release our code, including a wrapper that conveniently augments any OpenAI gym environment with custom delays.

2. DELAYED ENVIRONMENTS

We frame the general setting of real-world Reinforcement Learning in terms of an agent, random observation delays, random action delays, and an undelayed environment. At the beginning of each time-step, the agent starts computing a new action from the most recent available delayed observation. Meanwhile, a new observation is sent and the most recent delayed action is applied in the undelayed environment. Real-valued delays are rounded up to the next integer time-step. For a given delayed observation s t , the observation delay ω t refers to the number of elapsed time-steps from when s t finishes being captured to when it starts being used to compute a new action. The action delay α t refers to the number of elapsed time-steps from when the last action influencing s t starts being computed to one time-step before s t finishes being captured. We further refer to ω t + α t as the total delay of s t . As a motivating illustration of real-world delayed setting, we have collected a dataset of communication delays between a decisionmaking computer and a flying robot over WiFi, summarized in Figure 2 . In the presence of such delays, the naive approach is to simply use the last received observation. In this case, any delay longer than one time-step violates the Markov assumption, since the last sent action becomes an unobserved part of the current state of the environment. To overcome this issue, we define a Markov Decision Process that takes into account the communication dynamics.

2.1. RANDOM DELAY MARKOV DECISION PROCESSES

To ensure the Markov property in delayed settings, it is necessary to augment the delayed observation with at least the last K sent actions. K is the combined maximum possible observation and action delay. This is required as the oldest actions along with the delayed observation describe the current state of the undelayed environment, whereas the most recent actions are yet to be applied (see Appendix C). Using this augmentation suffices to ensure that the Markov property is met in certain delayed environments. On the other hand, it is possible to do much better when the delays themselves are also part of the state-space. First, this allows us to model self-correlated delays, e.g. discarding outdated actions and observations (see Appendix A.1). Second, this provides useful information to the model about how old an observation is and what actions have been applied next. Third, knowledge over the total delay allows for efficient credit assignment and off-policy partial trajectory resampling, as we show in this work. Definition 1. A Random Delay Markov Decision Process RDMDP (E, p ω , p α ) = (X, A, μ, p) augments a Markov Decision Process E = (S, A, µ, p) with: (1) state-space where s ∈ S is the delayed observation, u ∈ A K is a buffer of the last K sent actions, ω ∈ N is the observation delay, and α ∈ N is the action delay as defined above. To avoid conflicting with the subscript notation, we index the action buffers' elements using square brackets. Here, u[1] is the most recent and u[K] is the oldest action in the buffer. We denote slices by u[i :j] = (u[i], . . . , u[j]) and u[i :-j] = (u[i], . . . , u[K-j]). We slightly override this notation and additionally define u[0] = a. X = S × A K × N 2 , (2) action-space A, (3) initial state distribution μ(x 0 ) = μ(s, u, ω, α) = µ(s) δ(u -c u ) δ(ω -c ω ) δ(α -c α ),



Figure 1: A delayed environment can be decomposed into an undelayed environment and delayed communication dynamics.

Figure 3: Influence of actions on delayed observations in delayed environments.

Figure 2: Histogram of real-world WiFi delays.

4) transition distribution p(s ,u ,ω ,α ,r |s,u,ω,α,a)=f ω-ω (s ,α ,r |s,u,ω,α,a)p ω (ω |ω)p u (u |u,a),

