ACTING IN DELAYED ENVIRONMENTS WITH NON-STATIONARY MARKOV POLICIES

Abstract

The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of m steps. The brute-force state augmentation baseline where the state is concatenated to the last m committed actions suffers from an exponential complexity in m, as we show for policy iteration. We then prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code is available at https://github.com/galdl/rl_delay_basic.git.

1. INTRODUCTION

The body of work on reinforcement learning (RL) and planning problem setups has grown vast in recent decades. Examples for such distinctions are different objectives and constraints, assumptions on access to the model or logged trajectories, on-policy or off-policy paradigms, etc. (Puterman, 2014) . However, the study of delay in RL remains scarce. It is almost always assumed the action is executed as soon as the agent chooses it. This assumption seldom holds in real-world applications (Dulac-Arnold et al., 2019) . Latency in action execution can either stem from the increasing computational complexity of modern systems and related tasks, or the infrastructure itself. The wide range of such applications includes robotic manipulation, cloud computing, financial trading, sensor feedback in autonomous systems, and more. To elaborate, consider an autonomous vehicle required for immediate response to a sudden hazard on the highway. Driving at high speed, it suffers from perception module latency when inferring the surrounding scene, as well as delay in actuation once a decision has been made. While the latter phenomenon is an instance of execution delay, the former corresponds to observation delay. These two types of delay are in fact equivalent and can thus be treated with the same tools (Katsikopoulos & Engelbrecht, 2003) . Related works. The notion of delay is prominent in control theory with linear time-invariant systems (Bar-Ilan & Sulem, 1995; Dugard & Verriest, 1998; Richard, 2003; Fridman, 2014; Bruder & Pham, 2009) . While the delayed control literature is vast, our work intersects with it mostly in motivation. In the above control theory formulations, the system evolves according to some known diffusion or stochastic differential equation. Differently, the discrete-time MDP framework does not require any structural assumption on the transition function or reward. ˚Equal contribution A few works consider a delay in the reward signal rather than in observation or execution. Delayed reward has been studied on multi-armed bandits for deterministic and stochastic latencies (Joulani et al., 2013) and for the resulting arm credit assignment problem (Pike-Burke et al., 2017) . In the MDP setting, Campbell et al. ( 2016) proposed a Q-learning variant for reward-delay that follows a Poisson distribution. Katsikopoulos & Engelbrecht (2003) considered three types of delay: observation, execution, and reward. Chen et al. (2020b) studied execution delay on multi-agent systems. The above works on MDPs employed state-augmentation with a primary focus on empirical evaluation of the degradation introduced by the delay. In this augmentation method, all missing information is concatenated with the original state to overcome the partial observability induced by the delay. The main drawback of this embedding method is the exponential growth of the state-space with the delay value (Walsh et al., 2009; Chen et al., 2020a) and, in the case of (Chen et al., 2020b), an additional growth that is polynomial with the number of agents. Walsh et al. ( 2009) avoided state-augmentation in MDPs with delayed feedback via a planning approach. By assuming the transition kernel to be close to deterministic, their model-based simulation (MBS) algorithm relies on a most-likely present state estimate. Since the Delayed-Q algorithm we devise here resembles to MBS in spirit, we highlight crucial differences between them: First, MBS is a conceptual algorithm that requires the state-space to be finite or discretized. This makes it highly sensitive to the state-space size, as we shall demonstrate in Sec. 7[Fig. 5(c )], prohibiting it from running on domains like Atari. Differently, Delayed-Q works with the original, possibly continuous state-space. Second, MBS is an offline algorithm: it estimates a surrogate, non-delayed MDP from samples, and only then does it solve that MDP to obtain the optimal policy (Walsh et al., 2009)[Alg. 2, l. 16 ]. This is inapplicable to large continuous domains and is again in contrast to Delayed-Q. Recent studies considered a concurrent control setting where action sampling occurs simultaneously with state transition (Ramstedt & Pal, 2019; Xiao et al., 2020) . Both assumed a single action selection between two consecutive observations, thus reducing the problem to an MDP with execution delay of m " 1. Chen et al. ( 2020a) have generalized it to an arbitrary number of actions between two observations. Hester & Stone (2013) addressed execution delay in the braking control of autonomous vehicles with a relatively low delay of m § 3. All these works employ state-augmentation to preserve the Markov property of the process, whereas we are interested whether this restriction can be lifted. Additionally, they studied policy-gradient (policy-based) methods, while we introduce a Q-learning style (value-based) algorithm. Likewise, Firoiu et al. (2018) proposed a modified version of the policy-based IMPALA (Espeholt et al., 2018) which is evaluated on a single video game with delay values of m § 7. To the best of our knowledge, our work is the first to tackle a delayed variant of the popular Atari suite (Bellemare et al., 2013) . Contributions. Revisiting RL with execution delay both in theory and practice, we introduce: 1. Analysis of a delayed MDP quantifying the trade-off between stochasticity and delay. 2. The first tight upper and lower complexity bounds on policy iteration for action-augmented MDPs. We stress that this is also a contribution to general RL theory of non-delayed MDPs. 3. A new formalism of execution-delay MDPs that avoids action-embedding. Using it, we prove that out of the larger set of history-dependent policies, restricting to non-stationary deterministic Markov policies is sufficient for optimality in delayed MDPs. We also derive a Bellman-type recursion for a delayed value function. 4. A model-based DQN-style algorithm that yields non-stationary Markov policies. Our algorithm outperforms the alternative standard and state-augmented DDQN in 39 of 42 experiments spanning over 3 environment categories and delay of up to m " 25.

2. PRELIMINARIES: NON-DELAYED STANDARD MDP

Here, we describe the standard non-delayed MDP setup. Later, in Sec. 5, we introduce its generalization to the delayed case. We follow and extend notations from (Puterman, 2014)[Sec. 2.1.] . An infinite horizon discounted MDP is a tuple pS, A, P, r, q where S and A are finite state and action spaces, P : S ˆA Ñ S is a transition kernel, the reward r : S ˆA Ñ R is a bounded function, and P r0, 1q is a discount factor. At time t, the agent is in s t and draws an action a t according to a decision rule d t that maps past information to a probability distribution q dt over the action set. Once a t is taken, the agent receives a reward rps t , a t q.

