ACTING IN DELAYED ENVIRONMENTS WITH NON-STATIONARY MARKOV POLICIES

Abstract

The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of m steps. The brute-force state augmentation baseline where the state is concatenated to the last m committed actions suffers from an exponential complexity in m, as we show for policy iteration. We then prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code is available at https://github.com/galdl/rl_delay_basic.git.

1. INTRODUCTION

The body of work on reinforcement learning (RL) and planning problem setups has grown vast in recent decades. Examples for such distinctions are different objectives and constraints, assumptions on access to the model or logged trajectories, on-policy or off-policy paradigms, etc. (Puterman, 2014) . However, the study of delay in RL remains scarce. It is almost always assumed the action is executed as soon as the agent chooses it. This assumption seldom holds in real-world applications (Dulac-Arnold et al., 2019) . Latency in action execution can either stem from the increasing computational complexity of modern systems and related tasks, or the infrastructure itself. The wide range of such applications includes robotic manipulation, cloud computing, financial trading, sensor feedback in autonomous systems, and more. To elaborate, consider an autonomous vehicle required for immediate response to a sudden hazard on the highway. Driving at high speed, it suffers from perception module latency when inferring the surrounding scene, as well as delay in actuation once a decision has been made. While the latter phenomenon is an instance of execution delay, the former corresponds to observation delay. These two types of delay are in fact equivalent and can thus be treated with the same tools (Katsikopoulos & Engelbrecht, 2003) . Related works. The notion of delay is prominent in control theory with linear time-invariant systems (Bar-Ilan & Sulem, 1995; Dugard & Verriest, 1998; Richard, 2003; Fridman, 2014; Bruder & Pham, 2009) . While the delayed control literature is vast, our work intersects with it mostly in motivation. In the above control theory formulations, the system evolves according to some known diffusion or stochastic differential equation. Differently, the discrete-time MDP framework does not require any structural assumption on the transition function or reward. ˚Equal contribution 1

