ON THE GEOMETRY OF REINFORCEMENT LEARNING IN CONTINUOUS STATE AND ACTION SPACES Anonymous

Abstract

Advances in reinforcement learning have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens. Central to our work is the idea that the transition dynamics induce a low dimensional manifold of reachable states embedded in the high-dimensional nominal state space. We prove that, under certain conditions, the dimensionality of this manifold is at most the dimensionality of the action space plus one. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments. We further demonstrate the applicability of our result by learning a policy in this low dimensional representation. To do so we introduce an algorithm that learns a mapping to a low dimensional representation, as a narrow hidden layer of a deep neural network, in tandem with the policy using DDPG. Our experiments show that a policy learnt this way perform on par or better for four MuJoCo control suite tasks.

1. INTRODUCTION

The goal of a reinforcement learning (RL) agent is to learn an optimal policy that maximises the return which is the time discounted cumulative reward (Sutton & Barto, 1998) . Recent advances in RL research have lead to agents successfully learning in environments with enormous state spaces, such as games (Mnih et al., 2015; Silver et al., 2016) , and robotic control in simulation (Lillicrap et al., 2016; Schulman et al., 2015; 2017a) and real environments (Levine et al., 2016; Zhu et al., 2020; Deisenroth & Rasmussen, 2011) . However, we do not have an understanding of the intrinsic complexity of these seemingly large problems. For example, in most popular deep RL algorithms for continuous control, the agent's policy is parameterised by a a deep neural network (DNN) (Lillicrap et al., 2016; Schulman et al., 2015; 2017a) but we do not have theoretical models to guide the design of DNN architecture required to efficiently learn an optimal policy for various environments. There have been approaches to measure the difficulty of an RL environment from a sample complexity perspective (Antos et al., 2007; Munos & Szepesvari, 2008; Bastani, 2020) but these models fall short of providing recommendations for the policy and value function complexity required to learn an optimal policy. We view the complexity of RL environments through a geometric lens. We build on the intuition behind the manifold hypothesis, which states that most high-dimensional real-world datasets actually lie on low-dimensional manifolds (Tenenbaum, 1997; Carlsson et al., 2007; Fefferman et al., 2013; Bronstein et al., 2021) ; for example, the set of natural images are a very small, smoothly-varying subset of all possible value assignments for the pixels. A promising geometric approach is to model the data as a low-dimensional structure-a manifold-embedded in a high-dimensional ambient space. In supervised learning, especially deep learning theory, researchers have shown that the approximation error depends strongly on the dimensionality of the manifold (Shaham et al., 2015; Pai et al., 2019; Chen et al., 2019; Cloninger & Klock, 2020) , thereby connecting the complexity of the underlying structure of the dataset to the complexity of the DNN. As in supervised learning, researchers have applied the manifold hypothesis in RL-i.e. hypothesized that the effective state space lies on a low dimensional manifold (Mahadevan, 2005; Machado et al., 2017; 2018; Banijamali et al., 2018; Wu et al., 2019; Liu et al., 2021) . Despite its fruitful applications, this assumption-of a low-dimensional underlying structure-has never been theoretically and empirically validated in any RL setting. Our main result provides a general proof of this hypothesis for all continuous state and action RL environments by proving that the effective state space is a manifold and upper bound its dimensionality by, simply, the dimensionality of the action space plus one. Although our theoretical results are for deterministic environments with continuous states and actions, we empirically corroborate this upper bound on four MuJoCo environments (Todorov et al., 2012) , with sensor inputs, by applying the dimensionality estimation algorithm by Facco et al. (2017) . Our empirical results suggest that in many instances the bound on the dimensionality of the effective state manifold is tight. To show the applicability and relevance of our theoretical result we empirically demonstrate that a policy can be learned using this low-dimensional representation that performs as well as or better than a policy learnt using the higher dimensional representation. We present an algorithm that does two things simultaneously: 1) learns a mapping to a low dimensional representation, called the co-ordinate chart, parameterised by a DNN, and 2) uses this low-dimensional mapping to learn the policy. Our algorithm extends DDPG (Lillicrap et al., 2016) and uses it as a baseline with a higher-dimensional representation as the input. We empirically show a surprising new DNN architecture with a bottleneck hidden layer of width equal to dimensionality of action space plus one performs on par or better than the wide architecture used by Lillicrap et al. (2016) . These results demonstrate that our theoretical results, which speaks to the underlying geometry of the problem, can be applied to learn a low dimensional or compressed representation for learning in a data efficient manner. Moreover, we connect DNN architectures to effectively learning a policy based on the underlying geometry of the environment.

2. BACKGROUND AND MATHEMATICAL PRELIMINARIES

We first describe the continuous time RL model and Markov decision process (MDP). This forms the foundation upon which our theoretical result is based. Then we provide mathematical background on various ideas from the theory of manifolds that we employ in our proofs and empirical results.

2.1. CONTINUOUS-TIME REINFORCEMENT LEARNING

We analyze the setting of continuous-time reinforcement learning in a deterministic Markov decision process (MDP) which is defined by the tuple M = (S, A, f, f r , s 0 , λ) over time t ∈ [0, T ). S ⊂ R ds is the set of all possible states of the environment. A ⊂ R da is the rectangular set of actions available to the agent. f : S × A × R + → S and f ∈ C ∞ is a smooth function that determines the state transitions: s ′ = f (s, a, τ ) is the state the agent transitions to when it takes the action a at state s for the time period τ . Note that f (s, a, 0) = s, meaning that the agent's state remains unchanged if an action is applied for a duration of τ = 0. The reward obtained for reaching state s is f r (s), determined by the reward function f r : S → R. s t denotes the state the agent is at time t and a t is the action it takes at time t. s 0 is the fixed initial state of the agent at t = 0, and the MDP terminates at t = T . The agent does not have access to the functions f and f r , and can only observe states and rewards at a given time t ∈ [0, T ). The agent is equipped with a policy, π : S → A, that determines its decision making process. We denote the set of all the possible policies by Π. Simply put, the agent takes action π(s) at state s. The goal of the agent is to maximise the discounted return J(π) = T 0 e -l λ f r (s l )dl, where s t+ϵ = f (s t , π(s t ), ϵ) for infinitesimally small ϵ and all t ∈ [0, T ). We define the action tangent mapping, g : S × A → R ds , for an MDP as g(s, a) = lim ϵ→0 + f (s, a, ϵ) -s ϵ = ∂f (s, a, ϵ) ∂ϵ . Intuitively, this captures the direction in which the agents state changes at state s upon taking an action a. For notational convenience we will denote g(s, π(s)) as g π : S → R ds and name it the action flow of the environment defined for a policy π. Note that g π is a well defined function. Intuitively, g π is the direction of change in the agent's state upon following a policy π at state s for an infinitesimally small time. The curve in the set of possible states, or the state-trajectory of the agent, is a differential

