ENERGY-BASED PREDICTIVE REPRESENTATIONS FOR PARTIALLY OBSERVED REINFORCEMENT LEARNING

Abstract

In real world applications, it is usually necessary for a reinforcement learning algorithm to handle partial observability, which is not captured by a Markov decision process (MDP). Although partially observable Markov decision processes (POMDPs) have been precisely motivated by this requirement, they raise significant computational and statistical hardness challenges in learning and planning. In this work, we introduce the Energy-based Predictive Representation (EPR) to support a unified approach to practical reinforcement learning algorithm design for both the MDP and POMDP settings, which enables learning, exploration, and planning to be handled in a coherent way. The proposed framework relies on a powerful neural energy-based model to extract a sufficient representation, from which Q-functions can be efficiently approximated. With such a representation, confidence can be efficiently computed to allow optimism/pessimism in the face of uncertainty to be efficiently implemented in planning, enabling effective management of the exploration versus exploitation tradeoff. An experimental investigation shows that the proposed algorithm can surpass state-of-the-art performance in both MDP and POMDP settings in comparison to existing baselines.

1. INTRODUCTION

Reinforcement learning (RL) based on Markov Decision Processes (MDPs) has proved to be extremely effective in several real world decision-making problems (Levine et al., 2016; Jiang et al., 2021) . However, the success of most RL algorithms (Ren et al., 2022b; Zhang et al., 2022) relies heavily on the assumption that the environment state is fully observable to the agent. In practice, such an assumption can be easily violated in the presence of observational noise. To address this issue, Partially Observable Markov Decision Processes (POMDPs) ( Åström, 1965) have been proposed for capturing the inherent uncertainty about the state arising from partial observations. However, the flexibility of POMDPs creates significant statistical and computational hardness in terms of planning, exploration and learning. In particular, i), partial observability induces a non-Markovian dependence over the entire history; and ii), the expanded spaces of observation sequences or state space distributions incur significant representation challenges. In fact, due to the full history dependence, it has been proved that the planning for even finite-horizon tabular POMDPs is NP-hard without additional structural assumptions (Papadimitriou & Tsitsiklis, 1987; Madani et al., 1998) , and the sample complexity for learning POMDPs can be exponential with respect to the horizon (Jin et al., 2020a) . These complexities only become more demanding in continuous state spaces and real-world scenarios. On the other hand, despite the theoretical hardness, the widely used sliding window policy parameterization has demonstrated impressive empirical performance (Mnih et al., 2013; Berner et al., 2019) , indicating that there is sufficient structure in real-world POMDPs that can be exploited to bypass the aforementioned complexities. Recently, observable POMDPs with invertible emissions have been investigated to justify the sliding window heuristic in tabular cases (Azizzadenesheli et al., 2016; Guo et al., 2016; Jin et al., 2020a; Golowich et al., 2022a) , which has been further extended with function approximation for large and continuous state POMDPs (Wang et al., 2022; Uehara et al., 2022) . Although these algorithms can exploit particular structure efficiently in terms of the sample complexity, they rely on unrealistic computation oracles, and are thus not applicable in practice. In this paper, we consider the following natural question: How can one design efficient and practical algorithms for structured POMDPs? In particular, we would like to exploit special structures that allows approximation to bypass inherent worst-case difficulties. By "efficient" we mean considering learning, planning and exploration in a unified manner that can balance errors in each component and reduce unnecessary computation, while by "practical" we mean the algorithm retains sufficient flexibility and can be easily implemented and deployed in real-world scenarios. There have been many attempts to address this question. The most straightforward idea is to extend model-free RL methods, including policy gradient and Q-learning, with a memory-limited parametrization, e.g., recurrent neural networks (Wierstra et al., 2007; Hausknecht & Stone, 2015; Zhu et al., 2017) . Alternatively, in model-based RL (Kaelbling et al., 1998) , an approximation of the latent dynamics can be estimated and a posterior over latent states (i.e., beliefs) maintained, in principle allowing an optimal policy to be extracted via dynamic programming upon beliefs. Following this idea, Deisenroth & Peters ( 2012 2020) consider Gaussian process or deep model parametrizations, respectively. Such methods are designed based on implicit assumptions about structure through the parameterization choices of the models. However, these approaches suffer from sub-optimal performance due to several compounding factors: i), approximation error from inaccurate parametrizations of the learnable components (policy, value function, model, belief), ii), a sub-optimal policy induced by approximated planning (through policy gradient or dynamic progamming), and iii), the neglect of exploration when interacting with the environment. As an alternative, spectral representation approaches provide an alternative strategy based on extracting a sufficient representation that can support learning, planning and exploration. In this vein Azizzadenesheli et al. ( 2016) investigate spectral methods (Anandkumar et al., 2014) for latent variable model estimation in POMDPs, but only consider tabular scenarios with finite state and action cases. Predictive State Representations (PSR) (Littman & Sutton, 2001; Boots et al., 2011) also leverage spectral decomposition, but instead of recovering an underlying latent variable model, they learn an equivalent sufficient representation of belief. These methods have been extended to real-world settings with continuous observations and actions by exploiting kernel embeddings (Boots et al., 2013) or deep models (Downey et al., 2017; Venkatraman et al., 2017; Guo et al., 2018) . However, efficient exploration and tractable planning with spectral representations has yet to be thoroughly developed (Zhan et al., 2022) . In this paper, we propose Energy-based Predictive Representation (EPR) to support efficient and tractable learning, planning, and exploration in POMDPs (and MDPs), as a solution to the aforementioned question. More specifically: • We propose a flexible nonlinear energy-based model for induced belief-state MDPs without explicit parameterization of beliefs, providing a principled linear sufficient representation for the state-action value function. • We reveal the connection between EPR and PSR, while also illustrating the differences, to demonstrate the modeling ability of the proposed EPR. • We provide computationally-tractable learning and planning algorithms for EPR that implement the principles of optimism and pessimism in the face of uncertainty for online and offline RL, balancing exploration and exploitation. • We conduct a comprehensive comparison to existing state-of-the-art RL algorithms in both MDP and POMDP benchmarks, demonstrating superior empirical performance of the proposed EPR.

2. PRELIMINARIES

In this section, we briefly introduce POMDPs and their degenerate case of MDPs, identifying the special structures that will be used to derive the proposed representation learning method. 



) and Igl et al. (2018); Gregor et al. (2019); Zhang et al. (2019); Lee et al. (

Partially Observable Markov Decision Processes. Formally, we define a partially observable Markov decision process (POMDP) as a tuple P = (S, A, O, r, H, µ, P, O), where H is a positive integer denoting the length of horizon; µ is the initial distribution of state,

