ENERGY-BASED PREDICTIVE REPRESENTATIONS FOR PARTIALLY OBSERVED REINFORCEMENT LEARNING

Abstract

In real world applications, it is usually necessary for a reinforcement learning algorithm to handle partial observability, which is not captured by a Markov decision process (MDP). Although partially observable Markov decision processes (POMDPs) have been precisely motivated by this requirement, they raise significant computational and statistical hardness challenges in learning and planning. In this work, we introduce the Energy-based Predictive Representation (EPR) to support a unified approach to practical reinforcement learning algorithm design for both the MDP and POMDP settings, which enables learning, exploration, and planning to be handled in a coherent way. The proposed framework relies on a powerful neural energy-based model to extract a sufficient representation, from which Q-functions can be efficiently approximated. With such a representation, confidence can be efficiently computed to allow optimism/pessimism in the face of uncertainty to be efficiently implemented in planning, enabling effective management of the exploration versus exploitation tradeoff. An experimental investigation shows that the proposed algorithm can surpass state-of-the-art performance in both MDP and POMDP settings in comparison to existing baselines.

1. INTRODUCTION

Reinforcement learning (RL) based on Markov Decision Processes (MDPs) has proved to be extremely effective in several real world decision-making problems (Levine et al., 2016; Jiang et al., 2021) . However, the success of most RL algorithms (Ren et al., 2022b; Zhang et al., 2022) relies heavily on the assumption that the environment state is fully observable to the agent. In practice, such an assumption can be easily violated in the presence of observational noise. To address this issue, Partially Observable Markov Decision Processes (POMDPs) ( Åström, 1965) have been proposed for capturing the inherent uncertainty about the state arising from partial observations. However, the flexibility of POMDPs creates significant statistical and computational hardness in terms of planning, exploration and learning. In particular, i), partial observability induces a non-Markovian dependence over the entire history; and ii), the expanded spaces of observation sequences or state space distributions incur significant representation challenges. In fact, due to the full history dependence, it has been proved that the planning for even finite-horizon tabular POMDPs is NP-hard without additional structural assumptions (Papadimitriou & Tsitsiklis, 1987; Madani et al., 1998) , and the sample complexity for learning POMDPs can be exponential with respect to the horizon (Jin et al., 2020a) . These complexities only become more demanding in continuous state spaces and real-world scenarios. On the other hand, despite the theoretical hardness, the widely used sliding window policy parameterization has demonstrated impressive empirical performance (Mnih et al., 2013; Berner et al., 2019) , indicating that there is sufficient structure in real-world POMDPs that can be exploited to bypass the aforementioned complexities. Recently, observable POMDPs with invertible emissions have been investigated to justify the sliding window heuristic in tabular cases (Azizzadenesheli et al., 2016; Guo et al., 2016; Jin et al., 2020a; Golowich et al., 2022a) , which has been further extended with function approximation for large and continuous state POMDPs (Wang et al., 2022; Uehara et al., 2022) . Although these algorithms can exploit particular structure efficiently in terms of the sample complexity, they rely on unrealistic computation oracles, and are thus not applicable in practice. In this paper, we consider the following natural question: How can one design efficient and practical algorithms for structured POMDPs?

