ENERGY-BASED PREDICTIVE REPRESENTATIONS FOR PARTIALLY OBSERVED REINFORCEMENT LEARNING

Abstract

In real world applications, it is usually necessary for a reinforcement learning algorithm to handle partial observability, which is not captured by a Markov decision process (MDP). Although partially observable Markov decision processes (POMDPs) have been precisely motivated by this requirement, they raise significant computational and statistical hardness challenges in learning and planning. In this work, we introduce the Energy-based Predictive Representation (EPR) to support a unified approach to practical reinforcement learning algorithm design for both the MDP and POMDP settings, which enables learning, exploration, and planning to be handled in a coherent way. The proposed framework relies on a powerful neural energy-based model to extract a sufficient representation, from which Q-functions can be efficiently approximated. With such a representation, confidence can be efficiently computed to allow optimism/pessimism in the face of uncertainty to be efficiently implemented in planning, enabling effective management of the exploration versus exploitation tradeoff. An experimental investigation shows that the proposed algorithm can surpass state-of-the-art performance in both MDP and POMDP settings in comparison to existing baselines.

1. INTRODUCTION

Reinforcement learning (RL) based on Markov Decision Processes (MDPs) has proved to be extremely effective in several real world decision-making problems (Levine et al., 2016; Jiang et al., 2021) . However, the success of most RL algorithms (Ren et al., 2022b; Zhang et al., 2022) relies heavily on the assumption that the environment state is fully observable to the agent. In practice, such an assumption can be easily violated in the presence of observational noise. To address this issue, Partially Observable Markov Decision Processes (POMDPs) ( Åström, 1965) have been proposed for capturing the inherent uncertainty about the state arising from partial observations. However, the flexibility of POMDPs creates significant statistical and computational hardness in terms of planning, exploration and learning. In particular, i), partial observability induces a non-Markovian dependence over the entire history; and ii), the expanded spaces of observation sequences or state space distributions incur significant representation challenges. In fact, due to the full history dependence, it has been proved that the planning for even finite-horizon tabular POMDPs is NP-hard without additional structural assumptions (Papadimitriou & Tsitsiklis, 1987; Madani et al., 1998) , and the sample complexity for learning POMDPs can be exponential with respect to the horizon (Jin et al., 2020a) . These complexities only become more demanding in continuous state spaces and real-world scenarios. On the other hand, despite the theoretical hardness, the widely used sliding window policy parameterization has demonstrated impressive empirical performance (Mnih et al., 2013; Berner et al., 2019) , indicating that there is sufficient structure in real-world POMDPs that can be exploited to bypass the aforementioned complexities. Recently, observable POMDPs with invertible emissions have been investigated to justify the sliding window heuristic in tabular cases (Azizzadenesheli et al., 2016; Guo et al., 2016; Jin et al., 2020a; Golowich et al., 2022a) , which has been further extended with function approximation for large and continuous state POMDPs (Wang et al., 2022; Uehara et al., 2022) . Although these algorithms can exploit particular structure efficiently in terms of the sample complexity, they rely on unrealistic computation oracles, and are thus not applicable in practice. In this paper, we consider the following natural question: How can one design efficient and practical algorithms for structured POMDPs? In particular, we would like to exploit special structures that allows approximation to bypass inherent worst-case difficulties. By "efficient" we mean considering learning, planning and exploration in a unified manner that can balance errors in each component and reduce unnecessary computation, while by "practical" we mean the algorithm retains sufficient flexibility and can be easily implemented and deployed in real-world scenarios. There have been many attempts to address this question. The most straightforward idea is to extend model-free RL methods, including policy gradient and Q-learning, with a memory-limited parametrization, e.g., recurrent neural networks (Wierstra et al., 2007; Hausknecht & Stone, 2015; Zhu et al., 2017) . Alternatively, in model-based RL (Kaelbling et al., 1998) , an approximation of the latent dynamics can be estimated and a posterior over latent states (i.e., beliefs) maintained, in principle allowing an optimal policy to be extracted via dynamic programming upon beliefs. Following this idea, Deisenroth & Peters (2012) and Igl et al. (2018) ; Gregor et al. (2019) ; Zhang et al. (2019) ; Lee et al. (2020) consider Gaussian process or deep model parametrizations, respectively. Such methods are designed based on implicit assumptions about structure through the parameterization choices of the models. However, these approaches suffer from sub-optimal performance due to several compounding factors: i), approximation error from inaccurate parametrizations of the learnable components (policy, value function, model, belief), ii), a sub-optimal policy induced by approximated planning (through policy gradient or dynamic progamming), and iii), the neglect of exploration when interacting with the environment. As an alternative, spectral representation approaches provide an alternative strategy based on extracting a sufficient representation that can support learning, planning and exploration. In this vein Azizzadenesheli et al. (2016) investigate spectral methods (Anandkumar et al., 2014) for latent variable model estimation in POMDPs, but only consider tabular scenarios with finite state and action cases. Predictive State Representations (PSR) (Littman & Sutton, 2001; Boots et al., 2011) also leverage spectral decomposition, but instead of recovering an underlying latent variable model, they learn an equivalent sufficient representation of belief. These methods have been extended to real-world settings with continuous observations and actions by exploiting kernel embeddings (Boots et al., 2013) or deep models (Downey et al., 2017; Venkatraman et al., 2017; Guo et al., 2018) . However, efficient exploration and tractable planning with spectral representations has yet to be thoroughly developed (Zhan et al., 2022) . In this paper, we propose Energy-based Predictive Representation (EPR) to support efficient and tractable learning, planning, and exploration in POMDPs (and MDPs), as a solution to the aforementioned question. More specifically: • We propose a flexible nonlinear energy-based model for induced belief-state MDPs without explicit parameterization of beliefs, providing a principled linear sufficient representation for the state-action value function. • We reveal the connection between EPR and PSR, while also illustrating the differences, to demonstrate the modeling ability of the proposed EPR. • We provide computationally-tractable learning and planning algorithms for EPR that implement the principles of optimism and pessimism in the face of uncertainty for online and offline RL, balancing exploration and exploitation. • We conduct a comprehensive comparison to existing state-of-the-art RL algorithms in both MDP and POMDP benchmarks, demonstrating superior empirical performance of the proposed EPR.

2. PRELIMINARIES

In this section, we briefly introduce POMDPs and their degenerate case of MDPs, identifying the special structures that will be used to derive the proposed representation learning method. and reward r(s h+1 , a h+1 ). Due to partial observability, the dependence between observations is non-Markovian, hence, we define a policy π = {π t } where π t : O × (A × O) t → ∆(A) to depend on the whole history, i.e., x t = {o 0 , {a i , o i+1 } t-1 i=0 }. The corresponding value for policy π can be defined as V π = E π H h=1 r(s h , a h ) , and the objective is to find the optimal policy π * = arg max π V π . Markov decision processes (MDPs) are a degenerate case of POMDPs, where S = O and O(o|s) = δ(o = s), and can be specified as M = (S, A, r, H, µ, P ). One can also convert a POMDP to an MDP by treating the whole history x t = {o 0 , {a i , o i+1 } t-1 i=0 } as the state. Specifically, following (Kaelbling et al., 1998) (1) Each entry of the belief state describes the probability of the underlying state given the past history. Furthermore, with a slight abuse of notation, we use b t to denote the belief state at step t. Then, one can construct the equivalent belief MDP M b = (X , A, R h , H, µ b , T b ) with X denoting the set of possible histories, and µ b := b(s|o 1 )µ(o 1 )do 1 , R t (b, a) = b t (s h )r(s t , a)ds h (2) T b (b t+1 |b t , a t ) := O 1 bt+1=b(xt+1) P (o t+1 |b t , a t ) do t+1 . Therefore, the corresponding value function V π h (b h ) and Q π h (b h , a h ) for the belief MDP given a policy π can be defined as: V π h (b h ) = E H t=h R t (b t , a t )|x h , Q π h (b h , a h ) = E H t=h R t (b t , a t )|b h , a h . Following the MDP perspective, we also have the Bellman recursive equation: V π h (b h ) = E π [Q π h (b h , a h )] , Q π h (b h , a h ) = R h (b h , a h ) + E T b V π h+1 (b h+1 ) . (4) One can still apply a dynamic programming style approach to solve POMDPs according to (4), however since the belief depends on the entire history the number of possible beliefs can still be infinite even the number of states is finite. To combat with these essential difficulties, we will leverage two particular structures, observability and linearity, as introduced below. Observability in POMDPs. It has been shown (Even-Dar et al., 2007; Golowich et al., 2022b; Uehara et al., 2022) that for POMDPs with an observability assumption, one can safely relax the history dependence with a short window, bypassing the exponential sample and planning complexity w.r.t. horizon length (Golowich et al., 2022a; b) . Specifically, the observability property for POMDPs is defined as follows. Assumption 1 ((Even-Dar et al., 2007; Golowich et al., 2022b) ). The POMDP with emission model O satisfies γ-observability if for arbitrary beliefs b and b ′ over states, ∥⟨O, b⟩ -⟨O, b ′ ⟩∥ 1 ⩾ γ ∥b -b ′ ∥ 1 , where ⟨O, b⟩ := O(o|s)b(s)ds. A key consequence of observability is that, the belief can be well approximated with a short history window (Golowich et al., 2022b) , and one can construct an approximate MDP based on a finite belief history, which eliminates the exponential complexity induced by full history dependence. Specifically, we denote L as the length of the window. Then, defining x L t = o t-L , {a i , o i+1 } t i=t-L ∈ X L , the approximated beliefs b L follow the same recursive definition as (1) but with only finite history x L t starting from the uniform belief. This immediately induces an approximate MDP Golowich et al. (2022a) proves that the approximation error of the finite-memory belief MDP is small for observable POMDPs. Hence, with slight abuse of notation, we still use b to represent b L throughout the paper. M L b = X L , A, R L h , H, µ b , T L b according to (3) with b L , instead of b. Theorem 2.1 in

Linearity in MDPs.

To handle the complexity induced by large state spaces, linear/low-rank structures have been introduced in MDPs (Jin et al., 2020b; Agarwal et al., 2020) for effective function approximation, which leverages spectral factorization of the transition dynamics and reward: P (s ′ |s, a) = ⟨ϕ(s, a), µ(s ′ )⟩ , r(s, a) = ⟨ϕ(s, a), θ⟩ , where ϕ : S × A → H, µ : S → H are two feature maps to a Hilbert space H. Under such an assumption, we can represent the state-action value function Q π for an arbitrary policy π by: Q π (s, a) = r(s, a) + γ V π (s ′ )P (s ′ |s, a)ds ′ = ϕ(s, a), θ + γ V π (s ′ )µ(s ′ )ds ′ , which implies that instead of a complicated function space defined on the raw state space, one can design a computationally efficient planning and sample efficient exploration algorithm in the space linearly spanned by ϕ. In fact, from the correspondence between policy and Q-function as discussed in (Ren et al., 2022a) , ϕ can be understood as representing primitives for skill set construction. Efficient and practical algorithms have been designed for exploiting linearity in MDPs (Zhang et al., 2022; Qiu et al., 2022) , which inspires us to exploit similar properties in POMDPs. Energy-based Models. Energy-based Models are one of the most flexible models to represent the conditional probability measure. It takes the form of p(y|x) = exp(-f (x, y))/Z(x) where f (x, y), which can be parametrized by deep models, is the energy of (x, y) and Z(x) is a partition function that only depends on x to guarantee p(y|x) is a valid probability measure. When y is discrete, we have that p(y|x) = exp(-f (x, y))/ y exp(-f (x, y)), which corresponds to the standard softmax probability where -f (x, y) is the softmax logits. We refer the interested readers to Song & Kingma (2021) for the training methods of energy-based models.

3. ENERGY-BASED PREDICTIVE REPRESENTATION

We propose Energy-based Predictive Representation (EPR), which introduces linearity into finitehistory approximated POMDPs, allowing the complexity induced by large state spaces and long histories to be overcome, yielding improved efficiency for learning, planning and exploration. We emphasize that the proposed method is also applicable to MDPs. The approach builds upon recent progress in large-state MDPs (Zhang et al., 2022; Qiu et al., 2022) that leverages linear structure in the dynamics, P (s ′ |s, a) = ⟨ϕ(s, a), µ(s ′ )⟩, to obtain an efficient and practical framework for learning, planning and exploration. Recall the construction of a finitememory belief MDP to approximate a POMDP discussed in Section 2, which avoids full history dependence. For such a constructed belief MDP, a natural idea is to apply linear MDP algorithms, i.e., extracting the linear decomposition for T L b (b ′ |b, a) = ⟨ϕ(b, a), µ(b ′ )⟩, to handle the hardnesses of POMDPs mentioned in Section 1. However, there are several difficulties in such a straightforward extension: i, the set of beliefs is proportional to the number of states, which could be infinite; ii, the factorization of the transition dynamics (3) in the belief MDP is difficult. These difficulties hinder the extension of linear MDPs to observable POMDPs. However, note that we never explicitly require the beliefs and their dynamics, but only the representation ϕ(b, a). As beliefs are functions over finite-window histories, the representation can also be rewritten as ϕ(x t , a t ), which suggests that one might bypass the inherent difficulties by a reprameterization trick. Consider the energy-based parametrization (Arbel et al., 2020) for P (o t+1 |b(x t ), a t ) where b(x t ) is the belief for history x t : P (o t+1 |b(x t ), a t ) = p(o t+1 ) exp f (x t , a t ) ⊤ (g(o t+1 ) + λf (x t , a t )) , E ot+1 exp f (x t , a t ) ⊤ (g(o t+1 ) + λf (x t , a t )) = 1, ∀ (x t , a t ) ∈ X L , where λ is a scalar, p(o) is a fixed distribution and the normalization condition enforces that the energybased model P (o t+1 |b(x t ), a t ) is a valid distribution. We avoid any explicit parametrization and computation of beliefs b, while preserving dependence through f and g, which will be learned jointly. Compared to standard parametrizations, we do not need to specify unnecessary model parameters for the transition dynamics P and emmission O, and bypass any learning and approximation of beliefs that induce compounding errors. As a special case, we note that the observable Linear-Quadratic Gaussian (LQG) actually follows (6) with a specific λ and p(o). See Appendix C for details. Meanwhile, this approach also provides a linear factorization of T (b t+1 |b t , a t ) almost for free. By viewing the proposed parameterization (6) as a kernel and following the random Fourier feature trick (Rahimi & Recht, 2007; Choromanski et al., 2020; Ren et al., 2022b) , one can write P (o t+1 |b t , a t ) = E ω [ϕ ω (x t , a t )ψ ω (o t+1 )] , where ω i ∼ N (0, I d ) and ϕ(x t , a t ) = exp λ - 1 2 ∥f (x t , a t )∥ 2 + ω ⊤ i f (x t , a t ) d i=1 , ψ(x t+1 ) = p(o t+1 ) exp ω ⊤ i g(o t+1 ) - ∥g(o t+1 )∥ 2 2 d i=1 , which can be derived from the softmax random feature from Choromanski et al. (2020) . We also provide a derivation in Appendix B. Substituting (8) into (3) yields the factorization of T b as T b (b t+1 |b t , a t ) = O 1 b(•|xt+1)=b(•|xt,at,ot+1) E ω [ϕ ω (x t , a t )ψ ω (o t+1 )] do t+1 (11) =E ω [ϕ ω (x t , a t )µ(b t+1 )] , (12) where µ(b t+1 ) := O 1 b(•|xt+1)=b(•|xt,at,ot+1) ψ ω (o t+1 ) do t+1 . With this formula, we obtain a valid linear representation ϕ(b t , a t ) as an Energy-based Predictive Representation (EPR) for a belief MDP without any explicit beliefs. To learn the EPR given data D := {o t-1 , a t , r t } H t=1 , we exploit maximum likelihood estimation (MLE) of ( 6), min f,g -E D f (x t , a t ) ⊤ (g(o t+1 ) + λf (x t , a t )) , s.t., E p(ot+1) exp f (x t , a t ) ⊤ (g(o t+1 ) + λf (x t , a t )) = 1, ∀ (x t , a t ) ∈ X L . (14) To ensure the constraints, we add a regularization term log E o exp(f (x t , a t ) ⊤ (g(o) + λf (x t , a t ))) 2 (15) ≈ log 1 m m i=1 exp(f (x t , a t ) ⊤ (g(o i ) + λf (x t , a t ))) 2 , with o i ∼ p. The objective will be min f,g E D -f (x t , a t ) ⊤ (g(o t+1 ) + λf (x t , a t )) + α log 1 m m i=1 exp(f (x t , a t ) ⊤ (g(o i ) + λf (x t , a t ))) 2 . In practice, we can further simplify the objective by normalizing the f (x t , a t ) = f (xt,at) ∥f (xt,at)∥ 2 , obtaining the final objective min f ,g E D -f (x t , a t ) ⊤ g(o t+1 ) + λ + α log m i=1 exp( f (x t , a t ) ⊤ g(o i ) + λ) 2 , (18) which reduces to a contrastive loss that can be optimized by stochastic gradient descent with a deep network parameterization of f and g. We obtain negative samples {o i } m i=1 ∼ p(o) from a mixture of replay buffer and collected trajectories. Before we introduce an exploration-exploitation mechanism with EPR in Section 3.1, we first discuss the relationship between the proposed EPR, predictive state representations (PSR) (Littman & Sutton, 2001; Singh et al., 2004) , and spectral dynamics embedding (SPEDE) (Ren et al., 2022b) . Connection to PSR (Littman & Sutton, 2001; Singh et al., 2004) : The predictive state representation (PSR) was also proposed for bypassing belief calculation by factorizing a variant of the transition dynamics operator. Specifically, given the history (x t , a t ), the probability for observing a test, i.e., the finite sequence of events y = (a t+1 , o t+1 , • • • , a t+k , o t+k ) with k ∈ N, is p(y|x) := p(o t+k t+1 |x t , a t+k t ). For time step t, one can construct a set of core tests U = {u i , . . . , u k } as sufficient statistics for history x t , such that for any test τ , p(τ |x t ) = ⟨p(U |x t ), w τ ⟩ for some weights w τ, ∈ R |Ut| . The forward dynamics can be represented in a PSR by Bayes' rule: p(τ |x t , a t , o t+1 ) = w ⊤ (τ,at,o t+1 ) p(U |xt) w ⊤ (at,o t+1 ) p(U |xt) , which implies that a PSR updates with new observations and actions by repeating a calculation for each u i ∈ U . Although originally defined for tabular cases, PSRs have been extended to continuous observations by introducing kernels (Boots et al., 2013) or neural networks (Guo et al., 2018; Downey et al., 2017; Hefny et al., 2018) . Obviously, the proposed EPR shares similarities with PSR. Both factorize conditional distributions defined by the dynamics. However, these representations are designed for different purposes, and thus, with different usages and updates. Concretely, EPR is proposed for seeking a linear space that can represent the Q-function. The representation is designed to preserve linearity with successive observations without the need for Bayesian updates, which induce extra nonlinearity in PSRs. This Collect data {(x i,j , a i,j , o i,j , r i,j )} H j=1 with πi = ξπ i + (1 -ξ)π 0 , and add the data to D.

4:

Optimize f and g with ( 18) using the data from D.

5:

Obtain the representation ϕ(x t , a t ) via (9) using {ω i } n i=1 . 6: Add the bonus (19) to the reward and obtain the optimal policy π i+1 with the Q(x t , a t ) parameterize as ϕ(x t , a t ) and optimize via FQI.

7:

(Optional) Extract policy by soft-AC from learned Q. 8: end for 9: Return π K . linear property leads to efficient exploration and planning in EPR; while an efficient exploration and planning algorithm has not yet been discussed for PSR. Connection to SPEDE (Ren et al., 2022b) : Linear random features have been proposed for solving planning in MDPs with nonlinear dynamics in (Ren et al., 2022b) , where the transition operator is defined as T (s ′ |s, a) ∝ exp -∥s ′ -f (s, a)∥ 2 2 /(2σ 2 ) , corresponding to dynamics s ′ = f (s, a) + ϵ with Gaussian noise ϵ ∼ N (0, σ). In addition to the generalization of EPR for POMDPs, even in an MDP, EPR considers a general energy-based model, T (s ′ |s, a) ∝ p(s ′ ) exp f (s, a) ⊤ (g(s ′ ) + λf (s, a)) for the dynamics, which is far more flexible than the Gaussian perturbation model considered in SPEDE.

3.1. ONLINE EXPLORATION AND OFFLINE POLICY OPTIMIZATION WITH EPR

With an EPR ϕ(x t , a t ) learned for a POMDP, we can represent the Q-function linearly for the approximated belief MDP, and thus, achieve computationally efficient planning, while calculating confidence bounds for implementing the optimism/pessimism in the face of uncertainty. Exploration and Exploitation with Elliptical Confidence Bound. Given the learned representation ϕ(x t , a t ), the confidence bounds can be calculated efficiently, which allows efficient implementation of optimism/pessimism in the face of uncertainty via upper/lower confidence bound (Abbasi-Yadkori et al., 2011; Jin et al., 2020b; Uehara et al., 2021) . This is achieved simply by adding an additional elliptical bonus to the R(x, a). Specifically, given the dataset we collect D = {(x L i , a i , R i , o i+1 )} n i=1 , and calculate the confidence bound as the bonus, b(x t , a t ) = ϕ(x t , a t )Σ -1 n ϕ(x t , a t ) where λ is a pre-specified hyperparameter, and Σ n = n i=1 λI + ϕ(x L i , a i )ϕ(x L i , a i ) ⊤ . One can then implement UCB/LCB by adding/subtracting the bonus to R(x t , a t ), and performing planning on the modified reward function. Planning with Obtained Representation. Planning can be conducted by Bellman recursion within the linear space spanned by ϕ(x L t , a t ) without a bonus. However, with an additional bonus term, the Q π no longer lies in the linear space of ϕ, since (Zhang et al., 2022) , one can augment the feature space ψ (x, a) := {ϕ(x, a), b(x, a)} to ensure the Q-functions can be linearly represented but with an extra O d 2 memory cost. In practice, we perform fitted Q iteration with a nonlinear component extending the linear parameterization, i.e., Q(x, a) = {w 1 , w 2 } ⊤ ϕ(x, a), σ w ⊤ 3 ϕ(x, a) . We provide an outline of our implementation of UCB in Algorithm 1. LCB for pessimistic offline RL is similar but using a pre-collected dataset D without data collection iteration in Step 2, and with the bonus subtracted in Step 6. Our algorithm follows the standard interaction paradigm between the agent and the environment, where for each episode, the agent executes the policy and logs the data to the dataset. Then we perform representation learning and optimistic planning with the Q function parameterized upon the learned representation. Finally, we also extract a policy from the learned Q by soft actor-critic (Haarnoja et al., 2018) . Q π x L t , a t = R(x L t , a t ) + b(x L t , a t ) + E T L b π [Q π (x t+1 , a t+1 )] . As discussed in

4. RELATED WORK

Partial Observability in Reinforcement Learning. Despite the essential hardness of POMDPs in terms of learning, planning and exploration (Papadimitriou & Tsitsiklis, 1987; Madani et al., 1998; Vlassis et al., 2012; Jin et al., 2020b) , the study of reinforcement learning with partial observations, from both theoretical and empirical aspects, is still attractive due to its practical importance. Algorithmically, model-based/-free algorithms have been extended to POMDPs, explicitly or implicitly exploiting structure. Model-based RL algorithms parameterize and learn latent dynamics with an emission model explicitly, and planning through the simulation upon the learned models. A variety of deep models have been proposed recently for better modeling (Watter et al., 2015; Karl et al., 2016; Igl et al., 2018; Zhang et al., 2019; Lee et al., 2020; Hafner et al., 2019a; b; 2020) . Although deep models indeed provide better approximation ability, they also bring new challenges in terms of planning and exploration, which has not been fully handled. On the other hand, model-free RL algorithms have been extended for POMDPs by learning history dependent value functions and/or policies, through temporal-difference algorithms or policy gradients. For example, deep Q-learning (Mnih et al., 2013) concatenates 4 consecutive frames as the input of a deep neural Q-net, which is then improved by recurrent neural networks for longer windows (Bakker, 2001; Hausknecht & Stone, 2015; Zhu et al., 2017) . Recurrent neural networks have also been exploited for history dependent policies (Schmidhuber, 1990; Bakker, 2001; Wierstra et al., 2007; Heess et al., 2015) in policy gradient algorithms as well as actor-critic approaches (Ni et al., 2021; Meng et al., 2021) . Model-free RL for POMDPs bypasses the planning complexity of model-based RL algorithms. However, the difficulty in exploration remains, which leads to suboptimal performance in practice. By contrast, the proposed EPR not only can be efficiently learned, but is also equipped with simple yet principled planning and exploration methods, which has not been previously achieved. Representation Learning for RL. Successful vision-based representation learning methods have been extended to RL for extracting compact and invariant state-only information from raw-pixels, e.g., (Laskin et al., 2020a; b; Kostrikov et al., 2020) . However, such vision-based features are not specially designed for capturing properties in POMDPs/MDPs essential for decision making. To reveal structure that is particularly helpful for RL, many representation learning methods have been designed for different purposes, such as bi-simulation (Ferns et al., 2004; Gelada et al., 2019; Zhang et al., 2020) , successor features (Dayan, 1993; Barreto et al., 2017; Kulkarni et al., 2016) , spectral decomposition of transition operators (Mahadevan & Maggioni, 2007; Wu et al., 2018; Duan et al., 2019) , latent future prediction (Schwarzer et al., 2020; Stooke et al., 2021) and contrastive learning (Oord et al., 2018; Mazoure et al., 2020; Nachum & Yang, 2021; Yang et al., 2021) . These representation methods ignore the requirement of planning tractability. Moreover, they are learning based on a pre-collected dataset, which ignores the exploration issue. Features that are able to represent value functions are desirable for efficient planning and exploration. Based on the linear MDPs structure (Jin et al., 2020b) , several theoretical algorithms (Agarwal et al., 2020; Uehara et al., 2021) have been developed. Ren et al. (2022b) ; Zhang et al. (2022) ; Qiu et al. (2022) ; Ren et al. (2022a) bridge the gap between theory and practice and bypass computational intractability via different techniques, demonstrating advantages empirically. The proposed EPR is inspired from this class of representations, but extended to POMDPs, which is highly non-trivial.

5. EXPERIMENTS

Our experiments investigate how our algorithm performs in robotic lomocation simulation environments. We extensively evaluate the proposed approach on the Mojuco (Brockman et al., 2016) and DeepMind Control Suites (Tassa et al., 2018) . We conduct experiments on both the fully observable MDP and partially observable POMDP settings.

5.1. FULLY OBSERVABLE MDP

Dense-Reward Mujoco Tasks. We first conduct experiments in the fully observable MDP setting in Mujoco locomotion tasks. This is a test suite commonly used for both model-free and modelbased RL algorithms. We compare EPR with model-based RL baselines PETS (Chua et al., 2018) and ME-TRPO (Kurutach et al., 2018) , and model-free RL baselines SAC (Haarnoja et al., 2018) , TRPO (Schulman et al., 2015) and PPO (Schulman et al., 2017) . In addition, we also compare to the representation learning RL baselines Deep Successor Feature (DeepSF) (Kulkarni et al., 2016) and SPEDE (Ren et al., 2022b) . We list the best model-based RL results (except for iLQR (Li & Todorov, 2004 )) in MBBL (Wang et al., 2019) for comparison. All algorithms are run for 200K environment steps. The results are averaged across four random seeds with a window size of 10K. We show that in Tab. 1, EPR significantly outperforms all the baselines including the strong previous SoTA model-free algorithm SAC. In particular, we observe that most model-based algorithms have a hard time learning the walk and hop behavior in the Walker and Hopper environments respectively. We suspect that this is due to the fact that the quality of the data is bad at the initial data collection process (e.g., the agent often fall to the ground or has a hard time standing up). As a result, the behavior learned by most model-based algorithms can be suboptimal. For example, some model-based algorithms only learn to stand up without hopping in the Hopper environment. In contrast, EPR achieves SoTA performance in the Hopper task and Ant task, demonstrating the behavior of doing good exploration in the task domain. Sparse-Reward DM Control Tasks. Manually-designed dense reward functions are extremely hard to obtain, while it is difficult to gain access to a good dense reward function in practical real-robot settings. Thus, exploration in the sparse-reward settings is a key consideration for the success of RL in robotics settings. We test our algorithm EPR with SAC and PPO in such cases. Here we compare with DeepSF as an additional representation RL baseline. Note that the critic network used in SAC and PPO is deeper than EPR. From Tab. 2, we see that EPR achieves a particularly huge gain compared to SAC and PPO in sparse reward tasks walker-run-sparse.

5.2. PARTIAL OBSERVABLE MDP COVERING VELOCITY

Mujoco. Often in practice, it is hard to recover a full observation of the states. Thus, the ability to handle a partially-observed MDP (POMDP) is also important if we can only recover partial observations. To conduct experiments in this setting, we mask the the velocities in the observations (replacing them by 0). We compare to algorithms with different embedding approaches that maps a given history sequence to a latent representation, which is used as the input for a SAC planner. We consider four embedding methods as baselines: Transformer (Trans), GRU, PSR (Guo et al., 2018) , and finally a simple MLP baseline for sanity check, which concatenates the history sequence and directly maps that to a latent feature using a MLP. We find that this setting is very challenging and the performance of all algorithms degrades comparied to the fully-observable setting. Nevertheless, the proposed algorithm still achieves SoTA performance in tasks like Halfcheetah, Ant, SlimHumanoid. This demonstrates the capability of handling partial observability in EPR which can have an important effect in practice. DM Control Suite. Correspondingly, we conduct POMDP experiments in the DM Control Suite. However, we find that covering all the velocities is very challenging and thus we cover only the last 3 dimensions of the velocity.

5.3. IMAGE-BASED ENVIRONMENTS

Figure 1 : EPR in image-based environment: EPR gets a good performance compared to all baselines (e.g. SPR and SAC+AE). To test the capability of our method on imagebased environments, we conduct an additional experiment on MetaWorld (Yu et al., 2020) . We choose one of the fetch-reach tasks and compare against the model-free algorithm SAC+AE (Yarats et al., 2021) and a popular representation learning method SPR (Schwarzer et al., 2020) . We show the results in Fig. 1 and note that the minimum distance between the current state and the goal is used as the evaluation metric (the smaller distance means better performance). We can see that EPR manages to reach the distant goal within 100K steps. Comparing to SAC+AE, EPR strictly dominate its performance. For SPR, although it learns faster at the beginning, EPR has better final performance.

6. CONCLUSION

We exploit Energy-based Predictive Representation (EPR) for linearly representing value functions for arbitrary policies and supporting reinforcement learning in partially observed environments with finite memories. The proposed EPR shows that planning and strategic exploration can be implemented efficiently. The coherent design of each component brings empirical advantages in RL benchmarks considering both the MDP and POMDP settings. Such superior performance makes the theoretical understanding of EPR more intriguing, which we leave as future work.

A MORE RELATED WORK

Provably RL for POMDPs. Besides the statistical and computational hardness results for learning and planning upon POMDPs, most recent theoretical research focuses on overcoming the statistical complexity from the "curse of history" by considering tractable POMDPs (Krishnamurthy et al., 2016; Azizzadenesheli et al., 2016; Guo et al., 2016; Jin et al., 2020a; Liu et al., 2022) We have that P (o t+1 |x t , a t ) =p(o t+1 ) exp f (x t , a t ) ⊤ (g(o t+1 ) + λf (x t , a t )) =p(o t+1 ) exp λ - 1 2 ∥f (x t , a t )∥ 2 exp - ∥g(o t+1 )∥ 2 2 exp ∥f (x t , a t ) + g(o t+1 )∥ 2 2 , where we have that exp ∥f (x t , a t ) + g(o t+1 )∥ 2 2 =(2π) -d/2 exp ∥f (x t , a t ) + g(o t+1 )∥ 2 2 exp - ∥ω -(f (x t , a t ) + g(o t+1 ))∥ 2 2 dω =(2π) -d/2 exp - ∥ω∥ 2 2 + ω ⊤ (f (x t , a t ) + g(o t+1 )) dω =E ω∼N (0,I d ) exp ω ⊤ f (x t , a t ) exp ω ⊤ g(o t+1 ) , which concludes the proof for equation 8.

C OBSERVABLE LQG AS EPR

Follow the standard notations, the dynamics of Linear-Quadratic Gaussian is defined as s t =As t-1 + Ba t + w t , o t =Cs t-1 + z t , where w t and z t are Gaussian noise. Define the matrix G L = [C ⊤ , CA ⊤ , . . . , CA L-1 ⊤ ] ⊤ , and reduced observation õt = o t -z t -C t-2 k=0 A k Ba t-k-1 + t-2 k=0 A k w t-k-2 . By the observability condition of LQG, G L is full column rank, one can identify s 0 by s 0 = G ⊤ L G L -1 L j=1 A ⊤ j-1 C ⊤ õj . Therefore, we have s 1 =As 0 + Ba 0 + w 0 = A   G ⊤ L G L -1 L j=1 A ⊤ j-1 C ⊤ õj   + Ba 1 + w 0 , s 2 =As 1 + Ba 1 + w 1 = A 2   G ⊤ L G L -1 L j=1 A ⊤ j-1 C ⊤ õj   + ABa 1 + Ba 2 + Aw 0 + w 1 , s L+1 =As L + Ba L + w L = A L   G ⊤ L G L -1 L j=1 A ⊤ j-1 C ⊤ õj   + L j=0 A L-j Ba j+1 + L j=0 A L-j w j , o L+1 =Cs t+1 + z t = CA L   G ⊤ L G L -1 L j=1 A ⊤ j-1 C ⊤ õj   + C L j=0 A L-j Ba j+1 + C L j=0 A L-j w j + z t , which means o L+1 follows a Gaussian distribution with mean as a function of history x L = (o i-1 , a i ) L i=1 and action a L+1 , and variance as a function of σ w , σ z , and (A, B, C). Therefore, we have some function f A,B,C,σw,σz and g A,B,C,σw,σz , such that g A,B,C,σw,σz (o L+1 ) = f A,B,C,σw,σz (x L , a L+1 ) + ξ, ξ ∼ N (0, I) . On the other hand, we set λ = -1 2 , and p(o) = N (0, I) in ( 6), then, we obtain p(o L+1 |x L , a L ) ∝ exp - ∥g(o L+1 ) -f (x L , a L )∥ 2 2 2 , which reproduces the observable LQG with specific f A,B,C,σw,σz and g A,B,C,σw,σz .

D EXPERIMENT DETAILS D.1 ONLINE SETTING

In Table 9 , we list all the hyperparameters and network architecture we use for our experiments. We see that we don't use the additional exploration bonus term in the mojuco tasks. But this is very helpful in DM control suite tasks, especially in those sparse-reward tasks. For evaluation in Mujoco, in each evaluation (every 5K steps) we test our algorithm for 10 episodes. We average the results over the last 4 evaluations and 4 random seeds. For Dreamer and Proto-RL, we change their network from CNN to 3-layer MLP and disable the image data augmentation part (since we test on the state space). The architecture we used for the transformer is following the Trajectory Transformer (Janner et al., 2021) . The attention used is the causal attention. We tried to tune some of their hyperparameter (e.g., exploration steps in Proto-RL) and report the best number across our runs. However, due to the short time, it is also possible that we didn't tune the hyperparameter enough.

D.2 LEARNING CURVES

We provide the performance curves for online DM Control Suite experiments in Figure 2 . As we can see in the figures, the proposed EPR converges faster and achieve the state-of-the-art performances in most of the environments, demonstrating the sample efficiency and the ability to balance of exploration vs. exploitation of EPR. We also provide additional curves for POMDP setting in Figure 3 .

D.3 IMAGE-BASED EXPERIMENTS

We provide the details of metaworld image-based experiments here. We first provide an illustration of the reach environment in Figure 4 . We then provide some more experiment details in the following section. 



, we define the belief b : O × (A × O) t → ∆(S), ∀t ∈ N + , which can be recursively defined as: b(s 1 |o 1 ) = P (s 1 |o 1 ), and b(s t+1 |x t+1 ) = b(st|xt)P (st+1|st,at)O(ot+1|st+1) b(st|xt)P (st+1|st,at)O(ot+1|st+1) dst dot+1 .

Energy-based Predictive Representation 1: Input: History Embedding f (x, a), Observation Embedding g(o), Random Feature {ω i } n i=1 where ω i ∼ N (0, I d ), Initial Random Policy π 0 , Initial Dataset D = ∅ for online setting. 2: for Episode i = 1, • • • , K do 3:

Figure 3: Performance Curves for online POMDP DM Control Suite.

Figure 4: Reach environment: Using a robot arm to reach a specific position.

Performance on various MuJoCo control tasks. All the results are averaged across 4 random seeds and a window size of 10K. Results marked with * is adopted from MBBL. EPR achieves strong performance compared with baselines.

Performance of on various Deepmind Suite Control tasks. All the results are averaged across four random seeds and a window size of 10K. Comparing with SAC, our method achieves even better performance on sparse-reward tasks.

Performance on various MuJoCo control tasks. All the results are averaged across 4 random seeds and a window size of 10K. Results marked with * is adopted from MBBL. EPR achieves strong performance compared with baselines.

Performance of on various Deepmind Suite Control tasks. All the results are averaged across four random seeds and a window size of 10K. Comparing with SAC, our method achieves even better performance on sparse-reward tasks.

. Similarly to the observability structure we exploited in our algorithm, these work bypass the curse of history by different special structures, reducing the whole history dependency to finite-length memory. Recently,Uehara et al. (2022);Wang et al. (2022) generalize these special structures with function approximation beyond tabular cases.Golowich et al. (2022a)  consider the complexity planning and exploration together with learning, but only valid for tabular MDPs.

Hyperparameters used for EPR in FetchReachImage.

Hyperparameters used for SPR in FetchReachImage.

Hyperparameters used for SAC-AE in FetchReachImage.

