LATENT VARIABLE REPRESENTATION FOR REINFORCE-MENT LEARNING

Abstract

Deep latent variable models have achieved significant empirical successes in modelbased reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.

1. INTRODUCTION

Reinforcement learning (RL) seeks an optimal policy that maximizes the expected accumulated rewards by interacting with an unknown environment sequentially. Most research in RL is based on the framework of Markov decision processes (MDPs) (Puterman, 2014) . For MDPs with finite states and actions, there is already a clear understanding with sample and computationally efficient algorithms (Auer et al., 2008; Dann & Brunskill, 2015; Osband & Van Roy, 2014; Azar et al., 2017; Jin et al., 2018) . However, the cost of these RL algorithms quickly becomes unacceptable for large or infinite state problems. Therefore, function approximation or parameterization is a major tool to tackle the curse of dimensionality. Based on the parametrized component to be learned, RL algorithms can roughly be classified into two categories: model-free and model-based RL, where the algorithms in the former class directly learn a value function or policy to maximize the cumulative rewards, while algorithms in the latter class learn a model to mimic the environment and the optimal policy is obtained by planning with the learned simulator. Model-free RL algorithms exploit an end-to-end learning paradigm for policy and value function training, and have achieved empirical success in robotics (Peng et al., 2018 ), video-games (Mnih et al., 2013) , and dialogue systems (Jiang et al., 2021) , to name a few, thanks to flexible deep neural network parameterizations. The flexibility of such parameterizations, however, also comes with a cost in optimization and exploration. Specifically, it is well-known that temporal-difference methods become unstable or even divergent with general nonlinear function approximation (Boyan & Moore, 1994; Tsitsiklis & Van Roy, 1996) . Uncertainty quantization for general nonlinear function approximators is also underdeveloped. Although there are several theoretically interesting model-free exploration algorithms with general nonlinear function approximators (Wang et al., 2020; Kong et al., 2021; Jiang et al., 2017) , a computationally-friend exploration method for model-free RL is still missing. Model-based RL algorithms, on the other hand, exploit more information from the environment during learning, and are therefore considered to be more promising in terms of sample efficiency (Wang et al., 2019) . Equipped with powerful deep models, model-based RL can successfully reduce



* Equal Contribution. Project Website: https://rlrep.github.io/lvrep/

annex

Published as a conference paper at ICLR 2023 approximation error, and have demonstrated strong performance in practice (Hafner et al., 2019a; b; Wu et al., 2022) , following with some theoretical justifications (Osband & Van Roy, 2014; Foster et al., 2021) . However, the reduction of approximation error brings new challenges in planning and exploration, which have not been treated seriously from the empirical and theoretical aspects. Specifically, with general nonlinear models, the planning problem itself is already no longer tractable, and the problem becomes more difficult with an exploration mechanism introduced. While theoretical analysis typically assumes a planning oracle providing an optimal policy, some approximations are necessary in practice, including dyna-style planning (Chua et al., 2018; Luo et al., 2018) , random shooting (Kurutach et al., 2018; Hafner et al., 2019a) , and policy search with backpropagation through time (Deisenroth & Rasmussen, 2011; Heess et al., 2015) . These may lead to sub-optimal policies, even with perfect models, wasting potential modeling power.In sum, for both model-free and model-based algorithms, there has been insufficient work considering both statistical and computation tractability and efficiency in terms of learning, planning and exploration in a unified and coherent perspective for algorithm design. This raises the question:Is there a way to design a provable and practical algorithm to remedy both the statistical and computational difficulties of RL?Here, by "provable" we mean the statistical complexity of the algorithm can be rigorously characterized without explicit dependence on the number of states but instead the fundamental complexity of the parameterized representation space; while by "practical" we mean the learning, planning and exploration components in the algorithm are computationally tractable and can be implemented in real-world scenarios.This work provides an affirmative answer to the question above by establishing the representation view of latent variable dynamics models through a connection to linear MDPs. Such a connection immediately provides a computationally tractable approach to planning and exploration in the linear space constructed by the flexible deep latent variable model. Such a latent variable model view also provides a variational learning method that remedies the intractbility of MLE for general linear MDPs (Agarwal et al., 2020; Uehara et al., 2022) . Our main contributions consist of the following:• We establish the representation view of latent variable dynamics models in RL, which naturally induces Latent Variable Representation (LV-Rep) for linearly representing the state-action value function, and paves the way for a practical variational method for representation learning (Section 3); • We provide computation efficient algorithms to implement the principle of optimistm and pessimism in the face of uncertainty with the learned LV-Rep for online and offline RL (Section 3.1); • We theoretically analyze the sample complexity of LV-Rep in both online and offline settings, which reveals the essential complexity beyond the cardinality of the latent variable (Section 4); • We empirically demonstrate LV-Rep outperforms the state-of-the-art model-based and modelfree RL algorithms on several RL benchmarks (Section 6)

2. PRELIMINARIES

In this section, we provide brief introduction to MDPs and linear MDP, which play important roles in the algorithm design and theoretical analysis. We also provide the required background knowledge on functional analysis in Appendix D.

2.1. MARKOV DECISION PROCESSES

We consider the infinite horizon discounted Markov decision process (MDP) specified by the tuple M = ⟨S, A, T * , r, γ, d 0 ⟩, where S is the state space, A is a discrete action space, T * : S × A → ∆(S) is the transition, r : S × A → [0, 1] is the reward, γ ∈ (0, 1) is the discount factor and d 0 ∈ ∆(S) is the initial state distribution. Following the standard convention (e.g. Jin et al., 2020), we assume r(s, a) and d 0 are known to the agent. We aim to find the policy π : S → ∆(A), that maximizes the following discounted cumulative reward:V π T * ,r := E T * ,π ∞ i=0 γ i r(s i , a i ) s 0 ∼ d 0 .

