LATENT VARIABLE REPRESENTATION FOR REINFORCE-MENT LEARNING

Abstract

Deep latent variable models have achieved significant empirical successes in modelbased reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.

1. INTRODUCTION

Reinforcement learning (RL) seeks an optimal policy that maximizes the expected accumulated rewards by interacting with an unknown environment sequentially. Most research in RL is based on the framework of Markov decision processes (MDPs) (Puterman, 2014) . For MDPs with finite states and actions, there is already a clear understanding with sample and computationally efficient algorithms (Auer et al., 2008; Dann & Brunskill, 2015; Osband & Van Roy, 2014; Azar et al., 2017; Jin et al., 2018) . However, the cost of these RL algorithms quickly becomes unacceptable for large or infinite state problems. Therefore, function approximation or parameterization is a major tool to tackle the curse of dimensionality. Based on the parametrized component to be learned, RL algorithms can roughly be classified into two categories: model-free and model-based RL, where the algorithms in the former class directly learn a value function or policy to maximize the cumulative rewards, while algorithms in the latter class learn a model to mimic the environment and the optimal policy is obtained by planning with the learned simulator. Model-free RL algorithms exploit an end-to-end learning paradigm for policy and value function training, and have achieved empirical success in robotics (Peng et al., 2018 ), video-games (Mnih et al., 2013) , and dialogue systems (Jiang et al., 2021) , to name a few, thanks to flexible deep neural network parameterizations. The flexibility of such parameterizations, however, also comes with a cost in optimization and exploration. Specifically, it is well-known that temporal-difference methods become unstable or even divergent with general nonlinear function approximation (Boyan & Moore, 1994; Tsitsiklis & Van Roy, 1996) . Uncertainty quantization for general nonlinear function approximators is also underdeveloped. Although there are several theoretically interesting model-free exploration algorithms with general nonlinear function approximators (Wang et al., 2020; Kong et al., 2021; Jiang et al., 2017) , a computationally-friend exploration method for model-free RL is still missing. Model-based RL algorithms, on the other hand, exploit more information from the environment during learning, and are therefore considered to be more promising in terms of sample efficiency (Wang et al., 2019) . Equipped with powerful deep models, model-based RL can successfully reduce

funding

https://rlrep.github.io/lvrep/ 

