IS MODEL ENSEMBLE NECESSARY? MODEL-BASED RL VIA A SINGLE MODEL WITH LIPSCHITZ REGULAR-IZED VALUE FUNCTION

Abstract

Probabilistic dynamics model ensemble is widely used in existing model-based reinforcement learning methods as it outperforms a single dynamics model in both asymptotic performance and sample efficiency. In this paper, we provide both practical and theoretical insights on the empirical success of the probabilistic dynamics model ensemble through the lens of Lipschitz continuity. We find that, for a value function, the stronger the Lipschitz condition is, the smaller the gap between the true dynamics-and learned dynamics-induced Bellman operators is, thus enabling the converged value function to be closer to the optimal value function. Hence, we hypothesize that the key functionality of the probabilistic dynamics model ensemble is to regularize the Lipschitz condition of the value function using generated samples. To test this hypothesis, we devise two practical robust training mechanisms through computing the adversarial noise and regularizing the value network's spectral norm to directly regularize the Lipschitz condition of the value functions. Empirical results show that combined with our mechanisms, model-based RL algorithms with a single dynamics model outperform those with an ensemble of probabilistic dynamics models. These findings not only support the theoretical insight, but also provide a practical solution for developing computationally efficient model-based RL algorithms.

1. INTRODUCTION

Model-based reinforcement learning (MBRL) improves the sample efficiency of an agent by learning a model of the underlying dynamics in a real environment. One of the most fundamental questions in this area is how to learn a model to generate good samples so that it maximally boosts the sample efficiency of policy learning. To address this question, various model architectures are proposed such as Bayesian nonparametric models (Kocijan et al., 2004; Nguyen-Tuong et al., 2008; Kamthe & Deisenroth, 2018) , inverse dynamics model (Pathak et al., 2017; Liu et al., 2022) , multi-step model (Asadi et al., 2019), and hypernetwork (Huang et al., 2021) . Among these approaches, the most popular and common approach is to use an ensemble of probabilistic dynamics models (Buckman et al., 2018; Janner et al., 2019; Lai et al., 2020; Clavera et al., 2020; Froehlich et al., 2022; Li et al., 2022) . It is first proposed by Chua et al. (2018) to capture both the aleatoric uncertainty of the environment and the epistemic uncertainty of the data. In practice, MBRL methods with an ensemble of probabilistic dynamics models can often achieve higher sample efficiency and asymptotic performance than using only a single dynamics model. However, while the uncertainty-aware perspective seems reasonable, previous works only directly apply probabilistic dynamics model ensemble in their methods without an in-depth exploration of why this structure works. There still lacks enough theoretical evidence to explain the superiority of probabilistic neural network ensemble. In addition, extra computation time and resources are needed to train an ensemble of neural networks. In this paper, we provide a new perspective on why training a probabilistic dynamics model ensemble can significantly improve the performance of the model-based algorithm. We find that the ensemble model-generated state transition samples that are used for training the policies and the critics are much more "diverse" than samples generated from a single deterministic model. We hypothesize that it implicitly regularizes the Lipschitz condition of the critic network especially over a local region where the model is uncertain (i.e., a model-uncertain local region). Therefore, the Bellman operator induced by the learned transition dynamics will yield an update to the value function close to the true underlying Bellman operator's update. We provide systematic experimental results and theoretical analysis to support our hypothesis. (3) Experimental results on five MuJoCo tasks demonstrate the effectiveness of our proposed mechanisms and validate our insights, improving SOTA asymptotic performance using less computational time and resources.

2. PRELIMINARIES AND BACKGROUND

Reinforcement learning. We consider a Markov Decision Process (MDP) defined by the tuple (S, A, P, P 0 , r, γ), where S and A are the state space and action space respectively, P(s ′ |s, a) is the transition dynamics, P 0 is the initial state distribution, r(s, a) is the reward function and γ is the discount factor. In this paper, we focus on value-based reinforcement learning algorithms. Define the optimal Bellman operator T * such that T * Q(s, a) = r(s, a) + γ P(s ′ |s, a) arg max a ′ Q(s ′ , a ′ )ds ′ . The goal of the value-based algorithm is to learn a Q ϕ to approximate the optimal state-action value function Q * , where Q * = T * Q * is the fixed point of the optimal Bellman operator. For model-based RL (MBRL), we denote the approximate model for state transition and reward function as Pθ and rθ respectively. Similarly, we define the model-induced Bellman Operator T * such that T * Q(s, a) = r(s, a) + γ Pθ (s ′ |s, a) arg max a ′ Q(s ′ , a ′ )ds ′ . Probabilistic dynamics model ensemble. Probabilistic dynamics model ensemble consists of K neural networks Pθ = { Pθ1 , Pθ2 , ..., Pθ K } with the same architecture but randomly initialized with different weights. Given a state-action pair (s, a), the prediction of each neural network is a Gaussian distribution with diagonal covariances of the next state: Pθ k (s ′ |s, a) = N (µ θ k (s, a), Σ θ k (s, a)). The model is trained using negative log likelihood loss (Janner et al., 2019) (1) : L(θ k ) = N t=1 [µ θ k (s t , a t ) -s t+1 ] ⊤ Σ -1 θ k (s t , a t )[µ θ k (s t , a t ) -s t+1 ] + log det Σ θ k (s t ,



§ Equal contribution



Based on this insight, we propose two simple yet effective robust training mechanisms to regularize the Lipschitz condition of the value network. The first one is spectral normalization(Miyato  et al., 2018)  which provides a global Lipschitz constraint for the value network. It directly follows our theoretical insights to explicitly control the Lipschitz constant of the value function. However, based on both of our theoretical and empirical observations, only the local Lipschitz constant around a model-uncertain local region is required for a good performance. So we also propose the second mechanism, robust regularization, which stabilizes the local Lipschitz condition of the value network by computing the adversarial noise with fast gradient sign method (FGSM)(Goodfellow et al.,  2015). To compare the effectiveness of controlling global versus local Lipschitz, systematic experiments are implemented. Experimental results on five MuJoCo environments verify that the proposed Lipschitz regularization mechanisms with a single deterministic dynamics model improves SOTA performance of the probabilistic ensemble MBRL algorithms with less computational time.Our contributions are summarized as follows.(1) We propose a new insight into why an ensemble of probabilistic dynamics models can significantly improve the performance of the MBRL algorithms, supported by both theoretical analysis and experimental evidence. (2) We introduce two robust training mechanisms for MBRL algorithms, which directly regularize the Lipschitz condition of the agent's value function.

a t ) During model rollouts, the probabilistic dynamics model ensemble first randomly selects a network from the ensemble and then samples the next state from the predicted Gaussian distribution.Local Lipschitz constant. We now give the definition of local Lipschitz constant, a key concept in the following discussion. Definition 2.1. Define the local (X , ϵ)-Lipschitz constant of a scalar valued function f : R N → R over a set X as:L (p) (f, X , ϵ) = sup x∈X sup y1,y2∈B (p) (x,ϵ)|f (y 1 ) -f (y 2 )| ∥y 1 -y 2 ∥ p (y 1 ̸ = y 2 ).

