IS MODEL ENSEMBLE NECESSARY? MODEL-BASED RL VIA A SINGLE MODEL WITH LIPSCHITZ REGULAR-IZED VALUE FUNCTION

Abstract

Probabilistic dynamics model ensemble is widely used in existing model-based reinforcement learning methods as it outperforms a single dynamics model in both asymptotic performance and sample efficiency. In this paper, we provide both practical and theoretical insights on the empirical success of the probabilistic dynamics model ensemble through the lens of Lipschitz continuity. We find that, for a value function, the stronger the Lipschitz condition is, the smaller the gap between the true dynamics-and learned dynamics-induced Bellman operators is, thus enabling the converged value function to be closer to the optimal value function. Hence, we hypothesize that the key functionality of the probabilistic dynamics model ensemble is to regularize the Lipschitz condition of the value function using generated samples. To test this hypothesis, we devise two practical robust training mechanisms through computing the adversarial noise and regularizing the value network's spectral norm to directly regularize the Lipschitz condition of the value functions. Empirical results show that combined with our mechanisms, model-based RL algorithms with a single dynamics model outperform those with an ensemble of probabilistic dynamics models. These findings not only support the theoretical insight, but also provide a practical solution for developing computationally efficient model-based RL algorithms.

1. INTRODUCTION

Model-based reinforcement learning (MBRL) improves the sample efficiency of an agent by learning a model of the underlying dynamics in a real environment. One of the most fundamental questions in this area is how to learn a model to generate good samples so that it maximally boosts the sample efficiency of policy learning. To address this question, various model architectures are proposed such as Bayesian nonparametric models (Kocijan et al., 2004; Nguyen-Tuong et al., 2008; Kamthe & Deisenroth, 2018) , inverse dynamics model (Pathak et al., 2017; Liu et al., 2022) , multi-step model (Asadi et al., 2019) , and hypernetwork (Huang et al., 2021) . Among these approaches, the most popular and common approach is to use an ensemble of probabilistic dynamics models (Buckman et al., 2018; Janner et al., 2019; Lai et al., 2020; Clavera et al., 2020; Froehlich et al., 2022; Li et al., 2022) . It is first proposed by Chua et al. (2018) to capture both the aleatoric uncertainty of the environment and the epistemic uncertainty of the data. In practice, MBRL methods with an ensemble of probabilistic dynamics models can often achieve higher sample efficiency and asymptotic performance than using only a single dynamics model. However, while the uncertainty-aware perspective seems reasonable, previous works only directly apply probabilistic dynamics model ensemble in their methods without an in-depth exploration of why this structure works. There still lacks enough theoretical evidence to explain the superiority of probabilistic neural network ensemble. In addition, extra computation time and resources are needed to train an ensemble of neural networks. In this paper, we provide a new perspective on why training a probabilistic dynamics model ensemble can significantly improve the performance of the model-based algorithm. We find that the ensemble model-generated state transition samples that are used for training the policies and the critics are much more "diverse" than samples generated from a single deterministic model. We hypothesize that it implicitly regularizes the Lipschitz condition of the critic network especially over a local region where the model is uncertain (i.e., a model-uncertain local region). Therefore, the Bellman operator induced by the learned transition dynamics will yield an update to the agent's value function close to the true underlying Bellman operator's update. We provide systematic experimental results and theoretical analysis to support our hypothesis. Based on this insight, we propose two simple yet effective robust training mechanisms to regularize the Lipschitz condition of the value network. The first one is spectral normalization (Miyato et al., 2018) which provides a global Lipschitz constraint for the value network. It directly follows our theoretical insights to explicitly control the Lipschitz constant of the value function. However, based on both of our theoretical and empirical observations, only the local Lipschitz constant around a model-uncertain local region is required for a good performance. So we also propose the second mechanism, robust regularization, which stabilizes the local Lipschitz condition of the value network by computing the adversarial noise with fast gradient sign method (FGSM) (Goodfellow et al., 2015) . To compare the effectiveness of controlling global versus local Lipschitz, systematic experiments are implemented. Experimental results on five MuJoCo environments verify that the proposed Lipschitz regularization mechanisms with a single deterministic dynamics model improves SOTA performance of the probabilistic ensemble MBRL algorithms with less computational time. Our contributions are summarized as follows. (1) We propose a new insight into why an ensemble of probabilistic dynamics models can significantly improve the performance of the MBRL algorithms, supported by both theoretical analysis and experimental evidence. (2) We introduce two robust training mechanisms for MBRL algorithms, which directly regularize the Lipschitz condition of the agent's value function. (3) Experimental results on five MuJoCo tasks demonstrate the effectiveness of our proposed mechanisms and validate our insights, improving SOTA asymptotic performance using less computational time and resources.

Reinforcement learning.

We consider a Markov Decision Process (MDP) defined by the tuple (S, A, P, P 0 , r, γ), where S and A are the state space and action space respectively, P(s ′ |s, a) is the transition dynamics, P 0 is the initial state distribution, r(s, a) is the reward function and γ is the discount factor. In this paper, we focus on value-based reinforcement learning algorithms. Define the optimal Bellman operator T * such that T * Q(s, a) = r(s, a) + γ P(s ′ |s, a) arg max a ′ Q(s ′ , a ′ )ds ′ . The goal of the value-based algorithm is to learn a Q ϕ to approximate the optimal state-action value function Q * , where Q * = T * Q * is the fixed point of the optimal Bellman operator. For model-based RL (MBRL), we denote the approximate model for state transition and reward function as Pθ and rθ respectively. Similarly, we define the model-induced Bellman Operator T * such that T * Q(s, a) = r(s, a) + γ Pθ (s ′ |s, a) arg max a ′ Q(s ′ , a ′ )ds ′ . Probabilistic dynamics model ensemble. Probabilistic dynamics model ensemble consists of K neural networks Pθ = { Pθ1 , Pθ2 , ..., Pθ K } with the same architecture but randomly initialized with different weights. Given a state-action pair (s, a), the prediction of each neural network is a Gaussian distribution with diagonal covariances of the next state: Pθ k (s ′ |s, a) = N (µ θ k (s, a), Σ θ k (s, a) ). The model is trained using negative log likelihood loss (Janner et al., 2019) : L(θ k ) = N t=1 [µ θ k (s t , a t ) -s t+1 ] ⊤ Σ -1 θ k (s t , a t )[µ θ k (s t , a t ) -s t+1 ] + log det Σ θ k (s t , a t ) During model rollouts, the probabilistic dynamics model ensemble first randomly selects a network from the ensemble and then samples the next state from the predicted Gaussian distribution. Local Lipschitz constant. We now give the definition of local Lipschitz constant, a key concept in the following discussion. Definition 2.1. Define the local (X , ϵ)-Lipschitz constant of a scalar valued function f : R N → R over a set X as: L (p) (f, X , ϵ) = sup x∈X sup y1,y2∈B (p) (x,ϵ) |f (y 1 ) -f (y 2 )| ∥y 1 -y 2 ∥ p (y 1 ̸ = y 2 ). Throughout the paper, we consider p = 2 unless explicitly stated. If L(f, X , ϵ) = C is finite, we say that f is (X , ϵ)-locally Lipschitz with constant C. In particular, we can view the global Lipschitz constant as L(f, R N , ϵ), which upper bounds the local Lipschitz constant L(f, X , ϵ) with X ⊂ R N . In this section, we provide insight into why a value-based reinforcement learning algorithm trained from simulated samples generated by a probabilistic ensemble model performs well in practice. We first conduct an experiment on the MuJoCo Humanoid environment with MBPO (Janner et al., 2019) , the SOTA model-based algorithm using Soft Actor-Critic (SAC) (Haarnoja et al., 2018) as the backbone algorithm for policy and value optimization. We run MBPO with four different types of trained environment models: (1) a single deterministic dynamics model, (2) a single probabilistic dynamics model, (3) an ensemble of seven deterministic dynamics models, and (4) an ensemble of seven probabilistic dynamics models. As we can see from Figure 1a , in the Humanoid environment, the agent trained from an ensemble of the probabilistic dynamics model indeed achieves much better performance. A similar observation is found across all the other MuJoCo environments. Value-aware model error. Why does MBPO with an ensemble of probabilistic dynamics models perform much better? We try to answer this question by considering the effects of the learned transition models on the agent's value functions. In MBPO algorithm, it fits an ensemble of M Gaussian environment models P(i) θ (•|s, a) ≡ N µ (i) θ (s, a), Σ θ (s, a) , i = 1, ..., M . With a deep neural network as a powerful function approximator for the mean and variance of the Gaussian distribution, the mean squared error E (s,a)∼ρ,s ′ ∼P(•|s,a),ŝ ′ ∼ Pθ (•|s,a) [∥s ′ -ŝ′ ∥ 2 ] is often small. However, as argued in many previous MBRL works (Grimm et al., 2020; 2021) , the mean squared error is often not a good measurement for the quality of the learned transition model. Instead, the value-aware model error (Farahmand et al., 2017a) as defined below plays a crucial role here, which connects directly to the suboptimality of the MBRL algorithm. L vame ( Pθ ; Q, ρ) = E (s,a)∼ρ [ T * Q(s, a) -T * Q(s, a) 2 ] = γ 2 E (s,a)∼ρ P(s ′ |s, a)V (s ′ )ds ′ - Pθ (ŝ ′ |s, a)V (ŝ ′ )dŝ ′ 2 ≤ γ 2 E (s,a)∼ρ,s ′ ∼P(•|s,a),ŝ ′ ∼ Pθ (•|s,a) [ V (ŝ ′ ) -V (s ′ ) 2 ], V (s) = max a Q(s, a) This value-aware model error measures the difference between the simulated and true Bellman operator acting on a given Q function. In other words, even though the learned transition model Pθ is approximately accurate, the model prediction error can still be amplified by the value function so that the two Bellman operators yield completely different updates on the value function. Value function Lipschitz regulates value-aware model error. When the value function's local-Lipschitz constant is under control, the value-aware model error can be made small. As we see from Figure 1b , 1c, and 1d, although the probabilistic ensemble model and single deterministic model achieve similar mean squared errors, value functions trained from an ensemble of probabilistic dynamics model has a significantly smaller Lipschitz constant and value-aware model error. (Note that the value-aware model error plotted in Figure 1d is in log scale.) To see why this is the case, we view the learned Gaussian transition models P(i) θ as f (i) θ + g (i) θ , where f (i) θ is a deterministic model and g (i) θ (•|s, a) ≡ N (0, Σ (i) θ (s, a) ) is a noise distribution with zero mean. When training the value network of the MBRL agent, the target value is computed as r(s, a)+γE i∼Cat(M, 1 M ),ϵi∼g (i) θ (•|s,a) [V f (i) θ (s, a)+ϵ i ]. Here we view data generated from the probabilistic ensemble model as an implicit augmentation. The augmentation comes from two sources: (i) variation of prediction across different models in the ensemble and (ii) the noise added by each noise distribution g (i) θ . By augmenting the transition with a diverse set of noise and then training the value functions with such augmented samples, it implicitly regularizes the local Lipschitz condition of the value network over the local region around the state where the model prediction is uncertain. In the next section, we provide some theoretical insights into how the local Lipschitz constant can play a role in the suboptimality of the MBRL algorithms. Later in the algorithm and experiment section, we will provide two practical mechanisms to regularize the Lipschitz condition of the value network and demonstrates the effectiveness of such mechanisms, which further validates our claim.

3.2. ERROR BOUND OF MODEL-BASED VALUE ITERATION WITH LOCALLY LIPSCHITZ VALUE FUNCTIONS

In this section, we formally analyze how the Lipschitz constant of the value function affects the learning dynamics of the model-based (approximate) value iteration algorithm, the prototype for most of the value-based MBRL algorithms. For simplicity, following previous works (Farahmand et al., 2017b; Grimm et al., 2020) , we assume that we have the true reward function r(s, a), but extending the results to reward function approximation should be straightforward. We consider the following value iteration algorithm. At k th-iteration of the algorithm, we obtain a dataset D k = ∆ {(s i , a i , s ′ i )} N i=1 , where (s i , a i ) is sampled i.i.d. from ρ ∈ ∆(S × A), the empirical state-action distribution, and s ′ i ∼ P(•|s i , a i ) is the next state under the environment transition. Based on this dataset, we first approximate the transition kernel P to minimize the mean L 2 difference between the predicted and true next states. Pk ← arg min P∈M 1 N N i=1 P(ŝ ′ |s i , a i ) ŝ′ -s ′ i dŝ ′ Then at k th-iteration, we update the value function by solving the following regression problem: Qk ← arg min Q∈F L reg ( Q; Qk-1 , Pk ) := arg min Q∈F E (s,a)∼ρ Q(s, a) -T * Qk-1 (s, a) 2 Empirically, we update the value function such that Qk ← arg min Q∈F 1 N N i=1 Q(s i , a i ) -r i + γ Pk (ŝ ′ |s i , a i ) Vk-1 (ŝ ′ )dŝ ′ 2 (4) where Vk-1 (ŝ ′ ) = max a ′ Qk-1 (ŝ ′ , a ′ ). To simplify the analysis, we assume that we are given a fixed state-action distribution ρ such that for every iteration, we can sample i.i.d. from this data distribution. However, in practice, we may use a different data collecting policy at different iterations. As argued in (Farahmand et al., 2017a) , a similar result can be shown in this case by considering the mixing behavior of the stochastic process. Now we list the assumptions we make. Some assumptions are made only to simplify the finite sample analysis, while others characterize the crucial aspects of model and value learning. First, we make the deterministic assumption of the environment. This is only for the purpose of the finite sample analysis. When the environment transition is stochastic, we will not have the finite sample guarantee, but our insights still hold. We will provide more discussion on this later when we present our finite sample guarantee. Assumption 3.1. The environment transition is deterministic. Next, to apply the concentration inequality in our analysis, we have to make the following technical assumption of the state space and reward function. Assumption 3.2. (Boundedness of State Space and Reward Function) There exists constants D, R max such that for all s ∈ S, a ∈ A, ∥s∥ 2 ≤ D and r(s, a) < R max . In addition, we make the following mild approximate realizability assumption on the model class of approximated transition kernels so that in the model space, at least one transition model close to the true underlying transition kernel should exist. Assumption 3.3. (ϵ, ρ)-Approximate Realizability inf P∈M L 2 ( P) := inf P∈M P(dŝ ′ |s, a) P(s, a) -ŝ′ 2 dρ(s, a) ≤ ϵ (5) We also make a critical assumption of the local Lipschitz condition of the value function class. In particular, for the state-action value function Q : S × A → R, we define that Q is (ϵ, p)-locally Lipschitz with constant L if for every a ∈ A, the function Q a : s - → Q(s, a ) is (ϵ, p)-locally Lipschitz with constant less than or equal to L. Assumption 3.4. (Local Lipschitz condition of value functions) There exists a finite L, such that for every Q ∈ F, it is X , 2ϵ -locally Lipschitz with a constant less than or equal to L, where X is the support of the distribution ρ. Note here we only need to assume that value functions are all (X , (1 + β)ϵ)-locally Lipschitz with β > 0. We set β = 1 for simplicity. Finally, same as the previous work (Farahmand et al., 2017a) , to provide a finite sample guarantee of model learning, we make the following assumption on the complexity of the model space of the approximated transition kernels Assumption 3.5. (Complexity of Model Space) Let R > 0, J : M 0 → [0, ∞) be a pseudo-norm, where M 0 is a space of transition kernels. Let M = M R = {P : J(P) ≤ R}. There exists constants c > 0 and 0 < α < 1 such that for any ϵ, R > 0 and all sequence of state-action pairs z 1:n = ∆ z 1 , ..., z n ∈ S × A, the following metric entropy condition is satisfied: log N (ϵ, M, L 2 (z 1:n )) ≤ c R ϵ 2α (6) where N (ϵ, M, L 2 (z 1:n )) is the covering number of M with respect to the empirical norm L 2 (z 1:n ) such that ∥P∥ 2 2,z1:n = 1 N N i=1 ∥P(•|z i )∥ 2 2 Under these assumptions, we can now present our main theorem, which relates the local Lipschitz constant to the suboptimality of the approximate model-based value iteration algorithm. In particular, we provide a finite sample analysis of model learning in Theorem B.2 and value-aware model error in Theorem B.4. Then we apply the error accumulation results from (Farahmand et al., 2017a) , connecting the local Lipschitz constant with the suboptimality of the algorithm. Theorem 3.6. Suppose Q0 is initialized such that Q0 (s, a) ≤ R max 1 -γ for ∀ (s, a). Under the assumptions of 3.1, 3.2, 3.3, 3.4, and 3.5, after K iterations of the model-based approximate value iteration algorithm, there exists a constant κ(α) which depends solely on α ∈ (0, 1) such that, E (s,a)∼P0 Q * (s, a) -QK (s, a) ≤ 2γ (1 -γ) 2 C(ρ, P 0 ) max 0≤k≤K δ k + γ 2 4ϵ 2 L 2 ξ + (1 -ξ)R 2 max (1 -γ) 2 + 2γ K R max (7) where δ 2 k = L reg ( Qk ; Qk-1 , Pk ) is the regression error defined in Equation (3), ξ = 1 - exp(- ϵN 1 1+α κ(α)D 2 R 2α 1+α ), and C(ρ, P 0 ) is the concentrability constant defined in Definition A.1.

Remarks.

(1) As the number of samples N → ∞, ξ → 1. Consequently, the value-aware model error E |T * Q k (s, a) -T * Q k (s, a)| 2 will be governed by the term 4ϵ 2 L 2 ξ which is controlled by the local Lipschitz constant L. In addition, if the number of iterations k → ∞, the left hand side of Equation ( 7) will be bounded by 2γC(ρ, P 0 ) (1 -γ) 2 max 0≤k≤K δ k + 4ϵ 2 L 2 ξ . From here, we can see the role that the local Lipschitz constant has played in controlling the suboptimality of the algorithm. (2) Although we assume that the environment transition is deterministic, our insights should still hold when it is stochastic. In particular, if the learned transition model has a low prediction error L 2 ( P) (Eq. 5) (which is often the case with a neural network as the function approximator), and the local Lipschitz constant of the value function is bounded, then we can still have a small value-aware model error and get a similar error propagation result as Theorem 3.6. Tradeoff between regression error and value-aware model error. This theorem also reveals an essential trade-off between the regression error and value-aware model error through the Lipschitz condition of the value function class, i.e., the constant L. As L gets smaller, the model-induced Bellman operator gets closer to the actual underlying Bellman operator. However, with a smaller L, we also impose a stronger condition on the value function class. Therefore, the value function space will shrink, and the regression error δ k is expected to get larger. To further visualize this trade-off, we conduct an experiment on the Inverted-Pendulum environment, where we run the analyzed model-based approximate value iteration algorithms for 1000 iterations. We plot (Figure 2a ) the best evaluation performance for 20 episodes at every iteration, (Figure 2b ) the maximum regression error across every iteration, as well as (Figure 2c ) the maximum valueaware model error. As we see from the results in Figure 2b , indeed, the regression error dramatically drops as the Lipschitz constant grows from 100 to 1000 and then levels off, which indicates that perhaps the Lipschitz constant of 1000 is rich enough for the value function class to be Bellman-complete for this environment. Meanwhile, the value-aware model error also increases with a bigger Lipschitz constant. However, we can achieve a good balance between these two errors when the Lipschitz constants are between 1000 (left dashed line) and 4000 (right dashed line), where the algorithm is observed to perform the best.

4. METHODS

In this section, we present two different approaches to regularize the (local) Lipschitz constant of the value function.

4.1. SPECTRAL NORMALIZATION

First, to explicitly control the upper bound of the Lipschitz constant, we adapt the technique of Spectral Normalization, which is originally proposed to stabilize the training of Generative Adversarial Network (GAN) (Miyato et al., 2018) . By controlling the spectral norm of the weight matrix at every layer of the value network, its Lipschitz constant is upper bounded. In particular, during each forward pass, we approximate the spectral norm of the weight matrix ∥W ∥ 2 with one step of power iteration. Then, we perform a projection step so that its spectral norm will be clipped to β if bigger than β, and unchanged otherwise. See Appendix C for more computational details.

4.2. ROBUST REGULARIZATION

Spectral Normalization directly bounds the global Lipschitz constant of the value function. However, as we argued in Section 3.2, we only require the local Lipschitz conditions around the model-uncertain local region. Such a strong regularization is not necessary and can even negatively impact the expressive power of value function. Now we present an alternative approach to quickly regularize the local Lipschitz constant of the value function based on a robust training mechanism. To regularize the local Lipschitz condition of the value network, we minimize the following loss: L (α) reg (Q ϕ ; ϵ, π ψ , {s i , a i , s ′ i } D i=1 ) = D i=1 max ∥si-si∥α≤ϵ Q ϕ si , π ψ (s ′ i ) -Q ϕ (s i , a i ) 2 (8) This robust loss is to guarantee that the variation of the value function locally is small. Then we combine it with the original loss of the value function L org (Q ϕ ; π ψ , {s i , a i , s ′ i } D i=1 ) by minimizing L(Q ϕ ; π ψ , {s i , a i , s ′ i } D i=1 ) = L org (Q ϕ ; π ψ , {s i , a i , s ′ i } D i=1 ) + λL (α) reg (Q ϕ ; ϵ, π ψ , {s i , a i , s ′ i } D i=1 ). Here, λ and ϵ are two hyperparameters. A larger ϵ is required for the dynamics model which has larger prediction errors. In terms of λ, a bigger λ makes the value network varies smoother over the local region and gives better convergence guarantees, but it also hurts the expressive power of the value network. In practice, we find that using ϵ = 0.1 is often enough for good performance. So we fix ϵ and search for the best λ. For a detailed discussion on the choice of λ s, see Appendix D.4. To solve the perturbation s in the constrained optimization within the robust loss, we use the fast gradient sign method (FGSM) (Goodfellow et al., 2015) . Here, let (s, a) be the state action pair. Then we compute s = proj s 0 + ϵ sign ∇ s s=s0 Q ϕ s, π ψ (s) -Q ϕ (s, a) 2 , B (α) (s 0 , ϵ) , where s 0 ∼ N (s, ϵ 2 I). Here the purpose of random initialization of s 0 is because the gradient ∇ s ′ Q ϕ s, π ψ (s) -Q ϕ (s, a) 2 vanishes at s. In our experiments, we focus on the l ∞ norm, but our proposed robust loss should be applicable to any l p norm. In addition, we can apply more advanced constrained optimization solvers such as Projected Gradient Decent (PGD). But here, using FGSM together with a single deterministic environment model is mainly for efficiency purposes. In practice, we find that using PGD does not give much performance gain over FGSM method, and we refer the readers to Appendix D.3 for the experimental results.

5.1. EMPIRICAL EVALUATIONS OF PROPOSED MECHANISMS

We evaluate our two proposed training mechanisms on five MuJoCo tasks, including Walker, Humanoid, Ant, Hopper, and HalfCheetah. We compare the two training mechanisms with MBPO (Janner et al., 2019) using an ensemble of probabilistic transition models. We do not compare our methods with MBPO using an ensemble of deterministic transition models mainly because it is ensemble-based and across all five tasks, it is outperformed by probabilistic ensemble models. In addition, since our regularization mechanisms are only trained on top of a single deterministic transition model, we compare it with both MBPO using a single deterministic transition model and a single probabilistic model. We implement our methods and the baseline methods based on a PyTorch implementation of MBPO (Lin, 2022) . More implementation details are provided in Appendix C. Improved asymptotic performance. Figure 3 presents the learning curves for all methods. These results show that using just a single deterministic model, MBPO with our two Lipschitz regularization mechanisms achieves a comparable and even better performance across all five tasks than MBPO with a probabilistic ensemble model. In particular, the proposed robust regularization technique shows a larger advantage on three more sophisticated tasks: Humanoid, Ant, and Walker. For example, on Humanoid, it achieves the same final performance as that of a probabilistic ensemble with only about 60% of the environment interaction steps. In contrast, spectral normalization shows little improvement over a single deterministic model on these three tasks, showing the limitation of constraining the global Lipschitz constant. We provide more discussion on this in Section 5.3. The significance of the experimental results is twofold. First, it further validates our insights and shows the importance of the local Lipschitz condition of value functions in MBRL. Second, it demonstrates that having an ensemble of transition models is not necessary. We can save the computational time and cost of training an ensemble of transition models with a simple Lipschitz regularization mechanism. In practice, we can even use a single probabilistic model combined with our proposed mechanisms to get the best performance. See Appendix D.2 for the extra experimental results.

5.2. VISUALIZING THE VALUE-AWARE MODEL ERROR

Given the excellent performance of robust regularization, we now verify its effectiveness in controlling the value-aware model error. On Walker, we compare it with a variant: computing the perturbation with uniform random noise instead of adversarially choosing the perturbation. As we can see from the first figure in Figure 4 , robust regularization has a much smaller value-aware model error than all other methods. Adding uniformly random noise can somewhat reduce the valueaware model error compared with a single deterministic model (without any noise added). However, it is still much less effective than robust regularization, which computes the error adversarially. In the second figure of Figure 4 , we see that with uniform random noise, it achieves a slightly better performance than a single deterministic model but is still far worse than robust regularization. The results further verify that by adversarially choosing the noise, robust regularization is extremely effective at controlling the value-aware model error, resulting in great empirical performance. In Appendix D.1, we provide additional experimental results of value-aware model error in the rest of the four environments.

5.3. ROBUST REGULARIZATION VS. SPECTRAL NORMALIZATION

In Table 1 , we already see that robust regularization achieves a much better time efficiency than spectral normalization. Now we analyze their performance difference through the lens of value-aware model error. As we see from the plot, to reduce the value-aware model error so that it is about the same as robust regularization, we need to choose the spectral radius β as small as 10. However, under this constraint, it achieves a significantly worse performance than robust regularization, indicating that this constraint is perhaps too strong. To get the best performance, spectral normalization has to use a larger spectral radius to trade value-aware model error for expressive power and thus has worse empirical performance. In addition, we observe from Figure 3 that the performance discrepancy between the two methods is even greater in more complicated environments such as Ant and Humanoid, which further shows the limitation of constraining the global Lipschitz constant. 2022)). However, despite its popularity, there is still no clear explanation why the use of probabilistic dynamics ensemble brings such a large improvement to the policy. In this paper, we fill this gap and propose a novel theoretical explanation on why probabilistic dynamics ensemble works well.

6. RELATED WORK

Lipschitz continuity in reinforcement learning. In the realm of model-free RL, a few recent works also apply the technique of spectral normalization to regularize the Lipschitz condition of the RL agent's value function, which results in more stable optimization dynamics (Gogianu et al., 2021; Bjorck et al., 2021) . Ball & Roberts (2021) investigate the effects of added Gaussian noise of exploration on the smoothness of value functions. In addition, Shen et al. (2020) ; Zhang et al. (2020) propose to use a similar robust regularizer as our robust regularization method, aiming at an adversarially robust policy. In our work, we focus on MBRL, a fundamentally different setting. As explained in Section 3.1 and Appendix F, if no restrictions are imposed on the value function class, then the value-aware model error could explode, whereas the model error vanishes in the model-free setting. In MBRL, Osband & Van Roy (2014) connects Lipschitz constant of the value function to the regret bound of posterior sampling for RL. Asadi et al. (2018) proposed to learn a generalized Lipschitz transition model with respect to the Wasserstein metric, resulting in a bounded multi-step model prediction error and Lipschitz optimal value function induced from the learned dynamics. In contrast, our paper focuses on the local Lipschitz condition of the value function instead of the underlying transition model, studying its relation with the suboptimality of MBRL algorithm.

7. CONCLUSION AND DISCUSSION

In this paper, we provide insight into why the probabilistic ensemble model can achieve great empirical performance. We demonstrate the importance of the local Lipschitz condition of value functions in MBRL algorithms and justify our hypothesis with both theoretical and empirical results. Based on our insight, we propose two training mechanisms that directly regularize the Lipschitz condition of the value function. Empirical studies demonstrate the effectiveness of the proposed mechanisms. One limitation is that if we use a single model instead of an ensemble, then we cannot use many existing methods of uncertainty estimation based on model ensemble, and we need to redesign the uncertainty estimation mechanism based on a single model. We leave this as future work. Then, with c 1 = 704 and c 2 = 26, for any K > 1 and every x > 0, with probability at least 1 -e -x , for any f ∈ F, we have E[f ] ≤ K K -1 E n [f ] + c 1 K B r * + x(11(b -a) + c 2 BK) n Also with a probability at least 1 -e -x , for any f ∈ F, we have E n [f ] ≤ K K -1 E[f ] + c 1 K B r * + x(11(b -a) + c 2 BK) n We will then use r ⋆ (F) to denote the fixed point of a sub-root function ψ that satisfies 10 Now we prove the following theorem which provides a finite sample bound on the value-aware model error. Theorem B.2. Under the four assumptions 3.1,3.2, 3.3, 3.5, with the probability model learned based on Equation 2, there exists a constant κ(α) which depends solely on α ∈ (0, 1), such that with probability 1 -δ, L( P) ≤ ϵ + κ(α)D 2 R 2α 1+α ln( 1 δ ) N 1 1+α , ( ) where N is the number of samples from the data-collection distribution ρ, D is the size of the statespace defined in assumption 3.2, and R is defined in the metric entropy condition of the model class in assumption 3.5. Proof. Given a batch of state-action transition triples {(s i , a i , s ′ i )} N i=1 with (s i , a i ) sampled i.i.d from data distribution ρ, we denote the empirical loss L n (P) = 1 N N i=1 S P(ŝ ′ i |s i , a i )∥ŝ ′ i -s ′ i ∥d ŝi ′ ( ) We also denote the underlying loss over the data distribution ρ as L(P) = E (s,a)∼ρ S P(ŝ ′ |s, a)∥ŝ ′ -s ′ ∥dŝ ′ In addition, let P ∈ M be the best model in the transition kernel class M defined in 3.5. Now we would like to apply Theorem B.2 to bound the difference between the empirical and the true underlying loss. First, let F = {(s × a, s ′ ) → l(s × a, s ′ ; P) -l(s × a, s ′ ; P); P ∈ M} be the class of functions in Theorem B.1, where l(s × a, s ′ ; P) is the single datapoint version of the empirical loss L n (P). So E s×a∼ρ [l(s × a, s ′ ; P)] = L(P). Now by assumption 3.2, 0 ≤ l(s × a, a; P) ≤ 2D for every s × a ∈ S × A, s ′ ∈ S, and P ∈ M. Therefore, the value of f is bounded between -2D and 2D for every f ∈ F, As a consequence, for every f ∈ F, V ar(f ) ≤ E[f 2 ] ≤ 4D 2 . So we can set T (f ) = 4D 2 E S P(ŝ ′ |s, a)∥ŝ ′ -s ′ ∥dŝ ′ B = 4D 2 Now, we can apply Theorem B.2 to conclude that with probability 1 -δ (let K = 2), L(P) -L( P) ≤ 2(L n (P) -L n ( P)) + 2 × 704 4D 2 r * (F) + (11 × 4D + 2 × 26 × 4D 2 ) ln( 1 δ ) N ( ) Since P is the minimizer of the empirical loss L n (P), L( P) -L( P) ≤ 352 D 2 r * (F) + (44D + 208D 2 ) ln( 1 δ ) N We can provide an upper bound of the local Rademacher complexity r * (F): there exists a finite constant τ > 0 such that for a given 0 ≤ α ≤ 1, we have r * (F) ≤ c 1 (α)D 4 R 2α 1+α N 1 1+α + τ D 4 ln N N , where c(α) = τ (1 -α) 2 1+α . The proof follows the exact same steps of Proposition 10 in Farahmand et al. (2017a) . Now back to Equation 17, by the realizability assumption 3.3, the best model P in the model class satisfies that L( P) ≤ ϵ. Therefore, with probability 1 -δ, L( P) ≤ ϵ + 352c 1 (α)D 2 R 2α 1+α N 1 1+α + 352τ D 2 ln N N + (44D + 208D 2 ) ln( 1 δ ) N Finally, there should exist a constant κ(α) sufficiently large such that with probability 1 -δ, L( P) ≤ ϵ + κ(α)D 2 R 2α 1+α ln( 1 δ ) N 1 1+α (20) Corollary B.3. Under the five assumptions 3.1,3.2, 3.3, and 3.5 with the probability model learned based on Equation 2, there exists a constant κ(α) which depends solely on α ∈ (0, 1), such that with probability 1 -exp(-ϵN 1 1+α κ(α)D 2 R 2α 1+α ), L( P) ≤ 2ϵ, Proof. This is a straightforward application of Theorem B.2, where we could just let ϵ = κ(α)D 2 R 2α 1+α ln( 1 δ ) N 1 1+α . Next, we consider the local Lipschitz condition of the value function and provide a finite sample bound of the value-aware model error. Theorem B.4. Under the five assumptions 3. 1,3.2, 3.3, 3.4, and 3.5 , with the probability model learned based on Equation 2, there exists a constant κ(α) which depends solely on α ∈ (0, 1), such that for any m > 1, T * Q(s, a) -T * Q(s, a) 2 dρ(s, a) ≤ γ 2 4ϵ 2 L 2 ξ + R 2 max (1 -γ) 2 (1 -ξ) , where ξ = 1 -exp(-ϵN 1 1+α κ(α)D 2 R 2α 1+α ) Proof. ∥T * Q -T * Q∥ 2 ρ = r(s, a) + γV (s ′ ) -r(s, a) -γ P(dŝ ′ |s, a)V (ŝ ′ ) 2 dρ(s, a) = γ 2 P(dŝ ′ |s, a) V (ŝ ′ ) -V (s ′ ) 2 dρ(s, a) ≤ γ 2 P(ds ′ |s, a) V (ŝ ′ ) -V (s ′ ) 2 dρ(s, a) ≤ γ 2 1{∥s ′ -ŝ′ ∥ ≤ 2ϵ} V (ŝ ′ ) -V (s ′ ) 2 P(dŝ ′ |s, a)dρ(s, a) + γ 2 1{∥s ′ -ŝ′ ∥ > 2ϵ} V (ŝ ′ ) -V (s ′ ) 2 P(dŝ ′ |s, a)dρ(s, a) ≤ γ 2 2 2 ϵ 2 L 2 IP{∥s ′ -ŝ′ ∥ ≤ 2ϵ} + γ 2 R 2 max (1 -γ) 2 IP{∥s ′ -ŝ′ ∥ > 2ϵ} Now apply Corollary B, ∥T * Q -T * Q∥ 2 ρ ≤ γ 2 4ϵ 2 L 2 1 -exp(- ϵN 1 1+α κ(α)D 2 R 2α 1+α ) + γ 2 R 2 max (1 -γ) 2 exp(- ϵN 1 1+α κ(α)D 2 R 2α 1+α ) Finally, with the value-aware model error bounded, we could apply the error propagation results from (Munos, 2005; Farahmand et al., 2017a) and prove our main theorem, which relates the local Lipschitz constant to the sub-optimality of the approximate model-based value iteration algorithm. Theorem B.5. Suppose Q0 is initialized such that Q0 (s, a) ≤ R max 1 -γ for ∀ (s, a). Under the assumptions of 3.1, 3.2, 3.3, 3.4, and 3.5, after K iterations of the model-based approximate value iteration algorithm, there exists a constant κ(α) which depends solely on α ∈ (0, 1) such that, E (s,a)∼P0 Q * (s, a) -QK (s, a) ≤ 2γ (1 -γ) 2 C(ρ, P 0 ) max 0≤k≤K δ k + γ 2 4ϵ 2 L 2 ξ + (1 -ξ)R 2 max (1 -γ) 2 + 2γ K R max ( ) where δ 2 k = L reg ( Qk ; Qk-1 , Pk ) is the regression error defined in Equation (3), ξ = 1 - exp(- ϵN 1 1+α κ(α)D 2 R 2α 1+α ), and C(ρ, P 0 ) is the concentrability constant defined in Definition A.1. Proof. This follows directly from the Theorem B.4 and also Theorem 4 from (Farahmand et al., 2017a) , where the value-aware model error e model (N ) ≤ γ 2 4ϵ 2 L 2 ξ + R 2 max (1 -γ) 2 (1 -ξ)

C IMPLEMENTATION DETAILS

In this section, we are going to introduce the implementation details of our two proposed methods. We implement our methods based on a PyTorch implementation of MBPO (Lin, 2022) . The dynamics model architecture is MLP with four hidden layers of size 200. In Ant and Humanoid, the hidden size is 400 because these two environments are more complex than others. For the probabilistic dynamics model ensemble, we set the ensemble size to 7 which is the setting used in the original paper of MBPO (Janner et al., 2019) . The policy is optimized with Soft Actor-Critic (SAC) (Haarnoja et al., 2018) . The actor network architecture and the critic network architecture are MLP with two hidden layers of size 256. For robust regularization, we fix ϵ to 0.1 as discussed in Section 4.2, and we do a grid search of λ over [0.01, 0.1, 1.0], with the result presented in Table 2 For spectral normalization, we add the normalization on each layer of the critic network. In particular, at every forward pass, we approximate the spectral radius of the weight matrix with one step of power iteration. The algorithm is sketched below with u and v being the right and left singular vector of the weight matrix W . v ← W u (t-1) ; α ← ∥v∥; v (t) ← α -1 v (24) u ← W T v (t) ; ρ ← ∥v∥; u (t) ← ρ -1 u (25) Then we perform a projection of the parameters: W := max(1, max(α, ρ) β ) -1 W . So the spectral norm will be clipped to β if it's bigger than λ, unchanged otherwise. We do a grid search of β over [15, 20, 25, 30, 35] , with results shown in In the experiment section, we combine our proposed mechanisms with a single deterministic model and compare it against MBPO using an ensemble of probabilistic models. The purpose is to verify that regularization of the local Lipschitz constant is critical in MBRL algorithms and propose a computationally efficient MBRL algorithm without a model ensemble. In practice, complementary to our proposed Lipschitz regularization mechanisms, we can also use a single probabilistic model to further regularize the local Lipschitz condition of the value function. In addition, training a probabilistic environment model would be better suited for environments with stochastic transitions. Here in Figure 6 , we combine our two proposed mechanisms with both a probabilistic and deterministic model on Ant, comparing them with the probabilistic ensemble baseline. From Figure 6b , we see that although spectral normalization with a single deterministic model has a large value-aware model error, it is significantly reduced when combining it with a probabilistic dynamics model. Therefore, we find that spectral normalization with a probabilistic model achieves much better performance and even outperforms MBPO with an ensemble of probabilistic models. For robust regularization, using a probabilistic model also helps improve the algorithm's value-aware model error and performance. This observation suggests that the two Lipschitz regularization approaches, explicit regularization by spectral normalization or robust regularization and implicit regularization by probabilistic models, are complementary. In practice, we can combine the two approaches to get the best performance of the MBRL algorithm.

D.3 ROBUST REGULARIZATION WITH FGSM VS. PGD

In Figure 7 , we compare the performance of robust regularization with 20 steps of Project Gradient Descent (PGD-20) against the Fast Gradient Sign Method (FGSM) on Walker. In particular, although PGD-20 is much more computationally expensive, we do not observe the improvement with this more powerful constrained optimization solver.

D.4 ROBUST REGULARIZATION WITH DIFFERENT REGULARIZATION WEIGHTS

Figure 8 further visualizes how the regularization weight λ of robust regularization influences the algorithm performance. Similar to the findings of the experiments on spectral normalization, we see that the algorithm's performance first increases and drops as the regularization weight gets larger. This verifies our theoretical insights from Theorem B.5 that with a small λ, with algorithm gets less regularization and thus has a big value-aware model error. But meanwhile, it also has a small regression error since the regularization has little effect on the expressive power of the value function. When λ goes up, the regularization will have a stronger negative effect on the expressive power of the value function, but the value-aware model error will also get smaller. We observe that the algorithm performs the best with λ = 0.1, achieving the balance between value-aware model error and the value function's expressive power.



Figure 1: (a) Performance of MBPO algorithm trained with single deterministic, single probabilistic, deterministic ensemble, and probabilistic ensemble model respectively on Humanoid. (b) Model mean squared error. (c) Value-aware model error in log scale. (d) Upper bound of Lipschitz constant of value network. Results are averaged over 8 random seeds.

Figure 2: Model-based value iteration on Inverted-Pendulum across value networks with different Lipschitz constraints. All the results are the median over 5 random seeds.

Figure 4: (Left Two) Robust Regularization with a comparison of uniform random noise on Walker. (Right Two) Spectral Normalization with different spectral radius β on Walker.

Dynamics ensemble in model-based reinforcement learning. Dynamics model ensemble is first introduced in MBRL byKurutach et al. (2018) to avoid the learned policy exploit the insufficient data regions. ThenChua et al. (2018) proposed a probabilistic dynamics ensemble to capture the aleatoric uncertainty and epistemic uncertainty, and achieves significant performance improvement compared to the deterministic dynamics ensemble. Probabilistic dynamics ensemble is now widely used in MBRL methods (e.g.Lai et al. (2020); Clavera et al. (2020); D'Oro & Jaśkowski (2020); Lai et al. (2021); Froehlich et al. (2022); Li et al. (

Value-aware Model Error (Log Scale) Figure6: Spectral Normalization and Robust Regularization with a single probabilistic model on Ant. RR is short for Robust Regularization, and SN is short for Spectral Normalization

Figure 7: Robust Regularization with 20 steps of Project Gradient Descent (PGD-20) against Fast Gradient Sign Method (FGSM).

Figure 8: Robust Regularization with different regularization weights λ's on Walker. Experiments are all with 8 random seeds.

To upper bound the (local) Lipschitz constant, we constrain the spectral norm of the weight matrix for each layer of the value network.

Performance of the proposed value function training mechanisms against baselines. Results are averaged over 8 random seeds and shaded regions correspond to the 95% confidence interval among seeds. Comparison of computational time.

For a fair comparison, we set the rollout horizon during model rollouts to 1 in all environments. regularization effectively reduces the value-aware model error by constraining the local Lipschitz condition with computing adversarial perturbation. In addition, we also see that global Lipschitz constraints are too strong for spectral normalization. In two more complicated environments, Ant and Humanoid, it has to sacrifice value-aware model error for the expressive power of the value function. Therefore, spectral normalization does not achieve a good performance in these two environments. However, in two easier environments, Hopper and HalfCheetah, spectral normalization could still effectively reduce the value-aware model error and has good empirical performance.

ACKNOWLEDGEMENT

This work is supported by National Science Foundation NSF-IIS-FAI program, DOD-ONR-Office of Naval Research, DOD Air Force Office of Scientific Research, DOD-DARPA-Defense Advanced Research Projects Agency Guaranteeing AI Robustness against Deception (GARD), Adobe, Capital One and JP Morgan faculty fellowships.

A ADDITIONAL DEFINITIONS AND NOTATION

We start with the definition of the Concentrability constant Definition A.1. (Concentrability Constant) (Farahmand et al., 2017a) Given ρ, ν ∈ ∆(S), an integer k > 0, and an arbitrary sequence of policies (π i ) k i=1 , the distribution ρP π1 P π2 ...P π k denotes the future state distribution obtained when the state in the first step is distributed according to ρ and the agent follows the sequence of policies π 1 , ..., π k . Define:Here, ∥f ∥ 2 2,ν = f (s) 2 dν. The derivative dρP π1 P π2 ...P π k dν is the Radon-Nikydom Derivative of two probability measures, which is well-defined up to a set of measure zero by ν if ρP π1 P π2 ...P π k is absolutely continuous with respect to ν. In case it's not absolutely continuous, we set it to be ∞. Then, for a constant 0 ≤ γ < 1, define the discounted weighted average concentrability coefficient asIn addition, we define the empirical norm. Given a collection points {s 1 , . Given a collection of measurable functions F from X to R and a probability distribution µ over X , we sample n points x 1 , ..., x n i.i.d. from µ. DefineThen we define the Rademacher complexity ofBesides, same as (Bartlett et al., 2005) , we define the sub-root function as non-negative and non-r is non-increasing for r > 0

B PROOF OF THE THEOREM

We begin by citing the following theorem. Theorem B.1. (Bartlett et al., 2005) Let F be a class of functions with values in range [a, b] and assume that there are some functional T : F → R + and some constant B such that for every f ∈ F,Let ψ be a sub-root function and let r * = r * (F) be the fixed point of ψ. Assume that for any r ≤ r * , ψ satisfiesPublished as a conference paper at ICLR 2023 In Section 5.2 and 5.3, we demonstrate the effectiveness of robust regularization in controlling the value-aware model error on Walker and also show the limitation of constraining the global Lipschitz constant by spectral normalization. Here, we provide the results of value-aware model error in the rest of the four environments. We used the same best hyperparameters reported in Table 2 and Table 3, Initialize the value function Q0 .end for for t = 0 to H -1 do Update value function using gradient descent with In this paper, we provide both theoretical and empirical insights into why Lipschitz regularization is crucial in model-based RL algorithms through the lens of value-aware model error. On the contrary, the model error vanishes in the model-free setting, so our theoretical insights no longer hold. However, fitting a value function with a smaller Lipschitz constant may still be beneficial for policy optimization and the value prediction of out-of-distribution state-action pairs.So does the improvement shown in Section 5 indeed come from the controlled value-aware model error, which is unique in the model-based setting? We conduct an experiment on the model-free setting, adding our proposed spectral normallization and robust regularization mechanisms to the model-free SAC respectively with the same hyperparameter settings used in the paper. As shown in Figure 9 , combining the Lipschitz regularization mechanisms slightly improves the performance of model-free SAC algorithm. Still, the improvement is far more limited compared with the improvement of the model-based algorithm shown in Figure 3 . The results suggest that although some additional aspects of value learning could be affected by Lipschitz regularization, our insights into the value-aware model error for model-based scenarios should still hold. The exploration of how Lipschitz regularization impacts other aspects of value learning in the model-free setting is beyond the scope of our work, but it would be an interesting direction for future work.

