ON FEATURE DIVERSITY IN ENERGY-BASED MODELS Anonymous authors Paper under double-blind review

Abstract

Energy-based learning is a powerful learning paradigm that encapsulates various discriminative and generative approaches. An energy-based model (EBM) is typically formed of inner-model(s) that learn a combination of the different features to generate an energy mapping for each input configuration. In this paper, we focus on the diversity of the produced feature set. We extend the probably approximately correct (PAC) theory of EBMs and analyze the effect of redundancy reduction on the performance of EBMs. We derive generalization bounds for various learning contexts, i.e., regression, classification, and implicit regression, with different energy functions and we show that indeed reducing redundancy of the feature set can consistently decrease the gap between the true and empirical expectation of the energy and boosts the performance of the model.

1. INTRODUCTION

The energy-based learning paradigm was first proposed by Zhu & Mumford (1998) ; LeCun et al. (2006) as an alternative to probabilistic graphical models (Koller & Friedman, 2009) . As their name suggests, energy-based models (EBMs) map each input 'configuration' to a single scalar, called the 'energy'. In the learning phase, the parameters of the model are optimized by associating the desired configurations with small energy values and the undesired ones with higher energy values (Kumar et al., 2019; Song & Ermon, 2019; Yang et al., 2016) . In the inference phase, given an incomplete input configuration, the energy surface is explored to find the remaining variables which yield the lowest energy. EBMs encapsulate solutions to several supervised approaches (LeCun et al., 2006; Fang & Liu, 2016) and unsupervised learning problems (Deng et al., 2020; Bakhtin et al., 2021; Zhao et al., 2020; Xu et al., 2022) and provide a common theoretical framework for many learning models, including traditional discriminative (Zhai et al., 2016; Li et al., 2020) and generative (Zhu & Mumford, 1998; Xie et al., 2017b; Zhao et al., 2017; Che et al., 2020; Khalifa et al., 2021) approaches. Formally, let us denote the energy function by E(h, x, y) , where h = G W (x) represents the model with parameters W to be optimized during training and x, y are sets of variables. Figure 1 illustrates how classification, regression, and implicit regression can be expressed as EBMs. In Figure 1 (a), a regression scenario is presented. The input x, e.g., an image, is transformed using an inner model G W (x) and its distance, to the second input y is computed yielding the energy function. A valid energy function in this case can be the L 1 or the L 2 distance. In the binary classification case (Figure 1 (b) ), the energy can be defined as E(h, x, y) = -yG W (x) . In the implicit regression case (Figure 1 (c )), we have two inner models and the energy can be defined as the L 2 distance between their outputs E(h, x, y) = 1 2 ||G (1) W (x) -G W (y)|| 2 2 . In the inference phase, given an input x, the label y * can be obtained by solving the following optimization problem: y * = arg min y E(h, x, y). (1) An EBM typically relies on an inner model, i.e., G W (x), to generate the desired energy landscape (LeCun et al., 2006) . Depending on the problem at hand, this function can be constructed as a linear projection, a kernel method, or a neural network and its parameters are optimized in a data-driven manner in the training phase. Formally, G W (x) can be written as G W (x) = D i w i ϕ i (x), G W (x) x D(G W (x),y) y E(h,x,y) G W (x) x -yG W (x) y E(h,x,y) where {ϕ 1 (•), • • • , ϕ D (•)} is the feature set, which can be hand-crafted, separately trained from unlabeled data (Zhang & LeCun, 2017) , or modeled by a neural network and optimized in the training phase of the EBM model (Xie et al., 2016; Yu et al., 2020; Xie et al., 2021) . In the rest of the paper, we assume that the inner models G W defined in the energy-based learning system (Figure 1 ) are obtained as a weighted sum of different features as expressed in equation 2. G W 1 (x) x ½||G W 1 (x) -G W 2 (y)|| 2 y E(h,x,y) G W 2 (y) (a) (b) (c) In (Zhang, 2013) , it was shown that simply minimizing the empirical energy over the training data does not theoretically guarantee the minimization of the expected value of the true energy. Thus, developing and motivating novel regularization techniques is required (Zhang & LeCun, 2017) . We argue that the quality of the feature set {ϕ 1 (•), • • • , ϕ D (•)} plays a critical role in the overall performance of the global model. In this work, we extend the theoretical analysis of (Zhang, 2013) and focus on the 'diversity' of this set and its effect on the generalization ability of the EBM models. Intuitively, it is clear that a less correlated set of intermediate representations is richer and thus able to capture more complex patterns in the input. Thus, it is important to avoid redundant features for achieving a better performance. However, a theoretical analysis is missing. We start by quantifying the diversity of a set of feature functions. To this end, we introduce ϑ -τ -diversity: Definition 1 ((ϑ -τ )-diversity). A set of feature functions, {ϕ 1 (•), • • • , ϕ D (•)} is called ϑ-diverse, if there exists a constant ϑ ∈ R, such that for every input x we have 1 2 D i̸ =j (ϕ i (x) -ϕ j (x)) 2 ≥ ϑ 2 (3) with a high probability τ . Intuitively, if two feature maps ϕ i (•) and ϕ j (•) are non-redundant, they have different outputs for the same input with a high probability. However, if, for example, the features are extracted using a neural network with a ReLU activation function, there is a high probability that some of the features associated with the input will be zero. Thus, defining a lower bound for the pair-wise diversity directly is impractical. Therefore, we quantify diversity as the lower-bound over the sum of the pair-wise distances of the feature maps as expressed in equation 3 and ϑ measures the diversity of a set. In machine learning context, diversity has been explored in ensemble learning (Li et al., 2012; Yu et al., 2011; Li et al., 2017) , sampling (Derezinski et al., 2019; Bıyık et al., 2019) , ranking (Wu et al., 2019; Qin & Zhu, 2013) , pruning (Singh et al., 2020; Lee et al., 2020) , and neural networks (Xie et al., 2015; Shen et al., 2021) . In Xie et al. (2015; 2017a) , it was shown theoretically and experimentally that avoiding redundancy over the weights of a neural network using the mutual angles as a diversity measure improves the generalization ability of the model. In this work, we explore a new line of research, where diversity is defined over the feature maps directly, using the (ϑ -τ )-diversity, in the context of energy-based learning. In (Zhao et al., 2017) , a similar idea was empirically explored. A "repelling regularizer" was proposed to force non-redundant or orthogonal feature representations. Moreover, the idea of learning while avoiding redundancy has been used recently in the context of semi-supervised learning (Zbontar et al., 2021; Bardes et al., 2021) . Reducing redundancy by minimizing the cross-correlation of features learned using a Siamese network (Zbontar et al., 2021) was empirically shown to improve the generalization ability, yet a theoretical analysis to prove this has so far been lacking. In this paper, we close the gap between empirical experience and theory. We theoretically study the generalization ability of EBMs in different learning contexts, i.e., regression, classification, implicit regression, and we derive new generalization bounds using the (ϑ-τ )-diversity providing theoretical guarantees that avoiding redundancy indeed improves the generalization ability of the model. The contributions of this paper can be summarized as follows: • We explore a new line of research, where diversity is defined over the features representing the input data and not over the model's parameters. To this end, we introduce (ϑ -τ )diversity as a quantification of the diversity of a given feature set. • We extend the theoretical analysis (Zhang, 2013) and study the effect of avoiding redundancy of a feature set on the generalization of EBMs (Lemmas 3 to 7 and Theorem 1 to 5). • We derive bounds for the expectation of the true energy in different learning contexts, i.e., regression, classification, and implicit regression, using different energy functions. Our analysis consistently shows that avoiding redundancy by increasing the diversity of the feature set can boost the performance of an EBM. 2 PAC-LEARNING OF EBMS WITH (ϑ -τ )-DIVERSITY In this section, we derive a qualitative justification for (ϑ-τ )-diversity using probably approximately correct (PAC) learning (Valiant, 1984; Mohri et al., 2018; Li et al., 2019) . The PAC-based theory for standard EBMs has been established in (Zhang, 2013) . First, we start by defining Rademacher complexity: Definition 2. (Bartlett & Mendelson, 2002; Mohri et al., 2018) For a given dataset with m samples S = {x i , y i } m i=1 from a distribution D and for a model space F : X → R with a single dimensional output, the Empirical Rademacher complexity Rm (F) of the set F is defined as follows: Rm (F) = E σ sup f ∈F 1 m m i=1 σ i f (x i ) , where the Rademacher variables σ = {σ 1 , & Mendelson, 2002) , several learning guarantees for EBMs have been shown (Zhang, 2013) . We recall the following two lemmas related to the estimation error and the Rademacher complexity. In Lemma 2, we present the principal PAC-learning bound for energy functions with finite outputs. Lemma 1. (Wolf, 2018) For F ∈ R X , assume that g : R - → R is a L g -Lipschitz continuous function and A = {g • f : f ∈ F}. We have R m (A) ≤ L g R m (F). (5) Lemma 2. (Zhang, 2013) For a well-defined energy function E(h, x, y) over hypothesis class H, input set X and output set Y (LeCun et al., 2006) , the following holds for all h in H with a probability of at least 1 -δ E (x,y)∼D [E(h, x, y)] ≤ 1 m (x,y)∈S E(h, x, y) + 2R m (E) + M log(2/δ) 2m , ( ) where E is the energy function class defined as E = {E(h, x, y)|h ∈ H}, R m (E) is its Rademacher complexity, and M is the upper bound of E. Lemma 2 provides a generalization bound for EBMs with well-defined (non-negative) and bounded energy. The expected energy is bounded using the sum of three terms: The first term is the empirical expectation of energy over the training data, the second term depends on the Rademacher complexity of the energy class, and the third term involves the number of the training data m and the upperbound of the energy function M . This shows that merely minimizing the empirical expectation of energy, i.e., the first term, may not yield a good approximation of the true expectation. In (Zhang & LeCun, 2017) , it has been shown that regularization using unlabeled data reduces the second and third terms leading to better generalization. In this work, we express these two terms using the (ϑτ )-diversity and show that employing a diversity strategy may also decrease the gap between the true and empirical expectation of the energy. In Section 2.1, we consider the special case of regression and derive two bounds for two energy functions based on L 1 and L 2 distances. In Section 2.2, we derive a bound for the binary classification task using as energy function et al., 2006) . In Section 2.3, we consider the case of implicit regression, which encapsulates different learning problems such as metric learning, generative models, and denoising (LeCun et al., 2006) . For this case, we use the L 2 distance between the inner models as the energy function. In the rest of the paper, we denote the generalization gap, E(h, x, y) = -yG W (x) (LeCun E (x,y)∼D [E(h, x, y)]-1 m (x,y)∈S E(h, x, y) by ∆ D,S E. All the proofs are presented in the supplementary material.

2.1. REGRESSION TASK

Regression can be formulated as an energy-based learning problem (Figure 1 (a) ) using the inner model h (x) = G W (x) = D i=1 w i ϕ i (x) = w T Φ(x). We assume that the feature set is positive and well-defined over the input domain X , i.e., ∀x ∈ X : ||Φ(x)|| 2 ≤ A, the hypothesis class can be defined as follows: H = {h(x) = G W (x) = D i=1 w i ϕ i (x) = w T Φ(x) | Φ ∈ F, ∀x : ||Φ(x)|| 2 ≤ A}, the output set Y ⊂ R is bounded, i.e., y < B, and the feature set {ϕ 1 (•), • • • , ϕ D (•)} is ϑ-diverse with a probability τ . The two valid energy functions which can be used for regression are et al., 2006) . We study these two cases separately and we show theoretically that for both energy functions avoiding redundancy improves generalization of the EBM model. E 2 (h, x, y) = 1 2 ||G W (x) -y|| 2 2 and E 1 (h, x, y) = ||G W (x) -y|| 1 (LeCun

ENERGY FUNCTION: E 2

In this subsection, we present our theoretical analysis on the effect of diversity on the generalization ability of an EBM defined with the energy function E 2 (h, x, y) = 1 2 ||G W (x) -y|| 2 2 . We start by the following two Lemmas 3 and 4. Lemma 3. With a probability of at least τ , we have sup x,W |h(x)| ≤ ||w|| ∞ (DA 2 -ϑ 2 ). Lemma 4. With a probability of at least τ , we have sup x,y,h |E(h, x, y)| ≤ 1 2 (||w|| ∞ (DA 2 -ϑ 2 ) + B) 2 . (8) Proof. We have sup x,y,h |h(x) -y| ≤ sup x,y,h (|h(x)| + |y|) = (||w|| ∞ √ DA 2 -ϑ 2 + B). Thus sup x,y,h |E(h, x, y)| ≤ 1 2 (||w|| ∞ √ DA 2 -ϑ 2 + B) 2 . Lemmas 3 and 4 bound the supremum of the output of the inner model and the energy function as a function of ϑ, respectively. As it can been seen, both terms are decreasing with respect to diversity. Next, we bound the Rademacher complexity of the energy class, i.e., R m (E). Lemma 5. With a probability of at least τ , we have R m (E) ≤ 2D||w|| ∞ (||w|| ∞ (DA 2 -ϑ 2 ) + B)R m (F). Lemma 5 expresses the bound of the Rademacher complexity of the energy class using the diversity constant and the Rademacher complexity of the features. Having expressed the different terms of Lemma 2 using diversity, we now present our main result for an energy-basel model trained defined using E 2 . The main result is presented in Theorem 1. Theorem 1. For the energy function E(h, x, y) = 1 2 ||G W (x) -y|| 2 2 , over the input set X ∈ R N , hypothesis class H = {h(x) = G W (x) = D i=1 w i ϕ i (x) = w T Φ(x) | Φ ∈ F, ∀x : ||Φ(x)|| 2 ≤ A}, and output set Y ⊂ R, if the feature set {ϕ 1 (•), • • • , ϕ D (•)} is ϑ-diverse with a probability τ , with a probability of at least (1 -δ)τ , the following holds for all h in H: ∆ D,S E ≤ 4D||w|| ∞ (||w|| ∞ DA 2 -ϑ 2 + B)R m (F) + 1 2 (||w|| ∞ DA 2 -ϑ 2 + B) 2 log(2/δ) 2m , ( ) where B is the upper-bound of Y, i.e., y ≤ B, ∀y ∈ Y. Theorem 1 express the special case of Lemma 2 using the (ϑ -τ )-diversity of the feature set {ϕ 1 (•), • • • , ϕ D (•)}. As it can been seen, the bound of the generalization error is inversely proportional to ϑ 2 . This theoretically shows that reducing redundancy, i.e., increasing ϑ, reduces the gap between the true and the empirical energies and improves the generalization performance of the EBMs.

ENERGY FUNCTION: E 1

In this subsection, we consider the second case of regression using the energy function E 1 (h, x, y) = ||G W (x) -y|| 1 . Similar to the previous case, we start by deriving bounds for the energy function and the Rademacher complexity of the class using diversity in Lemmas 6 and 7. Lemma 6. With a probability of at least τ , we have sup x,y,h |E(h, x, y)| ≤ (||w|| ∞ DA 2 -ϑ 2 + B). Lemma 7. With a probability of at least τ , we have R m (E) ≤ 2D||w|| ∞ R m (F). Next, we derive the main result of the generalization of the EBMs defined using the energy function E 1 . The main finding is presented in Theorem 2. Theorem 2. For the energy function E(h, x, y) = ||G W (x) -y|| 1 , over the input set X ∈ R N , hypothesis class H = {h(x) = G W (x) = D i=1 w i ϕ i (x) = w T Φ(x) | Φ ∈ F, ∀x ||Φ(x)|| 2 ≤ A}, and output set Y ⊂ R, if the feature set {ϕ 1 (•), • • • , ϕ D (•)} is ϑ-diverse with a probability τ , then with a probability of at least (1 -δ)τ , the following holds for all h in H: ∆ D,S E ≤ 4D||w|| ∞ R m (F) + (||w|| ∞ DA 2 -ϑ 2 + B) log(2/δ) 2m , ( ) where B is the upper-bound of Y, i.e., y ≤ B, ∀y ∈ Y. Similar to Theorem 1, in Theorem 2, we consistently find that the bound of the true expectation of the energy is a decreasing function with respect to ϑ. This proves that for the regression task reducing redundancy can improve the generalization performance of the energy-based model.

2.2. BINARY CLASSIFIER

Here, we consider the problem of binary classification, as illustrated in Figure 1 (b) . Using the same assumption as in regression for the inner model, i.e., h( et al., 2006) , and the (ϑ-τ )-diversity of the feature set, we express Lemma 2 for this specific configuration in Theorem 3. Theorem 3. For the energy function E(h, x, y) = -yG W (x), over the input set x) = G W (x) = D i=1 w i ϕ i (x) = w T Φ(x), energy function of E(h, x, y) = -yG W (x) (LeCun X ∈ R N , hypoth- esis class H = {h(x) = G W (x) = D i=1 w i ϕ i (x) = w T Φ(x) | Φ ∈ F, ∀x : ||Φ(x)|| 2 ≤ A}, and output set Y ⊂ R, if the feature set {ϕ 1 (•), • • • , ϕ D (•)} is ϑ-diverse with a probability τ , then with a probability of at least (1 -δ)τ , the following holds for all h in H: ∆ D,S E ≤ 4D||w|| ∞ R m (F) + ||w|| ∞ DA 2 -ϑ 2 log(2/δ) 2m . ( ) Similar to the regression task, we note that the upper-bound of the true expectation is a decreasing function with respect to the diversity term. Thus, a less redundant feature set, i.e., higher ϑ, has a lower upper-bound for the true energy.

2.3. IMPLICIT REGRESSION

In this section, we consider the problem of implicit regression. This is a general formulation of a different set of problems such as metric learning, where the goal is to learn a distance function between two domains, image denoising, object detection as illustrated in (LeCun et al., 2006) , or semi-supervised learning (Zbontar et al., 2021) . This form of EBM (Figure 1 (c)) has two inner models, G 1 W (•) and G 2 W (•), which can be equal or different according to the problem at hand. Here, we consider the general case, where the two models correspond to two different combinations of different features, i.e., G (1)  and G (2) W (x) = D (1) i=1 w (1) i ϕ (1) i (x) W (y) = D (2) i=1 w (2) i ϕ (2) i (y). Thus, we have a different (ϑ -τ )-diversity term for each set. The final result is presented in Theorem 4. Theorem 4. For the energy function E(h, x, y) = 1 2 ||G (1) W (x) -G (2) W (y)|| 2 2 , over the input set X ∈ R N , hypothesis class H = {h (1) (x) = G (1) W (x) = D (1) i=1 w (1) i ϕ (1) i (x) = w (1) T Φ (1) (x), h (2) (x) = G (2) W (y) = D (2) i=1 w (2) i ϕ (2) i (y) = w (2) T Φ (2) (y) | Φ (1) ∈ F 1 , Φ (2) ∈ F 2 , ∀x : ||Φ (1) (x)|| 2 ≤ A (1) , ∀y : ||Φ (2) (y)|| 2 ≤ A (2) }, and output set Y ⊂ R N , if the feature set {ϕ (1) 1 (•), • • • , ϕ D (1) (•)} is ϑ (1) -diverse with a probability τ 1 and the feature set {ϕ (2) 1 (•), • • • , ϕ (2) D (2) (•)} is ϑ (2) -diverse with a probability τ 2 , then with a probability of at least (1 -δ)τ 1 τ 2 , the following holds for all h in H: ∆ D,S E ≤ 8( J 1 + J 2 ) D (1) ||w (1) || ∞ R m (F 1 ) + D (2) ||w (2) || ∞ R m (F 2 ) +(J 1 + J 2 ) log(2/δ) 2m , where J 1 = ||w (1) || 2 ∞ D (1) A (1) 2 -ϑ (1) 2 and J 2 = ||w (2) || 2 ∞ D (2) A (2) 2 -ϑ (2) 2 . The upper-bound of the energy model depends on the diversity variable of both feature sets. Moreover, we note that the bound for the implicit regression decreases proportionally to ϑ 2 , as opposed to the classification case for example, where the bound is proportional to ϑ. Thus, we can conclude that reducing redundancy improves the generalization of EBM in the implicit regression context.

2.4. GENERAL DISCUSSION

We note that the theory developed in our paper (Theorems 1 to 4) is agnostic to the loss function (LeCun et al., 2006) or the optimization strategy used (Kumar et al., 2019; Song & Ermon, 2019; Yu et al., 2020; Xu et al., 2022) . We show that reducing the redundancy of the features consistently decreases the upper-bound of the true expectation of the energy and, thus, can boost the generalization performance of the energy-based model. It also should be noted that A, i.e., the upper bound of the features and ϑ are connected. But our findings can be interpreted as follows: given two models with the same value of A (maximum L 2 norm of the features), the model with higher diversity ϑ has a lower generalization bound and is likely to generalize better. We note that our analysis is independent of how the features are obtained, e.g., handcrafted or optimized. In fact, in the recent state-of-the-art EBMs (Khalifa et al., 2021; Bakhtin et al., 2021; Yu et al., 2020) , the features are typically parameterized using a deep learning model and optimized during training. Our contribution is twofold. First, we provide theoretical guarantees that reducing redundancy in the feature space can indeed improve the generalization of the EBM. This can pave the way toward providing theoretical guarantees for WORKS ON SELF-SUPERVISED LEARNING using redundancy reduction Zbontar et al. (2021) ; Bardes et al. (2021) ; Zhao et al. (2017) . Second, our theory can be used to motivate novel redundancy reduction strategies, for example, in the form of regularization, to avoid learning redundant features. Such strategies can improve the performance of the model and improve generalization.

3. SIMPLE REGULARIZATION ALGORITHM

In general, theoretical generalization bounds can be too loose to be direct practical implications (Zhang et al., 2017; Neyshabur et al., 2017) . However, they typically suggest a regularizer to promote some desired aspects of the hypothesis class (Xie et al., 2015; Li et al., 2019; Kawaguchi et al., 2017) . Accordingly, inspired by the theoretical analysis in Section 2, we propose a straightforward  L aug = L -β x∈S D i̸ =j (ϕ i (x) -ϕ j (x)) 2 , ( ) where β is a hyper-parameter controlling the contribution of the second term in the total loss. The additional term penalizes the similarities between the distinct features ensuring learning a diverse and non-redundant mapping of the data. As a result, this can improve the general performance of our model.

3.1. TOY EXAMPLE

We test our regularization strategy first using a toy data. We use an EBM model to learn the distribution of a 2-D Swiss roll illustrated in Figure 2 (a). For the EBM, we use a fully connected neural network composed of two intermediate layers with 1000 units and ReLu activations. We train the models using Stochastic Gradient Langevin Dynamics (SGLD) sampling and the contrastive divergence-like algorithm proposed in (Du & Mordatch, 2019) . The total objective of the standard EBM is expressed as follows: L = 1 N n α E(x + n ) 2 + E(x - n ) 2 ) + E(x + n ) -E(x - n ) , where x + n denote positive samples and x - n negative samples. We augment this loss using equation 16, i.e., the features are the latent representations obtained at the last intermediate layer. The distribution learned using both the standard and the proposed approach are illustrated using the kernel density estimation (Terrell & Scott, 1992) in Figure 2 . As it can be seen, avoiding redundancy boosts the performance of the EBM model. Indeed, by comparing the two learned distributions, the EBM trained with our approach led to a better approximation of the ground-truth distribution and was able to better capture the tail of the distribution as opposed to the original EBM.

3.2. IMAGE GENERATION EXAMPLE

Recently, there has been a high interest in using EBMs to solve image/text generation tasks Du & Mordatch (2019) ; Du et al. (2021) ; Khalifa et al. (2021) ; Deng et al. (2020) . In this subsection, we validate the proposed regularizer on the simple example of MNIST digits image generation, as in (Du & Mordatch, 2019) . For the EBM model, we use a simple CNN model composed of four convolutional layers followed by a linear layer. The training protocol is the same as in (UvA; Du & Mordatch, 2019) , i.e., using Langevin dynamics Markov chain Monte Carlo (MCMC) and a sampling buffer to accelerate training. The full details are available in the supplementary material. In this example, the features, i.e., the latent representation obtained at the last intermediate layer, are learned in an end-to-end way. We evaluate the performance of our approach by augmenting the contrastive divergence loss using equation 16 to penalize the feature redundancy. Table 1 : Table of FID scores and negative log-likelihood (NLL) loss of different approaches for generations of MNIST images. Each experiment was performed three times with different random seeds, the results are reported as the mean/SEM over these runs. We quantitatively evaluate image quality of EBMs with 'Fréchet Inception Distance' (FID) score (Heusel et al., 2017) and the negative log-likelihood (NLL) loss in Table 1 for different values of β. We note that we obtain consistently better FID and NLL scores by penalizing the similarity of the learned features. The best performance is achieved by β = 1e -13 , which yields more than 10%, in terms of FID, improvement compared to the original EBM model. To gain insights into the visual performance of our approach, we plot a few intermediate samples of the MCMC sampling (Langevin Dynamics). The results obtained by the EBM with β = 1e -13 are presented in Figure 3 . Initiating from random noise, MCMC obtains reasonable figures after only 64 steps. The digits get clearer and more realistic over the iterations. More results are presented in the supplementary material.

3.3. CONTINUAL LEARNING EXAMPLE

In this subsection, we validate the proposed regularizer on the Continual Learning (CL) problem. CL tackles the problem of catastrophic forgetting in deep learning models (Parisi et al., 2019; Li & Hoiem, 2017; Shibata et al., 2021) . Its main goal is to solve several tasks sequentially without forgetting knowledge learned from the past. So, a continual learner is expected to learn a new task, crucially, without forgetting previous tasks. Recently, an EBM-based CL approach was proposed in (Li et al., 2020) and led to superior results compared to standard approaches. We use the same models and the same experimental protocol used in (Li et al., 2020) . However, here we focus only on the class-incremental learning task using CIFAR10 and CIFAR100. We evaluate the performance of our proposed regularizer using both the boundary-aware and boundary-agnostic settings. As defined in (Li et al., 2020) , the boundary-aware refers to the situation where the sequence of the tasks has explicit separation between them which is known to the model. The boundary agnostic case refers to the situation where the data distributions gradually changes without a notion of task boundaries. Similar to Section 3.2, we consider as 'features' the representation obtained by the last intermediate layer. The proposed regularizer is applied on top of this representation. In Table 2 , we report the performance of the EBM trained using the original loss and using the loss augmented with our additional term for different values of β. As shown in Table 2 , penalizing feature similarity and promoting the diversity of the feature set boosts the performance of the EBM model and consistently leads to a superior accuracy for both datasets. In Figure 4 , we display the accumulated classification accuracy, averaged over tasks, on the test set. Along the five tasks, our approach maintains higher classification accuracy than the standard EBM for both the boundary-aware and boundary-agnostic settings.

4. CONCLUSION

Energy-based learning is a powerful learning paradigm that encapsulates various discriminative and generative systems. An EBM is typically formed of one (or many) inner models which learn a combination of different features to generate an energy mapping for each input configuration. In this paper, we introduced a feature diversity concept, i.e., (ϑ -τ )-diversity, and we used it to extend the PAC theory of EBMs. We derived different generalization bounds for various learning contexts, i.e., regression, classification, and implicit regression, with different energy functions and we consistently found that reducing the redundancy of the feature set can improve the generalization error of energy-based approaches. We also note that our theory is independent of the loss function or the training strategy used to optimize the parameters of the EBM. This provides theoretical guarantees on learning via feature redundancy reduction. Our preliminary experimental results confirm that this is indeed a promising research direction and can motivate developing other approaches to promoting the diversity of the feature set. Future direction include more extensive experimental evaluation of different feature redundancy reduction approaches.

APPENDIX A PROOF OF LEMMA 3

Lemma With a probability of at least τ , we have sup x,W |h(x)| ≤ ||w|| ∞ (DA 2 -ϑ 2 ), where A = sup x ||ϕ(x)|| 2 . Proof. h 2 (x) = D i=1 wiϕi(x) 2 ≤ D i=1 ||w||∞ϕi(x) 2 = ||w|| 2 ∞ D i=1 ϕi(x) 2 = ||w|| 2 ∞ i,j ϕi(x)ϕj(x) = ||w|| 2 ∞   i ϕi(x) 2 + i̸ =j ϕi(x)ϕj(x)   We have ||Φ(x)|| 2 ≤ A. For the first term in equation 19, we have m ϕ m (x) 2 ≤ A 2 . By using the identity ϕ m (x)ϕ n (x) = 1 2 ϕ m (x) 2 + ϕ n (x) 2 -(ϕ m (x) -ϕ n (x)) 2 , the second term can be rewritten as m̸ =n ϕ m (x)ϕ n (x) = 1 2 m̸ =n ϕ m (x) 2 + ϕ n (x) 2 -ϕ m (x) -ϕ n (x) 2 . ( ) In addition, we have with a probability τ , 1 2 m̸ =n (ϕ m (x) -ϕ n (x)) 2 ≥ ϑ 2 . Thus, we have with a probability at least τ : m̸ =n ϕ m (x)ϕ n (x) ≤ 1 2 (2(D -1)A 2 -2ϑ 2 ) = (D -1)A 2 -ϑ 2 . ( ) By putting everything back to equation 19, we have with a probability τ , G 2 W (x) ≤ ||w|| 2 ∞ A 2 + (D -1)A 2 -ϑ 2 = ||w|| 2 ∞ (DA 2 -ϑ 2 ). Thus, with a probability τ , sup x,W |h(x)| ≤ sup x,W G 2 W (x) ≤ ||w|| ∞ DA 2 -ϑ 2 . ( ) B PROOF OF LEMMA 4 Lemma With a probability of at least τ , we have sup x,y,h |E(h, x, y)| ≤ 1 2 (||w|| ∞ (DA 2 -ϑ 2 ) + B) 2 . ( ) Proof. We have sup x,y,h |h(x) -y| ≤ sup x,y,h (|h(x)| + |y|) = (||w|| ∞ √ DA 2 -ϑ 2 + B). Thus sup x,y,h |E(h, x, y)| ≤ 1 2 (||w|| ∞ √ DA 2 -ϑ 2 + B) 2 . C PROOF OF LEMMA 5 Lemma With a probability of at least τ , we have R m (E) ≤ 2D||w|| ∞ (||w|| ∞ (DA 2 -ϑ 2 ) + B)R m (F) Proof. Using the decomposition property of the Rademacher complexity (if ϕ is a L-Lipschitz function, then R m (ϕ(A)) ≤ LR m (A)) and given that 1 2 ||. -y|| 2 is K-Lipschitz with a con- stant K = sup x,y,h ||h(x) -y|| ≤ (||w|| ∞ √ DA 2 -ϑ 2 + B), we have R m (E) ≤ KR m (H) = (||w|| ∞ √ DA 2 -ϑ 2 + B)R m (H), where H = {G W (x) = D i=1 w i ϕ i (x) }. We also know that ||w|| 1 ≤ D||w|| ∞ . Next, similar to the proof of Theorem 2.10 in (Wolf, 2018), we note that D i=1 w i ϕ i (x) ∈ (D||w|| ∞ )conv(F + -(F)) := G, where conv denotes the convex hull and F is the set of ϕ functions. Thus, R m (H) ≤ R m (G) = D||w|| ∞ R m (conv(F + (-F)) = D||w|| ∞ R m (F + (-F)) = 2D||w|| ∞ R m (F).

D PROOF OF THEOREM 1

Theorem For the energy function E (h, x, y) = 1 2 ||G W (x) -y|| 2 2 , over the input set X ∈ R N , hypothesis class H = {h(x) = G W (x) = D i=1 w i ϕ i (x) = w T Φ(x) | Φ ∈ F, ∀x : ||Φ(x)|| 2 ≤ A}, and output set Y ⊂ R, if the feature set {ϕ 1 (•), • • • , ϕ D (•)} is ϑ-diverse with a probability τ , with a probability of at least (1 -δ)τ , the following holds for all h in H: E (x,y)∼D [E(h, x, y)] ≤ 1 m (x,y)∈S E(h, x, y) + 4D||w|| ∞ (||w|| ∞ DA 2 -ϑ 2 + B)R m (F) + 1 2 (||w|| ∞ DA 2 -ϑ 2 + B) 2 log(2/δ) 2m , where B is the upper-bound of Y, i.e., y ≤ B, ∀y ∈ Y. Proof. We replace the variables in Lemma 1 using Lemma 4 and Lemma 5.

E PROOF OF LEMMA 6

Lemma With a probability of at least τ , we have sup x,y,h |E(h, x, y)| ≤ (||w|| ∞ DA 2 -ϑ 2 + B). Proof. We have sup x,y,h |h(x) -y| ≤ sup x,y,h (|h(x)| + |y|) = (||w|| ∞ √ DA 2 -ϑ 2 + B).

F PROOF OF LEMMA 7

Lemma With a probability of at least τ , we have R m (E) ≤ 2D||w|| ∞ R m (F) (28) Proof. |.| is 1-Lipschitz, Thus R m (E) ≤ R m (H).

G PROOF OF THEOREM 2

Theorem For the energy function E(h, x, y) = ||G W (x) -y|| 1 , over the input set X ∈ R N , hypothesis class H = {h(x) = G W (x) = D i=1 w i ϕ i (x) = w T Φ(x) | Φ ∈ F, ∀x ||Φ(x)|| 2 ≤ A}, and output set Y ⊂ R, if the feature set {ϕ 1 (•), • • • , ϕ D (•)} is ϑ-diverse with a probability τ , then with a probability of at least (1 -δ)τ , the following holds for all h in H: E (x,y)∼D [E(h, x, y)] ≤ 1 m (x,y)∈S E(h, x, y) + 4D||w|| ∞ R m (F) + (||w|| ∞ DA 2 -ϑ 2 + B) log(2/δ) 2m , where B is the upper-bound of Y, i.e., y ≤ B, ∀y ∈ Y. Proof. We replace the variables in Lemma 1 using Lemma 6 and Lemma 7.

H PROOF OF THEOREM 3

Lemma 8. With a probability of at least τ , we have sup x,y,h |E(h, x, y)| ≤ ||w|| ∞ DA 2 -ϑ 2 . ( ) Proof. We have sup -yG W (x) ≤ sup |G W (x)| ≤ ||w|| ∞ √ DA 2 -ϑ 2 . Lemma 9. With a probability of at least τ , we have R m (E) ≤ 2D||w|| ∞ R m (F) Proof. We note that for y ∈ {-1, 1}, σ and -yσ follow the same distribution. Thus, we have R m (E) = R m (H). Next, we note that R m (H) ≤ 2D||w|| ∞ R m (F). Theorem 3 For a well-defined energy function E(h, x, y) (LeCun et al., 2006) , over hypothesis class H, input set X and output set Y, if it has upper-bound M, then with a probability of at least 1 -δ, the following holds for all h in H E (x,y)∼D [E(h, x, y)] ≤ 1 m (x,y)∈S E(h, x, y) + 4D||w|| ∞ R m (F) + ||w|| ∞ DA 2 -ϑ 2 log(2/δ) 2m , Proof. We replace the variables in Lemma 1 using Lemma 8 and Lemma 9. (2) W (y)|| 2 2 ≤ ||w (2) || 2 ∞ D (2) A (2) 2 -ϑ (2) 2 = J 2 . We also have E(h, x, y) = 1 2 ||G (1) W (x) -G W (y)|| 2 2 . Lemma 11. With a probability of at least τ 1 τ 2 , we have R m (E) ≤ 4( J 1 + J 2 ) D (1) ||w (1) || ∞ R m (F 1 ) + D (2) ||w (2) || ∞ R m (F 2 ) Proof. Let f be the square function, i.e., f (x) = 1 2 x 2 and E 0 = {G (1) W (x) -G W (y) | x ∈ X , y ∈ Y}. We have E = f (E 0 + (-E 0 )). f is Lipschitz over the input space, with a constant L bounded by sup x,W G (1) W (x) + sup y,W G (2) W (y) ≤ √ J 1 + √ J 2 . Thus, we have R m (E) ≤ ( √ J 1 + √ J 2 )R m (E 0 + (-E 0 )) ≤ 2( √ J 1 + √ J 2 )R m (E 0 ). Next, we note that R m (E 0 ) = R m (H 1 + (-H 2 )) = R m (H 1 ) + R m (H 2 ). Using same as technique as in Lemma 4, we have R m (H 1 ) ≤ 2D (1) ||w (1)  || ∞ R m (F 1 ) and R m (H 2 ) ≤ 2D (2) ||w (2) || ∞ R m (F 2 ). Theorem 4 For the energy function E(h, x, y) = 1 2 ||G (1) W (x) -G W (y)|| 2 2 , over the input set X ∈ R N , hypothesis class H = {G (1) 1) , ∀y ||Φ (2) (y)|| 2 ≤ A (2) }, and output set Y ⊂ R N , if the feature set {ϕ  W (x) = D (1) i=1 w (1) i ϕ (1) i (x) = w (1) T Φ (1) (x), G W (y) = D (2) i=1 w (2) i ϕ (2) i (y) = w (2) T Φ (2) (y) | Φ (1) ∈ F 1 , Φ (2) ∈ F 2 , ∀x ||Φ (1) (x)|| 2 ≤ A ( + 8( J 1 + J 2 ) D (1) ||w (1) || ∞ R m (F 1 ) + D (2) ||w (2) || ∞ R m (F 2 ) + J 1 + J 2 log(2/δ) 2m , where J 1 = ||w (1) || 2 ∞ D (1) A (1) 2 -ϑ (1) 2 and J 2 = ||w (2) || 2 ∞ D (2) A (2) 2 -ϑ (2) 2 . Proof. We replace the variables in Lemma 1 using Lemma 10 and Lemma 11.

J IMAGE GENERATION EXAMPLE SETTINGS AND ADDITIONAL RESULTS

For the EBM model, we used a simple CNN model composed of four convolutional layers followed by a linear layer. The full CNN model is presented in Table 3 . The training protocol is the same as in (UvA; Du & Mordatch, 2019) , i.e., using Langevin dynamics MCMC and a sampling buffer to accelerate training. All models were trained for 60 epochs using Adam optimizer with learning rate lr = 1e -4 and a batch size of 128. In addition to the results presented in the paper, Figure 5 presents additional qualitative results. For the first two examples (top ones), the model is able to converge to a realistic image within reasonable amount of iterations. For the last two examples (in the bottom), we present failure cases of our approach. For these two tests, the generated image still improves over iterations. However, the model failed to converge to a clear realistic MNIST image after 256 steps.



Figure 1: An illustration of energy-based models used to solve (a) a regression problem (b) a binary classification problem (c) an implicit regression problem.

Figure 2: From left to right: (a): 2-D swiss roll ground truth distribution, (b) Distribution learned using a standard EBM model, (b) Distribution learned with augmented loss using our regularizer. Relative to the ground truth, the Jensen-Shannon distance of the standard EBM distribution and ours are 0.27426 and 0.2733, respectively.

Figure 3: Qualitative results of our approach (β = 1e -13 ) : Few intermediate samples of the MCMC sampling (Langevin Dynamics).

± 0.86 29.02 ± 0.24 48.40 ± 0.80 34.78 ± 0.26 ours (β = 1e -11 ) 39.61 ± 0.81 29.15 ± 0.27 49.63 ± 0.90 34.86 ± 0.30 ours (β = 1e -12 ) 40.64 ± 0.79 29.38 ± 0.21 50.25 ± 0.63 35.20 ± 0.23 ours (β = 1e -13 ) 40.15 ± 0.87 29.28 ± 0.28 50.20 ± 0.94 35.03 ± 0.21Table 2: Evaluation of class-incremental learning on both the boundary-aware and boundaryagnostic setting on CIFAR10 and CIFAR100 datasets. Each experiment was performed ten times with different random seeds, the results are reported as the mean/SEM over these runs.

Figure 4: Average test classification accuracy vs number of observed tasks on CIFAR10 using the boundary-aware (left) and boundary-agnostic (right) setting. The results are averaged over ten random seeds.

(1) || 2 ∞ D (1) A (1) 2 -ϑ (1) 2 = J 1 and sup ||G

(•)} is ϑ (1) -diverse with a probability τ 1 and the feature set {ϕ

(•)} is ϑ (2) -diverse with a probability τ 2 , then with a probability of at least (1 -δ)τ 1 τ 2 , the following holds for all h in H E (x,y)∼D [E(h, x, y)] ≤ 1 m (x,y)∈S E(h, x, y)

Lemma 10. With a probability of at least τ 1 τ 2 , we have

