ON FEATURE DIVERSITY IN ENERGY-BASED MODELS Anonymous authors Paper under double-blind review

Abstract

Energy-based learning is a powerful learning paradigm that encapsulates various discriminative and generative approaches. An energy-based model (EBM) is typically formed of inner-model(s) that learn a combination of the different features to generate an energy mapping for each input configuration. In this paper, we focus on the diversity of the produced feature set. We extend the probably approximately correct (PAC) theory of EBMs and analyze the effect of redundancy reduction on the performance of EBMs. We derive generalization bounds for various learning contexts, i.e., regression, classification, and implicit regression, with different energy functions and we show that indeed reducing redundancy of the feature set can consistently decrease the gap between the true and empirical expectation of the energy and boosts the performance of the model.

1. INTRODUCTION

The energy-based learning paradigm was first proposed by Zhu & Mumford (1998) ; LeCun et al. (2006) as an alternative to probabilistic graphical models (Koller & Friedman, 2009) . As their name suggests, energy-based models (EBMs) map each input 'configuration' to a single scalar, called the 'energy'. In the learning phase, the parameters of the model are optimized by associating the desired configurations with small energy values and the undesired ones with higher energy values (Kumar et al., 2019; Song & Ermon, 2019; Yang et al., 2016) . In the inference phase, given an incomplete input configuration, the energy surface is explored to find the remaining variables which yield the lowest energy. EBMs encapsulate solutions to several supervised approaches (LeCun et al., 2006; Fang & Liu, 2016) and unsupervised learning problems (Deng et al., 2020; Bakhtin et al., 2021; Zhao et al., 2020; Xu et al., 2022) and provide a common theoretical framework for many learning models, including traditional discriminative (Zhai et al., 2016; Li et al., 2020) and generative (Zhu & Mumford, 1998; Xie et al., 2017b; Zhao et al., 2017; Che et al., 2020; Khalifa et al., 2021) approaches. Formally, let us denote the energy function by E(h, x, y), where h = G W (x) represents the model with parameters W to be optimized during training and x, y are sets of variables. Figure 1 illustrates how classification, regression, and implicit regression can be expressed as EBMs. In Figure 1 (a), a regression scenario is presented. The input x, e.g., an image, is transformed using an inner model G W (x) and its distance, to the second input y is computed yielding the energy function. A valid energy function in this case can be the L 1 or the L 2 distance. In the binary classification case (Figure 1 (b) ), the energy can be defined as E(h, x, y) = -yG W (x) . In the implicit regression case (Figure 1 (c )), we have two inner models and the energy can be defined as the L 2 distance between their outputs E(h, x, y) = 1 2 ||G (1) W (x) -G (2) W (y)|| 2 2 . In the inference phase, given an input x, the label y * can be obtained by solving the following optimization problem: y * = arg min y E(h, x, y). (1) An EBM typically relies on an inner model, i.e., G W (x), to generate the desired energy landscape (LeCun et al., 2006) . Depending on the problem at hand, this function can be constructed as a linear projection, a kernel method, or a neural network and its parameters are optimized in a data-driven manner in the training phase. Formally, G W (x) can be written as G W (x) = D i w i ϕ i (x), x et al., 2016; Yu et al., 2020; Xie et al., 2021) . In the rest of the paper, we assume that the inner models G W defined in the energy-based learning system (Figure 1 ) are obtained as a weighted sum of different features as expressed in equation 2. D(G W (x),y) y E(h,x,y) G W (x) x -yG W (x) y E(h,x,y) G W 1 (x) x ½||G W 1 (x) -G W 2 (y)|| 2 y E(h,x,y) G W 2 (y) (a) (b) (c) In (Zhang, 2013), it was shown that simply minimizing the empirical energy over the training data does not theoretically guarantee the minimization of the expected value of the true energy. Thus, developing and motivating novel regularization techniques is required (Zhang & LeCun, 2017). We argue that the quality of the feature set {ϕ 1 (•), • • • , ϕ D (•)} plays a critical role in the overall performance of the global model. In this work, we extend the theoretical analysis of (Zhang, 2013) and focus on the 'diversity' of this set and its effect on the generalization ability of the EBM models. Intuitively, it is clear that a less correlated set of intermediate representations is richer and thus able to capture more complex patterns in the input. Thus, it is important to avoid redundant features for achieving a better performance. However, a theoretical analysis is missing. We start by quantifying the diversity of a set of feature functions. To this end, we introduce ϑ -τ -diversity: Definition 1 ((ϑ -τ )-diversity). A set of feature functions, {ϕ 1 (•), • • • , ϕ D (•)} is called ϑ-diverse, if there exists a constant ϑ ∈ R, such that for every input x we have 1 2 D i̸ =j (ϕ i (x) -ϕ j (x)) 2 ≥ ϑ 2 with a high probability τ . Intuitively, if two feature maps ϕ i (•) and ϕ j (•) are non-redundant, they have different outputs for the same input with a high probability. However, if, for example, the features are extracted using a neural network with a ReLU activation function, there is a high probability that some of the features associated with the input will be zero. Thus, defining a lower bound for the pair-wise diversity directly is impractical. Therefore, we quantify diversity as the lower-bound over the sum of the pair-wise distances of the feature maps as expressed in equation 3 and ϑ measures the diversity of a set. In machine learning context, diversity has been explored in ensemble learning (Li et al., 2012; Yu et al., 2011; Li et al., 2017 ), sampling (Derezinski et al., 2019; Bıyık et al., 2019 ), ranking (Wu et al., 2019; Qin & Zhu, 2013 ), pruning (Singh et al., 2020; Lee et al., 2020) , and neural networks (Xie et al., 2015; Shen et al., 2021) . In Xie et al. (2015; 2017a) , it was shown theoretically and experimentally that avoiding redundancy over the weights of a neural network using the mutual angles as a diversity measure improves the generalization ability of the model. In this work, we explore a new line of research, where diversity is defined over the feature maps directly, using the (ϑ -τ )-diversity, in the context of energy-based learning. In (Zhao et al., 2017) , a similar idea was empirically explored. A "repelling regularizer" was proposed to force non-redundant or orthogonal feature representations. Moreover, the idea of learning while avoiding redundancy has been used recently in the context of semi-supervised learning (Zbontar et al., 2021; Bardes et al., 2021) . Reducing redundancy by minimizing the cross-correlation of features learned using a Siamese network



Figure 1: An illustration of energy-based models used to solve (a) a regression problem (b) a binary classification problem (c) an implicit regression problem.

