ON FEATURE DIVERSITY IN ENERGY-BASED MODELS Anonymous authors Paper under double-blind review

Abstract

Energy-based learning is a powerful learning paradigm that encapsulates various discriminative and generative approaches. An energy-based model (EBM) is typically formed of inner-model(s) that learn a combination of the different features to generate an energy mapping for each input configuration. In this paper, we focus on the diversity of the produced feature set. We extend the probably approximately correct (PAC) theory of EBMs and analyze the effect of redundancy reduction on the performance of EBMs. We derive generalization bounds for various learning contexts, i.e., regression, classification, and implicit regression, with different energy functions and we show that indeed reducing redundancy of the feature set can consistently decrease the gap between the true and empirical expectation of the energy and boosts the performance of the model.

1. INTRODUCTION

The energy-based learning paradigm was first proposed by Zhu & Mumford (1998) ; LeCun et al. (2006) as an alternative to probabilistic graphical models (Koller & Friedman, 2009) . As their name suggests, energy-based models (EBMs) map each input 'configuration' to a single scalar, called the 'energy'. In the learning phase, the parameters of the model are optimized by associating the desired configurations with small energy values and the undesired ones with higher energy values (Kumar et al., 2019; Song & Ermon, 2019; Yang et al., 2016) . In the inference phase, given an incomplete input configuration, the energy surface is explored to find the remaining variables which yield the lowest energy. EBMs encapsulate solutions to several supervised approaches (LeCun et al., 2006; Fang & Liu, 2016) and unsupervised learning problems (Deng et al., 2020; Bakhtin et al., 2021; Zhao et al., 2020; Xu et al., 2022) and provide a common theoretical framework for many learning models, including traditional discriminative (Zhai et al., 2016; Li et al., 2020) and generative (Zhu & Mumford, 1998; Xie et al., 2017b; Zhao et al., 2017; Che et al., 2020; Khalifa et al., 2021) approaches. Formally, let us denote the energy function by E(h, x, y), where h = G W (x) represents the model with parameters W to be optimized during training and x, y are sets of variables. Figure 1 illustrates how classification, regression, and implicit regression can be expressed as EBMs. In Figure 1 (a), a regression scenario is presented. The input x, e.g., an image, is transformed using an inner model G W (x) and its distance, to the second input y is computed yielding the energy function. A valid energy function in this case can be the L 1 or the L 2 distance. In the binary classification case (Figure 1 (b) ), the energy can be defined as E(h, x, y) = -yG W (x) . In the implicit regression case (Figure 1 (c )), we have two inner models and the energy can be defined as the L 2 distance between their outputs E(h, x, y) = 1 2 ||G (1) W (x) -G W (y)|| 2 2 . In the inference phase, given an input x, the label y * can be obtained by solving the following optimization problem: y * = arg min y E(h, x, y). (1) An EBM typically relies on an inner model, i.e., G W (x), to generate the desired energy landscape (LeCun et al., 2006) . Depending on the problem at hand, this function can be constructed as a linear projection, a kernel method, or a neural network and its parameters are optimized in a data-driven manner in the training phase. Formally, G W (x) can be written as G W (x) = D i w i ϕ i (x),

