HYBRID DISCRIMINATIVE-GENERATIVE TRAINING VIA CONTRASTIVE LEARNING

Abstract

Contrastive learning and supervised learning have both seen significant progress and success. However, thus far they have largely been treated as two separate objectives, brought together only by having a shared neural network. In this paper we show that through the perspective of hybrid discriminative-generative training of energy-based models we can make a direct connection between contrastive learning and supervised learning. Beyond presenting this unified view, we show our specific choice of approximation of the energy-based loss significantly improves energybased models and contrastive learning based methods in confidence-calibration, out-of-distribution detection, adversarial robustness, generative modeling, and image classification tasks. In addition to significantly improved performance, our method also gets rid of SGLD training and does not suffer from training instability. Our evaluations also demonstrate that our method performs better than or on par with state-of-the-art hand-tailored methods in each task.

1. INTRODUCTION

In the past few years, the field of deep learning has seen significant progress. Example successes include large-scale image classification (He et al., 2016; Simonyan & Zisserman, 2014; Srivastava et al., 2015; Szegedy et al., 2016) on the challenging ImageNet benchmark (Deng et al., 2009) . The common objective for solving supervised machine learning problems is to minimize the crossentropy loss, which is defined as the cross entropy between a target distribution and a categorical distribution called Softmax which is parameterized by the model's real-valued outputs known as logits. The target distribution usually consists of one-hot labels. There has been a continuing effort on improving upon the cross-entropy loss, various methods have been proposed, motivated by different considerations (Hinton et al., 2015; Müller et al., 2019; Szegedy et al., 2016) . Recently, contrastive learning has achieved remarkable success in representation learning. Contrastive learning allows learning good representations and enables efficient training on downstream tasks, an incomplete list includes image classification (Chen et al., 2020a; b; Grill et al., 2020; He et al., 2019; Tian et al., 2019; Oord et al., 2018) , video understanding (Han et al., 2019) , and knowledge distillation (Tian et al., 2019) . Many different training approaches have been proposed to learn such representations, usually relying on visual pretext tasks. Among them, state-of-the-art contrastive methods (He et al., 2019; Chen et al., 2020a; c) are trained by reducing the distance between representations of different augmented views of the same image ('positive pairs'), and increasing the distance between representations of augment views from different images ('negative pairs'). Despite the success of the two objectives, they have been treated as two separate objectives, brought together only by having a shared neural network. In this paper, to show a direct connection between contrastive learning and supervised learning, we consider the energy-based interpretation of models trained with cross-entropy loss, building on Grathwohl et al. (2019) . We propose a novel objective that consists of a term for the conditional of the label given the input (the classifier) and a term for the conditional of the input given the label. We optimize the classifier term the normal way. Different from Grathwohl et al. (2019) , we approximately optimize the second conditional over the input with a contrastive learning objective instead of a Monte-Carlo sampling-based approximation. In doing so, we provide a unified view on existing practice. Our work takes inspiration from the work by Ng & Jordan (2002) . In their 2002 paper, Ng & Jordan (2002) showed that classifiers trained with a generative loss (i.e., optimizing p(x|y), with x the input and y the classification label) can outperform classifiers with the same expressiveness trained with a discriminative loss (i.e., optimizing p(y|x)). Later it was shown that hybrid discriminative generative model training can get the best of both worlds (Raina et al., 2004) . The work by Ng & Jordan (2002) was done in the (simpler) context of Naive Bayes and Logistic Regression. Our work can be seen as lifting this work into today's context of training deep neural net classifiers. Our empirical evaluation shows our method improves both the confidence-calibration and the classification accuracy of the learned classifiers, beating state-of-the-art methods. Despite its simplicity, our method outperforms competitive baselines in out-of-distribution (OOD) detection for all tested datasets. On hybrid generative-discriminative modeling tasks (Grathwohl et al., 2019) , our method obtains superior performance without needing to run computational expensive SGLD steps. Our method learns significantly more robust classifiers than supervised training and achieves highly competitive results with hand-tailored adversarial robustness algorithms. The contributions of this paper can be summarized as: (i) To the best of our knowledge, we are the first to reveal the connection between contrastive learning and supervised learning. We connect the two objectives through energy-based model. (ii) Built upon the insight, we present a novel framework for hybrid generative discriminative modeling via contrastive learning. (iii) Our method gets rid of SGLD therefore does not suffer from training instability of energy-based model. We empirically show that our method improves confidence-calibration, OOD detection, adversarial robustness, generative modeling, and classification accuracy, performing on par with or better than state-of-the-art energy-based models and contrastive learning algorithms for each task.

2. RELATED WORK

Our work falls into the category of hybrid generative discriminative models. Ng & Jordan (2002) ; Raina et al. (2004) ; Lasserre et al. (2006) ; Larochelle & Bengio (2008) ; Tu (2007) ; Lazarow et al. (2017) compare and study the connections and differences between discriminative model and generative model, and shows hybrid generative discriminative models can outperform purely discriminative models and purely generative models. Our work differs in that we propose an effective training approach in the context of deep neural network. By using contrastive learning to optimize the generative models, our method achieves state-of-the-art performance on a wide range of tasks. Energy-based models (EBMs) have been shown can be derived from classifiers in supervised learning in the work of Xie et al. (2016) ; Du & Mordatch (2019) , they reinterpret the logits to define a classconditional EBM p(x|y). Our work builds heavily on JEM (Grathwohl et al., 2019) which reveals that one can re-interpret the logits obtained from classifiers to define EBM p(x) and p(x, y), and shows this leads to significant improvement in OOD detection, calibration, and robustness while retain compelling classification accuracy. Our method differs in that we optimize our generative term via contrastive learning, buying the performance of state-of-the-art canonical EBMs algorithms (Grathwohl et al., 2019) without suffering from running computational expensive and slow SGLD (Welling & Teh, 2011) at every iteration. Concurrent to our work, Winkens et al. (2020) proposes to pretrain using contrastive loss and then finetune with a joint supervised and contrastive loss, and shows the SimCLR loss improves likelihood-based OOD detection. Tack et al. (2020) also demonstrate contrastive learning improves OOD detection and calibration. Our work differs in that instead of a contrastive representation pre-train followed by supervised loss fine-tune, we use the contrastive loss to approximate a hybrid discriminative-generative model. We also empirically demonstrate our method enjoys broader usage by applying it to generative modeling, calibration, and adversarial robustness.

3.1. SUPERVISED LEARNING

In supervised learning, given a data distribution p(x) and a label distribution p(y|x) with C categories, a classification problem is typically addressed using a parametric function, f θ : R D → R C , which maps each data point x ∈ R D to C real-valued numbers termed as logits. These logits are used to parameterize a categorical distribution using the Softmax function: q θ (y|x) = exp(f θ (x)[y]) y exp(f θ (x)[y ]) , where f θ (x)[y] indicates the y th element of f θ (x), i.e., the logit corresponding to the y th class label. One of the most widely used loss functions for learning f θ is minimizing the negative log likelihood: min θ -E p data (x,y) [log q θ (y|x)] . This loss function is often referred to as the cross-entropy loss function, because it corresponds to minimizing the KL-divergence with a target distribution p(y|x), which consists of one-hot vectors with the non-zero element denoting the correct prediction.

3.2. ENERGY-BASED MODELS

Energy-based models. Energy based models (EBMs) (LeCun et al., 2006) are based on the observation that probability densities p(x) for x ∈ R D can be expressed as p θ (x) = exp(-E θ (x)) Z(θ) , where E θ (x) : R D → R maps each data point to a scalar; and Z(θ) = x∈X exp(-E θ (x)) (or, for continuous x we'd have Z(θ) = x∈X exp(-E θ (x)) ) is the normalizing constant, also known as the partition function. Here X is the full domain of x. For example, in the case of (let's say) 16x16 RGB images, computing Z exactly would require a summation over (256 × 256 × 256) (16×16) ≈ 10 2500 terms. We can parameterize an EBM using any function that takes x as the input and returns a scalar. For most choices of E θ , one cannot compute or even reliably estimate Z(θ), which means estimating the normalized densities is intractable and standard maximum likelihood estimation of the parameters, θ, is not straightforward. Training EBMs. The log-likelihood objective for an EBM consists of a sum of log p θ (x) terms, one term for each data point x. The gradient of each term is given by: ∂ log p θ (x) ∂θ = E p θ (x ) ∂E θ (x ) ∂θ - ∂E θ (x) ∂θ , where the expectation is over the model distribution p θ (x ). This expectation is typically intractable (for much the same reasons computing Z(θ) is typically intractable). However, it can be approximated through samples-assuming we can sample from p θ . Generating exact samples from p θ is typically expensive, but there are some well established approximate (sometimes exact in the limit) methods based on MCMC (Grathwohl et al., 2019; Du & Mordatch, 2019; Hinton, 2002) . Among such sampling methods, recent success in training (and sampling from) energy-based models often relies on the Stochastic Gradient Langevin Dynamics (SGLD) approach (Welling & Teh, 2011) , which generates samples by following this stochastic process: x 0 ∼ p 0 (x), x i+1 = x i - α 2 ∂E θ (x i ) ∂x i + , ∼ N (0, α) where N (0, α) is the normal distribution with mean of 0 and standard deviation of α, and p 0 (x) is typically a Uniform distribution over the input domain and the step-size α should be decayed following a polynomial schedule. The SGLD sampling steps are tractable, assuming the gradient of the energy function can be computed with respect to x, which is often the case. It is worth noting this process does not require evaluation the partition function Z(θ) (or any derivatives thereof). Joint Energy Models. The joint energy based model (JEM) (Grathwohl et al., 2019) shows that classifiers in supervised learning are secretly also energy-based based models on p(x, y). The key insight is that the logits f θ (x)[y] in the supervised cross-entropy loss can be seen as defining an energy-based model over (x, y), as follows: p(x, y) = exp(f θ (x)[y]) Z(θ) , where Z(θ) is the unknown normalization constant. I.e., matching this with the typical EBM notation, we have f θ (x)[y] = -E θ (x, y). Subsequently, the density model of data points p(x) can be obtained by marginalizing over y: p(x) = y exp(f θ (x)[y]) Z(θ) , with the energy E θ (x) = -log y exp(f θ (x)[y] ). JEM (Grathwohl et al., 2019) adds the marginal log-likelihood p(x) to the training objective, where p(x) is expressed with the energy based model from Equation ( 7). JEM uses SGLD sampling for training.

3.3. CONTRASTIVE LEARNING

In contrastive learning (Hadsell et al., 2006; Gutmann & Hyvärinen, 2010; 2012; Mnih & Kavukcuoglu, 2013; Mikolov et al., 2013) , it is common to optimize an objective of the following form: min θ -E p data (x) log exp(h θ (x) h θ (x )) K i=1 exp(h θ (x) h θ (x i )) , where x and x are two different augmented views of the same data point, K is the number of negative examples, h θ : R D → R H maps each data point to a normalized representation space with dimension H. This objective tries to maximally distinguish an input x i from alternative inputs x i . The intuition is that by doing so, the representation captures important information between similar data points, and therefore might improve performance on downstream tasks. This is usually called the contrastive learning loss or InfoNCE loss (Oord et al., 2018) and has been successful used for learning unsupervised representations (Sohn, 2016; Wu et al., 2018; He et al., 2019; Chen et al., 2020a) . In the context of supervised learning, the Supervised Contrastive Loss (Khosla et al., 2020) shows that selecting x i from different categories as negative examples can improve the standard cross-entropy training. Their objective for learning the representation h θ (x) is given by: min θ - 2N i=1 1 2N ỹi -1 2N j=1 1 i =j 1 ỹi=ỹj log exp(h θ (x i ) h θ (x j )) 2N k=1 1 i =k exp(h θ (x i ) h θ (x k )) , where N ỹi is the total number of images in the minibatch that have the same label ỹi as the anchor i. We'll see that our approach outperforms Supervised Contrastive Learning, while also simplifying by removing the need for selecting negative examples or pre-training a representation. Through the simplification we might get a closer hint at where the leverage is coming from.

4. HYBRID DISCRIMINATIVE GENERATIVE ENERGY-BASED MODEL (HDGE)

As in the typical classification setting, we assume we are given a dataset (x, y) ∼ p data . The primary goal is to train a model that can classify (x to y). In addition, we would like the learned model to be capable of out-of-distribution detection, providing calibrated outputs, and serving as a generative model. To achieve these goals, we propose to train a hybrid model, which consists of a discriminative conditional and a generative conditional by maximizing the sum of both conditional log-likelihoods: min θ -E p data (x,y) [log q θ (y|x) + log q θ (x|y)] , where q θ (y|x) is a standard Softmax neural net classifier, and where q θ (x|y ) = exp(f θ (x)[y]) Z(θ) , with Z(θ) = x exp(f θ (x)[y]). The rationale for this objective originates from (Ng & Jordan, 2002; Raina et al., 2004) , where they discuss the connections between logistic regression and naive Bayes, and show that hybrid discriminative and generative models can out-perform purely generative or purely discriminative counterparts. The main challenge with the objective from Equation ( 10) is the intractable partition function Z(θ). Our main contribution is to propose a (crude, yet experimentally effective) approximation with a contrastive loss: E p data (x,y) [log q θ (x|y)] (11) = E p data (x,y) log exp(f θ (x)[y]) Z(θ) (12) ≈ E p data (x,y) log exp(f θ (x)[y]) K i=1 exp(f θ (x i )[y]) , ( ) where K denotes the number of normalization samples. This is similar to existing contrastive learning objectives, although in our formulation, we also use labels. Intuitively, in order to have an accurate approximation in Equation ( 13), K has to be sufficiently large-becoming exact in the limit of summing over all x ∈ X . We don't know of any formal guarantees for our proposed approximation, and ultimately the justification has to come from our experiments. Nevertheless, there are two main intuitions we considered: (i) We try to make K as large as is practical. Increasing K is not trivial as it requires a larger memory. To bring it all together, our objective can be seen as a hybrid combination of supervised learning and contrastive learning given by: min θ -E p data (x,y) [α log q θ (y|x) + (1 -α) log q θ (x|y)] (14) ≈ min θ -E p data (x,y) α log exp(f θ (x)[y]) y exp(f θ (x)[y ]) + (1 -α) log exp(f θ (x)[y]) K i=1 exp(f θ (x i )[y]) , where α is weight between [0, 1]. When α = 1, the objective reduces to the standard cross-entropy loss, while α = 0, it reduces to an end-to-end supervised version of contrastive learning. We evaluated these variants in experiments, and we found that α = 0.5 delivers the highest performance on classification accuracy as well as robustness, calibration, and out-of-distribution detection. The resulting model, dubbed Hybrid Discriminative Generative Energy-based Model (HDGE), learns to jointly optimize supervised learning and contrastive learning. A PyTorch (Paszke et al., 2019) -like pseudo code corresponding to this algorithm is included in Appendix Algorithm 1.

5.1. OUT-OF-DISTRIBUTION DETECTION

We conduct experiments to evaluate HDGE on out-of-distribution (OOD) detection tasks. In general, OOD detection is a binary classification problem, where the model is required to produce a score s θ (x) ∈ R, where x is the query, and θ is the model parameters. We desire that the scores for in-distribution examples are higher than that out-of-distribution examples. Following the setting of Grathwohl et al. (2019) , we use the area under the receiver-operating curve (AUROC) (Hendrycks & Gimpel, 2016) as the evaluation metric. In our evaluation, we will consider two different score functions, the input density q(x) (Section 5.1.1) and the predictive distribution q(y|x) (Section 5.1.2). Prior work show that fitting a density model on the data and consider examples with low likelihood to be OOD is effective, and the likelihoods from EBMs can be reliably used as a predictor for OOD inputs (Du & Mordatch, 2019; Grathwohl et al., 2019) . We are interested in whether HDGE results in better likelihood function for OOD detection. All the methods are based on the WideResNet-28-10 ( Zagoruyko & Komodakis, 2016) . We follow the same experiment settings of Grathwohl et al. (2019) to remove the batch normalization (BN) (Ioffe & Szegedy, 2015) in WideResNet-28-10. In addition to standard discriminative models and hybrid model JEM, we also compare HDGE with other canonical algorithms: 1) Glow (Kingma & Dhariwal, 2018) which is a compelling flow-based generative model. 2) JointLoss (Winkens et al., 2020) , a recent state-of-the-art which proposes to pretrain using contrastive loss and then finetune with a joint supervised and contrastive loss, and shows the SimCLR loss improves likelihood-based OOD detection. The results are shown in Table 1 (top), HDGE consistently outperforms all of the baselines. The corresponding distribution of score are visualized in Figure 1 , it shows that HDGE correctly assign lower scores to out-of-distribution samples and performs extremely well on detecting samples from SVHN, CIFAR-100, and CelebA.

5.1.1. INPUT

DENSITY q(x) Out-of-distribution s θ (x) Model SVHN Interp CIFAR100 CelebA log q(x) Interestingly, while Nalisnick et al. (2019) demonstrates powerful neural generative models trained to estimate density p(x) can perform poorly on OOD detection, often assigning higher scores to OOD data points (e.g. SVHN) than in-distribution data points (e.g. CIFAR10), HDGE successfully assign higher scores only to in-distribution data points as shown in the histograms in Figure 1 . We believe that the improvement of HDGE over JEM is due to compared with SGLD sampling based methods, HDGE holds the ability to incorporate a large number and diverse samples and their corresponding labels information to train the generative conditional log q(x|y). Comparing with contrastive learning approach (Winkens et al., 2020) , HDGE differs in that the contrastive loss inside the log q(x|y) utilizes label information to help contrast similar data points. The empirical advantage of HDGE over JointLoss shows the benefit of incorporating label information.

5.1.2. PREDICTIVE DISTRIBUTION p(y|x)

A widely used OOD score function is the maximum prediction probability (Hendrycks & Gimpel, 2016) which is given by s θ (x) = max y p θ (y|x). Intuitively, a model with high classification accuracy tends to has a better OOD performance using this score function. We compare with HDGE with standard discriminative models, generative models, and hybrid models. We also evaluate a contrastive pre-training baseline which consists of learning a representation via contrastive learning and training a linear classifier on top of the representation. The results of OOD detection are show in Table 5 .1 (bottom). We find HDGE performs beyond the performance of a strong baseline classifier and considerably outperforms all other generative modeling and hybrid modeling methods. The OOD detection evaluation shows that it is helpful to jointly train the generative model q(x|y) together with the classifier p(y|x) to have a better classifier model. HDGE provides an effective and simple approach to improve out-of-distribution detection. Figure 1 : Histograms for OOD detection using density q(x) as score function. The model is WideResNet-28-10 (without BN). Green corresponds to the score on (in-distribution) training dataset CIFAR-10, and red corresponds to the score on the testing dataset. The cifar10interp denotes a dataset that consists of a linear interpolation of the CIFAR-10 dataset.

5.2. CONFIDENCE-CALIBRATION

Calibration plays an important role when deploy the model in real-world scenarios where outputting an incorrect decision can have catastrophic consequences (Guo et al., 2017) . The goodness of calibration is usually evaluated in terms of the Expected Calibration Error (ECE), which is a metric to measure the calibration of a classifier. It works by first computing the confidence, max y p(y|x i ), for each x i in some dataset and then grouping the items into equally spaced buckets {B m } M m=1 based on the classifier's output confidence. For example, if M = 20, then B 0 would represent all examples for which the classifier's confidence was between 0.0 and 0.05. The ECE is defined as following: ECE = M m=1 |B m | n |acc(B m ) -conf(B m )|, ( ) where n is the number of examples in the dataset, acc(B m ) is the averaged accuracy of the classifier of all examples in B m and conf(B m ) is the averaged confidence over all examples in B m . For a perfectly calibrated classifier, this value will be 0 for any choice of M . Following Grathwohl et al. (2019) , we choose M = 20 throughout the experiments. A classifier is considered calibrated if its predictive confidence, max y p(y|x), aligns with its misclassification rate. Thus, when a calibrated classifier predicts label y with confidence score that is the same at the accuracy. We evaluate the methods on CIFAR-100 where we train HDGE and baselines of the same architecture, and compute the ECE on hold-out datasets. The histograms of confidence and accuracy of each method are shown in Figure 2 . While classifiers have grown more accurate in recent years, they have also grown considerably less calibrated (Guo et al., 2017) , as shown in the left of Figure 2 . Grathwohl et al. (2019) significantly improves the calibration of classifiers by optimizing q(x) as EBMs training (Figure 2 middle), however, their method is computational expensive due to the contrastive divergence and SGLD sampling process and their training also sacrifices the accuracy of the classifiers. In contrast, HDGE provides a computational feasible method to significantly improve both the accuracy and the calibration at the same time (Figure 2 right).

5.3. IMAGE CLASSIFICATION

We compare HDGE with (i) the supervised learning baseline uses the standard cross-entropy loss. We follow the settings of Zagoruyko & Komodakis (2016) for evaluation on CIFAR-10 and CIFAR-100, and we decay the learning rate by 0.2 at epoch 60, 120, 160. (ii) Supervised Contrastive Learning from (Khosla et al., 2020) , which proposes to use label information to select negative examples at the contrastive pre-training stage, and shows incorporating the label information helps the downstream supervised training of classifiers. We adapt the official implementation of the Supervised Contrastive Loss to use WideResNet. (iii) JEM from (Grathwohl et al., 2019) , which proposes to incorporate energy-based modeling training with the standard cross-entropy loss. As reported in Table 2 , HDGE outperforms standard Supervised Learning (which uses only the q θ (y|x) loss term), outperforms Supervised Contrastive Learning from Khosla et al. (2020) (which uses a different approximation to the q θ (y|x)), outperforms JEM (which uses the classification loss on q θ (y|x) supplemented with a loss on the marginal q θ (x)), and outperforms HDGE with log q θ (x|y) (which only trains the generative loss term). This shows the benefit of hybrid discriminative and generative model via jointly optimizing the discriminative (classifier) loss and the generative (contrastive) loss. In addition, when studying methods that only have the generative term q θ (x|y), we see that HDGE (log q θ (x|y) only) achieves higher accuracy than Khosla et al. (2020) , verifying our method provides an improved generative loss term. (Khosla et al., 2020) , JEM (Grathwohl et al., 2019) , and our method HDGE are based on WideResNet-28-10 (Zagoruyko & Komodakis, 2016) .

5.4. HYBRID DISCRIMINATIVE-GENERATIVE MODELING TASKS

HDGE models can be sampled from with SGLD. However, during experiments we found that adding the marginal log-likelihood over x (as done in JEM) improved the generation. we hypothesis that this is due the approximation via contrastive learning focuses on discriminating between images of different categories rather than estimating density. So we evaluated generative modeling through SGLD sampling from a model trained with the following objective: min θ E p data (x,y) [log q θ (y|x) + log q θ (x|y) + log q θ (x)] , where log q θ (x) is optimized by running SGLD sampling and contrastive divergence as in JEM and log q θ (y|x) + log q θ (x|y) is optimized through HDGE. We train this approach on CIFAR-10 and compare against other hybrid models as well as standalone generative and discriminative models. We present inception scores (IS) (Salimans et al., 2016) and Frechet Inception Distance (FID) (Heusel et al., 2017) given that we cannot compute normalized likelihoods. The results are shown in Table 3 and Figure 3 . The results show that jointly optimizing log q θ (y|x) + log q θ (x|y) + log q θ (x) by HDGE (first two terms) and JEM (third term) together can outperform optimizing log q θ (y|x) + log q θ (x) by JEM, and it significantly improves the generative performance over the state of the art in generative modeling methods and retains high classification accuracy simultaneously. We believe the superior performance of HDGE + JEM is due to the fact that HDGE learns a better classifier and JEM can exploit it and maybe optimizing log q(x|y) via HDGE is a good auxiliary objective. The commonly considered adversarial attack is the L p -norm constrained adversarial examples, which are defined as x ∈ B(x, ) that changes the model's prediction, where B(x, r) denotes a ball centered at x with radius r under the L p -norm metric. In this work, we run white-box PGD (projected gradient descent) attack with respect to the L 2 and L ∞ norms, giving the attacker access to gradients, in which PGD is used to find a local maximal within a given perturbation ball (Madry et al., 2017) . We train HDGE and compare with the state-of-the-art adversarial training methods. Adv Training (Madry et al., 2017; Santurkar et al., 2019) which proposes to use robust optimization to train classifier to be robust to the norm through which it is being attacked. Results from the PGD experiments can be seen in Figure 4 . We can see that HDGE can achieve compelling robustness to the state-of-the-art adversarial training methods.

Class

We note that while JEM improves the robustness too by optimizing the likelihood of EBMs, it requires computationally expensive SGLD sampling procedure. In contrast, HDGE significantly improves the robustness of standard classifiers by computationally scalable contrastive learning.

6. CONCLUSION

In this work, we develop HDGE, a new framework for supervised learning and contrastive learning through the perspective of hybrid discriminative and generative model. We propose to leverage contrastive learning to approximately optimize the model for discriminative and generative tasks. JEM (Grathwohl et al., 2019) shows energy-based models have improved confidence-calibration, out-of-distribution detection, and adversarial robustness. HDGE builds on top of JEM and contrastive learning beats JEM and contrastive loss in all of the tasks and performs significantly better or on par with state-of-the-art hand-tailored methods in each task. HDGE gets rid of SGLD therefore does not suffer from training instability and is also conceptual simple to implement. We hope HDGE will be useful for future research of hybrid discriminative-generative training. To evaluate HDGE, we completed a thorough empirical investigation on several standard datasets: CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) , two labeled datasets composed of 32 × 32 images with 10 and 100 classes respectively (Sections 5.1, 5.2, 5.3 and 5.4); SVHN (Netzer et al., 2011) , a labeled dataset composed of over 600, 000 digit images (Section 5.1); CelebA (Liu et al., 2015) , a labeled dataset consisting of over 200, 000 face images and each with 40 attribute annotation (Section 5.1).

B.2 TRAINING DETAILS

Our training settings follow exactly that of JEM, except stated otherwise in some ablation study. Pseudo-code for our training procedure is in Algorithm 1. The cross-entropy baseline is based on the code from the official PyTorch training codefoot_0 . HDGE's implementation is based on the official codes of MoCofoot_1 and JEMfoot_2 . Our source code in PyTorch (Paszke et al., 2019) is available onlinefoot_3 . In the OOD evaluation, the results of JointLoss are obtained from Winkens et al. (2020) , the results of JEM and other baselines are obtained from Grathwohl et al. (2019) . Our method HDGE follows the experimental settings of JEM and have exactly the same hyperparameters in optimization and model choices as JEM. the temperature τ = 0.1 as in other experiments conducted in this work. One baseline JointLoss (Winkens et al., 2020) The likelihood score log q(x) is calculated by applying LogSumExp operation on the log q(y|x) within HDGE. Specifically, log q(x) = log y q(x, y) = log y exp(f ( x)[y]) Z , ( ) where Z is the normalization constant. The score log q(x) we care about is then y exp(f ( x)[y]) = -LogSumExp y (f (x)[y]). A similar scheme also proposed in recent OOD detection work (Liu et al., 2020) .

C SIMCLR STYLE IMPLEMENTATION OF log p(x|y)

We conducted a comparison between HDGE with MoCo and SimCLR style approximations of the contrastive loss in log p(x|y). One of the key differences between MoCo and SimCLR is that MoCo uses a gradient disablaed memory to save logits while SimCLR simply increase batch size. Chen et al. (2020a; b) demonstrate that SimCLR can outperform MoCo significantly. We use batch size 2048 in our SimCLR style HDGE and its pseudo code similar to Algorithm 1 is shown in Algorithm 2. The results are shown in Table 5 , we can see that SimCLR style of HDGE performs comparably with the default MoCo style, indicating Hybrid Discriminative-Generative Training is insensible to detail implementation choices. However, our default implementation Algorithm 1 has less requirements on computation memory size, which makes it widely applicable.

D GOODNESS OF APPROXIMATION

Since we made the approximation to energy-based model by contrastive learning in Equation ( 13), we are interested in evaluating the impact of the number of negative examples K on the goodness of this approximation. We consider a classification task and a density based OOD detection task as proxies of evaluating the approximation. Classification. We compare the image classification of HDGE on CIFAR-100. The results are shown in Figure 5 . We found that increasing number of negative samples OOD detection. We evaluate HDGE with different value of K by running experiments on the log p(x) based OOD tasks, we use the same experiments setting as Section 5.1. We vary the batch size of SGLD sampling process in Grathwohl et al. (2019) , effectively, we change the number of samples used to estimate the derivative of the normalization constant E p θ (x ) ∂E θ (x ) ∂θ in the JEM update rule Equation (4). Specifically, we increase the default batch size N from 64 to 128 and 256, due to running the SGLD process is memory intensive and the technique constraints of the limited CUDA memory, we were unable to further increase the batch size. We also decrease K in HDGE to {64, 128, 256} to study the effect of approximation. The results are shown in Table 6 , the results show that HDGE with a small K performs fairly well except on CelebA probably due to the simplicity of other datasets. We note HDGE(K = 64) outperforms JEM and three out of four datasets, which shows the approximation in HDGE is reasonable good. While increasing batch size of JEM improves the performance, we found increasing K in HDGE can more significantly boost the performance on all of the four datasets. We note JEM with a large batch size is significantly more computational expensive than HDGE, as a result JEM runs more slower than HDGE with the largest K.



https://github.com/szagoruyko/wide-residual-networks/tree/master/pytorch https://github.com/facebookresearch/moco https://github.com/wgrathwohl/JEM anonymous during double-blind review



Figure 2: CIFAR-100 calibration results. The model is WideResNet-28-10 (without BN). Expected calibration error (ECE) (Guo et al., 2017) on CIFAR-100 dataset under various training losses.

Figure 3: Class-conditional samples generated by running HDGE+JEM on CIFAR-10.

Figure 5: Accuracy comparison with respect to different K on CIFAR-100. The baseline is standard cross-entropy loss. The model is WideResNet-28-10. Batch size is 256.

OOD Detection Results. The model is WideResNet-28-10 (without BN) following the settings of JEM(Grathwohl et al., 2019). The comparison with JointLoss(Winkens et al., 2020) follows their setting to use ResNet-50. The results of JointLoss are obtained from its paper. The training dataset is CIFAR-10. Values are AUROC. Standard deviations given in Table 4 (Appendix).

Comparison on three standard image classification datasets: All models use the same batch size of 256 and step-wise learning rate decay, the number of training epochs is 200. The baselines Supervised Contrastive

OOD Detection Results. The model is WideResNet-28-10 (without BN) following the settings of JEM(Grathwohl et al., 2019), except ResNet-50 when comparing with JointLoss(Winkens et al., 2020). The training dataset is CIFAR-10. Values are AUROC. Results of the baselines are fromGrathwohl et al. (2019) andWinkens et al. (2020).

uses a different model ResNet-50, to have a fair comparison, our HDGE (ResNet-50) also uses ResNet-50. JointLoss also incorporates multiple training techniques such as LARS optimizer and label smoothing to help training which we do not use in HDGE.

K improves the performance of HDGE, and with sufficient number of negative examples HDGE significantly outperform the Algorithm 2 HDGE with SimCLR style approximation Einstein sum; cat: concatenation; logsumexp: LogSumExp operation. OOD Detection Results. The model is WideResNet-28-10 (without BN) following the settings of JEM(Grathwohl et al., 2019). The training dataset is CIFAR-10. Values are AUROC.cross-entropy loss. The reason may be training with many negative examples helps to discriminate between positive and negative samples.

