HYBRID DISCRIMINATIVE-GENERATIVE TRAINING VIA CONTRASTIVE LEARNING

Abstract

Contrastive learning and supervised learning have both seen significant progress and success. However, thus far they have largely been treated as two separate objectives, brought together only by having a shared neural network. In this paper we show that through the perspective of hybrid discriminative-generative training of energy-based models we can make a direct connection between contrastive learning and supervised learning. Beyond presenting this unified view, we show our specific choice of approximation of the energy-based loss significantly improves energybased models and contrastive learning based methods in confidence-calibration, out-of-distribution detection, adversarial robustness, generative modeling, and image classification tasks. In addition to significantly improved performance, our method also gets rid of SGLD training and does not suffer from training instability. Our evaluations also demonstrate that our method performs better than or on par with state-of-the-art hand-tailored methods in each task.

1. INTRODUCTION

In the past few years, the field of deep learning has seen significant progress. Example successes include large-scale image classification (He et al., 2016; Simonyan & Zisserman, 2014; Srivastava et al., 2015; Szegedy et al., 2016) on the challenging ImageNet benchmark (Deng et al., 2009) . The common objective for solving supervised machine learning problems is to minimize the crossentropy loss, which is defined as the cross entropy between a target distribution and a categorical distribution called Softmax which is parameterized by the model's real-valued outputs known as logits. The target distribution usually consists of one-hot labels. There has been a continuing effort on improving upon the cross-entropy loss, various methods have been proposed, motivated by different considerations (Hinton et al., 2015; Müller et al., 2019; Szegedy et al., 2016) . Recently, contrastive learning has achieved remarkable success in representation learning. Contrastive learning allows learning good representations and enables efficient training on downstream tasks, an incomplete list includes image classification (Chen et al., 2020a; b; Grill et al., 2020; He et al., 2019; Tian et al., 2019; Oord et al., 2018) Despite the success of the two objectives, they have been treated as two separate objectives, brought together only by having a shared neural network. In this paper, to show a direct connection between contrastive learning and supervised learning, we consider the energy-based interpretation of models trained with cross-entropy loss, building on Grathwohl et al. (2019) . We propose a novel objective that consists of a term for the conditional of the label given the input (the classifier) and a term for the conditional of the input given the label. We optimize the classifier term the normal way. Different from Grathwohl et al. (2019) , we approximately optimize the second conditional over the input with a contrastive learning objective instead of a Monte-Carlo sampling-based approximation. In doing so, we provide a unified view on existing practice.



, video understanding(Han et al., 2019), and knowledge distillation(Tian et al., 2019). Many different training approaches have been proposed to learn such representations, usually relying on visual pretext tasks. Among them, state-of-the-art contrastive methods(He et al., 2019; Chen et al., 2020a;c)  are trained by reducing the distance between representations of different augmented views of the same image ('positive pairs'), and increasing the distance between representations of augment views from different images ('negative pairs').

