DECOMPOSING MUTUAL INFORMATION FOR REPRESENTATION LEARNING

Abstract

Many self-supervised representation learning methods maximize mutual information (MI) across views. In this paper, we transform each view into a set of subviews and then decompose the original MI bound into a sum of bounds involving conditional MI between the subviews. E.g., given two views x and y of the same input example, we can split x into two subviews, x and x , which depend only on x but are otherwise unconstrained. The following holds: I(x; y) ≥ I(x ; y) + I(x ; y|x ), due to the chain rule and information processing inequality. By maximizing both terms in the decomposition, our approach explicitly rewards the encoder for any information about y which it extracts from x , and for information about y extracted from x in excess of the information from x . We provide a novel contrastive lower bound on conditional MI, that relies on sampling contrast sets from p(y|x ). By decomposing the original MI into a sum of increasingly challenging MI bounds between sets of increasingly informed views, our representations can capture more of the total information shared between the original views. We empirically test the method in a vision domain and for dialogue generation.

1. INTRODUCTION

The ability to extract actionable information from data in the absence of explicit supervision seems to be a core prerequisite for building systems that can, for instance, learn from few data points or quickly make analogies and transfer to other tasks. Approaches to this problem include generative models (Hinton, 2012; Kingma & Welling, 2014) and self-supervised representation learning approaches, in which the objective is not to maximize likelihood, but to formulate a series of (label-agnostic) tasks that the model needs to solve through its representations (Noroozi & Favaro, 2016; Devlin et al., 2019; Gidaris et al., 2018; Hjelm et al., 2019) . Self-supervised learning includes successful models leveraging contrastive learning, which have recently attained comparable performance to their fully-supervised counterparts (Bachman et al., 2019; Chen et al., 2020a) . Many self-supervised learning methods train an encoder such that the representations of a pair of views x and y derived from the same input example are more similar to each other than to representations of views sampled from a contrastive negative sample distribution, which is usually the marginal distribution of the data. For images, different views can be built using random flipping, color jittering and cropping (Bachman et al., 2019; Chen et al., 2020a) . For sequential data such as conversational text, the views can be past and future utterances in a given dialogue. It can be shown that these methods maximize a lower bound on mutual information (MI) between the views, I(x; y), w.r.t. the encoder, i.e. the InfoNCE bound (Oord et al., 2018) . One significant shortcoming of this approach is the large number of contrastive samples required, which directly impacts the total amount of information which the bound can measure (McAllester & Stratos, 2018; Poole et al., 2019) . In this paper, we consider creating subviews of x by removing information from it in various ways, e.g. by masking some pixels. Then, we use representations from less informed subviews as a source of hard contrastive samples for representations from more informed subviews. For example, in Fig. 1 , one can mask a pixel region in x to obtain x and ask (the representation of) x to be closer to y than to random images of the corpus, and for x to be closer to y than to samples from p(y|x ). This corresponds to decomposing the MI between x and y into I(x; y) ≥ I(x ; y) + I(x ; y|x ). The conditional MI measures the information about y that the model has gained by looking at x beyond the information already contained in x . In Fig. 1 (left), standard contrastive approaches

