MULTI-SEGMENTAL INFORMATIONAL CODING FOR SELF-SUPERVISED REPRESENTATION LEARNING

Abstract

Self-supervised representation learning aims to map high-dimensional data into a compact embedding space, where samples with similar semantics are close to each other. Currently, most representation learning methods maximize the cosine similarity or minimize the distance between different views from the same sample in an ℓ 2 normalized embedding space, and reduce the feature redundancy via a linear correlation constraint. In this study, we propose MUlti-Segmental Informational Coding (MUSIC) as a new embedding scheme for self-supervised representation learning. MUSIC divides an embedding vector into multiple segments to represent different types of attributes, and each segment automatically learns a set of discrete and complementary attributes. MUSIC enables the estimation of the probability distribution over discrete attributes and thus the learning process can be directly guided by information measurements, reducing the feature redundancy beyond the linear correlation. Our theoretical analysis guarantees that MUSIC learns transform-invariant, non-trivial, diverse, and discriminative features. MU-SIC does not require a special asymmetry design, a very high dimension of embedding features, or a deep projection head, making the training framework flexible and efficient. Extensive experiments demonstrate the superiority of MUSIC.

1. INTRODUCTION

Self-supervised representation learning (SSRL) is now recognized as a core task in machine learning with rapid progress over the past years (Bengio et al., 2013; LeCun et al., 2015) . Deep neural networks pre-trained on large-scale unlabeled datasets via SSRL have demonstrated impressive characteristics, such as strong robustness (Hendrycks et al., 2019; Liu et al., 2021) and generalizability (Mohseni et al., 2020) , and improving various downstream tasks. Among various pretext tasks, an effective approach is to drive semantically similar samples (i.e., different transformations of the same instance) close to each other in the embedding space (Dosovitskiy et al., 2014; Wu et al., 2018; Tian et al., 2020b; Ye et al., 2019; Dwibedi et al., 2021) . Simply maximizing the similarity or minimizing the Euclidean distance between embedding features of semantically similar samples tends to produce trivial solutions; e.g., all samples have the same embedding features. Recently, various excellent methods have been proposed to learn meaningful representations feature and avoid trivial solutions. Contrastive learning (Hadsell et al., 2006; Oord et al., 2018) based methods, such as SimCLR (Chen et al., 2020a; b) and MoCo (He et al., 2020) , have achieved great success by additionally minimizing the similarity between embeddings of the reference and negative samples, which requires either relatively large batches or a large memory bank (Wu et al., 2018; Misra & Maaten, 2020) for negative samples. To avoid using negative samples, BYOL (Grill et al., 2020) and SimSiam (Chen & He, 2021) developed clever techniques, such as the asymmetry network architecture, stop gradients, and momentum weight updating. Subsequent theoretical analysis (Wang & Isola, 2020; Zhang et al., 2021a; Richemond et al., 2020; Tian et al., 2021) have demonstrated why these techniques avoid trivial solutions and learn meaningful representations from different aspects. Clustering-based methods DeepCluster (Caron et al., 2018) , SELA (Asano et al., 2019 ), SwAV (Caron et al., 2020) alternatively compute the cluster assignment of one view and optimize the network to predict the same assignment for other views of the same sample, where trivial solutions can be avoided via the even assignment of samples over different clusters. In another direction, W-MSE (Ermolov et al., 2021) and Barlow Twins (Zbontar et al., 2021) propose to drive self-or cross-correlation matrices towards the identity matrix, reducing the feature redundancy and learning Fundamentally different from the current SSRL methods that normalize embedding features onto the unit hypersphere via ℓ 2 norm and use cosine similarity as the metric, we propose MUlti-Segmental Informational Coding (MUSIC) for representation learning in a novel way. The motivation is based on the observation that an object can be represented by a set of attributes (Russakovsky & Fei-Fei, 2010) . As illustrated in Fig. 1 , we construct an embedding vector with multiple segments to represent different types of attributes; e.g., Seg-1, Seg-2, and Seg-S represent object configuration, texture, and shape, respectively, and each segment instantiates a set of specific attributes; e.g., Seg-2 represents samples with different textural patterns (dots, stripes, etc.). In other words, we discretize the feature variable by a segment of the one-hot vector implemented with softmax function, and the whole embedding vector consists of multiple such one-hot vectors. By doing so, MUSIC makes it possible to estimate the probability distribution over discrete units of each segment so that the information measures defined on probability distributions can be directly computed for both optimization and theoretical analysis. Two general properties are desired behind the illustration in Fig. 1 : 1) samples can be classified into a set of different and discrete attributes in each segment; and 2) different segments discriminate samples using different classification criteria, which means that the mutual information between different segments is minimized, or equivalently the information/entropy of embedding features is maximized. To automatically learn such MUSIC embeddings for SSRL, we propose an entropy-based loss function based on the empirical joint probability distribution. Our information-theoretic analysis reveals why such meaningful features can be promoted while trivial solutions are avoided, which are consistent to the qualitative results in Appendix C. The contributions of MUSIC are as follows. (1) MUSIC presents a new embedding scheme and enables a new information-theoretic optimization framework for SSRL. (2) Theoretical analysis ensures that the MUSIC embeddings are optimized to be transform-invariant, non-trivial, diverse, and discriminative. Importantly, MUSIC can minimize any form of dependency between feature variables beyond the linear correlation in current methods (Zbontar et al., 2021; Bardes et al., 2022) . ( 3) Similar to Barlow Twins and VICReg, MUSIC does not require an asymmetry network architecture, negative samples in a large batch or a memory bank, gradient stopping, or momentum updating. (4) MUSIC does not need a very high dimension of embedding features or a deep projection head, significantly reducing the computational cost. (5) Extensive experimental results demonstrate the superiority of MUSIC on representative datasets in various evaluation settings.

2. METHODOLOGY

2.1 SELF-SUPERVISED LEARNING FRAMEWORK Similar to W-MSE and Barlow Twins, in this study we adopt a twin architecture to learn the embedding features, where the same network is shared between two branches, as shown in Fig. 2 . During training, input images X = {x i } N i=1 are mapped to two distorted sets X ′ = {x ′ i } N i=1 and



Figure1: Illustration of the MUSIC vector with piano keys. The MUSIC embedding vector consists of multiple segments (Seg-1, ..., Seg-S) representing different types of attributes shown in different colors. Each segment is associated with a set of discrete attributes; e.g., Seg-2 represents the texture attribute, and different units in Seg-2 specify different textural patterns, like dots, stripes, etc. Each segment is discretized with a one-hot vector q(s, :).

