MULTI-SEGMENTAL INFORMATIONAL CODING FOR SELF-SUPERVISED REPRESENTATION LEARNING

Abstract

Self-supervised representation learning aims to map high-dimensional data into a compact embedding space, where samples with similar semantics are close to each other. Currently, most representation learning methods maximize the cosine similarity or minimize the distance between different views from the same sample in an ℓ 2 normalized embedding space, and reduce the feature redundancy via a linear correlation constraint. In this study, we propose MUlti-Segmental Informational Coding (MUSIC) as a new embedding scheme for self-supervised representation learning. MUSIC divides an embedding vector into multiple segments to represent different types of attributes, and each segment automatically learns a set of discrete and complementary attributes. MUSIC enables the estimation of the probability distribution over discrete attributes and thus the learning process can be directly guided by information measurements, reducing the feature redundancy beyond the linear correlation. Our theoretical analysis guarantees that MUSIC learns transform-invariant, non-trivial, diverse, and discriminative features. MU-SIC does not require a special asymmetry design, a very high dimension of embedding features, or a deep projection head, making the training framework flexible and efficient. Extensive experiments demonstrate the superiority of MUSIC.

1. INTRODUCTION

Self-supervised representation learning (SSRL) is now recognized as a core task in machine learning with rapid progress over the past years (Bengio et al., 2013; LeCun et al., 2015) . Deep neural networks pre-trained on large-scale unlabeled datasets via SSRL have demonstrated impressive characteristics, such as strong robustness (Hendrycks et al., 2019; Liu et al., 2021) and generalizability (Mohseni et al., 2020) , and improving various downstream tasks. Among various pretext tasks, an effective approach is to drive semantically similar samples (i.e., different transformations of the same instance) close to each other in the embedding space (Dosovitskiy et al., 2014; Wu et al., 2018; Tian et al., 2020b; Ye et al., 2019; Dwibedi et al., 2021) . Simply maximizing the similarity or minimizing the Euclidean distance between embedding features of semantically similar samples tends to produce trivial solutions; e.g., all samples have the same embedding features. Recently, various excellent methods have been proposed to learn meaningful representations feature and avoid trivial solutions. Contrastive learning (Hadsell et al., 2006; Oord et al., 2018) based methods, such as SimCLR (Chen et al., 2020a; b) and MoCo (He et al., 2020) , have achieved great success by additionally minimizing the similarity between embeddings of the reference and negative samples, which requires either relatively large batches or a large memory bank (Wu et al., 2018; Misra & Maaten, 2020) for negative samples. To avoid using negative samples, BYOL (Grill et al., 2020) and SimSiam (Chen & He, 2021) developed clever techniques, such as the asymmetry network architecture, stop gradients, and momentum weight updating. Subsequent theoretical analysis (Wang & Isola, 2020; Zhang et al., 2021a; Richemond et al., 2020; Tian et al., 2021) have demonstrated why these techniques avoid trivial solutions and learn meaningful representations from different aspects. Clustering-based methods DeepCluster (Caron et al., 2018) , SELA (Asano et al., 2019) , SwAV (Caron et al., 2020) alternatively compute the cluster assignment of one view and optimize the network to predict the same assignment for other views of the same sample, where trivial solutions can be avoided via the even assignment of samples over different clusters. In another direction, W-MSE (Ermolov et al., 2021) and Barlow Twins (Zbontar et al., 2021) propose to drive self-or cross-correlation matrices towards the identity matrix, reducing the feature redundancy and learning Each segment is associated with a set of discrete attributes; e.g., Seg-2 represents the texture attribute, and different units in Seg-2 specify different textural patterns, like dots, stripes, etc. Each segment is discretized with a one-hot vector q(s, :). meaningful features without requiring the asymmetry design. Most recently, VICReg (Bardes et al., 2022) constructs a novel loss function with three terms, i.e., invariance, variance, and covariance constraints that explicitly suppress trivial solutions. A theoretical study (Shwartz-Ziv et al., 2022) has given a deep analysis on why VICReg works well. Fundamentally different from the current SSRL methods that normalize embedding features onto the unit hypersphere via ℓ 2 norm and use cosine similarity as the metric, we propose MUlti-Segmental Informational Coding (MUSIC) for representation learning in a novel way. The motivation is based on the observation that an object can be represented by a set of attributes (Russakovsky & Fei-Fei, 2010) . As illustrated in Fig. 1 , we construct an embedding vector with multiple segments to represent different types of attributes; e.g., Seg-1, Seg-2, and Seg-S represent object configuration, texture, and shape, respectively, and each segment instantiates a set of specific attributes; e.g., Seg-2 represents samples with different textural patterns (dots, stripes, etc.). In other words, we discretize the feature variable by a segment of the one-hot vector implemented with softmax function, and the whole embedding vector consists of multiple such one-hot vectors. By doing so, MUSIC makes it possible to estimate the probability distribution over discrete units of each segment so that the information measures defined on probability distributions can be directly computed for both optimization and theoretical analysis. Two general properties are desired behind the illustration in Fig. 1 : 1) samples can be classified into a set of different and discrete attributes in each segment; and 2) different segments discriminate samples using different classification criteria, which means that the mutual information between different segments is minimized, or equivalently the information/entropy of embedding features is maximized. To automatically learn such MUSIC embeddings for SSRL, we propose an entropy-based loss function based on the empirical joint probability distribution. Our information-theoretic analysis reveals why such meaningful features can be promoted while trivial solutions are avoided, which are consistent to the qualitative results in Appendix C. The contributions of MUSIC are as follows. (1) MUSIC presents a new embedding scheme and enables a new information-theoretic optimization framework for SSRL. (2) Theoretical analysis ensures that the MUSIC embeddings are optimized to be transform-invariant, non-trivial, diverse, and discriminative. Importantly, MUSIC can minimize any form of dependency between feature variables beyond the linear correlation in current methods (Zbontar et al., 2021; Bardes et al., 2022) . ( 3) Similar to Barlow Twins and VICReg, MUSIC does not require an asymmetry network architecture, negative samples in a large batch or a memory bank, gradient stopping, or momentum updating. (4) MUSIC does not need a very high dimension of embedding features or a deep projection head, significantly reducing the computational cost. (5) Extensive experimental results demonstrate the superiority of MUSIC on representative datasets in various evaluation settings.

2. METHODOLOGY

2.1 SELF-SUPERVISED LEARNING FRAMEWORK Similar to W-MSE and Barlow Twins, in this study we adopt a twin architecture to learn the embedding features, where the same network is shared between two branches, as shown in Fig. 2 . During training, input images X = {x i } N i=1 are mapped to two distorted sets Here there are four segments and each segment consists of four units for illustration. X ′ = {x ′ i } N i=1 X ′′ = {x ′′ i } N i=1 , where N is the batch size. A common transformation distribution, which covers random crops combined with color distortions, the same as that in (Bardes et al., 2022) , is used to generate training samples. Then, the two sets of distorted images X ′ and X ′′ are respectively fed to two branches, each of which consists of an encoder f (•; θ f ) and a projector g(•; θ g ), where θ f and θ g respectively denote the parameters of the encoder and projector to be optimized. The outputs of the encoder are commonly used as the representation features. The projection head maps the representation features into the embedding space. Note that the presented method is not limited to this twin architecture, which can be extended to the two branches with different parameters, heterogeneous networks, or even different input modalities (e.g., text, audio, etc.) as studied in VICReg.

2.2. MULTI-SEGMENTAL INFORMATIONAL CODING

MUlti-Segmental Informational Coding (MUSIC) is a novel embedding scheme for SSRL. The embedding features of two transformed images are z ′ i = g(f (x ′ i ; θ f ); θ g ) ∈ R D , and z ′′ i = g(f (x ′′ i ; θ f ); θ g ) ∈ R D respectively , where D is the feature dimension. Most of the existing SSRL methods normalize embedding features in the ℓ 2 norm and then maximize their cosine similarity between the two transformed versions. Motivated by the observation described in Fig. 1 , we divide the embedding feature z i into multiple segments to represent different types of attributes, denoted by z i (s, d), s = 1, • • • , S, d = 1, • • • , D s , where S is the number of segments, D s is the dimension of the s th segment. In this study, we evenly split the embedding vector, i.e., ∀s, D s = D S , and the dimension of the whole embedding space is D = D S × S. In principle, uneven partitions can be applied as well given prior knowledge. To make attributes discrete and complementary, one-hot encoding is applied to each segment. Specifically, each segment is normalized to a score vector using the softmax function: q ′ i (s ′ , d ′ ) = exp(z ′ i (s ′ , d ′ )) D S d=1 exp(z ′ i (s ′ , d)) , where q ′ i (s ′ , d ′ ) denotes the score of the image x ′ i belongs to the d ′ -th instantiated attribute in the s ′ -th segment. The score vector q ′′ i (s ′′ , :) for the other branch is computed in the same way. Thus, the MUSIC scheme can be interpreted as a combination of multiple classifiers or cluster operators that implement different classification criteria learned in a data-driven fashion.

2.3. ENTROPY LOSS

Since each type of attribute has a finite number of discrete instantiations, it is possible to estimate the probability distribution of a set of samples over each and every segment. Specifically, we can compute the empirical joint distribution P (s ′ , s ′′ , d ′ , d ′′ ) between every two instantiated attributes within and across segments based on a set of samples as follows: P (s ′ , s ′′ , d ′ , d ′′ ) = 1 N N i=1 q ′ i (s ′ , d ′ )q ′′ i (s ′′ , d ′′ ), where P (s ′ , s ′′ , d ′ , d ′′ ) is computed as the statistical frequency of the sample having both the attribute-d ′ in the segment-s ′ and the attribute-d ′′ in the segment-s ′′ over N samples. With the empirical joint probability distribution, information-theoretic metrics can be directly computed. Here two versions of the loss function are defined. The first version L ent is a pure joint entropy loss: L ent = 1 S 2 S s ′ =1 S s ′′ =1 D S d ′ =1 D S d ′′ =1 (1 -1 s ′ =s ′′ ,d ′ ̸ =d ′′ )P (s ′ , s ′′ , d ′ , d ′′ ) log(P (s ′ , s ′′ , d ′ , d ′′ )), where 1 s ′ =s ′′ ,d ′ ̸ =d ′′ is an indicator function that equals to 1 if s ′ = s ′′ and d ′ ̸ = d ′′ ; otherwise, it is equal to 0. The empirical joint distribution can be denoted by a block matrix as shown in Fig. 2 , where (1 -1 s ′ =s ′′ ,d ′ ̸ =d ′′ ) means keeping the diagonal elements of the diagonal blocks and all elements of the off-diagonal blocks, as indicated by the orange area. Therefore, minimizing this loss function is maximizing the joint entropy over the selected elements. The following subsection will demonstrate the properties of the embedding features learned by optimizing this loss function. To enhance the transformation invariance of features, we introduce an additional term to maximize the inner product between the embedding features from two transformations. Then, the second version of the loss function is defined as L = L ent -λ 1 N S N i=1 S s=1 D S d=1 log(q ′ i (s, d)q ′′ i (s, d)), where λ is a balancing factor. By default, we set λ = 1, which is in principle neither too small nor too large for a good balance. Minimizing the transformation invariance loss enforces the embedding features between two transformations of the same image to be consistent and encourages the embedding feature within each segment to be one-hot. Clearly, this additional term promotes transformation invariance and confident assignments over different attributes. Different from the statistical entropy measurement, this transformation invariance term imposes a sample-specific constraint. The transformation invariance can be also achieved by minimizing the cross-entropy between the two embedding vectors, i.e., -1 N S N i=1 S s=1 D S d=1 q ′ i (s, d) log(q ′′ i (s, d)) . However, by using the cross-entropy the performance would be degraded, as reported in Subsection 4.2. Our proposed method can be easily implemented, with a PyTorch-style pseudo-code in Appendix A. Next, let us theoretically analyze why the entropy loss optimizes informational embedding features as illustrated in Fig. 1 .

2.4. ANALYSIS

The entropy loss function consists of two parts, including the entropy over diagonal elements of diagonal blocks and the entropy over all elements of off-diagonal blocks as illustrated by the orange area in Fig. 2 , and can be formally expressed as L ent = 1 S s ′ ,s ′′ ,s ′ =s ′′ d ′ ,d ′′ ,d ′ =d ′′ P (s ′ , s ′′ , d ′ , d ′′ ) log(P (s ′ , s ′′ , d ′ , d ′′ )) + 1 S(S -1) s ′ ,s ′′ ,s ′ ̸ =s ′′ d ′ ,d ′′ P (s ′ , s ′′ , d ′ , d ′′ ) log(P (s ′ , s ′′ , d ′ , d ′′ )). (5) For the first part, it can be demonstrated that its optimal solution is that ∀i, s, d, q ′ i (s, d) = q ′′ i (s, d), q ′ i (s, :) and q ′′ i (s, :) are one-hot vectors, the statistical probability of the s th attribute type taking the d th instantiation is p(s, d) = 1 N N i=1 q i (s, d) = 1 D S , and P (s, s, d, d) = 1 D S . The proof can be found in Appendix B. For the second part, it is intuitive that the optimal solution to maximize the joint entropy over the off-diagonal block items is ∀s ′ , s ′′ , d ′ , d ′′ , s ′ ̸ = s ′′ , P (s ′ , s ′′ , d ′ , d ′′ ) = 1 (D S ) 2 ; i.e., a batch of samples are evenly assigned over each off-diagonal block. Transformation invariance: The solution that ∀i, s, q ′ i (s, :) = q ′′ i (s, :) are one-hot vectors means that the learned MUSIC embeddings are invariant to transformations, and a sample tends to be confidently represented by a single instantiated attribute within each and every segment.

Non-trivial solution:

The solution that 1 N N i=1 q i (s, d) = 1 D S means that a batch of samples are evenly assigned over different attributes in each segment as q ′ i (s, :) and q ′′ i (s, :) are one-hot vectors. Thus, the trivial solution that all samples have the same embedding features can be avoided. The discriminative encoding analyzed below also ensures MUSIC emebddings are non-trivial. Minimum redundancy: As described in Fig. 1 , different segments of the MUSIC embedding vector are expected to focus on diverse and complementary attributes. In other words, the redundancy or mutual information between any two segments should be minimized, which is a popular measure for feature selection (Peng et al., 2005) . Specifically, it can be demonstrated that the redundancy or mutual information between any two segments is minimized when the optimal solution is obtained. Specifically, the mutual information I(s ′ , s ′′ ) between any two segments s ′ and s ′′ is I(s ′ , s ′′ ) =H(s ′ ) + H(s ′′ ) -H(s ′ , s ′′ ) = - D S d ′ =1 p ′ (s ′ , d ′ ) log(p ′ (s ′ , d ′ )) - D S d ′′ =1 p ′′ (s ′′ , d ′′ ) log(p ′′ (s ′′ , d ′′ )) + D S d ′ =1 D S d ′′ =1 P (s ′ , s ′′ , d ′ , d ′′ )log(P (s ′ , s ′′ , d ′ , d ′′ )) = -log 1 D S -log 1 D S + log 1 (D S ) 2 = 0. (6) Thus, the mutual information is minimized and diverse attributes are learned. From another perspective, ∀s ′ , s ′′ , d ′ , d ′′ , s ′ ̸ = s ′′ , we have P (s ′ , s ′′ , d ′ , d ′′ ) = 1 (D S ) 2 = p(s ′ , d ′ )p(s ′′ , d ′′ ) , and thus the variables q(s ′ , d ′ ) and q(s ′′ , d ′′ ) are independent. The redundancy constraint was studied for W-MSE, Barlow Twins, and VICReg by minimizing the covariance or linear correlation. In contrast, our entropy-based loss function reduces the redundancy or mutual information in a non-linear way. Moreover, it can be derived that the optimal MUSIC embedding features have zero covariance between any two features in different segments and negative covariance between the features within the same segment; for details, please see Appendix B. Discriminative encoding: Contrastive learning or instance discrimination has proven very effective for representation learning by maximizing the similarity between different transformations of the same instance while discriminating the reference from other instances. It is underlined that MUSIC is consistent to contrastive learning and discriminating instances in a novel way. Specifically, the optimal MUSIC embedding can totally encode (D S ) S different samples. In our default settings D S = 80, S = 102 (See Section 3 for details), MUSIC can represent 80 102 different samples. The maximized the joint entropy means that any two units from every two segments have the equal possibility to co-occur, that is, a batch of samples are evenly assigned to all possible embeddings. Since the number of all possible embeddings is much larger (2,048 vs 80 102 ) than the batch size, it will be enforced to encode different instances with different embeddings. Like contrastive learning, it ensures non-trivial solutions. The difference lies in that contrastive learning differentiates instances by pushing the reference away from its negative instances, while MUSIC intrinsically assigns instances with different attribute codes in an information-principled manner. In Appendix C, the individual MUSIC embeddings and the empirical joint probability matrix learned on the ImageNet dataset are visualized, showing that the empirical results are consistent with the above theorectical analysis. In summary, the MUSIC embedding features optimized with the entropy-based loss are transform-invariant, non-trivial, diverse, and discriminative.

3. IMPLEMENTATION DETAILS

For a fair comparison, we closely followed the implementation settings in VICReg to train MUSIC models. Specifically, the standard ResNet-50 backbone (He et al., 2016) was used as the encoder that outputs a representation vector of 2,048 units in the same training settings, including the data augmentation (random cropping, horizontal flip, color jittering, grayscale, Gaussian blur, solarization, with the same parameters in VICReg), the optimizer of LARS (You et al., 2017; Goyal et al., 2017) with a weight decay of 10 -6 and the learning rate of lr = batch size/256 × base lr, and the cosine decay schedule (Loshchilov & Hutter, 2016) from 0 with 10 warmup epochs towards the final value of 0.002. The base learning rate base lr was set to 0.6 in our study. By default, we used a two-layer MLP projector (8, 160) , the number of segments S = 102, the segment dimension D S = 80, and D = D S × S = 8, 160 (similar to the feature dimension of 8, 192 used by VICReg and Barlow Twins). The results were respectively analyzed for different feature dimensions, depths of projectors, batch sizes, segment dimensions, and numbers of training epochs. The effect of the single extra hyperparameter D S of MUSIC was evaluated as well. The SSRL models were trained on the 1,000-classes ImageNet dataset without labels and evaluated in various downstream tasks. All implementation details for downstream tasks are in Appendix G. Code will be made available. projector (8,192-8,160 ) instead of three layers (8, 192) . The performance of MUSIC is on par with the state of the art method BYOL that uses asymmetric techniques, such as an additional predictor and a momentum encoder. Note that the results of some excellent methods (Caron et al., 2020; Gidaris et al., 2021) based on multi-crop/multi-positive techniques are not included in Table 1 . These techniques can usually boost performance further. The comparative results show that MUSIC achieves better results than Barlow Twins and VICReg, where all these three methods trained a twin architecture without using negative pairs or any asymmetric techniques.

4.1. EVALUATION RESULTS IN DIFFERENT TASKS

Semi-Supervised Classification on ImageNet. We also evaluated MUSIC in the semi-supervised learning setting, where the pre-trained ResNet-50 with MUSIC was fine-tuned on subsets of Ima-geNet, including 1% and 10% of the full ImageNet dataset respectively, and all reported methods used the same subset images. Currently, MUSIC is not as good as Barlow Twins and VICReg in the semi-supervised learning settings, while it is better than BYOL and other compared methods, and significantly better than purely supervised learning without SSRL pretraining. Transfer Learning. Transfer learning is another popular way for the evaluation of SSRL methods, including object detection, instance segmentation, and linear classification. Our results are reported in Table 2 . It is noted that various studies have different setups for the object detection and instance segmentation tasks. Here we closely followed (Zbontar et al., 2021) selecting the same compari-Table 2 : Transfer Learning. For object detection and instance segmentation tasks, SSRL models pre-trained on ImageNet were used to initialize the backbone of the object detection and instance segmentation models on COCO. Mask R-CNN (He et al., 2017) with the C4 backbone variant (Wu et al., 2019) was fine-tuned using the 1 schedule. AP metrics defined by COCO are reported here. For the linear classification task, Top-1 accuracy (in %) for Places205 (Zhou et al., 2014) and mAP for VOC07 (Everingham et al., 2010) son methods in the same settings. MUSIC performs on par with the current methods and slightly better than Barlow Twins on the object detection and segmentation tasks. On the other hand, the linear classification results on VOC2007 and Places205 datasets show that MUSIC achieved better results than the selected methods. Also, similar to the other SSRL methods, MUSIC can effectively improve the downstream tasks in the transfer learning settings. All implementation details for the reproduction of transfer learning results are in Appendix G.2. Method 20-NN 200-NN NPID (Wu et al., 2018) -46.5 LA (Zhuang et al., 2019) -49.4 PCL (Li et al., 2021) 54.5 -BYOL (Grill et al., 2020) 66.7 64.9 SwAV (Caron et al., 2020) 65.7 62.7 BT (Zbontar et al., 2021) 64.8 62.9 VICReg (Bardes et al., 2022) 64.5 62.9 MUSIC (Ours) 67.0 64.9 Thus, MUSIC has the potential superiority when applied to the downstream tasks based on the nearest neighbors search. All the above results demonstrate the effectiveness and superiority of MUSIC as a new embedding strategy in with the information-theoretic optimization framework. In the following subsections, the characteristics and superiority of MUSIC will be further discussed.

4.2. EMPIRICAL ANALYSIS

In this subsection, we comprehensively evaluate the proposed MUSIC method in various settings and compare it with other SSRL methods if the corresponding results in the same or comparable settings were already reported. All the models were evaluated with linear classification on ImageNet. Effect of Batch Size. SSRL methods usually require a large batch size or a memory bank especially for contrastive learning. Here we evaluated MUSIC with different batch sizes and the results are reported in Table 4 . It shows that MUSIC achieved consistently better results than the latest method VICReg over different batch sizes. As discussed in Subsection 2.4, an intrinsic property of MUSIC is to discriminatively encode different instances, making it work well without a large number of contrastive samples. (Chen & He, 2021; Zbontar et al., 2021; Bardes et al., 2022) show that using a three-layer MLP as the projector achieved the best results. However, MUSIC has a different behavior that a two-layer MLP achieved the best results as shown in Table 6 . It may be due to the discriminability and diversity of MUSIC embeddings, allowing it to learn informational representations more effectively. At the same time, the computational cost can be reduced, especially for the fully-connected MLP with high-dimensional inputs and outputs. The running time per 100 epochs and the peak memory per GPU for different projector depth are reported in Table 6 , where the computational environment is described in Appendix E. Moreover, the comparison results of different methods in Appendix F show that MUSIC cannot only reduce the running time and memory cost but also achieves better performance than Barlow Twins and VICReg. Effect of Feature Dimension. In the previous Barlow Twins and VICReg studies, it was found that a very high-dimensional embedding vector is necessary for improving the representation learning performance. For MUSIC, the feature dimension plays an important role as well. The results of different feature dimensions for VICReg and MUSIC are reported in Table 7 , where the the dimensions of MUSIC embeddings are similar to those of VICReg embeddings while keeping the dimension of each segment the same, D S = 80. It can be seen that MUSIC achieved consistently better results than VICReg on different embedding feature dimensions. Importantly, when the embedding feature dimension was reasonably large (4,096 and 8,192) , MUSIC achieves the best results that are even better than the best results of VICReg using the larger dimension of 16,384. This is because that minimizing linear correlation by the existing methods cannot ensure the minimized non-linear dependency while MUSIC can minimize any form of dependency between any two feature variables. Therefore, the redundancy between MUSIC feature variables tends to be lower than the existing methods so that feature dimension can be reduced for even better results. In principle, the large embedding feature dimension (i.e., 16,384) significantly increases the computational and memory cost for Barlow Twins, VICReg, and MUSIC that compute the covariance or joint probability matrix, which was also discussed in the Barlow Twins study (Zbontar et al., 2021) . This point is demonstrated in Table 7 by evaluating running time and memory cost, where the computational environment is described in Appendix E. Thus, MUSIC is both efficient and effective. Effect of Loss Function. The effect of different loss terms was evaluated in Table 8 , where DE, OE, TIC, and TI denote the diagonal entropy loss, off-diagonal entropy loss, transformation invariance loss implemented with cross-entropy, and transformation invariance loss implemented with innerproduct, respectively. As described in Subsection 2.4, only optimizing the entropy loss (DE+OE) allows MUSIC to avoid trivial solutions and learn informational representations. This theoretical analysis is consistent with the empirical results in Table 8 that 65.4% Top-1 was achieved using the entropy loss only, comparable to some methods reported in Table 1 . Adding the enhanced transformation invariance constraint in instance-level significantly improved the performance, as also discussed in the Subsection 2.3. Without adding the DE loss, the results were significantly degraded, as the DE loss not only enhances the transformation invariance but also helps learn nontrivial and complementary attributes in each and every segment. In our experiments, minimizing the cross-entropy degraded the performance compared with the inner-product implementation. Effect of Segment Dimension. Finally, the effect of our unique hyperparameter, i.e., segment dimension, was evaluated. Our empirical results with different segment dimensions in Table 9 indicate that D S = 80 achieved the best results, where the dimension of the whole embedding vector was kept the same. It shows that the representation performance is not sensitive to this hyperparameter.

5. DISCUSSIONS AND CONCLUSION

From pairwise independence to mutual independence. Although MUSIC is minimizing the pairwise independence, the minimum mutual independence among multiple feature variables cannot be ensured (Gallager, 2013) . In other words, redundancy may still exist among multiple feature variables. Similar to the study (Niu & Wang, 2022 ) that imposes high-order moment constraints on the embedding features, maximizing joint entropy among multiple variables could be implemented to reduce the redundancy further for even better self-learning performance. From representation learning to hierarchical clustering. Each segment in MUSIC can be regarded as a clustering head, while different segments promote different and independent clustering criteria. In this study, we mainly focus on the general representation learning task, where each segment has the same number of clusters and every two segments are independent. This idea could be extended to hierarchical clustering; e.g., different segments may have different numbers of hierarchical clusters and the independence constraint between segments can be adapted with task prior. More discussions and comparisons between MUSIC and related methods are in Appendix D. In conclusion, we have presented the multi-segment informational coding (MUSIC) scheme that discretizes feature variables and a new information-theoretic SSRL framework. Theoretical analysis ensures that the optimized embedding features are transform-invariant, non-trivial, diverse, and discriminative. Importantly, MUSIC can minimize any form of dependency between feature variables beyond the linear correlation in current methods. Various evaluation results have clearly shown the effectiveness and superiority of MUSIC in terms of both accuracy and efficiency. MUSIC could be adapted to more downstream tasks, such as clustering and dense prediction tasks. Since every diagonal block has the same optimal solution, we only need to consider the s th diagonal block, and the objective function can be simplified as L ent (s, s) = D S d=1 P (s, s, d, d) log(P (s, s, d, d)) (B-8) where 0 ≤ P (s, s, d, d) ≤ 1, 0 ≤ D S d=1 P (s, s, d, d) ≤ 1. Then, it is easy to find the solution that minimizes this objective function; i.e., ∀s, d, P (s, s, d, d) = 1 D S . Thus, ∀s, d, we have d P (s, s, d, d) = D S d=1 1 D S = 1 (B-9) As defined in Eqs. ( 1) and ( 2), we have ∀s, d, 0 ≤ q ′ i (s, d) ≤ 1, 0 ≤ q ′′ i (s, d) ≤ 1, D S d=1 q ′ i (s, d) = 1, D S d=1 q ′′ i (s, d) = 1, and P (s, s, d, d) = 1 N N i=1 q ′ i (s, d)q ′′ i (s, d). Given the above conditions, let us next prove that for ∀s, ∃d, q ′ i (s, d) = q ′′ i (s, d) = 1 by contradic- tion. If its negation is true, i.e., ∀s, d, either 0 ≤ q ′ i (s, d) < 1 or 0 ≤ q ′′ i (s, d) < 1, then we have ∀s, d, either q ′ i (s, d) < D S d ′ =1 q ′ i (s, d ′ ) = 1, or q ′′ i (s, d) < D S d ′′ =1 q ′′ i (s, d ′′ ) = 1. For q ′′ i (s, d) < D S d ′′ =1 q ′′ i (s, d ′′ ) = 1, we have D S d=1 P (s, s, d, d) = D S d=1 1 N N i=1 q ′ i (s, d)q ′′ i (s, d) = 1 N N i=1 D S d=1 q ′ i (s, d)q ′′ i (s, d) < 1 N N i=1 D S d=1 q ′ i (s, d) D S d ′′ q ′′ i (s, d ′′ ) = 1 (B-10) That is, d P (s, s, d, d) < 1, which leads to a contradiction with Eq. (B-9). Similarly, we have the same contradiction for q ′ i (s ′ , d) < D S d ′ =1 q ′ i (s ′ , d ′ ) = 1. Therefore, the statement that ∀s, ∃d, q ′ i (s, d) = q ′′ i (s, d) = 1 is true. It means that for ∀s, q ′ i (s, :) and q ′′ i (s, :) are one-hot vectors and equal to each other. Because ∀s, d, P (s, s, d, d) = 1 D S , q ′ (s, d) = q ′′ (s, d), and q ′ (s, :) and q ′′ (s, :) are one-hot vectors, then P (s, s, d, d) = 1 N N i=1 q ′ i (s, d)q ′′ i (s, d) = 1 N N i=1 q ′ i (s, d) = p(s, d) = 1 D S . Covariance of the optimal solution. The optimal solution to maximize the joint entropy over the off-diagonal blocks for the second part is ∀s ′ ̸ = s ′′ , P (s ′ , s ′′ , d ′ , d ′′ ) = E[q(s ′ , d ′ )q(s ′′ , d ′′ )] = 1 (D S ) 2 . According to the above proof, we have ∀s, d, p(s, d) = E[q(s, d)] = 1 D S . We can theoretically demonstrate that the covariance is zero between any two units from different segments. Specifically, ∀s ′ , s ′′ , d ′ , d ′′ , s ′ ̸ = s ′′ , we have cov[q(s ′ , d ′ ), q(s ′′ , d ′′ )] = E[q(s ′ , d ′ )q(s ′′ , d ′′ )] -E[q(s ′ , d ′ )]E[q(s ′′ , d ′′ )] = 1 (D S ) 2 - 1 D S × 1 D S = 0. (B-11) Since ∀s, d ′ , d ′′ , d ′ ,d ′′ P (s, s, d ′ , d ′′ ) = 1 and ∀s, d ′ , d ′′ , d ′ = d ′′ , P (s, s, d ′ , d ′′ ) = 1 D S , then ∀s, d ′ , d ′′ , d ′ ̸ = d ′′ , P (s, s, d ′ , d ′′ ) = 0. We can demonstrate that any two units within the same segment are negatively correlated. Formally, ∀s, d ′ , d ′′ , d ′ ̸ = d ′′ , we have cov[q(s, d ′ ), q(s, d ′′ )] = E[q(s, d ′ )q(s, d ′′ )] -E[q(s, d ′ )]E[q(s, d ′′ )] = P (s, s, d ′ , d ′′ ) -p(s, d ′ )p(s, d ′′ ) = 0 - 1 D S × 1 D S = - 1 D S 2 . (B-12) That is, every unit within each segment encodes discriminative and complementary features, while the units from different segments encode unrelated and diverse features. the first segment represent different types of textures; e.g., the feature unit indexed by 24 represents the dot style textures, and the units 6, 8, and 25 correspond to other specific textures/patterns. In the second segment, some features represent different shapes; e.g., the unit 124 abstract a "∩" shape, and the units 88, 134, 138 represent other shapes/patterns. Obviously, the first and second segments use different principles to group samples; e.g., the image containing red mushrooms in the first segment is grouped (indexed by 24) with the objects having similar textures, while in the second segment it is grouped (indexed by 134) with the images having twin/repeated objects. These visual results indicate that the learned MUSIC embedding features are indeed meaningful and consistent to the general properties of Fig. 1 , which are ensured by the theoretical analysis.

APPENDIX D RELATED WORK

For self-supervised representation learning (SSRL), various pretext tasks were designed such as denoising auto-encoders (Vincent et al., 2008) , context auto-encoders (Pathak et al., 2016) , colorization and cross-channel auto-encoders (Zhang et al., 2016; 2017) , masked auto-encoders (He et al., 2022) , rotation (Gidaris et al., 2018; 2020) , patch ordering (Noroozi & Favaro, 2016; Doersch et al., 2015; Chen et al., 2021) , clustering (Caron et al., 2018; 2019; Asano et al., 2019; Yan et al., 2020; Huang et al., 2019; Zhuang et al., 2019; Gidaris et al., 2021) , and instance discrimination (Dosovitskiy et al., 2014; Wu et al., 2018; Tian et al., 2020b; Ye et al., 2019; Dwibedi et al., 2021) . Here we compare MUSIC with the most related SSRL methods, including contrastive learning, asymmetric non-contrastive Learning, clustering-based SSRL, dense SSRL, and the more recent non-asymmetric and non-contrastive learning methods. Contrastive Learning. The contrastive learning based SSRL methods (Chen et al., 2020a; He et al., 2020) need to directly compare the features between negative pairs. Thus, large batch sizes are required, such as for SimCLR (Chen et al., 2020a) . MoCo (He et al., 2020) uses a memory bank to store a large number of features as negative samples so that small batch sizes can be used, while it requires a momentum updating technique. The theoretical analysis for contrastive learning is based on estimating the lower bound on mutual information between different views (Oord et al., 2018) . In contrast, MUSIC doesn't need negative samples, memory bank, or momentum updating, while it can still discriminate instances. Without directly comparing a large number of negative pairs for instance discrimination, MUSIC naturally encodes different instances with different embeddings via maximizing the joint entropy, as demonstrated in our theoretical analysis. MUSIC can directly use information measurements for both optimization and analysis instead of estimating the lower bound. Asymmetric non-contrastive Learning. BYOL (Richemond et al., 2020) and SimSiam (Chen & He, 2021) demonstrate that meaningful representations can be learned without using negative pairs. However, these methods depend on asymmetric architectures and stop gradient techniques to avoid trivial solutions. The follow-up theoretical analysis (Wang & Isola, 2020; Zhang et al., 2021a; Richemond et al., 2020; Tian et al., 2021) leverage various concepts under different assumptions to demonstrate why these methods can avoid trivial solutions. MUSIC requires neither negative samples nor asymmetric designs. Moreover, MUSIC enables a new information-theoretic SSRL framework where the information measures are directly used for both numerical optimization and theoretical analysis, which intrinsically avoids trivial solutions and learns meaningful features. Clustering-based SSRL. DeepCluster (Caron et al., 2018) iteratively performs clustering with an extra kmeans algorithm on the features extracted in the previous step, and updates the weights of the network using the cluster assignments as supervision. To avoid trivial solutions, random samples are selected for the empty cluster to compute the centroid. It is time consuming for kmeans to cluster the whole large datasets, and the random selected samples is hard to form a meaningful cluster. Similarly, SELA (Asano et al., 2019) leverages the Sinkhorn-Knopp algorithm to iteratively perform clustering and optimize the clustering networks with the assigned cluster labels in an online manner. SwAV (Caron et al., 2020) alternatively computes the cluster assignment of one view and optimize the network to predict the same assignment for other views of the same sample. As a contrastive learning method, SwAV still requires a lot of prototype vectors for negative comparisons between embeddings and codes. In can be seen that all these clustering methods require to compute a large number of extra cluster centers and leverage extra algorithms to compute assignments. From the clustering perspective, each segment in the MUSIC embedding can be regarded as a cluster assignment, and multiple segments can be regarded as performing multiple clustering simultaneously, where different segments have different clustering principles. Different from current clustering based methods, MUSIC does not require computing a large number of cluster centers or an extra algorithm to estimate the cluster assignments iteratively. Dense SSRL. Some excellent studies, such as DenseCL (Wang et al., 2021) and DenseSiam (Zhang et al., 2022) , design self-supervised/unsupervised learning methods for improving dense prediction tasks, including object detection (Tian et al., 2019) , instance segmentation (Zang et al., 2021) , and semantic segmentation (Zhang et al., 2021b) . The basic idea is to enhance the pixel-level and regionlevel consistency in the self-supervised/unsupervised learning setting. DenseCL (Wang et al., 2021) proposes to optimize a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Most recently, DenseSiam (Zhang et al., 2022) proposes to optimizes the consistency of different levels based on the simple Siamese network without needing negative pixel pairs, momentum encoders or heuristic masks. Synergistically, MUSIC can be also used for dense prediction tasks by discretizing the feature vector of each sub-region/pixel and learning dense representations in the information-theoretic framework. Non-asymmetric and non-contrastive learning. Recently, WMSE (Ermolov et al., 2021) , Barlow Twins Zbontar et al. (2021) , and VICReg (Bardes et al., 2022) propose to train a simple twin network architecture using covariance matrix based loss functions without needing any asymmetric or constrastive learning techniques. Specifically, WMSE (Ermolov et al., 2021) proposes to minimize the MSE distance between different views and enforce the self-covariance matrix to be an identity-matrix. Barlow Twins (Zbontar et al., 2021) optimizes the cross-covariance matrix to be an identity-matrix. VICReg (Bardes et al., 2022) proposes three loss terms including invariance, variance, and covariance. The theoretical analysis for Barlow Twins (Zbontar et al., 2021) assumes the Gaussian distribution assumption of embedding features, then the loss function can be interpreted with the information measures under some approximations. One of the key ideas of these methods is to reduce the redundancy between feature variables by minimizing the linear correlation. In contrast, MUSIC discretizes the feature variables making the probability distribution estimable so that the information-measures (entropy/mutual information/(in)dependence) can be directly used for both numerical optimization and theoretical analysis without assuming the Gaussian distribution. Importantly, our theoretical analysis shows that MUSIC is exactly minimizing the mutual information or any dependence between feature variables, which is beyond the linear correlation constraints used in current methods. That is why MUSIC can use a shorter embedding vector and achieve even better results at a lower computational cost. Linear classification: For all evaluation experiments on ImageNet linear classification, we followed the standard procedure that a linear classifier was trained on top of the frozen backbone of a ResNet-50 pre-trained with MUSIC. The SGD optimizer was used with a learning rate of 0.02, a cosine decay, a weight decay of 10 -6 , a batch size of 256, and 100 training epochs. In the training stage, the images were augmented by the composition of random cropping and resizing of ratio 0.2 to 1.0 for size 224×224, and random horizontal flips. In the testing stage, the images were simply cropped from the image center and resized to 224 × 224. Semi-supervised learning: In the semi-supervised learning setting, a linear classifier was appended to the pre-trained backbone with MUSIC, and the network was fine-tuned using 1% and 10% of the labels. The SGD optimizer was used with no weight decay and a batch size of 256, and the model was trained for 20 epochs. In the 1% of labels case, we used a learning rate of 0.08 for the encoder and 0.1 for the linear head. In the 10% of labels case, we used 0.02 for the encoder and 0.1 for the linear head. Both these learning rates followed a cosine decay schedule. The data augmentation steps for training and testing followed the same settings of the linear evaluation.

G.2 TRANSFER LEARNING

Object detection and instance segmentation: Mask R-CNN (He et al., 2017) with the C-4 backbone was trained on the COCO 2017 train split and tested on the validation set. We used a learning rate of 0.1 and kept the other parameters the same as in the 1 schedule in detectron2. Linear classification: We followed the exact settings from PIRL (Misra & Maaten, 2020) in evaluating linear classifiers on the Places-205 and VOC07 datasets. For Places-205, a linear classifier was trained using the SGD optimizer for 14 epochs with a learning rate of 0.01 reduced by a factor of 10 at epochs 5 and 10, a weight decay of 5 × 10 -4 , and a momentum of 0.9. For VOC2007 dataset, we trained SVM classifiers, where the C values were computed using cross-validation. Here we further evaluated the characteristics of MUSIC in terms of the segment dimension on the CIFAR10 dataset. Without using any asymmetric or contrastive techniques, Barlow Twins and VI-CReg are regarded as the baseline methods for MUSIC. Specifically, the batch size was 512 and the total dimension of the embedding vector was 1,024, the base learning rate was 0.1, the number of training epochs was 800, and all other hyper-parameters were kept the default settings for MU-SIC, Barlow Twins, and VICReg. Top-1 accuracy results are reported in Table 11 , showing that the segment dimension should be adjusted according to the target dataset, which is similar to that the network architecture and the total feature dimension are usually associated with the scale and complexity of target datasets. Nevertheless, MUSIC achieved superior results than the baseline models over a large range of segment dimensions under the same evaluation setting.



Figure1: Illustration of the MUSIC vector with piano keys. The MUSIC embedding vector consists of multiple segments (Seg-1, ..., Seg-S) representing different types of attributes shown in different colors. Each segment is associated with a set of discrete attributes; e.g., Seg-2 represents the texture attribute, and different units in Seg-2 specify different textural patterns, like dots, stripes, etc. Each segment is discretized with a one-hot vector q(s, :).

Figure2: SSRL framework through multi-segmental informational coding optimized with maximum entropy. Here there are four segments and each segment consists of four units for illustration.

Comparison of SSRL methods on ImageNet for linear and semi-supervised classification. Top-1 and Top-5 accuracies (in %) are reported. The best three results are underlined.Linear Classification on ImageNet. Linear probing is the commonly used evaluation protocol that trains a linear classifier on top of the frozen representations to evaluate the performance of SSRL methods. Being consistent with Barlow Twins and VICReg, a ResNet-50 backbone was trained with the batch size of 2,048 for 1,000 epochs on the training set of ImageNet, and the linear classification results including Top-1 and Top-5 accuracies of different methods on the evaluation set are reported in Table1. The difference from Barlow Twins and VICReg is that MUSIC used a two-layer MLP

are based on the frozen representations pre-trained on ImageNet. The best results are in bold.

KNN classification. Top-1 accuracy with 20 and 200 nearest neighbors are reported. The best results are highlighted in bold.

Batch Size.

Training epochs. Top-1 accuracy (in %) of linear classification on ImageNet using ResNet-50. The best results are highlighted in bold while the second best results are underlined. Effect of Epoch Number. The SSRL methods in different studies do not always use the same training epochs due to different computational costs and environments. MUSIC was evaluated with different numbers of training epochs as reported in Table 5. MUSIC is consistently better than most of the existing methods on all different training epochs. When the numbers of training epochs were small (100 and 200), MUSIC can converge to the best results.

Projector Depth. The best results are highlighted in bold.

Feature Dimension. The best Top-1 accuracies are highlighted in bold.

Loss terms. The best results are highlighted in bold.

Segment Dimension. The best results are highlighted in bold.

Running time and peak memory. Comparison of different methods in terms of the running time over 100 epochs, the peak memory on a single GPU, and the top-1 accuracy (%) on linear classification on top of the frozen representations. All models were distributively trained on 32 Tesla V100 GPUs.

Top-1 linear classification accuracies on CIFAR10. Here the embedding dimension for all models was set to 1024. The MUSIC results with different segment dimensions are reported.

APPENDIX A PYTORCH PSEUDOCODE

An exemplary implementation for MUSIC in the PyTorch-style is described in Algorithm 1.Algorithm 1: PyTorch-style pseudocode for MUSIC # f: network function # lambda: weight on the transformation invariance loss term # N: batch size # D: dimensionality of the embedding vector # D S: dimensionality of each segment # S=D/D S: number of segments # # select: select the diagonal elements of diagonal blocks and all elements of off-diagonal blocks for x in loader: # load a batch with N samples # two randomly augmented versions of x x', x'' = augment(x)# compute transformation invariance loss loss TI = -torch.log((q' * q'').sum(dim=2)).mean() # compute entropy loss q' = torch.reshape(q', [N, D]) # N × D q'' = torch.reshape(q'', [N, D]) # N × D P = torch.einsum('np,nq->pq', [q', q'']) / N # compute empirical joint probability distribution P s = select(P) loss ent = (P s * torch.log(P s)).sum() / (S × S)# final loss loss = loss ent + lambda * loss TI # lambda=1 by default # optimization step loss.backward() optimizer.step()

APPENDIX B THEORETICAL ANALYSIS

The optimal solution to the first part in L ent . As described in Subsection 2.4, the entropy loss function consists of two parts: (1) the entropy over diagonal elements of diagonal blocks and (2) the entropy over all elements of off-diagonal blocks, as illustrated in Fig. 2 . Now, let us minimize the first part, which is formulated as To qualitatively evaluate if meaningful embedding features are learned, in Fig. C .4 we show some examples assigned to specific units in the first two segments, where the whole embedding vector has 8,160 units including 102 segments, and each segment has 80 units. Specifically, some features in APPENDIX E COMPUTATIONAL ENVIRONMENT MUSIC models were distributively trained on four nodes, each of which has the system information:• 2× 20 core 2.5 GHz Intel Xeon Gold 6248;• 8× NVIDIA Tesla V100 GPU each with 32 GiB HBM;• 768 GiB RAM per node;• Dual 100 Gb EDR Infiniband

APPENDIX F RUNNING TIME

In Table 10 , the computational cost of MUSIC was evaluated and compared with other methods. All methods were run on 32 Tesla V100 GPUs. These methods offer different trade-offs among running time, memory and performance. SwAV with multi-crop and BYOL achieve better performance at the additional computational cost and memory usage. Barlow Twins and VICReg have balanced results with less memory than BYOL and SwAV (multi-crop), faster speed than SwAV (multi-crop), but a slightly worse performance. Compared with the most related Barlow Twins and VICReg methods, MUSIC cannot only reduce the running time and memory usage significantly, but also improve the performance. It is due to that MUSIC can use a shallower fully-connected MLP head for a better performance as discussed in Subsection 4.2. The computational cost of MUSIC will be significantly reduced further when using a (×2) lower dimension for embeddings, and the performance would be degraded very slightly, as discussed in Subsection 4.2.

