DINO AS A VON MISES-FISHER MIXTURE MODEL

Abstract

Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between K-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are L 2 -normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also L 2 -normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.

1. INTRODUCTION

Self-supervised learning (SSL) is an effective approach for pre-training models on large unlabeled datasets. The main objective of SSL pre-training is to learn representations that are transferable to a range of, so called, downstream tasks. Early SSL methods achieved this through handcrafted pretext tasks that act as inductive biases in the representation learning process (Komodakis & Gidaris, 2018; Noroozi & Favaro, 2016; Kim et al., 2018; Doersch et al., 2015; Larsson et al., 2016; Zhang et al., 2016) . Contrastive methods (Tian et al., 2020; Wu et al., 2018) , using supervisory signals in the form of augmentation invariances and instance discrimination, have produced strong performance benchmarks. In practice, contrastive methods require large batch sizes (Chen et al., 2020a) or specialized techniques like memory banks (He et al., 2020; Misra & Maaten, 2020) to achieve the best performance. Self-distillation methods based on the Mean Teacher (Tarvainen & Valpola, 2017) framework are effective representation learners that do not require such large batch sizes for pre-training. A trivial solution where the network learns to output the same representation irrespective of the input is known as representation collapse. The negative samples in contrastive learning prevent representation collapse. In the absence of negative samples, self-distillation methods use explicit approaches to avoid collapse, such as asymmetric model architecture (Grill et al., 2020; Chen & He, 2021) and whitening (Ermolov et al., 2021; Zbontar et al., 2021) . Transformers (Vaswani et al., 2017) , originally introduced in NLP, have emerged as a strong model architecture for vision tasks as well (Dosovitskiy et al., 2020; Liu et al., 2021) . Current state-of-the-art SSL methods leverage the highly flexible Vision Transformers (ViTs) (Caron et al., 2021; Bao et al., 2021; Li et al., 2021; Zhou et al., 2021; Xie et al., 2021; Chen et al., 2021) . DINO (Caron et al., 2021) is a non-contrastive SSL method that is effective for pre-training ViT models. Interestingly, ViTs pre-trained using DINO outperformed ResNets (He et al., 2016 ) by a significant margin at kNN classification based on learned representations. DINO is an influential SSL method with several state-of-the-art derivatives: MSN (Assran et al., 2022) adapts DINO to produce strong few-shot performance with enhanced training efficiency; iBOT (Zhou et al., 2021) extends DINO by adding an additional masked image modeling task; EsViT extends DINO by adding patch-level tasks that also

