DINO AS A VON MISES-FISHER MIXTURE MODEL

Abstract

Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between K-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are L 2 -normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also L 2 -normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.

1. INTRODUCTION

Self-supervised learning (SSL) is an effective approach for pre-training models on large unlabeled datasets. The main objective of SSL pre-training is to learn representations that are transferable to a range of, so called, downstream tasks. Early SSL methods achieved this through handcrafted pretext tasks that act as inductive biases in the representation learning process (Komodakis & Gidaris, 2018; Noroozi & Favaro, 2016; Kim et al., 2018; Doersch et al., 2015; Larsson et al., 2016; Zhang et al., 2016) . Contrastive methods (Tian et al., 2020; Wu et al., 2018) , using supervisory signals in the form of augmentation invariances and instance discrimination, have produced strong performance benchmarks. In practice, contrastive methods require large batch sizes (Chen et al., 2020a) or specialized techniques like memory banks (He et al., 2020; Misra & Maaten, 2020) to achieve the best performance. Self-distillation methods based on the Mean Teacher (Tarvainen & Valpola, 2017) framework are effective representation learners that do not require such large batch sizes for pre-training. A trivial solution where the network learns to output the same representation irrespective of the input is known as representation collapse. The negative samples in contrastive learning prevent representation collapse. In the absence of negative samples, self-distillation methods use explicit approaches to avoid collapse, such as asymmetric model architecture (Grill et al., 2020; Chen & He, 2021) and whitening (Ermolov et al., 2021; Zbontar et al., 2021) . Transformers (Vaswani et al., 2017) , originally introduced in NLP, have emerged as a strong model architecture for vision tasks as well (Dosovitskiy et al., 2020; Liu et al., 2021) . Current state-of-the-art SSL methods leverage the highly flexible Vision Transformers (ViTs) (Caron et al., 2021; Bao et al., 2021; Li et al., 2021; Zhou et al., 2021; Xie et al., 2021; Chen et al., 2021) . DINO (Caron et al., 2021) is a non-contrastive SSL method that is effective for pre-training ViT models. Interestingly, ViTs pre-trained using DINO outperformed ResNets (He et al., 2016 ) by a significant margin at kNN classification based on learned representations. DINO is an influential SSL method with several state-of-the-art derivatives: MSN (Assran et al., 2022) adapts DINO to produce strong few-shot performance with enhanced training efficiency; iBOT (Zhou et al., 2021) extends DINO by adding an additional masked image modeling task; EsViT extends DINO by adding patch-level tasks that also use a DINO-like formulation. These methods are all trained using the Mean Teacher framework by learning to produce consistent outputs in the probability simplex. The networks output softmax-logit scores based on an inner product between a learned representation and a set of prototypes. We provide a better understanding of DINO, and its derivatives, by taking a closer look at its inner-product formulation. We interpret DINO as a von Mises-Fisher mixture model under certain assumptions. Based on this interpretation, we propose DINO-vMF, as a modified version of DINO, that adds flexibility in the learned latent space while keeping the training stable. DINO-vMF pretraining consistently improves performance on similar downstream tasks as DINO. We also show that the larger ViT models achieve significantly improved few-shot classification performance with our pre-training. By incorporating our vMF modification in iBOT, we achieve significantly improved performance that suggests that our method is applicable to DINO-derived methods as well.

2. DINO

The self-distillation learning framework in DINO considers a teacher network g θt and a student network g θs , with parameters θ t and θ s , respectively. In DINO, the student network is formulated to predict a vector in the (K -1)-dimensional probability simplex using a softmax function. The student probability distribution is obtained as follows: P (k) s (x) k ∝ exp g (k) θs (x) where k ∝ indicates that the right-hand side is normalized w.r.t. the index k (i.e. the equation above corresponds to a softmax). The teacher probability distribution P t (x) is computed analogously. Given an unlabeled image dataset I, consider uniform samples x ∼ I and two random augmentations A s ∼ A s , A t ∼ A t . By applying these augmentations, we get two views x s = A s (x) and x t = A t (x). The student network is trained using gradient updates to produce outputs P s (x s ) that are consistent with those of the teacher network P t (x t ) by minimizing a cross-entropy loss given by: min θs K k=1 -P (k) t (x t ) log P (k) s (x s ). The teacher network parameters θ t are only updated as an exponential moving average (EMA) of θ s . SSL methods based on Siamese networks (Bromley et al., 1993) face the representation collapse problem, where the network learns to produce the same output irrespective of the input. One approach to address this is by introducing an asymmetry between the teacher and student networks. DINO uses the same model architecture for the student and teacher networks but instead shows that adding asymmetry through centering and sharpening operations is sufficient to avoid collapse. The targets produced by the teacher network are centered to remove bias towards a cluster and sharpened using a temperature τ t as g θt (x t ) ← (g θt (x t ) -c)/τ t . On the other hand, the student outputs are only sharpened as g θs (x s ) ← g θs (x s )/τ s . The centering operation prevents one of the probability components from dominating but the solution could collapse to a uniform distribution instead. This is avoided by the sharpening operation, where the teacher uses a lower temperature value than the student, τ t < τ s = 0.1. The overall schematic of DINO is illustrated in Figure 1a .

3.1. DINO: A CLOSER LOOK AT THE FINAL LAYER

The teacher and student networks are designed by combining a backbone network that can be transferred to downstream tasks and a prediction head that is specific to the SSL pre-training. We take a closer look at the prediction head as shown in Figure 1b , which is found to be important to achieve good performance in ablation experiments (Caron et al., 2021) . The weight-normalization in the last linear layer (Salimans & Kingma, 2016) refers to a reparameterization of the weights W as w (k) = g (k) v (k) where w (k) is the kth column of W , g (k) is a scalar magnitude, and v (k) is a unit vector. Optionally, the weights are L 2 -normalized by fixing g = 1. Whether or not to L 2 -normalize the prototypes is a subtle, and in our opinion ad hoc, design choice which is not discussed enough in the literature. Most prior clustering-based SSL methods do use normalized prototypes (Caron et al., 2018; 2019; 2020; Li et al., 2020; Asano et al., 2020), and this 

