WEIGHTED ENSEMBLE SELF-SUPERVISED LEARNING

Abstract

Ensembling has proven to be a powerful technique for boosting model performance, uncertainty estimation, and robustness in supervised learning. Advances in self-supervised learning (SSL) enable leveraging large unlabeled corpora for state-of-the-art few-shot and supervised learning performance. In this paper, we explore how ensemble methods can improve recent SSL techniques by developing a framework that permits data-dependent weighted cross-entropy losses. We refrain from ensembling the representation backbone; this choice yields an efficient ensemble method that incurs a small training cost and requires no architectural changes or computational overhead to downstream evaluation. The effectiveness of our method is demonstrated with two state-of-the-art SSL methods, DINO (Caron et al., 2021) and MSN (Assran et al., 2022). Our method outperforms both in multiple evaluation metrics on ImageNet-1K, particularly in the few-shot setting. We explore several weighting schemes and find that those which increase the diversity of ensemble heads lead to better downstream evaluation results. Thorough experiments yield improved prior art baselines which our method still surpasses; e.g., our overall improvement with MSN ViT-B/16 is 3.9 p.p. for 1-shot learning.



The promise of self-supervised learning (SSL) is to extract information from unlabeled data and leverage this information in downstream tasks (He et al., 2020; Caron et al., 2021) (e.g., Grill et al., 2020; Zbontar et al., 2021; He et al., 2022) . Perhaps surprisingly however, a simple and otherwise common idea has received limited consideration: ensembling. Ensembling combines predictions from multiple trained models and has proven effective at improving model accuracy (Hansen & Salamon, 1990; Perrone & Cooper, 1992 ) and capturing predictive uncertainty in supervised learning (Lakshminarayanan et al., 2017; Ovadia et al., 2019) . Ensembling in the SSL regime is nuanced, however; since the goal is to learn useful representations from unlabeled data, it is less obvious where and how to ensemble. We explore these questions in this work. We develop an efficient ensemble method tailored for SSL that replicates the non-representation parts (e.g., projection heads) of the SSL model. In contrast with traditional "post-training" ensembling, our ensembles are only used during training to facilitate the learning of a single representation encoder, which yields no extra cost in downstream evaluation. We further present a family of weighted crossentropy losses to effectively train the ensembles. The key component of our losses is the introduction of data-dependant importance weights for ensemble members. We empirically compare different choices from our framework and find that the choice of weighting schemes critically impacts ensemble diversity, and that greater ensemble diversity correlates with improved downstream performance. Our method is potentially applicable to many SSL methods; we focus on DINO (Caron et al., 2021) and MSN (Assran et al., 2022) to demonstrate its effectiveness. Fig. 1 shows DINO improvements from using our ensembling and weighted cross-entropy loss.



Figure Our improvements to DINO, including baseline improvements and ensembling.

