DOES DATASET LOTTERY TICKET HYPOTHESIS EX

Abstract

Tuning hyperparameters and exploring the suitable training schemes for the selfsupervised models are usually expensive and resource-consuming, especially on large-scale datasets like ImageNet-1K. Critically, this means only a few establishments (e.g., Google, Meta, etc.) have the ability to afford the heavy experiments on this task, which seriously hinders more engagement and better development in this area. An ideal situation is that there exists a subset from the full large-scale dataset, the subset can correctly reflect the performance distinction 1 when performing different training frameworks, hyper-parameters, etc. This new training manner can substantially decrease resource requirements and improve the computational performance of ablations without compromising accuracy using subset discovered configuration to the full dataset. We formulate this interesting problem as the dataset lottery ticket hypothesis and the target subsets as the winning tickets. In this work, we analyze this problem through finding out partial empirical data on the class dimension that has a consistent Empirical Risk Trend as the full observed dataset. We also examine multiple solutions, including (i) a uniform selection scheme that has been widely used in literature; (ii) subsets by involving prior knowledge, for instance, using the sorted per-class performance of the strong supervised model to identify the desired subset, WordNet Tree on hierarchical semantic classes, etc., for generating the target winning tickets. We verify this hypothesis on the self-supervised learning task across a variety of recent mainstream methods, such as MAE, DINO, MoCo-V1/V2, etc., with different backbones like ResNet and Vision Transformers. The supervised classification task is also examined as an extension. We conduct extensive experiments for training more than 2K self-supervised models on the large-scale ImageNet-1K and its subsets by 1.5M GPU hours, to scrupulously deliver our discoveries and demonstrate our conclusions. According to our experimental results, the winning tickets (subsets) that we find behave consistently to the original dataset, which generally can benefit many experimental studies and ablations, saving 10× of training time and resources for the hyperparameter tuning and other ablation studies.

1. INTRODUCTION

In the recent years, large deep neural networks, such as Convolutional Neural Networks (CNNs) (Lecun & Bengio, 1995; He et al., 2016) and Transformers (Vaswani et al., 2017) have achieved breakthroughs in the fields of supervised learning (Tan & Le, 2019; Dosovitskiy et al., 2020) and selfsupervised learning (Kenton & Toutanova, 2019; Brown et al., 2020; Caron et al., 2021; 2020; He et al., 2022) empowered by the large-scale datasets. Naturally, the computational resources required for training these models on the large data are increasing accordingly. A dilemma is that many researchers do not have such resources to conduct experiments on the large datasets directly, especially on the expensive self-supervised learning by tuning the hyperparameters and exploring the proper training settings and frameworks. A commonly-used practice in the vision domain is to conduct ablations on relatively smaller datasets like CIFAR (Krizhevsky, 2009) and MNIST (Lecun et al., 1998) , and then transfer the tuned optimal configurations to the large datasets like ImageNet-1K (Deng et al., 2009) . While in many cases, it is observed that the models' behaviors and properties on the small datasets are quite different from the ones learned on the large dataset, making it improper to directly transfer hyperparameters found on small data to the large one. Considering that self-supervised learning does not use human-annotated labels for training models, another popular solution emerged by randomly choosing a subset from the full dataset, for example, randomly selecting 100 classes among 1,000 in ImageNet-1K for self-supervised ablations and exploring experiments (Tian et al., 2020; Kalantidis et al., 2020; Ermolov et al., 2021) . Such a scheme has shown great advantages in lower resource demand of costly learning frameworks for fast hyperparameter tuning and model exploration with large backbone architectures. However, according to the learning rule of Empirical Risk Minimization (ERM) (Vapnik, 1991; 1999) principle, the training convergence of ERM is guaranteed when the number of parameters of the neural networks scales linearly with the number of training examples. Under this principle, it is challenging to only leverage a subset of data to model the properties of the full larger number of training data, since in both of the settings, models are trained to minimize their average error over the current training samples. Moreover, ERM is unable to provide generalization on unseen distributions from the subset to the whole, making this strategy full of uncertainty. Consequently, a natural concern has been raised in this work: What kind of subset is qualified for evaluating selfsupervised/supervised methods on full data?

Goal of Dataset Lottery Ticket Hypothesis (DLTH):

The goal of DLTH is to find out a subset as the winning ticket from a large-scale dataset, this subset has the same or similar empirical behaviors and performance trends as the original full dataset when performing different training approaches and hyper-parameters on it, meaning that it can truly reflect the performance changes according to different training settings. Different from (i) data pruning (Zhang et al., 2021; Sorscher et al., 2022) that removes low-contribution and forgetting data to replace the full dataset, (ii) dataset distillation (Wang et al., 2018; Cazenavette et al., 2022) and condensation (Zhao et al., 2020) that will generate a new compressed data, the proposed DLTH will not predict the accuracy of full data but select a proper subset from the original data and the models trained on the subset under various configurations have a consistent performance trend to the models trained on the full data. Thus, we can further use this subset for fast hyper-parameter tuning, frameworks exploration or time-consuming tasks. More detailed discussions with the related tasks are provided in Appendix G. Overfitting Issue on the Subset: The key observed issue of using subset data is that the selfsupervised pre-training on the smaller training data (subset) will frailly suffer from overfitting. To reveal this, we visualize the pre-training losses in Fig. 1 . Following DINO protocol (Caron et al., 2021) , we train with 400 epochs for ViT-Base and 800 epochs for ViT-Small. On the full data, the evolution of training loss first drops rapidly, then slowly rises a little bit, and finally continues to descend on both small and base models. While on the randomly selected subsets (RS-ID), the losses of base and small models are dropping constantly with a plateau. It is interesting to see on our identified winning ticket, the rebounding phenomenon on loss has emerged again at around 400 epoch of blue curve and the magnitude of loss value is generally larger than those on the random subsets, which is also more aligned to the trend of the full data.



In this work, we focus on the performance trend or relative accuracy trend trained on the subset and full data across different train/eval configurations. The absolute accuracy on the individual subset is not necessary.



Figure 1: Illustration of the loss curves from self-supervised pre-training on full data, random subsets and winning ticket subset. ViT-Small and ViT-Base models are used as the backbone networks.

