HOW TO FINE-TUNE VISION MODELS WITH SGD

Abstract

SGD (with momentum) and AdamW are the two most used optimizers for finetuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we show that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and Con-vNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: merely freezing the embedding layer (less than 1% of the parameters) leads to SGD performing competitively with AdamW while using less memory. Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, Living-17, Waterbirds, and DomainNet.

1. INTRODUCTION

Fine-tuning large pretrained models on downstream tasks has become a dominant approach in deep learning (Kornblith et al., 2019; Chen et al., 2020; Zhai et al., 2020) . The two most commonly used optimizers in current practice are SGD and AdamW (Kingma & Ba, 2015; Loshchilov & Hutter, 2019) foot_0 . While most modern vision architectures (ViTs, ConvNeXts, and variants) increasingly use AdamW for pretraining, it is still common to use SGD for fine-tuning. Part of the appeal is that SGD is more memory and compute efficient: AdamW maintains 4 states/parameter, while SGD only maintains 3 states/parameter (Ginsburg et al., 2019; Dettmers et al., 2022) . In training ultra-large models, the additional memory from even 1 extra state/parameter can be costly. At the same time, in terms of fine-tuning accuracies, prior work (Dosovitskiy et al., 2021; Steiner et al., 2021; Kumar et al., 2022) report similar performance between AdamW and SGD on ImageNet like domains that are closer to pretraining data. In contrast, we reach different conclusions when fine-tuning on datasets that are far from pretraining data or have substantial distribution shifts. We examine 7 popular models, including vision transformers (Dosovitskiy et al., 2021; Caron et al., 2021; Radford et al., 2021 ), ConvNeXts (Liu et al., 2022 ), and ResNets (Kolesnikov et al., 2020; He et al., 2016) , of different sizes and pretraining modalities. When pretrained on a large corpus and then fine-tuned, these models achieve near state-of-the-art performance on downstream benchmarks. In addition to good transfer learning, we also want our fine-tuned models to handle practical distribution shifts gracefully. So we focus on 5 distribution shift datasets that have both in-distribution (ID) and out-of-distribution (OOD) evaluations: WILDS-FMoW, WILDS-Camelyon, Waterbirds, BREEDS-Living-17, DomainNet. These were selected to capture different types of data shifts (subpopulation shifts, spurious correlations, style shifts), including two real world shifts in medical imaging and satellite remote sensing from the WILDS benchmark (Koh et al., 2021) . We find that on newer models like ViTs and ConvNeXt, AdamW can have significantly higher accuracies, especially OOD. For example, averaged across the datasets, fine-tuning a CLIP ViT-B/16 model with AdamW gets 2.1% higher accuracy ID and 8.1% higher accuracy OOD compared to SGD (Figure 1b ). These gains are consistent across models too-averaged across all models and datasets, AdamW gets 1.2% higher accuracy ID and 4.0% higher accuracy OOD (Tables 1 2 ).



We use SGD to refer to its usage in deep learning as minibatch stochastic gradient descent with momentum.1

