NOTELA: A GENERALIZABLE METHOD FOR SOURCE FREE DOMAIN ADAPTATION Anonymous authors Paper under double-blind review

Abstract

Source-free domain adaptation (SFDA) is a compelling problem as it allows to leverage any off-the-shelf model without requiring access to its original training set and adapts it using only unlabelled data. While several SFDA approaches have recently been proposed, their evaluation focuses on a narrow set of distribution shifts for vision tasks, and their generalizability outside of that scope has not yet been investigated. We put those recent approaches to the test by evaluating them on a new set of challenging-due to extreme covariate and label shiftand naturally-occurring distribution shifts in the audio domain. We study the task of adapting a bird species classifier trained on focalized recordings of bird songs to datasets of passive recordings for various geographical locations. Interestingly, we find that some recent SFDA methods underperform doing no adaptation at all. Drawing inspiration from those findings and insights, we propose a new method that improves on noisy student approaches by adjusting the teacher's pseudo-labels through Laplacian regularization. Our approach enjoys increased stability and significantly better performance on several of our proposed distribution shifts. We then look back at SFDA benchmarks in the vision domain and find that our approach is competitive with the state-of-the-art there as well.

1. INTRODUCTION

Deep learning has made significant progress on a wide range of application areas. An important contributing factor has been the availability of increasingly larger datasets and models (Kaplan et al., 2020; Song et al., 2022) . A downside of this trend is that training state-of-the-art models has also become increasingly expensive. This is not only wasteful from an environmental perspective, but also makes the training of such models inaccessible to some practitioners due to the prohibitive resources required, or potential difficulties with data access. On the other hand, directly reusing already-trained models is often not desirable, as their performance can degrade significantly in the presence of distribution shifts during deployment (Geirhos et al., 2020) . Therefore, a fruitful avenue is designing adaptation methods for pre-trained models to succeed on a new target domain, without requiring access to the original (source) training data, i.e., "source-free". Preferably this adaptation can be performed unsupervised. This is the problem of source-free domain adaptation (SFDA) that we target in this work. Several models have been proposed recently to tackle SFDA. However, we argue that evaluation in this area is a significant challenge in and of itself: We desire SFDA methods that are general, in that they can be used for different applications to adapt an application-appropriate pre-trained model to cope with a wide range of distribution shifts. Unfortunately, the standard evaluation protocol only considers a narrow set of shifts in computer vision tasks, leaving us with a limited view of the relative merits among different SFDA methods, as well as their generalizability. In this work, we address this limitation by studying a new set of distribution shifts. We expand on the existing evaluation methods, in order to gain as much new information as possible about SFDA methods. We also argue that we should target distribution shifts that are naturally-occurring. This maximizes the chances of the resulting research advances being directly translated into progress in solving real-world problems. To that end, we propose to study a new set of distribution shifts in the audio domain. Specifically, we use a bird species classifier that was trained on a large dataset of bird song recordings as our pre-Table 1 : Relationship of problem settings. x and y denote inputs and labels, and s and t "source" and "target", respectively (note that in some cases, as in DG, s might be a union of source domains / environments). For TTA and SFDA, the * in their training data and loss reflects that they are entirely agnostic to how source training is performed, allowing the use of any off-the-shelf model.  ) + L(x s , x t ) L(x s , y s ) L(x s , y s ) + L(x t ) * * Adaptation loss - - L(x t ) L(x t ) L(x t ) trained model. This dataset consists of focalized recordings, where the song of the identified bird is at the foreground of the recording. Our goal is to adapt this model to a set of passive recordings (soundscapes). The shift from focalized to soundscape recordings is substantial, as the recordings in the latter often feature much lower signal-to-noise ratio, several birds vocalizing at once, as well as significant distractors and environmental noise like rain or wind. In addition, the soundscapes we consider originate from different geographical locations, inducing extreme label shifts. Our rationale for choosing to study these shifts is threefold. Firstly, they are challenging, as evidenced by the poor performance on soundscape datasets compared to focalized recordings observed by Goëau et al. ( 2018); Kahl et al. (2021) . Secondly, they are naturally occurring and any progress in addressing them can support ecological monitoring and biodiversity conservation efforts and research. Finally, our resulting evaluation framework is "just different enough" from the established one: It differs in terms of i) the modality (vision vs. audio), ii) the problem setup (single-label vs multi-label classification), and iii) the degree and complexity of shifts (we study extreme covariate shifts that co-occur with extreme label-space shifts). Existing SFDA methods designed for vision tasks can also be evaluated using this new framework, since audio inputs are often represented as spectrograms and thus can be treated as images. We perform a thorough empirical investigation of established SFDA methods on our new shifts. Interestingly, not only do some of the methods not improve the performance of the pre-trained model, they often degrade it. Studying this striking finding generates insights that lead us to make substantial modelling improvements. Notably, in the presence of extreme shifts, we observe that the confidence of the pre-trained model drops significantly and its calibration is poor as a result. This in turn poses a challenge for entropy minimization. While Noisy Student (Xie et al., 2020) copes best with this shift, it exhibits poor stability, necessitating careful early-stopping. This violates our assumption that labelled target data is unavailable. Our insight is that we can leverage the model's feature space as another "source of truth", as this space carries rich information about the relationship between examples. We propose Noisy Student Teacher with Laplacian Adjustment (NOTELA), a new model that enhances the Noisy Student model with a Laplacian regularizer. This model enjoys significantly increased performance on our new shifts as well as increased stability and resilience to model miscalibration. Closing the loop, we also evaluate our model on established vision benchmarks and observe that it is competitive with the state-of-the-art there; in fact largely surpassing it on two datasets.

2. RELATED WORK

Our discussion of related work is summarized in Table 1 .

Domain adaptation (DA)

. DA assumes a setting in which labelled data is available for a source domain, and unlabelled data for a target domain. The goal is to maximize performance on the target domain. DA methods can be roughly divided into three types (Sagawa et al., 2022) : domaininvariant training (also called feature alignment) aims to ensure that the features generated by the model for the source and target domain are indistinguishable by some metric (Sun et al., 2016; Sun & Saenko, 2016; Tzeng et al., 2014; Long et al., 2015; Ganin et al., 2016; Long et al., 2018;  

