AN INVESTIGATION OF DOMAIN GENERALIZATION WITH RADEMACHER COMPLEXITY

Abstract

The domain generalization (DG) setting challenges a model trained on multiple known data distributions to generalise well on unseen data distributions. Due to its practical importance, many methods have been proposed to address this challenge. However much work in general purpose DG is heuristically motivated, as the DG problem is hard to model formally; and recent evaluations have cast doubt on existing methods' practical efficacy -in particular compared to a well tuned empirical risk minimisation baseline. We present a novel learning-theoretic generalisation bound for DG that bounds unseen domain generalisation performance in terms of the model's empirical risk and Rademacher complexity -providing a sufficient condition for DG. Based on this insight, we empirically analyze the performance of several methods and show that their performance is indeed influenced by model complexity in practice. Algorithmically, our analysis suggests that tuning for domain generalisation should be achieved by simply performing regularised ERM with a leave-one-domain-out cross-validation objective. Empirical results on the DomainBed benchmark corroborate this.

1. INTRODUCTION

Machine learning systems have shown exceptional performance on numerous tasks in computer vision and beyond. However performance drops rapidly when the standard assumption of i.i.d. training and testing data is violated. This domain-shift phenomenon occurs widely in many applications of machine learning (14; 37; 25) , and often leads to disappointing results in practical machine learning deployments, since data 'in the wild' is almost inevitably different from training sets. Given the practical significance of this issue, numerous methods have been proposed that aim to improve models' robustness to deployment under train-test domain shift (37), a problem setting known as domain generalisation (DG). These span diverse approaches including specialised neural architectures, data augmentation strategies, and regularisers. Nevertheless, the DG problem setting is difficult to model formally for principled derivation and theoretical analysis of algorithms; since the target domain(s) of interest cannot be observed during training, and cannot be directly approximated by the training domains due to unknown distribution shift. Therefore the many popular approaches (37) are based on poorly understood empirical heuristics-a problem highlighted by (20), who found that no DG methods reliably outperform a well-tune empirical risk minimisation (ERM) baseline. Our first contribution is to present an intuitive learning-theoretic bound for DG performance. Intuitively, while the held-out domain of interest is unobservable during training, we can bound its performance using learning theoretic tools similar to the standard ones used to bound the performance on (unobserved) testing data given (observed) training data. In particular we show that the performance on a held out target domain is bounded by the performance on known source domains, plus two additional model complexity terms, that describe how much a model can possibly have overfitted to the training domains. This provides a sufficient condition for DG and leads to several insights. Firstly, our theory suggests that DG performance is influenced by a trade-off between empirical risk and model complexity that is analogous to the corresponding and widely understood trade-off that explains generalisation in standard i.i.d. learning as an overfitting-underfitting trade-off (17). Based on this, we conjecture that the efficacy of the plethora of available strategies (37) -from data-augmentation to specialised optimisers -is largely influenced by explicitly or implicitly choosing different fit-complexity trade-offs. And further, that the importance of proper tuning as discussed by (20) may be mediated in part by the impact of hyperparameters on model complexity. Analyzing these issues empirically is difficult, as model complexity is hard to carefully control in deep learning due to the large number of relevant factors (explicit regularisers, data augmentation, optimiser parameters, etc). ( 20) attempted to address this by random hyper-parameter search in the DomainBed benchmark, but are hampered by the computational infeasibility of accurate hyper-parameter search. In this paper, we use linear models, random forests, and shallow MLPs to demonstrate more clearly how cross-domain performance depends on model complexity. Secondly, our theory further suggests that the model selection criterion ( 22) is an important factor in DG performance. In particular, regularisation should be stronger when optimizing for future DG performance than when optimizing for performance on seen domains, which we confirm empirically. Further, our theoretical and empirical results show that, contrary to the conclusion of (20), domainwise cross-validation is a better objective to drive DG model selection than instance-wise. In summary, the take-home messages of our analysis are: (i) When achievable, low empirical risk combined with low complexity provides a sufficient condition for DG. (ii) Model fit vs complexity trade-off is a key factor in practical DG performance. (iii) The complexity control strategy used to determine bias-variance trade-off is crucial, with peak DG performance achieved when optimizing model complexity based on domain-wise validation. (iv) The regularisation strength required for optimal DG is greater than for conventional optimization for within-domain performance.

2. RELATED WORK

Theoretical Analysis of the DG Setting and Algorithms The DG problem setting was first analysed in (9). Since then there have been some attempts to analyse DG algorithms from a generalisation bound perspective (29; 8; 23; 2; 33) . However these studies have theoretical results that are either restricted to specific model classes, such as kernel machines, or make strong assumptions about how the domains seen during training will resemble those seen at test time-e.g., that all domains are convex combinations of a finite pre-determined set of prototypical domains. In contrast, our Rademacher complexity approach can be applied to a broad range of model classes (including neural networks), and makes comparatively milder assumptions about the relationship between domains-i.e., they are i.i.d. samples from another arbitrary distribution over domains. The majority of the existing work investigating the theoretical foundations of DG follow the initial formalisation of the domain generalisation problem in (9), where the goal is to minimise the expected error over unseen domains. However, several recent works have also explored the idea of bounding the error on a single unseen domain with the most pathological distribution shift (3; 24). This type of analysis is typically rooted in methods from causal inference, rather than statistical learning theory. As a consequence, they are able to make stronger claims for the problems they address, but the scope of their analysis is necessarily limited to the scenarios where their assumptions about the underlying causal structures are valid. For example, (24) provides bounds that assume problems conform to a specific class of structural equation models, and the analysis assumes that infinite training data is available within each observed training domain. We primarily address the standard DG formalisation of (9), which is concerned with the expected performance of a model on new domains sampled from a distribution over domains. However, we also provide a means to transform any bound on the expected (or "average-case") risk to a high-confidence bound on the worst-case risk. Possibly the most similar work to our theoretical contributions is (1), that also provides learning-theoretic generalisation bounds for DG. However, their analysis only applies to finite hypothesis classes (which does not include, e.g., linear models or neural networks), whereas ours can be applied to any class amenable to analysis with Rademacher complexity.

Empirical Analysis

The main existing empirical analysis on DG is (20), who compared several state of the art methods using DomainBed, a common benchmark and hyper-parameter tuning protocol. They ultimately defend Empirical Risk Minimization (ERM) over more sophisticated alternatives on the grounds that no competitor consistently beats it. We also broadly defend ERM, and build on the same benchmark, but differently we provide a deeper analysis into when and why ERM works. More specifically: (i) We provide a theoretical analysis of ERM's generalisation quality unlike the prior purely empirical evaluation, (ii) We re-use the DomainBed benchmark to directly

