PARETO INVARIANT RISK MINIMIZATION: TOWARDS MITIGATING THE OPTIMIZATION DILEMMA IN OUT-OF-DISTRIBUTION GENERALIZATION

Abstract

Recently, there has been a growing surge of interest in enabling machine learning systems to generalize well to Out-of-Distribution (OOD) data. Most efforts are devoted to advancing optimization objectives that regularize models to capture the underlying invariance; however, there often are compromises in the optimization process of these OOD objectives: i) Many OOD objectives have to be relaxed as penalty terms of Empirical Risk Minimization (ERM) for the ease of optimization, while the relaxed forms can weaken the robustness of the original objective; ii) The penalty terms also require careful tuning of the penalty weights due to the intrinsic conflicts between ERM and OOD objectives. Consequently, these compromises could easily lead to suboptimal performance of either the ERM or OOD objective. To address these issues, we introduce a multi-objective optimization (MOO) perspective to understand the OOD optimization process, and propose a new optimization scheme called PAreto Invariant Risk Minimization (PAIR). PAIR improves the robustness of OOD objectives by cooperatively optimizing with other OOD objectives, thereby bridging the gaps caused by the relaxations. Then PAIR approaches a Pareto optimal solution that trades off the ERM and OOD objectives properly. Extensive experiments on challenging benchmarks, WILDS, show that PAIR alleviates the compromises and yields top OOD performances.

1. INTRODUCTION

The interplay between optimization and generalization is crucial to the success of deep learning (Zhang et al., 2017; Arora et al., 2019; Allen-Zhu et al., 2019; Jacot et al., 2021; Allen-Zhu & Li, 2021) . Guided by empirical risk minimization (ERM) (Vapnik, 1991) , simple optimization algorithms can find uneventful descent paths in the non-convex loss landscape of deep neural networks (Sagun et al., 2018) . However, when distribution shifts are present, the optimization is usually biased by spurious signals such that the learned models can fail dramatically in Out-of-Distribution (OOD) data (Beery et al., 2018; DeGrave et al., 2021; Geirhos et al., 2020) . Therefore, overcoming the OOD generalization challenge has drawn much attention recently. Most efforts are devoted to proposing better optimization objectives (Rojas-Carulla et al., 2018; Koyama & Yamaguchi, 2020; Parascandolo et al., 2021; Krueger et al., 2021; Creager et al., 2021; Liu et al., 2021; Pezeshki et al., 2021; Ahuja et al., 2021a; Wald et al., 2021; Shi et al., 2022; Rame et al., 2021; Chen et al., 2022b ) that regularize the gradient signals produced by ERM, while it has been long neglected that the interplay between optimization and generalization under distribution shifts has already changed its nature. In fact, the optimization process of the OOD objectives turns out to be substantially more challenging than ERM. There are often compromises when applying the OOD objectives in practice. Due to the optimization difficulty, many OOD objectives have to be relaxed as penalty terms of ERM in

