CONTINUAL INVARIANT RISK MINIMIZATION

Abstract

Empirical risk minimization can lead to poor generalization behaviour on unseen environments if the learned model does not capture invariant feature representations. Invariant risk minimization (IRM) is a recent proposal for discovering environment-invariant representations. It was introduced by Arjovsky et al. ( 2019) and extended by Ahuja et al. (2020). The assumption of IRM is that all environments are available to the learning system at the same time. With this work, we generalize the concept of IRM to scenarios where environments are observed sequentially. We show that existing approaches, including those designed for continual learning, fail to identify the invariant features and models across sequentially presented environments. We extend IRM under a variational Bayesian and bilevel framework, creating a general approach to continual invariant risk minimization. We also describe a strategy to solve the optimization problems using a variant of the alternating direction method of multiplier (ADMM). We show empirically using multiple datasets and with multiple sequential environments that the proposed methods outperforms or is competitive with prior approaches.

1. INTRODUCTION

Empirical risk minimization (ERM) is the predominant principle for designing machine learning models. In numerous application domains, however, the test data distribution can differ from the training data distribution. For instance, at test time, the same task might be observed in a different environment. Neural networks trained by minimizing ERM objectives over the training distribution tend to generalize poorly in these situations. Improving generalization of learning systems has become a major research topic in recent years, with many different threads of research including, but not limited to, robust optimization (e.g., Hoffman et al. (2018) ) and domain adaptation (e.g., Johansson et al. (2019) ). Both of these research directions, however, have their own intrinsic limitations (Ahuja et al. (2020) ). Recently, there have been proposals of approaches that learn environment-invariant representations. The motivating idea is that the behavior of a model being invariant across environments makes it more likely that the model has captured a causal relationship between features and prediction targets. This in turn should lead to a better generalization behavior. Invariant risk minimization (IRM, Arjovsky et al. ( 2019)), which pioneered this idea, introduces a new optimization loss function to identify non-spurious causal feature-target interactions. Invariant risk minimization games (IRMG, Ahuja et al. (2020) ) expands on IRM from a game-theoretic perspective. The assumption of IRM and its extensions, however, is that all environments are available to the learning system at the same time, which is unrealistic in numerous applications. A learning agent experiences environments often sequentially and not concurrently. For instance, in a federated learning scenario with patient medical records, each hospital's (environment) data might be used to train a shared machine learning model which receives the data from these environments in a sequential manner. The model might then be applied to data from an additional hospital (environment) that was not available at training time. Unfortunately, both IRM and IRMG are incompatible with such a continual learning setup in which the learner receives training data from environments presented in a sequential manner. As already noted by Javed et al. (2020) , "IRM Arjovsky et al. ( 2019) requires sampling data from multiple environments simultaneously for computing a regularization term pertinent to its learning objective, where different environments are defined by intervening on one or more variables of the world." The same applies to IRMG (Ahuja et al. ( 2020)) • We expand both IRM and IRMG under a Bayesian variational framework and develop novel objectives (for the discovery of invariant models) in two scenarios: (1) the standard multienvironment scenario where the learner receives training data from all environments at the same time; and (2) the scenario where data from each environment arrives in a sequential manner. • We demonstrate that the resulting bilevel problem objectives have an alternative formulation, which allows us to compute a solution efficiently using the alternating direction method of multipliers (ADMM). • We compare our method to ERM, IRM, IRMG, and various continual learning methods (EWC, GEM, MER, VCL) on a diverse set of tasks, demonstrating comparable or superior performance in most situations.

2. BACKGROUND: OFFLINE INVARIANT RISK MINIMIZATION

We consider a multi-environment setting where, given a set of training environments E = {e 1 , e 2 , • • • , e m }, the goal is to find parameters θ that generalize well to unseen (test) environments. Each environment e has an associated training data set D e and a corresponding risk R e R e (w • φ) . = E (x,y)∼De e ((w • φ)(x), y), where  f θ = w • φ is where H φ , H w are the hypothesis sets for, respectively, feature extractors and classifiers. Unfortunately, solving the IRM bi-level programming problem directly is difficult since solving the outer problem requires solving multiple dependent minimization problems jointly. We can, however, relax IRM to IRMv1 by fixing a scalar classifier and learning a representation φ such that the classifier is "approximately locally optimal" (Arjovsky et al. ( 2019)) min φ∈H φ e∈E R e (φ) + λ||∇ w|w=1.0 R e (wφ)|| 2 , ∀e ∈ E, where w is a scalar evaluated in 1 and λ controls the strength of the penalty term on gradients on w. Alternatively, the recently proposed Invariant Risk Minimization Games (IRMG) (Ahuja et al. ( 2020)) proposes to learn an ensemble of classifiers with each environment controlling one component of the ensemble. Intuitively, the environments play a game where each environment's action is to decide its contribution to the ensemble aiming to minimize its risk. Specifically, IRMG optimizes the following objective: 

3. CONTINUAL IRM BY APPROXIMATE BAYESIAN INFERENCE

Both IRM and IRMG assume the availability of training data from all environments at the same time, which is impractical and unrealistic in numerous applications. A natural approach would be



the composition of a feature extraction function φ and a classifier (or regression function) w. Empirical Risk Minimization (ERM) minimizes the average loss across all training examples, regardless of environment:R ERM (θ) . = E (x,y)∼∪ e∈E De (f θ (x), y).(2)ERM has strong theoretical foundations in the case of iid data (Vapnik (1992)) but can fail dramatically when test environments differ significantly from training environments. To remove spurious features from the model, Invariant Risk Minimization (IRM, Arjovsky et al. (2019)) instead aims to capture invariant representations φ such that the optimal classifier w given φ is the same across all training environments. This leads to the following multiple bi-level optimization problem min φ∈H φ ,w∈Hw e∈E R e (w • φ) s.t. w ∈ arg min we∈Hw R e (w e • φ), ∀e ∈ E,

w -e ) • φ , ∀e ∈ E, (5) where w = 1 |E| e∈E w e is the average and w -e = e ∈E,e =e w e the complement classifier.

