ENFORCING PREDICTIVE INVARIANCE ACROSS STRUCTURED BIOMEDICAL DOMAINS

Abstract

Many biochemical applications such as molecular property prediction require models to generalize beyond their training domains (environments). Moreover, natural environments in these tasks are structured, defined by complex descriptors such as molecular scaffolds or protein families. Therefore, most environments are either never seen during training, or contain only a single training example. To address these challenges, we propose a new regret minimization (RGM) algorithm and its extension for structured environments. RGM builds from invariant risk minimization (IRM) by recasting simultaneous optimality condition in terms of predictive regret, finding a representation that enables the predictor to compete against an oracle with hindsight access to held-out environments. The structured extension adaptively highlights variation due to complex environments via specialized domain perturbations. We evaluate our method on multiple applications: molecular property prediction, protein homology and stability prediction and show that RGM significantly outperforms previous state-of-the-art baselines.

1. INTRODUCTION

In many biomedical applications, training data is necessarily limited or otherwise heterogeneous. It is therefore important to ensure that model predictions derived from such data generalize substantially beyond where the training samples lie. For instance, in molecule property prediction (Wu et al., 2018) , models are often evaluated under scaffold split, which introduces structural separation between the chemical spaces of training and test compounds. In protein homology detection (Rao et al., 2019) , the split is driven by protein superfamily where entire evolutionary groups are held out from the training set, forcing models to generalize across larger evolutionary gaps. The key technical challenge is to be able to estimate models that can generalize beyond their training data. The ability to generalize implies a notion of invariance to the differences between the available training data and where predictions are sought. A recently proposed approach known as invariant risk minimization (IRM) (Arjovsky et al., 2019) seeks to find predictors that are simultaneously optimal across different such scenarios (called environments). Indeed, one can apply IRM with environments corresponding to molecules sharing the same scaffold (Bemis & Murcko, 1996) or proteins from the same family (El-Gebali et al., 2019) (see Figure 1 ). However, this is challenging since, for example, scaffolds are structured objects and can often uniquely identify each example in the training set. It is not helpful to create single-example environments as the model would see any variation from one example to another as scaffold variation. In this paper, we propose a regret minimization algorithm to handle both standard and structured environments. The basic idea is to simulate unseen environments by using part of the training set as held-out environments E e . We quantify generalization in terms of regret -the difference between the losses of two auxiliary predictors trained with and without examples in E e . This imposes a stronger constraint on φ and avoids some undesired representations admitted by IRM. For the structured environments like molecular scaffolds, we simulate unseen environments by perturbing the representation φ. The perturbation is defined as the gradient of an auxiliary scaffold classifier with respect to φ. The difference between the original and perturbed representation highlights the scaffold variation to the model. Its associated regret measures how well a predictor trained without perturbation generalizes to the perturbed examples. The goal is to characterize the scaffold variation without explicitly creating an environment for every possible scaffold. (2018) have demonstrated that state-of-the-art models exhibit drop in performance when tested under scaffold or protein family split. De facto, the scaffold split and its variants (Feinberg et al., 2018) are used so commonly in cheminformatics as they emulate temporal evaluation adopted in pharmaceutical industry. Therefore, the ability to generalize to new scaffold or protein family environments is the key for practical usage of these models. Moreover, input objects in these domains are typically structured (e.g., molecules are represented by graphs (Duvenaud et al., 2015; Dai et al., 2016; Gilmer et al., 2017) ). This characteristic introduces unique challenges with respect to the environment definition for IRM style algorithms. Invariance et al., 2020; Chang et al., 2020) has sought to extend IRM. We focus on the structured setting, where most of the environments can uniquely specify X in the training set. As a result, E would act similarly to X. In the extreme case, the IRM principle reduces to Y ⊥ X | Z, which is not the desired invariance criterion. We propose to address this issue by introducing domain perturbation to adaptively highlight the structured variation. Domain generalization These methods seek to learn models that generalize to new domains (Muandet et al., 2013; Ghifary et al., 2015; Motiian et al., 2017; Li et al., 2017; 2018b) . Domain generalization methods can be roughly divided into three categories: domain adversarial training (Ganin et al., 2016; Tzeng et al., 2017; Long et al., 2018 ), meta-learning (Li et al., 2018a; Balaji et al., 2018; Li et al., 2019a; b; Dou et al., 2019) and domain augmentation (Shankar et al., 2018; Volpi et al., 2018) . Our method resembles meta-learning based methods in that we create held-out environments to simulate domain shift during training. However, our objective seeks to reduce the regret between predictors trained with or without access to the held-out environments. Existing domain generalization benchmarks assume that each domain contains sufficient amounts of data. We focus on a different setting where most of the environments contain only few (or single)



Illustration of the SCOP hierarchy modified from Hubbard et al.[39]. Sequence length statistics for the SCOPe datasets. We report the mean and standard deviation along with minimum and maximum sequence lengths.

Comparison of encoder architectures on the ASTRAL 2.06 test set. Encoders included LM inputs and were trained using SSA without contact prediction.

Sequence length statistics for the SCOPe datasets. We report the mean and standard deviation along with minimum and maximum sequence lengths.

Comparison of encoder architectures on the ASTRAL 2.06 test set. Encoders included LM inputs and were trained using SSA without contact prediction. Illustration of the SCOP hierarchy modified from Hubbard et al. [39].

Sequence length statistics for the SCOPe datasets. We report the mean and standard deviation along with minimum and maximum sequence lengths.

Comparison of encoder architectures on the ASTRAL 2.06 test set. Encoders included LM inputs and were trained using SSA without contact prediction. Illustration of the SCOP hierarchy modified from Hubbard et al. [39].

Sequence length statistics for the SCOPe datasets. We report the mean and standard deviation along with minimum and maximum sequence lengths.

Comparison of encoder architectures on the ASTRAL 2.06 test set. Encoders included LM inputs and were trained using SSA without contact prediction. Illustration of the SCOP hierarchy modified from Hubbard et al. [39].

Sequence length statistics for the SCOPe datasets. We report the mean and standard deviation along with minimum and maximum sequence lengths.

Comparison of encoder architectures on the ASTRAL 2.06 test set. Encoders included LM inputs and were trained using SSA without contact prediction. Illustration of the SCOP hierarchy modified from Hubbard et al. [39].

Sequence length statistics for the SCOPe datasets. We report the mean and standard deviation along with minimum and maximum sequence lengths.

Comparison of encoder architectures on the ASTRAL 2.06 test set. Encoders included LM inputs and were trained using SSA without contact prediction. Left: Data generation process for molecule property prediction. Training and test environments are generated by controlling the scaffold variable. Middle: Scaffold is a subgraph of a molecular graph with its side chains removed. Right: In a toxicity prediction task(Wu et al., 2018), there are 1600 scaffold environments with 75% of them having a single example.Our methods are evaluated on real-world datasets such as molecule property prediction and protein classification. We compare our model against multiple baselines including IRM, MLDG(Li et al.,  2018a)  andCrossGrad (Shankar et al., 2018). On the QM9 dataset(Ramakrishnan et al., 2014), we outperform the best baseline by a wide margin across multiple properties (41.7 v.s 52.3 average MAE) under an extrapolation evaluation. On a protein stability dataset(Rocklin et al., 2017), we achieve new state-of-the-art results compared to Rao et al. (2019) (0.79 v.s. 0.73 spearman's ρ).Generalization challenges in biomedical applicationsThe challenges of generalization have been extensively documented in this area. For instance, Yang et al. (2019); Rao et al. (2019); Hou et al.

Prior work has sought generalization by enforcing an appropriate invariance constraint over learned representations. For instance, domain adversarial network (DANN)(Ganin et al., 2016;  Zhao et al., 2018)  enforces the latent representation Z = φ(X) to have the same distribution across different environments E (i.e, Z ⊥ E). However, this forces predicted label distribution P (Y |Z) to be the same across all the environments (Zhao et al., 2019). Long et al. (2018); Li et al. (2018c); Combes et al. (2020) extends the invariance criterion by conditioning on the label in order to address the label shift issue of DANN. Invariant risk minimization (IRM) (Arjovsky et al., 2019) seeks a different notion of invariance. Instead of aligning distributions of Z, IRM requires that the predictor f operating on Z = φ(X) is simultaneously optimal across different environments. The associated independence is Y ⊥ E | Z. Various work (Krueger

