ENFORCING PREDICTIVE INVARIANCE ACROSS STRUCTURED BIOMEDICAL DOMAINS

Abstract

Many biochemical applications such as molecular property prediction require models to generalize beyond their training domains (environments). Moreover, natural environments in these tasks are structured, defined by complex descriptors such as molecular scaffolds or protein families. Therefore, most environments are either never seen during training, or contain only a single training example. To address these challenges, we propose a new regret minimization (RGM) algorithm and its extension for structured environments. RGM builds from invariant risk minimization (IRM) by recasting simultaneous optimality condition in terms of predictive regret, finding a representation that enables the predictor to compete against an oracle with hindsight access to held-out environments. The structured extension adaptively highlights variation due to complex environments via specialized domain perturbations. We evaluate our method on multiple applications: molecular property prediction, protein homology and stability prediction and show that RGM significantly outperforms previous state-of-the-art baselines.

1. INTRODUCTION

In many biomedical applications, training data is necessarily limited or otherwise heterogeneous. It is therefore important to ensure that model predictions derived from such data generalize substantially beyond where the training samples lie. For instance, in molecule property prediction (Wu et al., 2018) , models are often evaluated under scaffold split, which introduces structural separation between the chemical spaces of training and test compounds. In protein homology detection (Rao et al., 2019) , the split is driven by protein superfamily where entire evolutionary groups are held out from the training set, forcing models to generalize across larger evolutionary gaps. The key technical challenge is to be able to estimate models that can generalize beyond their training data. The ability to generalize implies a notion of invariance to the differences between the available training data and where predictions are sought. A recently proposed approach known as invariant risk minimization (IRM) (Arjovsky et al., 2019) seeks to find predictors that are simultaneously optimal across different such scenarios (called environments). Indeed, one can apply IRM with environments corresponding to molecules sharing the same scaffold (Bemis & Murcko, 1996) or proteins from the same family (El-Gebali et al., 2019) (see Figure 1 ). However, this is challenging since, for example, scaffolds are structured objects and can often uniquely identify each example in the training set. It is not helpful to create single-example environments as the model would see any variation from one example to another as scaffold variation. In this paper, we propose a regret minimization algorithm to handle both standard and structured environments. The basic idea is to simulate unseen environments by using part of the training set as held-out environments E e . We quantify generalization in terms of regret -the difference between the losses of two auxiliary predictors trained with and without examples in E e . This imposes a stronger constraint on φ and avoids some undesired representations admitted by IRM. For the structured environments like molecular scaffolds, we simulate unseen environments by perturbing the representation φ. The perturbation is defined as the gradient of an auxiliary scaffold classifier with respect to φ. The difference between the original and perturbed representation highlights the scaffold variation to the model. Its associated regret measures how well a predictor trained without perturbation generalizes to the perturbed examples. The goal is to characterize the scaffold variation without explicitly creating an environment for every possible scaffold.

