CONFOUNDER IDENTIFICATION-FREE CAUSAL VISUAL FEATURE LEARNING

Abstract

Confounders in deep learning are in general detrimental to model's generalization where they infiltrate feature representations. Therefore, learning causal features that are free of interference from confounders is important. Most previous causal learning-based approaches employ back-door criterion to mitigate the adverse effect of certain specific confounders, which require the explicit identification of confounders. However, in real scenarios, confounders are typically diverse and difficult to be identified. In this paper, we propose a novel Confounder Identification-free Causal Visual Feature Learning (CICF) method, which obviates the need for identifying confounders. CICF models the interventions among different samples based on the front-door criterion, and then approximates the global-scope intervening effect based on the instance-level intervention from the perspective of optimization. In this way, we aim to find a reliable optimization direction, which eliminates the confounding effects of confounders, to learn causal features. Furthermore, we uncover the relation between CICF and the popular meta-learning strategy MAML (Finn et al., 2017), and provide an interpretation of why MAML works from the theoretical perspective of causal learning for the first time. Thanks to the effective learning of causal features, our CICF enables models to have superior generalization capability. Extensive experiments on domain generalization benchmark datasets demonstrate the effectiveness of our CICF, which achieves the state-of-the-art performance.

1. INTRODUCTION

Deep learning excels at capturing correlations between the inputs and labels in a data-driven manner, which has achieved remarkable successes on various tasks, such as image classification, object detection, and question answering (Liu et al., 2021; He et al., 2016; Redmon et al., 2016; He et al., 2017; Antol et al., 2015) . Even so, in the field of statistics, correlation is in fact not equivalent to causation (Pearl et al., 2016) . For example, when tree branches usually appear together with birds in the training data, deep neural networks (DNNs) are easy to mistake features of tree branches as the features of birds. A close association between two variables does not imply that one of them causes the other. Capturing/modeling correlations instead of causation is at high risk of allowing various confounders to infiltrate into the learned feature representations. When affected by intervening effects of confounders, a network may still make correct predictions when the testing and training data follow the same distribution, but fails when the testing data is out of distribution. This harms the generalization capability of learned feature representations. Thus, learning causal feature, where the interference of confounders is excluded, is important for achieving reliable results. As shown in Fig. 1 , confounders C bring a spurious (non-causal) connection X ← -C -→ Y between samples X and their corresponding labels Y . A classical example to shed light on this is that we can instantiate X, Y, C as the sales volume of ice cream, violent crime and hot weather. Seemingly, an increase in ice cream sales X is correlated with an increase in violent crime Y . However, the hot weather is the common cause of them, which makes an increase in ice cream sales to be a misleading factor of analyzing violent crime. Analogically, in deep learning, once the misleading features/confounders are captured, the introduced biases may be mistakenly fitted by neural networks, thus leading to the detriment of the generalization capability of learned features. In theory, we expect DNNs to model the causation between X and Y . Deviating from such expectation, the interventions of confounders C make the learned model implicitly condition on C. This makes that the regular feature learning does not approach the causal feature learning. To learn causal features, previous studies (Yue et al., 2020; Zhang et al., 2020; Wang et al., 2020b) adopt the backdoor criterion (Pearl et al., 2016) to explicitly identify confounders that should be adjusted for modeling intervening effects. However, they can only exploit the confounders that are accessible and can be estimated, leaving others still intervening the causation learning. Moreover, in many scenarios, confounders are unidentifiable or their distributions are hard to model (Pearl et al., 2016) . Theoretically, front-door criterion (Pearl et al., 2016) does not require identifying/explicitly modeling confounders. It introduces an intermediate variable Z and transfers the requirement of modeling the intervening effects of confounders C on X → Y to modeling the intervening effects of X on Z → Y . Without requiring explicitly modeling confounders, the front-door criterion is inherently suitable for wider scenarios. However, how to exploit the front-door criterion for causal visual feature learning is still under-explored. In this paper, we design a Confounder Identification-free Causal visual Feature learning method (CICF). Particularly, CICF models the interventions among different samples based on the frontdoor criterion, and then approximates the global-scope intervening effect based on the instance-level interventions from the perspective of optimization. In this way, we aim to find a reliable optimization direction, which eliminates the confounding effects of confounders, to learn causal features. There are two challenges we will address for CICF. 1) How to model the intervening effects from other samples on a given sample in the training process. 2) How to estimate the global-scope intervening effect across all samples in the training set to find a suitable optimization direction. As we know, during training, each sample intervenes others through its effects on network parameters by means of gradient updating. Inspired by this, we propose a gradient-based method to model the intervening effects on a sample from all samples to learn causal visual features. However, it is intractable to involve such modeled global-scope intervening effects in the network optimization, which requires a traversal over the entire training set and is costly. To address this, we propose an efficient cluster-then-sample algorithm to approximate the global-scope intervening effects for feasible optimization. Moreover, we revisit the popular meta-learning method Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) . We surprisingly found that our CICF can provide an interpretation on why MAML works well from the perspective of causal learning: MAML tends to learn causal features. We validate the effectiveness of our CICF on the Domain Generalization (DG) (Wang et al., 2021; Zhou et al., 2021a) task and conduct extensive experiments on the PACS, Digits-DG, Office-Home, and VLCS datasets. Our method achieves the state-of-the-art performance.

2. RELATED WORK

Causal Inference aims at pursuing the causal effect of a particular phenomenon by removing the interventions from the confounders (Pearl et al., 2016) . Despite its success in economics (Rubin, 1986) , statistics (Rubin, 1986; Imbens & Rubin, 2015) and social science (Murnane & Willett, 2010) , big challenges present when it meets machine learning, i.e., how to model the intervention



Figure 1: (a) Examples of some confounders, which may lead to learning biased features. (b) Backdoor criterion in causal inference, where the counfunders are accessible. (c) Front-door criterion in causal inference, where the confounders are inaccessible.the regular feature learning does not approach the causal feature learning. To learn causal features, previous studies(Yue et al., 2020; Zhang et al., 2020; Wang et al., 2020b)  adopt the backdoor criterion(Pearl et al., 2016)  to explicitly identify confounders that should be adjusted for modeling intervening effects. However, they can only exploit the confounders that are accessible and can be estimated, leaving others still intervening the causation learning. Moreover, in many scenarios, confounders are unidentifiable or their distributions are hard to model(Pearl et al., 2016).

