MODALITY COMPLEMENTARITY: TOWARDS UNDER-STANDING MULTI-MODAL ROBUSTNESS

Abstract

Along with the success of multi-modal learning, the robustness of multi-modal learning is receiving attention due to real-world safety concerns. Multi-modal models are anticipated to be more robust due to the possible redundancy between modalities. However, some empirical results have offered contradictory conclusions. In this paper, we point out an essential factor that causes this discrepancy: The difference in the amount of modality-wise complementary information. We provide an information-theoretical analysis of how the modality complementarity affects the multi-modal robustness. Based on the analysis, we design a metric for quantifying how complementary the modalities are to others and propose an effective pipeline to calculate our metric. Experiments on carefully-designed synthetic data verify our theory. Further, we apply our metric to real-world multi-modal datasets and reveal their property. To our best knowledge, we are the first to identify modality complementarity as an important factor affecting multi-modal robustness.

1. INTRODUCTION

Recently, deep neural networks have proved successful in various areas, such as image recognition (He et al., 2015; Krizhevsky et al., 2012) , speech recognition (Chorowski et al., 2015) and neural machine translation (Wu et al., 2016) . The revolution is also happening in multi-modal research, e.g. RGB-D semantic segmentation (Wang et al., 2016) , audio-visual learning (Zhao et al., 2018) , and visual question answering (Antol et al., 2015) . Intuitively, multi-modal models are anticipated to be more robust due to the potential redundancy between modalities. When one of the modalities is corrupted, others can compensate for the loss. This intuition is supported by both psychological studies of the human perception system (Sumby & Pollack, 1954) and deep learning practices (Zhang et al., 2019b; Qian et al., 2021; Wang et al., 2020) . However, some recent studies cast doubt on this belief. From a theoretical perspective, the multimodal models usually have a larger input dimension than uni-modal models, and the increase of input dimensions significantly degrades model robustness (Ford et al., 2019; Simon-Gabriel et al., 2019) . From an empirical view, some experiments suggest that multi-modal integration may be more vulnerable to attacks or corruptions than uni-modal models (Yu et al., 2020; Tian & Xu, 2021; Ma et al., 2022) . What causes this contradiction in multi-modal robustness? We notice that the conclusions above are drawn under assorted multi-modal task settings ranging from action classification to question answering, which vary in the presence and type of modality interconnections (Liang et al., 2022) . Therefore, a question arises naturally: What aspects of modality interconnection affect the multi-modal robustness? We hypothesize that the complementarity of modalities plays an essential role. If the complementary part of each modality is negligible, the corruption of one modality would not severely damage the model performance. Otherwise, the multi-modal model could perform even worse than a uni-modal model. For the visual question answering task, the two modalities are highly complementary: Only perceiving either the question or the image could not lead to an ideal answer (Agrawal et al., 2018) . For the action classification task, the RGB and optical flow are less complementary since each of them can suggest a roughly correct answer (Feichtenhofer et al., 2016b) . To validate the above hypothesis, we first demonstrate the key role of modality complementarity to model robustness through theoretical analysis. Following previous work (Tsai et al., 2020; Sun et al., 2020; Sridharan & Kakade, 2008; Tosh et al., 2021) , we use an information-theoretical framework for multi-modal learning and study how the complementary information affects robustness under missing and noisy modality settings. Based on the analysis, we design a novel metric and a practical calculation pipeline built on Mutual Information Neural Estimator (MINE) (Belghazi et al., 2018) to quantify the complementarity of modalities in multi-modal datasets. With the specially designed metric and pipeline on hand, we verify our theory and the effectiveness of our proposed metric on synthetic data and a carefully-designed toy dataset AAV-MNIST. The results are consistent with the model robustness in modality missing, noisy modality, and adversarial attack settings on the datasets we test on. Then we apply our metric to real-world multi-modal datasets to further investigate the modality complementarity in different settings. To our best knowledge, we are the first to identify and prove the important role of modality complementarity in multimodal robustness. Hence, for future research, we recommend that researchers consider the modality complementarity as a control variable for a fairer comparison of multi-modal robustness. The main contributions are highlighted as follows: • We point out the effect of modality complementarity on multi-modal model robustness through information-theoretical analysis. • We propose a dataset-wise metric to qualitatively evaluate how complementary the modalities are in each multi-modal dataset, and also design a pipeline for computing the metric in real-world datasets. • We create a synthetic dataset and a toy dataset (AAV-MNIST) to test our metric and pipeline. These datasets cover various complementary situations of different modalities and are used to verify the effectiveness of our pipeline. • We further reveal the modality complementarity and its relationship with model robustness in real-world multi-modal datasets, which could lead to a less biased comparison for multimodal robustness.

2. RELATED WORK

Multi-modal learning. Various multi-modal learning tasks and models are proposed in recent years (Baltrusaitis et al., 2017; Liang et al., 2021) , such as multi-modal reasoning (Yi et al., 2019; Johnson et al., 2016 ), cross-modal retrieval (Gu et al., 2017; Radford et al., 2021) , and cross-modal translation (Ramesh et al., 2021) . Among these settings, we mainly focus on the supervised multimodal classification setting. The theoretical understanding of multi-modal learning is relatively under-explored, with (Huang et al., 2021) deriving generalization error bounds and (Sun et al., 2020) comparing with the Bayesian posterior classifiers. A concept close to multi-modal learning is the multi-view learning (Xu et al., 2013) . The theory of multi-view learning has long been studied both theoretically (Zhang et al., 2019a; Tosh et al., 2021) and empirically (Sindhwani et al., 2005; Ding et al., 2021; Amini et al., 2009; Tian et al., 2019) . Earlier work (Kakade & Foster, 2007; Sridharan & Kakade, 2008) proposes the multi-view assumption: Each modality suffices to predict the label. Recently, many multi-view analyses adopted this assumption (Han et al., 2021; Tsai et al., 2020; Lin et al., 2021; Federici et al., 2020; Lin et al., 2022) . However, as pointed out by (Huang et al., 2021; 2022) , this might not hold in the multi-modal learning setting. et al., 2015; Huang et al., 2015; Meng & Chen, 2017) . For multi-modal models, some papers regard the use of multi-modality as a way to improve robustness (Zhang et al., 2019b; Qian et al., 2021; Wang et al., 2020) , while others continue to improve multi-modal models' robustness by designing new network architectures and fusion methods (Kim & Ghosh, 2019a; Tsai et al., 2018; Yang et al., 2021) and training routines (Eitel et al., 2015; Ma et al., 2021) . When dealing with known missing patterns, researchers explore additional ways: data imputation through available modalities or views (Tran et al., 2017; Lin et al., 2021) , or training different models for different availability of modalities (Yuan et al., 2012) . Our analysis



Model robustness. Model robustness under data missing (Ramoni & Sebastiani, 2001), random corruption (Hendrycks & Dietterich, 2019), and adversarial attacks (Madry et al., 2017) is constantly been concerned in consideration of real-world safety issues. For uni-modal models, several methods are proposed to strengthen model robustness (Papernot

