ON UNI-MODAL FEATURE LEARNING IN SUPERVISED MULTI-MODAL LEARNING

Abstract

We abstract the features of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interaction. Multi-modal joint training is expected to benefit from cross-modal interaction on the basis of ensuring uni-modal feature learning. However, recent late-fusion training approaches still suffer from insufficient learning of uni-modal features on each modality, and we prove that this phenomenon does hurt the model's generalization ability. Given a multi-modal task, we propose to choose targeted late-fusion learning method from Uni-Modal Ensemble (UME) and the proposed Uni-Modal Teacher (UMT), according to the distribution of uni-modal and paired features. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and ModelNet40.

uni-modal features paired features uni-modal features paired features learned features unlearned features

Modality #1

Modality #2

Modality #1 Modality #2 Figure 1: Overview of Modality Laziness. Although multi-modal joint training provides the opportunity for cross-modal interaction to learn paired features, the model easily saturates and ignores the uni-modal features that are hard to learn but also important to generalization.

1. INTRODUCTION

Multi-modal signals, e.g., vision, sound, text, are ubiquitous in our daily life, allowing us to perceive the world through multiple sensory systems. Inspired by the crucial role that multi-modal interactions play in human perception and decision (Smith & Gasser, 2005) , substantial efforts have been made to build effective and reliable computational multi-modal systems in fields like multimedia computing (Wang et al., 2020; Xiao et al., 2020) , representation learning (Arandjelovic & Zisserman, 2017; Radford et al., 2021) and robotics (Chen et al., 2020a) . According to how the features of multi-modal data can be learned, we abstract them into two categories: (1) uni-modal features, which can be learned from uni-modal training, and (2) paired features, which can only be learned from cross-modal interaction. In this paper, we focus on multimodal tasks where uni-modal priors are meaningful 1 (Kay et al., 2017; Chen et al., 2020b). Ideally, we hope that multi-modal joint training can learn paired features through cross-modal interactions on the basis of ensuring that enough uni-modal features are learned. However, recent late-fusion methods still suffer from learning insufficient uni-modal representations of each modality (Peng et al., 2022) . We term this phenomenon as Modality Laziness and illustrate that in Figure 1 . We theoretically characterize Modality Laziness and prove that it does hurt the generalization ability of the model, especially when uni-modal features are dominant in the given task. Besides the laziness problem, another shortcoming of recent late-fusion approaches is that they are complex to implement. For example, G-Blending (Wang et al., 2020) needs an extra split of data to estimate the overfitting-to-generalization ratio to re-weight the losses and then re-train the model again and again. Peng et al. ( 2022) proposes OGM-GE, which dynamically adjusts the gradients of different modalities during training. However, it needs to tune too many hyper-parameters 2 , including the start and end epoch of the gradient modulation, an "alpha" used to calculate the coefficients for the modulation and whether adaptive Gaussian noise Enhancement (GE) is needed. The more complicated thing is that these hyper-parameters need to be re-tuned on new datasets. To this end, more simple and effective methods are urgently needed. We pay attention to the learning of uni-modal features and propose to choose targeted late-fusion training method from Uni-Modal Ensemble (UME) and proposed Uni-Modal Teacher (UMT) according to the distribution of unimodal and paired features. If both uni-modal and paired features are essential, UMT is effective, which helps multi-modal models better learn uni-modal features via uni-modal distillation; if both modalities have strong uni-modal features and paired features are not important enough, UME is properer, which combines predictions of uni-modal models and completely avoids insufficient learning of uni-modal features. We also provide an empirical trick to decide which one to use. 

2. RELATED WORK

Multi-modal training approaches aim to train a multi-modal model by using all modalities simultaneously (Liang et al., 2021) , including audio-visual classification (Peng et al., 2022; Xiao et al., 2020; Panda et al., 2021) , action recognition (Wang et al., 2020; Panda et al., 2021) , visual question answering (Agrawal et al., 2018) and RGB-D segmentation (Park et al., 2017; Hu et al., 2019; Seichter et al., 2020) . There are several different fusion methods, including early/middle fusion (Seichter et al., 2020; Nagrani et al., 2021; Wu et al., 2022) and late fusion (Wang et al., 2020; Peng et al., 2022; Fayek & Kumar, 2020) . In this paper, we mainly consider the late-fusion methods following Wang et al. (2020) , which is convenient and straightforward to evaluate the learning of uni-modal features. We demonstrate that simple late-fusion approaches can outperform approaches with more complex model architecture (Wu et al., 2022; Xiao et al., 2020) . Multi-modal learning theory. The research on multi-modal learning theory is still at an early age. A line of work focuses on understanding multi-view tasks (Amini et al., 2009; Xu et al., 2013; Arora et al., 2016; Allen-Zhu & Li, 2020) , and our assumption on the data structure partially stems from



Uni-modal prior here means that we get predictions only according to one modality in multi-modal tasks. Allen-Zhu & Li (2020).Huang et al. (2021) explains multi-modal learning is potentially better than uni-modal learning and Huang et al. (2022) explains why failure exists in multi-modal learning. Our paper investigates the different types of features in multi-modal data and provides solutions for the weakness of multi-modal learning.Knowledge distillation was introduced to compress the knowledge from an ensemble into a smaller and faster model but still preserve competitive generalization power(Buciluǎ et al., 2006; Hinton et al., 2015; Tian et al., 2019; Gou et al., 2021; Allen-Zhu & Li, 2020). In this paper, we propose Uni-Modal Teacher to leverage uni-modal distillation for joint training to help the learning of unimodal features, without involving cross-modal knowledge distillation(Pham et al., 2019; Gupta et al., 2016; Tan & Bansal, 2020; Garcia et al., 2018; Luo et al., 2018).3 ANALYSIS, LEARNING GUIDANCE AND THEORYIn this section, we show the drawbacks and advantages of joint training. On one hand, joint training results in insufficient learning of uni-modal features (Modality Laziness). On the other hand, it 2 https://github.com/GeWu-Lab/OGM-GE_CVPR2022



Under this guidance, we achieve comparable results to other complex late-fusion or intermediate-fusion methods on multiple multi-modal datasets, including VGG-Sound(Chen et al., 2020b),Kinetics- 400 (Kay et al., 2017), UCF101(Soomro et al., 2012) and ModelNet40 (Wu et al., 2022).

