A MATHEMATICAL FRAMEWORK FOR CHARACTERIZ-ING DEPENDENCY STRUCTURES IN MULTIMODAL

Abstract

Dependency structures between modalities have been utilized explicitly and implicitly in multimodal learning to enhance classification performance, particularly when the training samples are insufficient. Recent efforts have concentrated on developing suitable dependence structures and applying them in deep neural networks, but the interplay between the training sample size and various structures has not received enough attention. To address this issue, we propose a mathematical framework that can be utilized to characterize conditional dependency structures in analytic ways. It provides an explicit description of the sample size in learning various structures in a non-asymptotic regime. Additionally, it demonstrates how task complexity and a fitness evaluation of conditional dependence structures affect the results. Furthermore, we develop an autonomous updated coefficient algorithm auto-CODES based on the theoretical framework and conduct experiments on multimodal emotion recognition tasks using the MELD and IEMOCAP datasets. The experimental results validate our theory and show the effectiveness of the proposed algorithm.

1. INTRODUCTION

Multimodal learning is recently an active research area in machine learning aiming at jointly extracting information and learning knowledge from different categories of data, such as images, audios, and texts (Ngiam et al., 2011; Zadeh et al., 2019; Kiela et al., 2020) . In multimodal learning, a critical issue is to design efficient algorithms to extract features from various modalities, such that the label information can be effectively extracted for classification, especially when the number of training samples is insufficient to learn a huge and complex multimodal structure. There have been many kinds of literature addressing this issue with different kinds of algorithms (Baltrušaitis et al., 2018) , in which the main research stream focuses on extracting features from several modalities that are both relevant to the label and connected to one another (Gao et al., 2020; Ma et al., 2020; Summaira et al., 2021) . The effectiveness of such algorithms could have resulted from the intuition that the labels appear to be the common patterns shared between different modalities in many real multimodal datasets. For multimodal data with such property, designing modality features with higher correlations will implicitly force the algorithm to search for more informative features to the label, and hence can often require fewer training samples to achieve good performance. From the statistical learning aspects, such benefits can be interpreted as that the modalities and the label follow a conditional dependency structure where modalities are independent of each other once the label is given. Since it is in a relatively low-dimensional space, it will demand less number of samples to learn a good representation (Varma et al., 2019) . Thus, there can be several factors affecting the classification performance including (i) the number of labeled training samples, (ii) the fitness of the conditional dependency structures to represent the true one, and (iii) the complexity of the discrimination task. There are works exploiting appropriate dependency structures to achieve good performances by designing effective networks, fusion approaches (Zadeh et al., 2017; Liu et al., 2018; Nagrani et al., 2021) and objective functions (Sohn et al., 2014; Sutter et al., 2020; Piergiovanni et al., 2020) . However, most existing works focus on designing learning algorithms and architectures without a theoretical understanding of the training sample size in multimodal problems, which potentially limits the performance, especially for complicated multimodal problems.

