A MATHEMATICAL FRAMEWORK FOR CHARACTERIZ-ING DEPENDENCY STRUCTURES IN MULTIMODAL

Abstract

Dependency structures between modalities have been utilized explicitly and implicitly in multimodal learning to enhance classification performance, particularly when the training samples are insufficient. Recent efforts have concentrated on developing suitable dependence structures and applying them in deep neural networks, but the interplay between the training sample size and various structures has not received enough attention. To address this issue, we propose a mathematical framework that can be utilized to characterize conditional dependency structures in analytic ways. It provides an explicit description of the sample size in learning various structures in a non-asymptotic regime. Additionally, it demonstrates how task complexity and a fitness evaluation of conditional dependence structures affect the results. Furthermore, we develop an autonomous updated coefficient algorithm auto-CODES based on the theoretical framework and conduct experiments on multimodal emotion recognition tasks using the MELD and IEMOCAP datasets. The experimental results validate our theory and show the effectiveness of the proposed algorithm.

1. INTRODUCTION

Multimodal learning is recently an active research area in machine learning aiming at jointly extracting information and learning knowledge from different categories of data, such as images, audios, and texts (Ngiam et al., 2011; Zadeh et al., 2019; Kiela et al., 2020) . In multimodal learning, a critical issue is to design efficient algorithms to extract features from various modalities, such that the label information can be effectively extracted for classification, especially when the number of training samples is insufficient to learn a huge and complex multimodal structure. There have been many kinds of literature addressing this issue with different kinds of algorithms (Baltrušaitis et al., 2018) , in which the main research stream focuses on extracting features from several modalities that are both relevant to the label and connected to one another (Gao et al., 2020; Ma et al., 2020; Summaira et al., 2021) . The effectiveness of such algorithms could have resulted from the intuition that the labels appear to be the common patterns shared between different modalities in many real multimodal datasets. For multimodal data with such property, designing modality features with higher correlations will implicitly force the algorithm to search for more informative features to the label, and hence can often require fewer training samples to achieve good performance. From the statistical learning aspects, such benefits can be interpreted as that the modalities and the label follow a conditional dependency structure where modalities are independent of each other once the label is given. Since it is in a relatively low-dimensional space, it will demand less number of samples to learn a good representation (Varma et al., 2019) . Thus, there can be several factors affecting the classification performance including (i) the number of labeled training samples, (ii) the fitness of the conditional dependency structures to represent the true one, and (iii) the complexity of the discrimination task. There are works exploiting appropriate dependency structures to achieve good performances by designing effective networks, fusion approaches (Zadeh et al., 2017; Liu et al., 2018; Nagrani et al., 2021) and objective functions (Sohn et al., 2014; Sutter et al., 2020; Piergiovanni et al., 2020) . However, most existing works focus on designing learning algorithms and architectures without a theoretical understanding of the training sample size in multimodal problems, which potentially limits the performance, especially for complicated multimodal problems. In this paper, we consider the model to learn a linear combination of two types of estimators (Piergiovanni et al., 2020) representing the general dependency structure and the considered conditional dependency structure. Then, we propose a testing loss function to evaluate the performance of the trained model in a non-asymptotic regime. It considers the average performance of the linear combined estimator over a certain number of training samples. Note that it is parametrized by a linear combination coefficient of different types of estimators. Minimizing the testing loss will give us the optimal coefficient which determines the optimal estimator which can lead to more informative classification features. The optimal coefficient can also be used to characterize the conditional dependency structure in the process. The detailed mathematical formulation and interpretations are presented in Section 2 and Section 3. In particular, we show that the explicit analytical solution of the optimal coefficient is inversely proportional to the number of labeled training samples and the fitness measurement of the conditional dependency structure. Also, it is proportional to the complexity of the learning task, which we will discuss in Section 2. For instance, when the number of samples is insufficient to learn a highdimensional model, it will be preferable to fit the low-dimensional conditional dependency structure which requires less number of parameters. Meanwhile, the coefficient for the estimator corresponding to the conditional structure will be large. Therefore, the optimal coefficient essentially indicates the efficient model that one shall choose for predicting the label in the multimodal problem with the number of training samples taken into account. Moreover, our approach essentially provides guidance and theoretical understanding for designing efficient multimodal algorithms to utilize the sample size information and different dependency structures. At last, we extend our theoretical results and propose an autonomous updated coefficient on dependency structures (auto-CODES) algorithm by exploiting parametric models. It can compute the weights on different dependency structures automatically with the features evolving in deep neural networks. The experiments on the emotion recognition tasks with the MELD and IEMOCAP datasets validate our theoretical results and show the effectiveness of the algorithm. The main contributions of this paper can be summarized as follows: • We propose a novel theoretical framework for multimodal analyses to characterize the influence of the conditional dependency structure. To the best of our knowledge, it is the first work to give an explicit characterization of the number of training samples toward different dependency structures for multimodal learning in a non-asymptotic regime. Also, it quantifies the task complexity and the fitness of the conditional dependency structure, measured by the χ 2 -divergence, to the estimation. • We extend the analyses from discrete to continuous data in the real world by exploiting parametric models. Furthermore, we propose an algorithm with the autonomous updated coefficient on different dependency structures (auto-CODES) based on the theoretical analyses. • We evaluate the proposed algorithm auto-CODES on multimodal emotion recognition tasks with the widely used MELD and IEMOCAP datasets. The experimental results validate our theory and show the effectiveness of our algorithm. Due to space limitations, the proofs of theorems are presented in the supplemental material.

2. PROBLEM FORMULATION AND ANALYSIS

In this section, we consider a multimodal scenario where both modalities are discrete random variables. For better illustration, we elaborate the framework in two modalities case. Specifically, we focus on the linear combination of two types of estimators. By introducing the testing loss, we evaluate the performance of the proposed estimator. Finally, we determine the optimal combining coefficient by minimizing the testing loss and illustrate the aspects that affect the optimal coefficient. Notation: First, let random variables X 1 , X 2 and Y denote different modalities and label over finite alphabets X 1 , X 2 and Y, respectively. Then, n sample tuples D ≜ {(x (i) 1 , x 2 , y (i) )} n i=1 are generated in an independent, identically distributed (i.i.d.) manner from the true joint distribution P X1X2Y , where P X1X2Y (x 1 , x 2 , y) > 0 for all entries. Specifically, we consider two different estimators to approximate the joint distribution P X1X2Y : (i) the empirical joint distribution PX1X2Y ,

