A MATHEMATICAL FRAMEWORK FOR CHARACTERIZ-ING DEPENDENCY STRUCTURES IN MULTIMODAL

Abstract

Dependency structures between modalities have been utilized explicitly and implicitly in multimodal learning to enhance classification performance, particularly when the training samples are insufficient. Recent efforts have concentrated on developing suitable dependence structures and applying them in deep neural networks, but the interplay between the training sample size and various structures has not received enough attention. To address this issue, we propose a mathematical framework that can be utilized to characterize conditional dependency structures in analytic ways. It provides an explicit description of the sample size in learning various structures in a non-asymptotic regime. Additionally, it demonstrates how task complexity and a fitness evaluation of conditional dependence structures affect the results. Furthermore, we develop an autonomous updated coefficient algorithm auto-CODES based on the theoretical framework and conduct experiments on multimodal emotion recognition tasks using the MELD and IEMOCAP datasets. The experimental results validate our theory and show the effectiveness of the proposed algorithm.

1. INTRODUCTION

Multimodal learning is recently an active research area in machine learning aiming at jointly extracting information and learning knowledge from different categories of data, such as images, audios, and texts (Ngiam et al., 2011; Zadeh et al., 2019; Kiela et al., 2020) . In multimodal learning, a critical issue is to design efficient algorithms to extract features from various modalities, such that the label information can be effectively extracted for classification, especially when the number of training samples is insufficient to learn a huge and complex multimodal structure. There have been many kinds of literature addressing this issue with different kinds of algorithms (Baltrušaitis et al., 2018) , in which the main research stream focuses on extracting features from several modalities that are both relevant to the label and connected to one another (Gao et al., 2020; Ma et al., 2020; Summaira et al., 2021) . The effectiveness of such algorithms could have resulted from the intuition that the labels appear to be the common patterns shared between different modalities in many real multimodal datasets. For multimodal data with such property, designing modality features with higher correlations will implicitly force the algorithm to search for more informative features to the label, and hence can often require fewer training samples to achieve good performance. From the statistical learning aspects, such benefits can be interpreted as that the modalities and the label follow a conditional dependency structure where modalities are independent of each other once the label is given. Since it is in a relatively low-dimensional space, it will demand less number of samples to learn a good representation (Varma et al., 2019) . Thus, there can be several factors affecting the classification performance including (i) the number of labeled training samples, (ii) the fitness of the conditional dependency structures to represent the true one, and (iii) the complexity of the discrimination task. There are works exploiting appropriate dependency structures to achieve good performances by designing effective networks, fusion approaches (Zadeh et al., 2017; Liu et al., 2018; Nagrani et al., 2021) and objective functions (Sohn et al., 2014; Sutter et al., 2020; Piergiovanni et al., 2020) . However, most existing works focus on designing learning algorithms and architectures without a theoretical understanding of the training sample size in multimodal problems, which potentially limits the performance, especially for complicated multimodal problems. In this paper, we consider the model to learn a linear combination of two types of estimators (Piergiovanni et al., 2020) representing the general dependency structure and the considered conditional dependency structure. Then, we propose a testing loss function to evaluate the performance of the trained model in a non-asymptotic regime. It considers the average performance of the linear combined estimator over a certain number of training samples. Note that it is parametrized by a linear combination coefficient of different types of estimators. Minimizing the testing loss will give us the optimal coefficient which determines the optimal estimator which can lead to more informative classification features. The optimal coefficient can also be used to characterize the conditional dependency structure in the process. The detailed mathematical formulation and interpretations are presented in Section 2 and Section 3. In particular, we show that the explicit analytical solution of the optimal coefficient is inversely proportional to the number of labeled training samples and the fitness measurement of the conditional dependency structure. Also, it is proportional to the complexity of the learning task, which we will discuss in Section 2. For instance, when the number of samples is insufficient to learn a highdimensional model, it will be preferable to fit the low-dimensional conditional dependency structure which requires less number of parameters. Meanwhile, the coefficient for the estimator corresponding to the conditional structure will be large. Therefore, the optimal coefficient essentially indicates the efficient model that one shall choose for predicting the label in the multimodal problem with the number of training samples taken into account. Moreover, our approach essentially provides guidance and theoretical understanding for designing efficient multimodal algorithms to utilize the sample size information and different dependency structures. At last, we extend our theoretical results and propose an autonomous updated coefficient on dependency structures (auto-CODES) algorithm by exploiting parametric models. It can compute the weights on different dependency structures automatically with the features evolving in deep neural networks. The experiments on the emotion recognition tasks with the MELD and IEMOCAP datasets validate our theoretical results and show the effectiveness of the algorithm. The main contributions of this paper can be summarized as follows: • We propose a novel theoretical framework for multimodal analyses to characterize the influence of the conditional dependency structure. To the best of our knowledge, it is the first work to give an explicit characterization of the number of training samples toward different dependency structures for multimodal learning in a non-asymptotic regime. Also, it quantifies the task complexity and the fitness of the conditional dependency structure, measured by the χ 2 -divergence, to the estimation. • We extend the analyses from discrete to continuous data in the real world by exploiting parametric models. Furthermore, we propose an algorithm with the autonomous updated coefficient on different dependency structures (auto-CODES) based on the theoretical analyses. • We evaluate the proposed algorithm auto-CODES on multimodal emotion recognition tasks with the widely used MELD and IEMOCAP datasets. The experimental results validate our theory and show the effectiveness of our algorithm. Due to space limitations, the proofs of theorems are presented in the supplemental material.

2. PROBLEM FORMULATION AND ANALYSIS

In this section, we consider a multimodal scenario where both modalities are discrete random variables. For better illustration, we elaborate the framework in two modalities case. Specifically, we focus on the linear combination of two types of estimators. By introducing the testing loss, we evaluate the performance of the proposed estimator. Finally, we determine the optimal combining coefficient by minimizing the testing loss and illustrate the aspects that affect the optimal coefficient. Notation: First, let random variables X 1 , X 2 and Y denote different modalities and label over finite alphabets X 1 , X 2 and Y, respectively. Then, n sample tuples D ≜ {(x (i) 1 , x 2 , y (i) )} n i=1 are generated in an independent, identically distributed (i.i.d.) manner from the true joint distribution P X1X2Y , where P X1X2Y (x 1 , x 2 , y) > 0 for all entries. Specifically, we consider two different estimators to approximate the joint distribution P X1X2Y : (i) the empirical joint distribution PX1X2Y , and (ii) the empirical Markov-structured distribution P (M) X1X2Y , which characterize the conditional dependency structure X 1 -Y -X 2 , PX1X2Y (x 1 , x 2 , y) ≜ 1 n n i=1 1{x (i) 1 = x 1 }1{x (i) 2 = x 2 }1{y (i) = y}, P (M) X1X2Y (x 1 , x 2 , y) ≜ PX1|Y (x 1 |y) PX2|Y (x 2 |y) PY (y), where 1{•} denotes the indicator function, PY denotes the marginal empirical distribution of label Y and P (M) X1X2Y denotes P X1|Y P X2|Y P Y . For simplification, we consider the case when the label distribution has been learned wellfoot_0 i.e., PY (y) = P Y (y), for y ∈ Y. To begin, we focus on a linearly combined estimator based on two different structures. PX1X2Y ≜ (1 -α) • PX1X2Y + α • P (M) X1X2Y , where the coefficient α ∈ [0, 1] is the parameter to be designedfoot_1 . Note that when α becomes zero, the proposed estimator equation 2 will degrade into a widely-used unbiased estimator PX1X2Y . When α goes to one, it will become the estimator P (M) X1X2Y representing the conditional dependency structure, i.e. X 1 -Y -X 2 . Thus, there has an optimal combining coefficient to make the estimator equation 2 the most appropriate estimation for label prediction. In addition, the optimal coefficient can be used to characterize the dependency structure and provide theoretical insights for deriving the optimal estimation.

2.1. OPTIMAL COMBINATION COEFFICIENT

We define a testing loss function based on the referenced χ 2 -divergence to measure the performance of the estimator equation 2, where the referenced χ 2 -divergence is defined as followsfoot_2 . Definition 1. For discrete random variable Z over finite alphabet Z, and its distributions P Z and Q Z , with reference distribution R Z , the referenced χ 2 -divergence between them is defined as χ 2 R Z (P Z , Q Z ) ≜ z∈Z (P Z (z) -Q Z (z)) 2 R Z (z) . ( ) We denote χ 2 (P Z , Q Z ) ≜ χ 2 P Z (P Z , Q Z ), which corresponds to the Pearson χ 2 -divergence. Based on the referenced χ 2 -divergence, we define the testing loss as the average divergence between the estimator (2) and the true joint distribution under the fixed training sample size n. Definition 2. For estimator PX1X2Y with coefficient α and the corresponding true distribution P X1X2Y , the testing loss and the optimal coefficient α * are defined as Ltest (α) ≜ E χ 2 (P X1X2Y , PX1X2Y ) , α * ≜ arg min α∈[0,1] Ltest (α), where the expectation is taken over all n i.i.d. samples generated from the true distribution. Then, we have the following characterization of our proposed testing loss over the linearly combined estimator and the optimal combining coefficient α * . Theorem 3. The testing loss as defined in equation 4 can be expressed as Ltest (α) = 1 n C + 1 n V + χ 2 (P X1X2Y , P (M ) X1X2Y ) • α 2 - 2 n C • α + 1 n (|X 1 ||X 2 ||Y| -1), and the optimal coefficient α * to minimize the testing loss equation 4 can be given as α * = 1 n C χ 2 (P X1X2Y , P (M ) X1X2Y ) + 1 n C + 1 n V , where C ≜ |Y| • [|X 1 ||X 2 | -(|X 1 | + |X 2 |)] + 1 + a n , V ≜ -6 • χ 2 (P X1X2Y , P (M ) X1X2Y ) + 2 x2,y χ 2 (P X1|X2Y , P X1|Y ) + 2 x1,y χ 2 (P X2|X1Y , P X2|Y ) + |Y| (|X 1 | + |X 2 |) -2 + b n . a n and b n are of the order O( 1 n ), which will go to constants when n goes to infinity. The proof and the detailed analytical expressions of a n and b n are provided in the supplementary material. By considering the conditional dependency structure and tuning the coefficient α, the improvement can be calculated by the difference in the testing losses. Corollary 4. The improvement of considering the optimal coefficient α * can be given as Ltest (0) -Ltest (α * ) = 1 n • C 2 C + V + n • χ 2 (P X1X2Y , P (M ) X1X2Y ) , where parameters C and V are defined in Theorem 3. From the expressions equation 6, the optimal combining coefficient α * is determined by three main factors: (i) the training sample size n; (ii) the fitness of the conditional dependency to describe the true one, measured by the χ 2 -divergence χ 2 (P X1X2Y , P X1X2Y ) and terms in the parameter V ; and (iii) the task complexity C, characterized by the number of parameters needed to estimate the joint distribution. The last characterization comes from the fact that when the task is to learn all the entries of the true distribution, the number of parameters we need corresponds to the cardinality of the sample space. Many existing multimodal algorithms (Ma et al., 2020) focus on finding the appropriate dependency structures to approximate the true one, while the number of training samples is not sufficiently addressed. In Theorem 3, we show that the combining coefficient is inversely proportional to the number of training samples and the fitness measure of the conditional dependency to the true one. Also, it is proportional to the task complexity measured by the number of model parameters. There are two interesting special cases for better understanding of Theorem 3 and Corollary 4. Case 1: When the true dependency structure is Markovian, i.e. X 1 -Y -X 2 , the optimal coefficient will becomes 1 -V (C + V ) -1 , which is nearly 1 -|X 1 | -1 -|X 2 | -1 . The cardinality terms |X 1 | and |X 2 | are usually large which results in that the α * is quite close to 1, representing that the model should be close a Markov onefoot_3 and the improvement is relatively large. Case 2: When the number of training samples is relatively small and insufficient to learn a complex model, the optimal coefficient α * would be close to 1, meaning that the model behaves as a "near Markov" one and would improve from considering the conditional dependency structure. Such insights were not well captured in many multimodal algorithms, and our results essentially provide the optimal characterization of the combing coefficient among different dependency structures adjusted by the sample size and the fitness measure. Additionally, the established expression of α * can be interpreted as the optimal bias-variance trade-off (Duda et al., 1973) of the low-dimensional structure to the estimation. Note that the bias-variance trade-off in testing loss (4) is tuned by the coefficient α as Ltest (α) = 1 n Cα 2 + V α 2 + 2Cα + |X 1 ||X 2 ||Y| -1 variance term(s) + α 2 χ 2 (P X1X2Y , P (M ) X1X2Y ) bias term . ( ) The variance terms will vanish with the increase in the training sample sizes. And the bias term characterizes the cost of utilizing dependency structures. Then, the coefficient α * achieves the optimal bias-variance trade-off when the testing loss is minimized.

3. ALGORITHM WITH AUTONOMOUS UPDATED COEFFICIENT ON DEPENDENCY STRUCTURE

In this section, we build an algorithm with an autonomous updated coefficient on the dependency structures (auto-CODES) as a realization of our theoretical framework on multimodal learning. Specifically, we first extend our theory from a discrete data domain to a continuous one that can be applied to practical datasets using representations in factorization form. Then, we give the optimal coefficient α expressed by multimodal features and the objective loss function which linearly combines two components corresponding to different dependency structures. Finally, we give the proposed auto-CODES algorithm and the discrimination rule using maximum a posterior (MAP). Throughout this section, we consider a dataset with k modalities X 1 , . . . , X k , n training samples (x (i) 1 , . . . , x 2 , y (i) ), i = 1, . . . , n, and m labels. The conditional dependency structure considered here is that all the modalities are independent of each other once the label is given.

3.1. MULTIMODAL REPRESENTATIONS IN FACTORIZATION FORM

To utilize the previous analyzing framework, we introduce a parameterized representation for modeling the density function of the continuous data. The parameterized model is based on two parts: an early fusion model and an embedding layer which can help us reduce the parameters for classification to finite. An early fusion model can be described as the following. First, k modalities will go through different deep neural networks and output k features. Then, they will be concatenated and fully connected with a d-dimensional output layer to learn a joint representation f . The embedding layer is the topmost layer for linear classifications, with weights corresponding to label y is given by g(y) = [g 1 (y), . . . , g d (y)] T . For a specific task, the weights in the topmost layer with a finite number of parameters, i.e. g(1), . . . , g(|Y|), can be effectively trained with the joint representation f . Our framework considers an inference model P (f ,g) Y |X1,...,X k , which is widely used in natural language processing (Levy & Goldberg, 2014) and image recognition (Xu & Huang, 2020) , in the following factorization formfoot_4 . P (f ,g) Y |X1,...,X k (y|x 1 , . . . , x k ) ≜ P Y (y)(1 + ⟨f (x 1 , . . . , x k ) , g (y)⟩). ( ) The optimal weights g * 0 and g * 1 , which make the model P (f ,g * i ) Y |X1...X k fit the training samples, minimize the distance between empirical distributions and the estimation P X1...X k P (f ,g * i ) Y |X1...X kfoot_5 , respectively. They can be given as: g * 0 ≜ arg min g0 χ 2 R PX1...X k Y , P X1...X k P (f ,g0) Y |X1...X k , g * 1 ≜ arg min g1 χ 2 R P (M) X1X2Y , P XX1...X k P (f ,g1) Y |X1...X k , where the fitness is measured by the χ 2 -divergence and the reference distribution R ≜ P X1...X k P Y . This allows us to apply the previous analyses and focus on the inference model P (f ,g * i ) Y |X1...X k . Analogous to the linearly combined estimator (2), we consider the linear combination of these inference models Q (α) Y |X1...X k ≜ (1 -α) P (f ,g * 0 ) Y |X1...X k + α P (f ,g * 1 ) Y |X1...X k = P (f ,g * ) Y |X1...X k , with g * = (1 -α)g * 0 + αg * 1 . Further, we define the testing loss and the corresponding optimal coefficient α * as L(f,g) test (α) ≜ E χ 2 R P X1...X k Y , P X1...X k Q (α) Y |X1...X k , α * ≜ arg min α∈[0,1] L(f,g) test (α). We have the following characterization. Theorem 5. For a given dataset, the optimal α * for testing loss equation 13 can be given as α * = 1 n C ′′ Γ + 1 n C ′′ + 1 n V ′′ , where Γ ≜ y∈Y 1 P Y (y) x ′ 1 ,...,x ′ k x ′′ 1 ,...,x ′′ k f T (x ′ 1 , . . . , x ′ k )Λ -1 f f (x ′′ 1 , . . . , x ′′ k ) P (M ) X1...X k Y (x ′ 1 , . . . , x ′ k , y) -P X1...X k Y (x ′′ 1 , . . . , x ′′ k , y) 2 Λ f ≜ x1,...,x k P X1...X k (x 1 , . . . , x k )f (x 1 , . . . , x k )f T (x 1 , . . . , x k ). Terms C ′′ , V ′′ , and the calculation approach of those terms from the training data are given in the supplementary material. These terms can be represented by some expectations of features f and g and are approximated by the corresponding empirical means. For instance, Λ f can be computed from the data by Λ f ← 1 n n i=1 f (x (i) 1 , . . . , x (i) k )f T (x (i) 1 , . . . , x k ).

3.2. OUR PROPOSED ALGORITHM

First, based on linear estimator equation 12, the objective function can be chosen as the linear combination of two referenced χ 2 -distances measuring the gap between the learned distribution with distributions corresponding to different dependency structures, i.e., (1 -α)χ 2 R PX1...X k Y , PX1...X k P (f ,g) Y |X1...X k + αχ 2 R P (M ) X1...X k Y , PX1...X k P (f ,g) Y |X1...X k . (14) Based on the approach established in (Wang et al., 2019; Huang et al., 2017) , the objective equation 14 can then be transformed to the following loss function which can be computed by the feature (f , g), L(α) train (f , g) = (1 -α)L dep (f , g) + αL (M ) dep (f , g), L dep (f , g) = 1 n -1 n i=1 f T (x (i) 1 , . . . , x (i) k )g(y (i) ) - 1 2 tr(cov(f ) cov(g)), L (M ) dep (f , g) = m j=1 PY (j) 1 n j -1 nj i=1 f T (x (i,j) 1 , . . . , x (i,j) k )g(j) - 1 2 tr(cov(f j ) cov(g)) , where f (x (i) 1 , . . . , x k ) is the feature output for i-th sample (x (i) 1 , . . . , x (i) k , y (i) ), g(i) is the embedding for label i, cov(f ) ← 1 n-1 n i=1 f (x (i) 1 , . . . , x (i) k )f T (x (i) 1 , . . . , x (i) k ), cov(g) ← 1 n-1 n i=1 g(y (i) )g T (y (i) ), PY (j) = n i=1 1{y (i) = j}, i = 1, . . . , m. As for loss L (M ) dep (f , g), it needs a permutation on samples' modalities within the subset of the same label. We denote the subset of training samples with label j ∈ {1, . . . , k} as D j = {(x (i,j) 1 , . . . , x (i,j) k )} dj i=1 , where d j is the number of samples whose label is j in the overall dataset D. x (i,j) t is chosen from {x (i,j) t } dj i=1 , t = 1, . . . , k, and n j = k t=1 d t , cov(f j ) ← 1 nj -1 nj t=1 f (x (t,j) 1 , . . . , x (t,j) k )f T (x (t,j) 1 , . . . , x (t,j) k ). Then, our algorithm can be organized as an iteration of two main optimizations: (i) the optimization of α for given (f , g) by minimizing the testing loss Ltest (α) equation 13 and use the expression in Theorem 5 for computation; (ii) the optimization of features (f , g) for given α to minimize the training loss equation 14 by the deep neural network. We summarize it in Algorithm 1. With the output features f * and g * trained by the algorithm, the classification of a newly observed sample (x 1 , x 2 ) is given by the maximum a posterior (MAP) decision rule ỹ(x 1 , . . . , x k ) = arg max y∈Y P Y |X1...X k (y|x 1 , . . . , x k ) = arg max y∈Y P Y (y)(1 + ⟨f * (x 1 , . . . , x k ) , g * (y)⟩).

Algorithm 1 An Auto-updated Coefficient on Dependency Structures (auto-CODES) Algorithm

Input: multimodal data samples {(x (i) 1 , . . . , x (i) 2 , y (i) )} n i=1 Initialize α * = 0 repeat (f * , g * ) ← arg min f ,g L(α * ) train (f , g) α * ← arg min α∈[0,1] L(f * ,g * ) test (α) until α * converges (f * , g * ) ← L(α * ) train (f , g) return f * , g * , α *

4. EXPERIMENTS

In this section, we verify our model and algorithm to answer the following research questions (RQ): RQ1: Can auto-CODES make discrimination well? RQ2: Can auto-CODES automatically determine an appropriate coefficient α? RQ3: Is the optimal coefficient α scale to the inverse proportion of the training sample size? Experimental settings. In our experiments, two widely-used multimodal emotion recognition datasets are used, MELD (Poria et al., 2018) and IEMOCAP (Busso et al., 2008) . MELD contains 13K utterances from 1433 dialogues from the TV series Friends. Each utterance is annotated with three emotion labels, positive, neutral, and negative. We use the audio and textual modalities for our verification. As for IEMOCAP, it contains approximately 12 hours of audiovisual data with six emotion categories: anger, happiness, sadness, neutral, excitement, and frustration. We use visual and audio modalities for our verification. In our settings, the sample size plays a crucial role. Thus, we randomly select subsets of both datasets with certain levels of sample sizes as our training sets. To preserve the inner structure within dialogues, the random selection is towards the dialogues. The exact sample sizes are listed in Table 1 . As for the model structure, we use DialogueRNN (Majumder et al., 2019) as our backbone network for extracting multi-modal features and adopt 2-layer Multi-layer Perceptron (MLP) with RELU (Glorot et al., 2011 ) activation for extracting one-hot label features. We use accuracy and F1-score as our evaluation metric.

4.1. EMOTION RECOGNITION RESULTS

To answer RQ1, we compare auto-CODES with the following methods: (i) CE: Cross entropy loss that is widely used in machine learning classification tasks, (ii) MaskedNLL: a variant of NLL loss to cope with excluding logit value of the padded sample, which is used in (Majumder et al., 2019) , (iii) Focal loss (Mukhoti et al., 2020) : Focal loss is designed to address the issue of the class imbalance problem, and (iv) Soft-HGR (Wang et al., 2019) : Soft-HGR loss learns correlated representation across modalities without hard whitening constraints. All the experiments are conducted on various training sample sizes. The discrimination accuracies along with the F1-score on emotion recognition tasks are reported and shown in Table 1 and Table 3. The results demonstrate that our auto-updated method achieves superior performance against existing methods among all settings with different sample sizes. From the dialogue size 10 to 40, auto-CODES outperforms the second-best method Soft-HGR by the margin of 2.086% (size 20) to 2.63% (size 30) on the F1-score metric. For the dialogue sizes 40 and 60, auto-CODES achieves absolute improvements by margins of 1.629% and 1.515% over Focal loss. Also, to examine the impact of the coefficient α, we conduct experiments with prefixed static α as a comparison with our auto-updated algorithm. Comparing the last two columns in Table 1 , we can observe that auto-CODES obtains 1.856% (size 60) to 4.456% (size 10) improvements in terms of F1-score over static CODES. These results suggest that: (1) when the training samples are insufficient, our proposed auto-CODES outperforms focal loss, Soft-HGR, Cross Entropy loss, and MaskedNll loss, (2) auto-updating α can improve the results over static CODES by large margins. 

4.2. DETERMINATION OF THE OPTIMAL COEFFICIENT

To answer RQ2, we conduct experiments to compare our auto-updated coefficient α and the coefficient determined by grid search and manual adjustments conventionally. Instead of only relying on empirical experience with no theoretical assurance, our theory determines the optimal coefficient α * and updates the learned modalities' features automatically. During training, our algorithm and grid search method use the same number of training epochs for each α updating iteration circle at different training tuple sizes. In the grid search method, we search 101 values of α ranging from 0 to 1 with a step length of 0.01. In our algorithm, we stop the iteration once the difference ratio of the updating alpha is smaller than 0.1. The results are shown in Table 2 . It suggests that our method can locate an appropriate and even better α than the grid search method.

4.3. OPTIMAL COEFFICIENT WITH THE NUMBER OF TRAINING SAMPLES

To answer RQ3, we examine the optimal α determined by auto-CODES on MELD and IEMOCAP. According to our theory, α * is roughly of the order 1 n . For different training sample sizes, we conduct 10 repetitive experiments using auto-CODES to locate the optimal coefficient and report their average as the final α * . The values of their product n • α * for both MELD dataset are demonstrated in Table 2 . It can be approximately recognized as a fixed number around 3.8. The results for IEMOCAP has given in the appendix. According to the column "n • α auto " of Table 2 , our theoretical outcome can not only determine a good α but also reveal its relation with the number of training sample sizes. For more experiment results on the IEMOCAP dataset, please see Supplementary Material. 

5. RELATED WORK

In this section, we summarize the related works from two aspects: correlation analysis and the dependency structures in multimodal learning. Correlation Analyses in Multimodal Learning. Correlation analysis methods can be fusion-based or objective-based. Fusion-based methods design the fusion approaches of different modalities representations, such as tensor fusion network for multimodal sentiment analysis (Zadeh et al., 2017) , and factorized multimodal representations (Liu et al., 2018; Tsai et al., 2018) . Objective-based methods capture modality interactions by using distinct statistical notions. For example, canonical correlation analysis (CCA) approaches like (Karami & Schuurmans, 2021) are based on Pearson correlation. Jensen-Shannon-Divergence (Sutter et al., 2020), Variation of Information (Sohn et al., 2014) , and HGR maximal correlation (Wang et al., 2019; Ma et al., 2020; Tong et al., 2021) have also shown that the statistical design of learning objectives can facilitate the correlation extraction among modalities. The concept of sample efficiency has been discussed in the weighted algorithm TAWT (Chen et al., 2021) . The explicit and accurate quantity of training samples for analysis is not appropriately addressed. Learning Dependency Structures in Multimodal. Learning features in complex manifolds formed by different modalities of data is essential. One approach is to learn the discriminative one such as conditional random fields (CRF) (Lafferty et al., 2001) . Another strategy involves learning the generative model using multimodal deep Boltzmann machines (DBMs) (Salakhutdinov & Hinton, 2009) , or coupled, factorial and multi-stream hidden Markov models (HMM) method (Nefian et al., 2002; Ghahramani & Jordan, 1997; Gurban et al., 2008) . However, the theoretical characterization of utilizing the dependency structures is not sufficiently explored.

6. DISCUSSION

In this paper, we propose a new theoretical framework to analytically characterize the explicit and exact relation between the sample size with conditional dependency structures in multimodal learning in a non-asymptotic regime. Moreover, we propose a weighted training algorithm, auto-CODES, based on the theoretical framework. It can iteratively update the coefficient on different dependency structures based on the evolving modalities' features. The effectiveness of auto-CODES is further corroborated through multimodal emotion recognition experiments on MELD and IEMOCAP datasets with promising results. Limitations and Future Work. There are two main limitations in our work. First, we proposed a tractable algorithm for one specific type of conditional dependency structure, but we left the generalized tractable approach for all different kinds of dependency structures for future work. Second, even though we have specified the precise shape of the coefficient that features can compute, it is still laborious to specify its role in the algorithm; therefore, we intend to improve the computing method for the coefficient in the future. We also intend to integrate our framework with pre-train networks and run further experiments on various modalities across various datasets, including MOSEI.



Note that in many real-world datasets, such as MNIST, the label will be uniformly distributed in the training set which makes this assumption reasonable. It can be shown that the estimator equation 2 can be naturally derived from optimizing a linearly combined Log-Loss function, where we refer to the supplementary material for detailed discussion. Conventionally, such performance is computed by logarithm loss. However, in our setting, it will be illdefined when some (x1, x2, y) tuple is missing in training samples. By that time, we have PX 1 X 2 Y (x1, x2, y) = 0 while PX 1 X 2 Y (x1, x2, y) > 0, which would bring the logarithm loss to infinite. Due to the consideration of the limited number of samples and the assumption on the distribution of label Y , it will not be strictly 1. Note that it can be negative in real applications. But we can also use it to make discriminative decisions through maximum a posterior (MAP) rule. Note that when the discriminative modelP (f ,g) Y |X 1 ...X k is fixed, PX 1 ...X k P (f ,g) Y |X 1 ...X k is the optimal approximation of the true distribution PX 1 ...X k Y .



Comparison with other objectives in MELD dataset with different training sample sizes. All reported results are averaged over 10 repeated experiments.

The optimal coefficients α * derived by auto-CODES and grid search method on different training sample sizes on the MELD dataset.

Comparison with other objectives in the IEMOCAP dataset with different training sample sizes. All reported results are averaged over 10 repeated experiments.

