REUSING PREPROCESSING DATA AS AUXILIARY SU-PERVISION IN CONVERSATIONAL ANALYSIS

Abstract

Conversational analysis systems are trained using noisy human labels and often require heavy preprocessing during multi-modal feature extraction. Using noisy labels in single-task learning increases the risk of over-fitting. However, auxiliary tasks could improve the performance of the primary task learning. This approach is known as Primary Multi-Task Learning (MTL). A challenge of MTL is the selection of beneficial auxiliary tasks that avoid negative transfer. In this paper, we explore how the preprocessed data used for feature engineering can be re-used as auxiliary tasks in Primary MTL, thereby promoting the productive use of data in the form of auxiliary supervision learning. Our main contributions are: (1) the identification of sixteen beneficially auxiliary tasks, (2) the method of distributing learning capacity between the primary and auxiliary tasks, and (3) the relative supervision hierarchy between the primary and auxiliary tasks. Extensive experiments on IEMOCAP and SEMAINE data validate the improvements over single-task approaches, and suggest that it may generalize across multiple primary tasks.

1. INTRODUCTION

The sharp increase in uses of video-conferencing creates both a need and an opportunity to better understand these conversations (Kim et al., 2019a) . In post-event applications, analyzing conversations can give feedback to improve communication skills (Hoque et al., 2013; Naim et al., 2015) . In real-time applications, such systems can be useful in legal trials, public speaking, e-health services, and more (Poria et al., 2019; Tanveer et al., 2015) . Analyzing conversations requires both human expertise and a lot of time, which is what many multimodal conversational analysis systems are trying to solve with automation. However, to build such systems, analysts often require a training set annotated by humans (Poria et al., 2019) . The annotation process is costly, thereby limiting the amount of labeled data. Moreover, third-party annotations on emotions are often noisy. Noisy data coupled with limited labeled data increases the chance of overfitting (James et al., 2013) . From the perspective of feature engineering to analyze video-conferences, analysts often employ pre-built libraries (Baltrušaitis et al., 2016; Vokaturi, 2019) to extract multimodal features as inputs to training. This preprocessing phase is often computationally heavy, and the resulting features are only used as inputs. In this paper, we investigate how the preprocessed data can be re-used as auxiliary tasks in Primary Multi-Task Learning (MTL), thereby promoting a more productive use of data, in the form of auxiliary supervised learning. Specifically, our main contributions are (1) the identification of beneficially auxiliary tasks, (2) the method of distributing learning capacity between the primary and auxiliary tasks, and (3) the relative supervision hierarchy between the primary and auxiliary tasks. We demonstrate the value of our approach through predicting emotions on two publicly available datasets, IEMOCAP (Busso et al., 2008) and SEMAINE (McKeown et al., 2011) .

2. RELATED WORKS AND HYPOTHESES

Multitask learning has a long history in machine learning (Caruana, 1997) . In this paper, we focus on Primary MTL, a less commonly discussed subfield within MTL (Mordan et al., 2018) . Primary MTL is concerned with the performance on one (primary) task -the sole motivation of adding auxiliary tasks is to improve the primary task performance. In recent years, primary MTL has been gaining attention in computer vision (Yoo et al., 2018; Fariha, 2016; Yang et al., 2018; Mordan et al., 2018; Sadoughi & Busso, 2018) , speech recognition (Krishna et al., 2018; Chen & Mak, 2015; Tao & Busso, 2020; Bell et al., 2016; Chen et al., 2014) , and natural language processing (NLP) (Arora et al., 2019; Yousif et al., 2018; Zalmout & Habash, 2019; Yang et al., 2019; Du et al., 2017) . The benefit of adding multiple tasks is to provide inductive bias through multiple noisy supervision (Caruana, 1997; Lipton et al., 2015; Ghosn & Bengio, 1997) . On the other hand, the drawback of adding multiple tasks increases the risk of negative transfer (Torrey & Shavlik, 2010; Lee et al., 2016; 2018; Liu et al., 2019; Simm et al., 2014) , which leads to many design considerations. Two of such considerations are, identifying (a) what tasks are beneficial and (b) how much of the model parameters to share between the primary and auxiliary tasks. In addition, because we are performing Primary MTL, we have the third consideration of (c) whether we should prioritize primary supervision by giving it a higher hierarchy than the auxiliary supervision. In contrast with previous MTL works, our approach (a) identifies sixteen beneficially auxiliary targets, (b) dedicates a primary-specific branch within the network, and (c) investigates the efficacy and generalization of prioritizing primary supervision across eight primary tasks. Since our input representation is fully text-based, we dive deeper into primary MTL in the NLP community. Regarding model architecture designs for primary MTL in NLP, Søgaard & Goldberg (2016) found that lower-level tasks like part-of-speech tagging, are better kept at the lower layers, enabling the higher-level tasks like Combinatory Categorical Grammar tagging to use these lowerlevel representations. In our approach, our model hierarchy is not based on the difficulty of the tasks, but more simply, we prioritize the primary task. Regarding identifying auxiliary supervisors in NLP, existing works have included tagging the input text (Zalmout & Habash, 2019; Yang et al., 2019; Søgaard & Goldberg, 2016) . Text classification with auxiliary supervisors have included research article classification (Du et al., 2017; Yousif et al., 2018) , and tweet classification (Arora et al., 2019) . There is a large body of work in multimodal sentiment analysis, but not in the use of multimodal auxiliary supervisors, as detailed in the next paragraph. Multimodal analysis of conversations has been gaining attention in deep learning research, particularly for emotion recognition in conversations (Poria et al., 2019) . The methods in the recent three years have been intelligently fusing numeric vectors from the text, audio, and video modalities before feeding it to downstream layers. This approach is seen in MFN (Zadeh et al., 2018a) , MARN (Zadeh et al., 2018b) , CMN (Hazarika et al., 2018b) , ICON (Hazarika et al., 2018a) , DialogueRNN (Majumder et al., 2019), and M3ER (Mittal et al., 2020) . Our approach is different in two ways. (1) Our audio and video information is encoded within text before feeding only the text as input. Having only text as input has the benefits of interpretability, and the ability to present the conversational analysis on paper (Kim et al., 2019b) . This is similar to how the linguistics community performs manual conversational analysis using the Jefferson transcription system (Jefferson, 2004) , where the transcripts are marked up with symbols indicating how the speech was articulated. (2) Instead of using the audio and video information as only inputs to a Single Task Learning (STL) model, the contribution of this paper is that we demonstrate how to use multimodal information in both input and as auxiliary supervisors to provide inductive bias that helps the primary task. Hypothesis H1: The introduced set of auxiliary supervision features improves primary MTL. We introduce and motivate the full set of sixteen auxiliary supervisions, all based on existing literature: these are grouped into four families, each with four auxiliary targets. The four families are (1) facial action units, (2) prosody, (3) historical labels, and (4) future labels: (1) Facial action units, from the facial action coding system identifies universal facial expressions of emotions (Ekman, 1997) . Particularly, AU 05, 17, 20, 25 have been shown to be useful in detecting depression (Yang et al., 2016a; Kim et al., 2019b) and rapport-building (Anonymous, 2021). (2) Prosody, the tone of voice -happiness, sadness, anger, and fear -can project warmth and attitudes (Hall et al., 2009) , and has been used as inputs in emotions detection (Garcia-Garcia et al., 2017) . (3 and 4) Using features at different historical time-points is a common practice in statistical learning, especially in time-series modelling (Christ et al., 2018) . Lastly, predicting future labels as auxiliary tasks can help in learning (Caruana et al., 1996; Cooper et al., 2005; Trinh et al., 2018; Zhu et al., 2020; Shen et al., 2020) . Inspired by their work, we propose using historical and future

