REUSING PREPROCESSING DATA AS AUXILIARY SU-PERVISION IN CONVERSATIONAL ANALYSIS

Abstract

Conversational analysis systems are trained using noisy human labels and often require heavy preprocessing during multi-modal feature extraction. Using noisy labels in single-task learning increases the risk of over-fitting. However, auxiliary tasks could improve the performance of the primary task learning. This approach is known as Primary Multi-Task Learning (MTL). A challenge of MTL is the selection of beneficial auxiliary tasks that avoid negative transfer. In this paper, we explore how the preprocessed data used for feature engineering can be re-used as auxiliary tasks in Primary MTL, thereby promoting the productive use of data in the form of auxiliary supervision learning. Our main contributions are: (1) the identification of sixteen beneficially auxiliary tasks, (2) the method of distributing learning capacity between the primary and auxiliary tasks, and (3) the relative supervision hierarchy between the primary and auxiliary tasks. Extensive experiments on IEMOCAP and SEMAINE data validate the improvements over single-task approaches, and suggest that it may generalize across multiple primary tasks.

1. INTRODUCTION

The sharp increase in uses of video-conferencing creates both a need and an opportunity to better understand these conversations (Kim et al., 2019a) . In post-event applications, analyzing conversations can give feedback to improve communication skills (Hoque et al., 2013; Naim et al., 2015) . In real-time applications, such systems can be useful in legal trials, public speaking, e-health services, and more (Poria et al., 2019; Tanveer et al., 2015) . Analyzing conversations requires both human expertise and a lot of time, which is what many multimodal conversational analysis systems are trying to solve with automation. However, to build such systems, analysts often require a training set annotated by humans (Poria et al., 2019) . The annotation process is costly, thereby limiting the amount of labeled data. Moreover, third-party annotations on emotions are often noisy. Noisy data coupled with limited labeled data increases the chance of overfitting (James et al., 2013) . From the perspective of feature engineering to analyze video-conferences, analysts often employ pre-built libraries (Baltrušaitis et al., 2016; Vokaturi, 2019) to extract multimodal features as inputs to training. This preprocessing phase is often computationally heavy, and the resulting features are only used as inputs. In this paper, we investigate how the preprocessed data can be re-used as auxiliary tasks in Primary Multi-Task Learning (MTL), thereby promoting a more productive use of data, in the form of auxiliary supervised learning. Specifically, our main contributions are (1) the identification of beneficially auxiliary tasks, (2) the method of distributing learning capacity between the primary and auxiliary tasks, and (3) the relative supervision hierarchy between the primary and auxiliary tasks. We demonstrate the value of our approach through predicting emotions on two publicly available datasets, IEMOCAP (Busso et al., 2008) and SEMAINE (McKeown et al., 2011) .

2. RELATED WORKS AND HYPOTHESES

Multitask learning has a long history in machine learning (Caruana, 1997) . In this paper, we focus on Primary MTL, a less commonly discussed subfield within MTL (Mordan et al., 2018) . Primary

