DEEP DEPENDENCY NETWORKS FOR ACTION CLAS-SIFICATION IN VIDEOS

Abstract

We propose a simple approach which combines the strengths of probabilistic graphical models and deep learning architectures for solving the multi-label action classification task in videos. At a high level, given a video clip, the goal in this task is to infer the set of activities, defined as verb-noun pairs, that are performed in the clip. First, we show that the performance of previous approaches that combine Markov Random Fields with neural networks can be modestly improved by leveraging more powerful methods such as iterative join graph propagation, ℓ 1 regularization based structure learning and integer linear programming. Then we propose a new modeling framework called deep dependency network which augments a dependency network, a model that is easy to train and learns more accurate dependencies but is limited to Gibbs sampling for inference, to the output layer of a neural network. We show that despite its simplicity, joint learning this new architecture yields significant improvements in performance over the baseline neural network. In particular, our experimental evaluation on three video datasets: Charades, Textually Annotated Cooking Scenes (TACoS), and Wetlab shows that deep dependency networks are almost always superior to pure neural architectures that do not use dependency networks.

1. INTRODUCTION

We focus on the following multi-label action classification (MLAC) task: given a video partitioned into segments (or frames) and a pre-defined set of actions, label each segment with a subset of actions from the pre-defined set that are performed in the segment (or frame). We consider each action to be a verb-noun pair such as "open bottle", "pour water" and "cut vegetable". MLAC is a special case of the standard multi-label classification (MLC) task where given a pre-defined set of labels and a test example, the goal is to assign each test example to a subset of labels. It is well known that MLC is notoriously difficult because in practice the labels are often correlated and thus predicting them independently may lead to significant errors. Therefore, most advanced methods explicitly model the relationship or dependencies between the labels, either using probabilistic techniques (cf. Wang et al. ( 2008  )). Intuitively, because MLAC is a special case of MLC, methods used for solving MLAC should model relationships between the actions as well as between the actions and features. To this end, the primary goal of this paper is to develop a simple, general-purpose scheme that models the relationships between actions using probabilistic representation and reasoning techniques and works on top of neural feature extractors in order to improve their generalization performance. Our work is inspired by previous work on hybrid models that combine probabilistic graphical models (PGMs) with neural networks (NNs) under the assumption that their strengths are often complementary (cf. Johnson et al. (2016) ; Krishnan et al. (2015) . At a high level, in these hybrid models NNs perform feature extractions while PGMs model the relationships between the labels as well as between features and labels. In previous work, these hybrid models have been used for solving a range of computer vision tasks such as image crowd counting (Han et al., 2017) , visual relationship detection (Yu et al., 2022) , modeling for epileptic seizure detection in multichannel EEG (Craley et al., 2019) , face sketch synthesis (Zhang et al., 2020) , semantic image segmentation (Chen et al., 1



); Guo & Xue (2013); Tan et al. (2015); Di Mauro et al. (2016); Wang et al. (2014); Antonucci et al. (2013)) or non-probabilistic/neural methods (cf. Papagiannopoulou et al. (2015); Kong et al. (2013); Wang et al. (2021a); Nguyen et al. (2021); Wang et al. (

