DEEP DEPENDENCY NETWORKS FOR ACTION CLAS-SIFICATION IN VIDEOS

Abstract

We propose a simple approach which combines the strengths of probabilistic graphical models and deep learning architectures for solving the multi-label action classification task in videos. At a high level, given a video clip, the goal in this task is to infer the set of activities, defined as verb-noun pairs, that are performed in the clip. First, we show that the performance of previous approaches that combine Markov Random Fields with neural networks can be modestly improved by leveraging more powerful methods such as iterative join graph propagation, ℓ 1 regularization based structure learning and integer linear programming. Then we propose a new modeling framework called deep dependency network which augments a dependency network, a model that is easy to train and learns more accurate dependencies but is limited to Gibbs sampling for inference, to the output layer of a neural network. We show that despite its simplicity, joint learning this new architecture yields significant improvements in performance over the baseline neural network. In particular, our experimental evaluation on three video datasets: Charades, Textually Annotated Cooking Scenes (TACoS), and Wetlab shows that deep dependency networks are almost always superior to pure neural architectures that do not use dependency networks.

1. INTRODUCTION

We focus on the following multi-label action classification (MLAC) task: given a video partitioned into segments (or frames) and a pre-defined set of actions, label each segment with a subset of actions from the pre-defined set that are performed in the segment (or frame). We consider each action to be a verb-noun pair such as "open bottle", "pour water" and "cut vegetable". MLAC is a special case of the standard multi-label classification (MLC) task where given a pre-defined set of labels and a test example, the goal is to assign each test example to a subset of labels. It is well known that MLC is notoriously difficult because in practice the labels are often correlated and thus predicting them independently may lead to significant errors. Therefore, most advanced methods explicitly model the relationship or dependencies between the labels, either using probabilistic techniques (cf. Wang et al. ( 2008 (2021b) ). Intuitively, because MLAC is a special case of MLC, methods used for solving MLAC should model relationships between the actions as well as between the actions and features. To this end, the primary goal of this paper is to develop a simple, general-purpose scheme that models the relationships between actions using probabilistic representation and reasoning techniques and works on top of neural feature extractors in order to improve their generalization performance. Our work is inspired by previous work on hybrid models that combine probabilistic graphical models (PGMs) with neural networks (NNs) under the assumption that their strengths are often complementary (cf. Johnson et al. (2016) ; Krishnan et al. (2015) . At a high level, in these hybrid models NNs perform feature extractions while PGMs model the relationships between the labels as well as between features and labels. In previous work, these hybrid models have been used for solving a range of computer vision tasks such as image crowd counting (Han et al., 2017) , visual relationship detection (Yu et al., 2022) , modeling for epileptic seizure detection in multichannel EEG (Craley et al., 2019) , face sketch synthesis (Zhang et al., 2020) , semantic image segmentation (Chen et al., 2018; Lin et al., 2016) , animal pose tracking (Wu et al., 2020) and pose estimation (Chen & Yuille, 2014) . In this paper, we seek to extend and adapt these previously proposed PGM+NN approaches for solving the multi-label action classification task in videos. Motivated by approaches that combine PGMs with NNs, as a starting point, we investigated using Markov random fields (MRFs), a type of undirected PGM to capture the relationship between the labels and features computed using convolutional NNs. Unlike previous work which used these MRF+CNN or CRF+CNN hybrids with conventional inference schemes such as Gibbs sampling (GS) and mean-field inference, our goal was to evaluate whether advanced reasoning and learning approaches, specifically ( 1 (3) Wetlab (Naim et al., 2014) . We found that, generally speaking, both IJGP and ILP are superior to the baseline CNN and Gibbs sampling in terms of JI and SA but are sometimes inferior to the CNN in terms of mAP and LRAP. We speculated that because MRF structure learners only allow pairwise relationships and impose sparsity or low-treewidth constraints for faster, accurate inference, they often yield poor posterior probability estimates in high-dimensional settings. Since both mAP and LRAP require good posterior probability estimates, GS, IJGP, and ILP exhibit poor performance when mAP and LRAP are used to evaluate the performance. To circumvent this issue and in particular, to derive good posterior estimates, we propose a new PGM+NN hybrid called deep dependency networks (DDNs). At a high level, a dependency network (DN) (Heckerman et al., 2000) represents a joint probability distribution using a collection of conditional distributions, each defined over a variable given all other variables in the network. A DN is consistent if the conditional distributions come from a unique joint probability distribution. Otherwise, it is called inconsistent. A consistent DN has the same representation power as a MRF in that any MRF can be converted to a consistent DN and vice versa. A key advantage of DNs over MRFs is that they are easy to train because each conditional distribution can be trained independently and modeled using classifiers such as logistic regression, decision trees, and multi-layer perceptrons. These classifiers can be easily defined over a large number of features and as a result the conditional distributions (defined by the classifiers) are often more accurate than the ones inferred from a sparse or low-treewidth MRF learned from data. However, DNs admit very few inference schemes and are more or less restricted to Gibbs sampling for inference in practice. The second disadvantage of DNs is that because the local distributions are learned independently, they can yield an inconsistent DN, and thus one has to be careful when performing inference over such DNs. Despite these disadvantages, DNs often learn better models which typically translates to superior generalization performance (cf. (Neville & Jensen, 2003; Guo & Gu, 2011) ). In our proposed deep dependency network (DDN) architecture, a dependency network sits on the top of a convolutional neural network (CNN). The CNN converts the input image or video segment to a set of features, and the dependency network uses these features to define a conditional distribution over each label (action) given the features and other labels. We show that deep dependency models are easy to train either jointly or via a pipe-line method where the CNN is trained first, followed by the DNN by defining an appropriate loss function that minimizes the negative pseudo log-likelihood of the data. We conjecture that because DDNs can be quite dense, they often learn a better representation of the data, and as a result, they are likely to outperform MRFs learned from data in terms of posterior predictions. We evaluated DDNs using the four aforementioned metrics and three datasets. We observed that they are often superior to the baseline neural networks as well as MRF+CNN methods that use GS, IJGP and ILP on all four metrics. Specifically, they achieve the highest score on the mAP metric on all the datasets. We compared the pipeline model with the jointly learned model and found that the joint model is more accurate than the pipeline model. In summary, this paper makes the following contributions:



); Guo & Xue (2013); Tan et al. (2015); Di Mauro et al. (2016); Wang et al. (2014); Antonucci et al. (2013)) or non-probabilistic/neural methods (cf. Papagiannopoulou et al. (2015); Kong et al. (2013); Wang et al. (2021a); Nguyen et al. (2021); Wang et al.

) iterative join graph propagation (IJGP)(Mateescu et al., 2010)  (a type of generalization Belief propagation techniqueYedidia et al. (2000)), (2) integer linear programming (ILP) based techniques for computing most probable explanations, and (3) logistic regression with ℓ 1 -regularization based methods(Lee et al., 2006; Wainwright et al., 2006)  for learning the structure of pairwise MRFs, improve the generalization performance of MRF+CNN hybrids. To measure generalization accuracy, we used several metrics such as mean average precision (mAP), label ranking average precision (LRAP), subset accuracy (SA) and the jaccard index (JI) and experimented on three video datasets: (1) Charades(Sigurdsson et al., 2016), (2) TACoS(Regneri et al., 2013) and

