GAIN: ON THE GENERALIZATION OF INSTRUCTIONAL ACTION UNDERSTANDING

Abstract

Despite the great success achieved in instructional action understanding by deep learning and mountainous data, deploying trained models to the unseen environment still remains a great challenge, since it requires strong generalizability of models from in-distribution training data to out-of-distribution (OOD) data. In this paper, we introduce a benchmark, named GAIN, to analyze the Generaliz-Ability of INstructional action understanding models. In GAIN, we reassemble steps of existing instructional video training datasets to construct the OOD tasks and then collect the corresponding videos. We evaluate the generalizability of models trained on in-distribution datasets with the performance on OOD videos and observe a significant performance drop. We further propose a simple yet effective approach, which cuts off the excessive contextual dependency of action steps by performing causal inference, to provide a potential direction for enhancing the OOD generalizability. In the experiments, we show that this simple approach can improve several baselines on both instructional action segmentation and detection tasks. We expect the introduction of the GAIN dataset will promote future in-depth research on the generalization of instructional video understanding.

1. INTRODUCTION

Instructional videos play an essential role for learners to acquire different tasks. The explosion of instructional video data on the Internet paves the way for learners to acquire knowledge and for computer vision community training models, for example, human can train an action segmentation model to understand the video by the dense step prediction of each frame, or an action detection model to localize each step. While a number of datasets for instructional action understanding (IAU) have been proposed over the past years (Alayrac et al., 2016; Das et al., 2013b; Malmaud et al., 2015; Sener et al., 2015) and growing efforts have been devoted to learning IAU models (Zhukov et al., 2019; Huang et al., 2017) , the limited generalizability of models remains to be a major obstacle to the deployment in real-world environments. One may ask a question "Suppose the model has learned how to inflate bicycle tires, does it know how to inflate car tires?" In fact, due to potential environmental bias between the training dataset and application scenes, the well-trained model might not be well deployed in an OOD environment (Ren et al., 2019) , especially when instructional videos of interest to users are not involved in the finite training dataset. To encourage models to learn transferable knowledge, it is desirable to benchmark their generalizability. Though this OOD generalization problem (Barbu et al., 2019; Hendrycks et al., 2021; Hendrycks & Dietterich, 2019) attracts much attention in the field of image recognition, such as Ob-jectNet (Barbu et al., 2019) and ImageNet-R (Hendrycks et al., 2020) , it has barely been explored for the IAU task. A related problem is video domain generalization (Yao et al., 2021) (VDG) for conventional action recognition which focuses on domain generalization when changing the scene or background of the action. However, different from conventional action, the key obstacle to the generalization of instructional action is the distribution shift of action steps under different task categories, which is caused by the collection bias of the datasets. In Fig. 3 , we show that the steps under different task categories have different distributions. For example, the OOD task "Make Jelly" consists of five steps: {prepare seasonings, stir, put into molds, take out, cut into pieces}, where the "prepare seasonings" step is in the task "Make Lamb Kebab", the "put into molds" and "take out" steps come from "Make Chocolate", and the "stir" and "cut into pieces" steps are in the task "Make Pizza". The steps in GAIN are consistent with those in the training set, with non-overlapping task categories. GAIN encourages models to transfer the knowledge learned from training data for OOD data. Given the motivation that action steps are the key research objects of IAU and have distribution shift when task categories change, we propose a new evaluation strategy to benchmark the generalizability by re-constructing test task categories using the steps of training tasks and evaluating the models with these new task categories. In the reconstruction, we require that training and testing task categories are different but step categories are consistent. As shown at the bottom of Fig. 1 , we try to find a new testing task "Make Jelly" with existing step categories including the "prepare seasonings" step in the task "Make Lamb Kebab", the " put into molds" and "take out" steps in "Make Chocolate", and the " stir" and " cut into pieces" steps in "Make Pizza". This construction is non-trivial since existing IAU datasets cannot be directly used. propose to reduce the over-dependency between steps to mitigate the negative effect from temporal context bias. For example, if the task "Inflate Bicycle Tires" is always observed together with the "bicycle pumps" during the training process, this knowledge will be difficult to transfer to the OOD task "Inflate Car Tires" with other inflaters. In our approach, we apply the Back-Door Criterion to infer causal effect, and present a Monte Carlo based method to approximate the distribution after "intervention". The method is evaluated with various baseline methods on both action segmentation and detection tasks, and is shown to produce consistent improvements.

Contributions.

(1) We propose a new evaluation strategy to benchmark the generalizability of IAU models by evaluating the models on the OOD tasks. (2) We build a real-world OOD instructional video dataset, GAIN, where the OOD tasks are constructed by reassembling the steps of



Figure1: Two examples of constructing new OOD instructional tasks by reassembling the steps of in-distribution videos in training datasets. For example, the OOD task "Make Jelly" consists of five steps: {prepare seasonings, stir, put into molds, take out, cut into pieces}, where the "prepare seasonings" step is in the task "Make Lamb Kebab", the "put into molds" and "take out" steps come from "Make Chocolate", and the "stir" and "cut into pieces" steps are in the task "Make Pizza". The steps in GAIN are consistent with those in the training set, with non-overlapping task categories. GAIN encourages models to transfer the knowledge learned from training data for OOD data.

First, for most IAU datasets (such asCOIN (Tang  et al., 2019)), the steps in different videos are not shared, therefore, we cannot construct testing data by splitting itself.Second, though CrossTask (Zhukov et al., 2019)  also collects cross-task videos with partial steps shared, these shared parts are only a minority in the dataset (only 14% steps are shared, i.e. 73 are shared of a total of 517 steps) and most videos have steps that are not shared with others. Besides, because the related tasks are not fine-grained annotated, they cannot be used for evaluation. Furthermore, it is built to investigate whether sharing constituent components improves the performance of weakly supervised learning. It motivates us to collect and annotate a real-world IAU dataset, GAIN. It consists of 1,231 videos of 116 OOD tasks with 230 categories of steps, covering a wide range of daily activities. All videos in our GAIN dataset are employed for evaluation. These videos can be split into two groups: GAIN-C and GAIN-B, as counterparts of theCOIN (Tang et al., 2019)  and Breakfast (Kuehne et al., 2014) datasets, respectively. Furthermore, we propose a simple yet effective approach to enhance the generalizability of IAU models by cutting off excessive contextual dependency by performing causal inference. It is inspired by the observation that model generalizability is inevitably influenced by short-cutting with a biased context. Compared with previous methods learn the temporal dependency among video steps by the temporal networks, such asTCN (Lea et al., 2017)  applying a hierarchy of temporal convolutions, we

