GAIN: ON THE GENERALIZATION OF INSTRUCTIONAL ACTION UNDERSTANDING

Abstract

Despite the great success achieved in instructional action understanding by deep learning and mountainous data, deploying trained models to the unseen environment still remains a great challenge, since it requires strong generalizability of models from in-distribution training data to out-of-distribution (OOD) data. In this paper, we introduce a benchmark, named GAIN, to analyze the Generaliz-Ability of INstructional action understanding models. In GAIN, we reassemble steps of existing instructional video training datasets to construct the OOD tasks and then collect the corresponding videos. We evaluate the generalizability of models trained on in-distribution datasets with the performance on OOD videos and observe a significant performance drop. We further propose a simple yet effective approach, which cuts off the excessive contextual dependency of action steps by performing causal inference, to provide a potential direction for enhancing the OOD generalizability. In the experiments, we show that this simple approach can improve several baselines on both instructional action segmentation and detection tasks. We expect the introduction of the GAIN dataset will promote future in-depth research on the generalization of instructional video understanding. The project page is https://jun-long-li.github.io/GAIN.

1. INTRODUCTION

Instructional videos play an essential role for learners to acquire different tasks. The explosion of instructional video data on the Internet paves the way for learners to acquire knowledge and for computer vision community training models, for example, human can train an action segmentation model to understand the video by the dense step prediction of each frame, or an action detection model to localize each step. While a number of datasets for instructional action understanding (IAU) have been proposed over the past years (Alayrac et al., 2016; Das et al., 2013b; Malmaud et al., 2015; Sener et al., 2015) and growing efforts have been devoted to learning IAU models (Zhukov et al., 2019; Huang et al., 2017) , the limited generalizability of models remains to be a major obstacle to the deployment in real-world environments. One may ask a question "Suppose the model has learned how to inflate bicycle tires, does it know how to inflate car tires?" In fact, due to potential environmental bias between the training dataset and application scenes, the well-trained model might not be well deployed in an OOD environment (Ren et al., 2019) , especially when instructional videos of interest to users are not involved in the finite training dataset. To encourage models to learn transferable knowledge, it is desirable to benchmark their generalizability. Though this OOD generalization problem (Barbu et al., 2019; Hendrycks et al., 2021; Hendrycks & Dietterich, 2019) attracts much attention in the field of image recognition, such as Ob-jectNet (Barbu et al., 2019) and ImageNet-R (Hendrycks et al., 2020) , it has barely been explored for the IAU task. A related problem is video domain generalization (Yao et al., 2021) (VDG) for conventional action recognition which focuses on domain generalization when changing the scene or background of the action. However, different from conventional action, the key obstacle to the generalization of instructional action is the distribution shift of action steps under different task categories, which is caused by the collection bias of the datasets. In Fig. 3 , we show that the steps under different task categories have different distributions. Figure 1 : Two examples of constructing new OOD instructional tasks by reassembling the steps of in-distribution videos in training datasets. For example, the OOD task "Make Jelly" consists of five steps: {prepare seasonings, stir, put into molds, take out, cut into pieces}, where the "prepare seasonings" step is in the task "Make Lamb Kebab", the "put into molds" and "take out" steps come from "Make Chocolate", and the "stir" and "cut into pieces" steps are in the task "Make Pizza". The steps in GAIN are consistent with those in the training set, with non-overlapping task categories. GAIN encourages models to transfer the knowledge learned from training data for OOD data. Given the motivation that action steps are the key research objects of IAU and have distribution shift when task categories change, we propose a new evaluation strategy to benchmark the generalizability by re-constructing test task categories using the steps of training tasks and evaluating the models with these new task categories. In the reconstruction, we require that training and testing task categories are different but step categories are consistent. As shown at the bottom of Fig. 1 , we try to find a new testing task "Make Jelly" with existing step categories including the "prepare seasonings" step in the task "Make Lamb Kebab", the " put into molds" and "take out" steps in "Make Chocolate", and the " stir" and " cut into pieces" steps in "Make Pizza". This construction is non-trivial since existing IAU datasets cannot be directly used. First, for most IAU datasets (such as COIN (Tang et al., 2019) ), the steps in different videos are not shared, therefore, we cannot construct testing data by splitting itself. Second, though CrossTask (Zhukov et al., 2019) also collects cross-task videos with partial steps shared, these shared parts are only a minority in the dataset (only 14% steps are shared, i.e. 73 are shared of a total of 517 steps) and most videos have steps that are not shared with others. Besides, because the related tasks are not fine-grained annotated, they cannot be used for evaluation. Furthermore, it is built to investigate whether sharing constituent components improves the performance of weakly supervised learning. It motivates us to collect and annotate a real-world IAU dataset, GAIN. It consists of 1,231 videos of 116 OOD tasks with 230 categories of steps, covering a wide range of daily activities. All videos in our GAIN dataset are employed for evaluation. These videos can be split into two groups: GAIN-C and GAIN-B, as counterparts of the COIN (Tang et al., 2019) and Breakfast (Kuehne et al., 2014) datasets, respectively. Furthermore, we propose a simple yet effective approach to enhance the generalizability of IAU models by cutting off excessive contextual dependency by performing causal inference. It is inspired by the observation that model generalizability is inevitably influenced by short-cutting with a biased context. Compared with previous methods learn the temporal dependency among video steps by the temporal networks, such as TCN (Lea et al., 2017) applying a hierarchy of temporal convolutions, we propose to reduce the over-dependency between steps to mitigate the negative effect from temporal context bias. For example, if the task "Inflate Bicycle Tires" is always observed together with the "bicycle pumps" during the training process, this knowledge will be difficult to transfer to the OOD task "Inflate Car Tires" with other inflaters. In our approach, we apply the Back-Door Criterion to infer causal effect, and present a Monte Carlo based method to approximate the distribution after "intervention". The method is evaluated with various baseline methods on both action segmentation and detection tasks, and is shown to produce consistent improvements.

Contributions.

(1) We propose a new evaluation strategy to benchmark the generalizability of IAU models by evaluating the models on the OOD tasks. (2) We build a real-world OOD instructional video dataset, GAIN, where the OOD tasks are constructed by reassembling the steps of training datasets. (3) We propose a simple yet effective approach, cutting off excessive contextual dependency by causal inference, which provides a potential direction to enhance generalizability.

2. THE GAIN DATASET

In this section, we introduce our GAIN dataset, a video-based dataset covering a large range of daily tasks reassembled via a specific framework, which collects the tasks whose categories are different from training tasks to benchmark the generalizability of IAU models. For convenience, we call the tasks whose categories are same with the tasks in the training dataset as in-distribution tasks, and the ones with different categories as OOD tasks. Note that, when we mention "in-distribution" and "OOD", the variables are steps but not tasks. It means the steps are "in-distribution"/ "OOD" under same/different tasks. To our best knowledge, GAIN is the first dataset to evaluate the generalizability of IAU models on the OOD steps. Fig. 2 shows the pipeline to construct the GAIN dataset. Below, we describe the details of our dataset, including how to benchmark the generalizability, how to collect the data and construct the dataset, and the basic dataset statistics.

2.1. PROBLEM DEFINITION

The generalizability is of critical importance to the models for the deployment in a real-world environment, especially for IAU systems, e.g., we expect the model can know how to "Inflate Car Tires" after learning how to "Inflate Bicycle Tires". For this goal, we propose to benchmark the generalizability of IAU models by building an OOD evaluation dataset, in which task categories is constructed by reassembling the steps of the training set. With this construction setting, the step categories are consistent and step distribution is changed.

The training dataset X

T = {X T i } n T i=1 contains n T instructional videos X T , where each video is composed as a set of steps X T i = S T i . This step set can be formulated as S T i = {s T i,j } ns j=1 , where n s is the number of steps in a video. In the conventional experimental setting, both training and evaluation data are in-distribution, which is formulated as X T i.i.d ∼ X Source and X E i.i.d ∼ X Source , where X Source denotes the data distribution and X E denotes the videos in the evaluation set. To benchmark the generalizability of models, we collect the videos of unseen tasks with seen steps, where videos in the evaluation set are OOD tasks that can be formulated as follows: X E OOD i.i.d ∼ X T arget , X T i.i.d ∼ X Source s.t. Ω T S = Ω E S , where Ω T S and Ω E S respectively denote the set of all steps in the training and evaluation set, and X E OOD denotes the collected OOD evaluation dataset. As shown in Fig. 1 , we show some collected videos of our GAIN dataset, where the collected videos follow different step distributions but share the same step space. Finally, we evaluate the models trained on X T with the OOD evaluation dataset X E OOD to benchmark the OOD generalizability. (Sener & Yao, 2019)  ✓ ✓ ✓ ✓ ✗ VDG (Yao et al., 2021) ✗ ✓ ✗ ✓ ✗ Ours ✓ ✓ ✗ ✓ ✓ Here we distinguish our defined OOD generalizability evaluation from other evaluation strategies. We summarize the comparisons with different evaluation methods in Table 1 . First, compared to conventional supervised and unsupervised methods, we focus on the OOD evaluation to benchmark the generalizability of models. Second, UDA (Busto et al., 2018; Zhang et al., 2019) aims to transfer the knowledge from source domain to some known target domain. It needs the domain index (e.g. the target data) for the training process to minimize the domain gap between the source and known target. Compared with UDA, our OOD generalization further considers how to solve the problem without any domain indices. Though zero-shot recognition (ZSR) (Sener & Yao, 2019; Wang et al., 2019 ) also focuses on the generalizability of models, it is too difficult to conduct zero-shot analysis directly for IAU models since ZSR requires the models to understand unseen action steps. This setting can be used for task-level actions (e.g. classification) given extra descrip- (right) . Given an instructional video training set, we first separate the steps of these tasks and generate a large number of task candidates. Secondly, we filter out the unqualified ones according to three principles. Then, we search for YouTube videos related to the selected tasks and download the videos, which embrace high relevance with queries, explicit instructions, and rich diversity. tions (Wang et al., 2019) , but not for complex action understanding tasks such as action segmentation or action detection. Recently, many methods in the field of image classification have attempted to evaluate the generalizability of models by collecting or generating the OOD data, e.g. Object-Net (Barbu et al., 2019) and ImageNet-R (Hendrycks et al., 2021) . However, how to evaluate the generalizability of models for more complex IAU task has barely been visited. The most related one is VDG (Yao et al., 2021) , which evaluates the domain generalization ability of action recognition models when changing the scene or background. Unlike VDG, our setting focuses on the distribution shift of action steps when task categories are changed in the target domain, which is more common in the field of IAU.

2.2. TASK SELECTION

To construct an evaluation dataset consisting of diverse and high-quality daily tasks, we choose the largest fine-grained annotated dataset, COIN (Tang et al., 2019) , and the widely-used instructional video dataset, Breakfast (Kuehne et al., 2014) , as the training sets. How should we select the new tasks to benchmark the generalizability? We argue that the tasks in our GAIN dataset require three basic principles as follows: • Task Non-overlapping: The steps in our GAIN dataset are out-of-distribution, which requires the tasks in GAIN to be non-overlapping with those in the original training set. The model performance on these non-overlapping tasks can intuitively indicate the generalizability. • Step Consistent: Despite the step distributions are different under non-overlapping tasks, we require that the categories of steps in the testing videos are consistent with the training dataset. On the one hand, with totally different steps, IAU will be even more difficult, which deflects our goal to benchmark the generalizability. On the other hand, the steps in the training set are common in daily life, which is of critical importance for IAU. • Category Diverse: The third principle, category diverse, encourages the annotators to discover more diverse data. In other word, we argue that the larger number of task categories is the better. For example, a dataset (with 3 tasks) contains 2 videos of repairing a car, 2 videos of repairing a roof and a video of repairing a television is more diverse than a dataset (with only 1 task) with 5 videos of repairing a car. More diverse data indicates more reliable benchmarking. With the principles above, as shown in Fig. 2 , we first generate a large number of task candidates (i.e. step combinations) and then filter out the unqualified ones. Specifically, we apply the steps in the training dataset as the anchor steps and generate 10 step combinations with the caption clues, where each step combination contains 2∼5 steps. The captions in instructional videos often mention steps that are not in the current task but closely related to other steps in the video, which could help to generate step combinations. In total, we generate more than 8,000 task candidates. Here we provide details to show how we filter the candidates. Given a candidate, we first check whether it is logical. For example, if the candidate is "lifting jack, replace the tire, remove the jack", it makes sense because it could form a task "Change Car Tire"; it is not acceptable if the candidate is "lifting jack, replace the tire, add the seasoning", which may never happen sensibly in daily life. Then according to Task Non-overlapping, we drop the logical candidates if they make up tasks that already exist in training sets. Since all candidates are composed of steps directly from training sets, they inherently satisfy the principle Step Consistent. In the first round of annotation, we ask 11 annotators to go through these candidates and annotate whether the candidate is reasonable and satisfies the above principles. We filter out the candidates annotated as unqualified by more than half annotators and finally select 147 candidates. In the second round, annotators are asked to name the new OOD task, refine current step combinations, and collect the videos from the Internet. By filtering out the rare actions, we finally collect 1,231 videos of 116 tasks. In the last round, the annotators label the fine-grained temporal boundaries of each steps in videos.

COIN Train

COIN Test

GAIN-C Frequency Rate

Step Given the selected tasks, we search for YouTube videos related to the task names. We use a query with exactly the task name or the task name following a "how to" prefix to locate instructional videos, e.g. for the task "Change Car Tire" we use "change car tire" or "how to change car tire". To improve the quality and diversity, we adopt several criteria to select videos including: high relevance with queries, explicit instructions, and rich diversity. We prefer videos more relevant to the queries and containing explicit instructions with pictures, since visual models are not able to only learn from narrations. Besides, videos with explicit steps to complete tasks are also favorable, although there might be no vocal instructions. Furthermore, explicit steps in a video do not need to exactly match those in its task -in other words, permuting and being a proper subset of the task are acceptable. Moreover, if similar steps are witnessed in different videos of a task, like "add salt" and "add sugar", we regard them as the same step. With regard to the undefined steps, they are considered as the background and not further annotated. On the one hand, during the data collection stage, if a video contains a long stretch of undefined but important steps, this video will be filtered out according to the principle of Step Consistent. On the other hand, videos with undefined meaningless steps are acceptable and these steps will be considered as the background. After collecting the videos, we utilize the annotation tool provided in (Tang et al., 2019) to label the corresponding step categories and segments.

2.4. STATISTICS

The final version of the GAIN dataset consists of 1,231 instructional videos related to 116 unseen tasks. Our GAIN dataset is a pure evaluation dataset to benchmark the generalizability of IAU models. Each task in GAIN contains 2∼24 videos with an average of 10 videos. We annotate 6,382 action segments in GAIN with an average of 5 steps in each video. The average length of videos is 2 minutes and 30 seconds, and the average length of steps is 12 seconds. Totally, the GAIN dataset contains OOD videos of 51.2 hours for generalizability evaluation. GAIN can be divided into two splits as counterparts of COIN (Tang et al., 2019) and Breakfast (Kuehne et al., 2014) datasets, and we name them GAIN-C and GAIN-B, respectively. COIN is a large-scale benchmark with 9,030 training videos and 2,797 testing videos of 180 tasks. As its counterpart, GAIN-C contains 1,000 videos of 100 unseen tasks with a length of 41.6 hours in total, where 5,238 segments are annotated. Fig. 3 shows the step distributions on COIN Train, COIN Test, and our GAIN-C, where horizontal axis denotes the different steps and the vertical axis denotes the frequency rates of these steps. We can observe the step distributions in original COIN Train and Test sets are similar, but different from our GAIN-C. It demonstrates our assuming that under different task categories, the step distributions are different. Breakfast is composed of more than 1.9k cooking-related videos of 10 breakfast routines such as "Make Coffee" and "Cook Pancakes". Accordingly, the GAIN-B split includes 231 videos of 16 OOD tasks with an average length of 2 minutes and 30 seconds. These tasks consist of 20 fine-grained action categories. We provide more statistical data and analysis in the Appendix.

3. METHOD

In this section, we first construct a causal graph for the IAU problem. Then, we introduce our method which applies causal inference to mitigate the negative effect of confounding context bias.

3.1. CAUSAL GRAPH CONSTRUCTION

The widely researched action understanding tasks include action segmentation and detection, which both focus on the steps. Without loss of generality, we formulate the task as: P (Y |S) = f θ (S), where S denotes a step in the video X, Y is the prediction and f θ represents the model. Then, we formulate the action understanding framework in light of a causal graph G = {V, E}, where the nodes V include the step S, model prediction Y , and context steps Z. Note that X = S ∪ Z and S ∩ Z = ∅ denote video X can be divided by query step S and context steps Z. The links E indicate the dependence (computational but not strict causal direction (Liu et al., 2021) ) between two variables. For example, S → Y in the causal graph indicates variable S is the cause of variable Y . We show the casual structure of the IAU problem in Fig. 4 (a) and explain it as follows: • Z → Y ← S indicates that the model prediction depends on both the step S and the context steps Z. For example, when recognizing the current step S, temporal models (e.g. LSTM (Hochreiter & Schmidhuber, 1997) and C3D (Tran et al., 2015) ) always use the temporal context clues for current prediction, which leads to Z → Y . • S ← Z → Y denotes that the video context steps Z simultaneously affects the steps and model prediction. Z → Y has been explained above and S ← Z is intuitive due to the temporal dependency of video. Thus we call the context bias Z is a confounder (Pearl, 2009) , which misleads the model to focus on the spurious correlation, reducing the generalizability of the model. The casual graph describes the information flow during the inference. When S is being estimated, other context steps are Z, and since S is affected by Z during the inference, Z points to S. Then we show that the model prediction is misled by the spurious correlations of context bias when we only consider the likelihood P (Y |S). As shown in Fig. 4 (a), we re-write P (Y |S) with the Bayes rule as: P (Y |S) = Σ z P (Y |S, Z = z)P (Z = z|S), which denotes that the likelihood P (Y |S) are influenced by P (Z = z|S). Now we use an example to show that P (Z = z|S) is biased. In the video "Inflate bicycle tires" , current content S is "installing the nozzle" and the context Z is "using bicycle pump". The content S and the context Z are always observed together in the training process and thus P (Z = using bicycle pump|S = installing the nozzle) is higher. It leads the model to predict higher probability P (Y |S = installing the nozzle) when observing the Z = using bicycle pump and vice verse. However, when we apply the model to analyze the OOD video "Inflate car tires", where Z "using bicycle pump" is absent, the model may be confused and consequently give wrong prediction. Additionally, although there is actually bidirectional effect between S and Z, we find that Bayes' theorem and Eq.3 remains unchanged. In Fig. 4 (a), we apply S ← Z to highlight the bias caused by the con-fonder S ← Z → Y , which is demonstrated by the Fig. 3 . Motivated by the causal inference method (Pearl, 2009; Glymour et al., 2016) , we propose to conduct intervention to alleviate the negative effect of context bias. In causal inference, the intervention is represented as do(•). Once intervened, a variable will have no in-coming links anymore and the previous in-coming links in the causal graph are cut off. As shown in Fig. 4 (a), when we intervene S with the Back-Door Criterion in (Pearl, 2009) , i.e. do(S), the link between S and Z is cut off so as the dependency. We formulate the model prediction process under the intervention: P (Y |do(S)) = Σ z P (Y |S, Z = z)P (Z = z), where Z = z is independent from S. Thus, after intervention, when the model predicts from do(S) to the label Y , it fairly takes every z into consideration. Please see more detailed introduction and derive about Back-Door Criterion in Section 3.3 of (Glymour et al., 2016) . However, the intervention is a great challenge, since the prediction under this intervention is subject to the prior P (Z), which is difficult to compute numerically. Thus, we simulate conducting the intervention. We replace the numeric process with a sampling process and approximate the prior P (z) by the Monte Carlo method. As shown in Fig. 4 (b), we regard each step as an individual instance and put all the steps in a lottery box, i.e. a step pool. Statistically, all the steps in the training set form the population Z = {z 1 , ..., z n } with n categories, and then we can sample from this population. During the sampling process, the frequency of z is not affected by X anymore and is

Reassemble

Step Pool only related to the statistics of the training set. We approximate the prior with the relative frequency: ••• % ! % " & # % # % $ "Intervened" Videos % ! & # % $ & # % " % # % " Videos % ! % " & # % # % $ & # % # % $ " # +,(/) = ! ! " # /, P (Z = z ′ ) = Σ z∈Ω I(z = z ′ ) ∥Ω∥ , where Ω denotes the sampling population, ∥Ω∥ is the sample size, and I is an indicator function. We use the sampled steps to assemble "intervened" videos to learn causations of X on Y , instead of the spurious correlations due to context bias Z. Fig. 4 (b) illustrates an example of this process, in which s 1 only occurs with z 1 &z 2 in the original videos and consequently models tend to learn spurious correlations of them. After dissembling and reassembling, in the "intervened" videos, s 1 could be observed with others, like z 3 &z 4 , and the occurrence of z is not dependent on s 1 . Technically, our causal "intervention" can be regarded as a new kind of data argumentation, where we dissemble the steps and reassemble them as new video data.

4. EXPERIMENTS

In this section, we provide performance comparisons between the in-distribution dataset and out-ofdistribution GAIN dataset, and assess the effectiveness of our causal approach on both action segmentation and action detection tasks. We conduct experiments on three datasets, where COIN (Tang et al., 2019) and Breakfast (Kuehne et al., 2014) are used for both training and testing, and our GAIN dataset is only used for evaluation. As mentioned in Section 2.4, COIN and Breakfast are widely used for IAU, so GAIN is split into two groups as counterparts of them, named GAIN-C and GAIN-B respectively. Please refer to Section 2.4 for more descriptions of these datasets. The implementation details, results, and analysis are described below. More experimental results and visualization examples can be found in Appendix.

4.1. IMPLEMENTATION DETAILS

We conduct experiments with the following baseline methods: (1) LSTM (Hochreiter & Schmidhuber, 1997 ) is one of the earliest and most popular deep models dealing with temporal modeling. (2) ED-TCN (Lea et al., 2017) applies a hierarchic encoder-decoder framework with temporal convolutions, pooling, and upsampling to learn temporal patterns. We use 5 convolution layers for both the encoder and decoder, whose convolutional filters' sizes are 25. (3) TResNet (He et al., 2016) adds a residual stream in the encoder-decoder framework. We follow the network structure depicted in (Lei & Todorovic, 2018) and adopt the same experimental setting as ED-TCN's. (4) MS-TCN++ (Li et al., 2020) proposes a multi-stage architecture, which first generates initial predictions and refines them several times. Our implementation is built upon the publicly provided codebase. For COIN/GAIN-C, we use the temporal video resolution at 10 fps, and extract S3D (Miech et al., 2020) features with a pretrained model on HowTo100M (Miech et al., 2019) as the model input. And for Breakfast/GAIN-B, we use I3D (Carreira & Zisserman, 2017) features (pretrained on Kinetics (Carreira & Zisserman, 2017 )) sampled at 15 fps as the model input. As for the original evaluation set, we follow the default setting and present results on split 1. For all experiments, we employ a 1 × 1 convolution layer to project the features into an embedding space, whose dimension is 64. Then, we apply different baseline methods to model the spatio-temporal clues. All settings are the same with both the baseline methods and our methods.

4.2. ACTION SEGMENTATION

Setting: Action segmentation aims at assigning each video frame with a step label. This task is a key step to understand complex actions in instructional videos. We adopt frame-wise accuracy, which is the number of correctly predicted frames divided by the number of total video frames. For Breakfast/GAIN-B, we also adopt edit distance and F1 score (Lea et al., 2017) at overlapping thresholds 10% to further measure the quality of the model prediction. We observe an obvious performance drop of approximately 10.0 on average, although the steps of the two datasets are shared. Besides, Table 3 shows the results on Breakfast and GAIN-B. The performance gap is more obtrusive since the frame accuracy decreases by approximately 60% on average. It indicates that the current methods lack generalizability on the out-of-distribution tasks. We observe that causal-based methods achieve consistent improvements in the OOD scenario. Besides, we find that the performances of LSTM are poor with the F1 score and edit score, while other baseline methods work well. By qualitatively checking the predictions, we find that LSTM only can tell actions from the background, but fails to classify the action categories correctly. It may be because the LSTM model is overly dependent on temporal relations and weakens the representational capacity. Quantitative Analysis: We compare our methods with the baseline methods to demonstrate the effectiveness. Table 2 (the Overall row) and Table 3 summarize the performance comparisons of our methods with all four baseline methods including LSTM (Hochreiter & Schmidhuber, 1997) , ED-TCN (Lea et al., 2017) , TResNet (He et al., 2016) and MS-TCN++ (Li et al., 2020) . For all four baseline methods, our causal-based methods achieve significant and consistent improvements on GAIN and obtains comparable results on the original evaluation sets with the frame accuracy metric. For example, after blocking the causal link between S and Z, Causal MS-TCN++ relatively outperforms the baseline over +8.1% on GAIN-C. Moreover, on GAIN-B the causal inference methods relatively outperform the baselines over +69.3% on average.

Domain Analysis:

Following the COIN dataset, we provide more in-depth analysis with experimental results across different domains in Table 2 (Domains are described and showed in the Ap-pendix). An obvious performance drop from COIN to GAIN-C occurs on domains 'Electrical' and 'Science'. The reason is that steps in these domains often follow a fixed process, which introduces strong contextual dependency to models and results in poor performance on the OOD tasks. The causal inference approach alleviates these negative effects, for example, Causal LSTM relatively outperforms the baseline with a large margin of +9.4% and +10.4% on domain 'Electrical' and 'Science', respectively. On domains like 'Housework', models obtain comparable results on COIN and GAIN-C, which is because the video collection of GAIN-C is independent of the collection for COIN. So it is possible that the videos we find are easier to be segmented than those in COIN. The "Overall" results show that OOD test set is more challenging. At the top, the original MS-TCN++ predicts 6 kinds of steps for the video "Wash cat", while Causal MS-TCN predicts the same causations of steps as the ground truth. This demonstrates that our model does not predict the spurious correlations caused by the context bias but focuses on the step itself. At the bottom, we show an example of the video "Scalded shrimp", the causal one outperforms the baseline method with more smooth predictions.

4.3. ACTION DETECTION

Setting: The goal of action detection is to detect a series of steps and output the temporal boundaries. It is also an important yet challenging task for IAU. We follow the evaluation protocol of (Lea et al., 2017; Singh et al., 2016) by reporting the widely-used segment-wise metric, mean Average Precision with midpoint hit criterion (mAP@mid). Specifically, the criterion of mAP@mid for a true positive is whether or not the temporal midpoint of the output interval is within the corresponding groundtruth action segments. Results: Table 4 presents the experimental results on COIN, Breakfast, and their counterparts. In this task, we choose LSTM as the baseline. From the training set to GAIN, we observe a huge performance drop by more than 80%, which is related to the weak OOD generalizability of the baseline method. We also compare the causal methods with the baselines. Without any performance cost on the original evaluation set, the causal methods relatively outperform baselines over +33%/+55% on GAIN-C/GAIN-B.

5. CONCLUSION

In this paper, we have introduced a dataset, named GAIN, to benchmark the generalizability of IAU models. Our GAIN dataset contains 1,231 videos of 116 OOD tasks, which are collected by reassembling the in-distribution steps of the training set. Based on GAIN, we have proposed to evaluate the generalizability of models with the performance on the OOD tasks. We have also proposed a causal inference approach to cut off the excessive contextual dependency for enhancing generalizability. We evaluate the generalizability of some widely used methods on GAIN and demonstrate that causal inference is a potential direction to improve generalization. We will release this dataset to promote the real-world deployment of IAU models. Limitation: By benchmarking the generalizability on GAIN, we offer a testbed and expect work to develop models that can work in dynamic environments. However, the OOD data needs to be found, which is labor-intensive. Thus, the GAIN dataset cannot be omniscient or cover every aspect. Besides, while causal inference shows its potential to improve generalizability, designing algorithms for generalization is still an open question. It is worth devoting efforts and we leave it as future work. the instructional video is not the scene but the distribution shift of action steps in changing tasks. The detailed comparisons are summarized in Table 1 . Causal Inference: Causal inference (Pearl, 2009; 2019) plays an important role in machine learning, which investigates causal effects of different variables. Recently, causal inference has been successfully applied to diverse fields including computer vision (Lopez-Paz et al., 2017; Wang et al., 2020) , natural language processing (Park et al., 2019) , and reinforcement learning (Nair et al., 2019; Forney et al., 2017) , due to its ability for removing confounding bias (Tang et al., 2020; Wang et al., 2020) , building explainable machine (Wang & Vasconcelos, 2020; Goyal et al., 2019) , promoting fairness (Kusner et al., 2017; Chiappa, 2019) , and recovering missing data (Mohan & Pearl, 2018) . In this paper, we apply causal inference to mitigate the negative effect brought by confounding context bias to enhance the generalizability of IAU models.

B PREREQUISITE: CAUSAL MODEL

In this section, we provide some prerequisites of causal model that may help to better understand our causal approach. More details could be found in (Glymour et al., 2016) . Our task for video understanding is to predict the label of step based on the observation as P (Y |S). However, the context steps Z also affect the prediction. With the Bayes rule, we can re-write P (Y |S) as: P (Y |S) = Σ z P (Y |S, Z = z)P (Z = z|S), which denotes that the likelihood P (Y |S) are influenced by P (Z = z|S). However, P (Z = z|S) is changed in the OOD setting. Now we use an example to show that P (Z = z|S) introduces the observation bias. In the video "Inflate bicycle tires" , current content S is "installing the nozzle" and the context Z is "using bicycle pump". The content S and the context Z are always observed together in the training process and thus P (Z = using bicycle pump|S = installing the nozzle) is higher. It leads the model to predict higher probability P (Y |S = installing the nozzle) when observing the Z = using bicycle pump and vice verse. However, when we apply the model to analyze the OOD video "Inflate car tires", where Z "using bicycle pump" is absent, the model may be confused and consequently give wrong prediction. Therefore, we aim at mitigating the influence from Z on S. Before that, we first introduce some definitions in the causal inference to help understand, and the proofs can be found in (Pearl, 2009; Glymour et al., 2016) . Definition 1 (Intervention) An intervention represents an external force that fixes a variable to a constant value (akin to random assignment if an experiment), and is denoted do(S = s)foot_0 , meaning that S is experimentally fixed to the value s. 



do-operator erases all the arrows that come into S to prevent any information about S from flowing in the non-causal direction.



Figure 2: The pipeline to construct GAIN, which includes Task Selection (left) and Data Collection(right). Given an instructional video training set, we first separate the steps of these tasks and generate a large number of task candidates. Secondly, we filter out the unqualified ones according to three principles. Then, we search for YouTube videos related to the selected tasks and download the videos, which embrace high relevance with queries, explicit instructions, and rich diversity.

Figure 3: The step distributions of the training dataset, original in-distribution test dataset, and our OOD test dataset on COIN. Under the same task categories, the step distribution is similar to the original training and test datasets. With different task categories in GAIN-C, the step distribution changes a lot, which supports our assumption that steps are in-distribution/OOD with same/different task categories.

Figure 4: (a) The causal inference illustration for instructional action understanding. (Top) presents the original causal graph of IAU and the likelihood P (Y |S). (Bottom) shows the causal graph and the causation P (Y |do(S)) after intervention. (b) Approximation with Monte Carlo method.We first dissemble the videos, where s 1 only occurs with z 1 &z 2 , and sample from the step pool. The prior P (Z) is approximated with the relative frequency and sampled steps constitute the"intervened" videos, where s 1 could be observed with z 3 or z 4 .

Figure 5: Visualization examples of action segmentation results on GAIN-C.Qualitative Analysis: We qualitatively analyze how our method contributes to the improvement of performance. Fig.5demonstrates the visualization of two prediction examples on GAIN-C with MS-TCN++ and the corresponding causal method, where different colors means different step categories. Obviously, Causal MS-TCN++ achieves higher frame accuracy on both two examples. At the top, the original MS-TCN++ predicts 6 kinds of steps for the video "Wash cat", while Causal MS-TCN predicts the same causations of steps as the ground truth. This demonstrates that our model does not predict the spurious correlations caused by the context bias but focuses on the step itself. At the bottom, we show an example of the video "Scalded shrimp", the causal one outperforms the baseline method with more smooth predictions.

Figure 9: The duration statistics in the video-level (left) and step-level (right) of GAIN.

Figure 10: The views distributions of tasks in GAIN-C on YouTube.



Evaluation on COIN/GAIN-C with baselines and finer results across domains. • / • denotes performances on COIN/GAIN-C. C means we apply causal-based method.

Results on Breakfast/GAIN-B for action segmentation.

Evaluation on training sets and GAIN for action detection. '• / •' denotes the performances of 'Training set / GAIN'.

Tasks and the corresponding steps in GAIN-C.

Tasks and the corresponding steps in GAIN-C. ReverseParking drive the car forward, drive the car backward, adjust front and back position HaveAPicnic clean up the ground, lay the cushion evenly, load the dish RoastChicken prepare seasonings and side dishes, remove the intestines and blood vessels, brush sauce or sprinkle seasoning, bake pizza HighJump begin to run up, begin to jump up, fall to the ground RoastedChickenWings soak them in water, add raw materials, mix the raw materials, bake pizza InstallPaintingOnWall drill in the wall, knock in the nails, paste and level the wallpaper RoastSweetPotato clean the pumpkin, fry or roast or braise, cut the bread, peel InstallPointers let the flat side of the new needle towards the jack and insert the new needle, screw on the screw ScaldedShrimp prepare and boil water, pour the noodles into the water and stir, remove the shrimp shell, pour the cooked noodles IntravenousInjection tie the tourniquet, disinfect, pull out the needle and press with cotton ShotPut pre-swing, throw the hammer out MakeBananaMilk peel, cut into strips and pieces, add milk, shake and juice, pour the orange juice into the cup, put strawberries and other fruits into the juicer UnloadSpareTire unscrew the screw, remove the tire MakeBaozi add ingredients into cone, knead together UseBalance put the sample to be measured on the balance, put the weight until the balance is balanced MakeBread knead the dough, run the toaster and adjust, take out the slice of bread UseDryer put the clothes neatly on a ironing table, use a hair dryer to blow hot wall, flip the clothes repeatly MakeCake pour the egg into the bowl, add raw materials, mix raw materials, put materials into mold, run the toaster and adjust, take out chocolate UseNasalSpray wipe nose, fill a nostril with saline and do the same to the other nostril, shake and stir, remove cap MakeCoconutJuice dig out the seeds with spoon, put the ingredients into the can, pour the orange juice into the cup, put strawberries and other fruits into the juicer, put yogurt, honey and other ingredients into the juicer, shake and juice UseTeakettle pour the tea into the vessel, heat the teapot and wash the cup, close up the cover MakeGarlicBread mix the raw materials, cut the bread, put the filler on the bread slice, put a slice of bread in, run the toaster and adjust, take out the slice of bread UseWashingMachine open the fuel tank cap, add some cleaner to clean and wet the lenses and take out the lenses, close up the cover MakeHoneyLemon cut ingredients, add some water to the tea, put the ingredients into the can, mix and pickle Vaccinate fill the injection head, disinfect the injecting place, inject to the muscular, pull out the needle and press MakeInstantCoffee add tea powder, brew tea and stir, add some ingredients in the coffee, mix the raw materials WashCar add detergent and make bubble, clean the surface, wipe off the cleaning agent MakeJelly pour the ingredients into the bowl, stir the mixture, take out chocolate, cut into strips and pieces, put materials into mold WashCat use the body wash, wash the body wash away

Tasks and the corresponding steps in GAIN-C. and pieces, mix and pickle, clean up and soak, put the ingredients into the can, mix raw materials, add some water to the tea, shake and juice WashClothes add detergent and make bubble, clean toys and hamster cages, wrap the hair by hands MakeMashedPotatoes peel, cut potato into strips, soak them in water, add raw materials, mix raw materials, prepare and boil water WashFace wet and wash hands, apply cleansing milk to the face, wipe up the face MakeMilkShake add milk, shake and juice, pour the orange juice into the cup WashThePot add detergent and make bubble, flush and wash the interior, scrub the toilet interior MakeMilkTea put yogurt, honey and other ingredients into the juicer, mix the raw materials, pour in after mix it, add ice cubes WaterBoilMeat prepare and boil water, soak them in water, load the dish

ACKNOWLEDGEMENT

This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 62125603, and in part by a grant from the Beijing Academy of Artificial Intelligence (BAAI).

Appendix A RELATED WORK

Instructional Action Understanding: With the explosion of video data on the Internet, learners can acquire knowledge from instructional videos to accomplish different tasks. Many instructional video datasets have been proposed for different goals, such as action detection datasets (Caba Heilbron et al., 2015; Idrees et al., 2017; Gu et al., 2018) , video summarization datasets (De Avila et al., 2011; Gygli et al., 2014; Panda et al., 2017; Song et al., 2015) , and video caption datasets (Xu et al., 2016; Yu et al., 2018; Miech et al., 2019; Krishna et al., 2017) . To analyze instructional videos, diverse research fields are presented in recent years including action segmentation (Richard et al., 2018a; b; Miech et al., 2020; 2019; Sun et al., 2019) , procedure segmentation (Zhou et al., 2018b) , step localization (Alayrac et al., 2018; Zhukov et al., 2019) , action anticipating (Sener & Yao, 2019; Farha et al., 2018) , dense video caption (Das et al., 2013a) , video grounding (Huang et al., 2017; 2018; Zhou et al., 2018a) , and skill determination (Doughty et al., 2018; 2019) . Despite the great progress on the in-distribution environment, it is a major challenge to deploy the trained models in the real-world environment.Out-of-Distribution Generalization: How to generalize the trained model into OOD environments is a key challenge in machine learning (Geirhos et al., 2020) . A kind of widely-used methods are zero-shot recognition (ZSR) (Xu et al., 2017; Brattoli et al., 2020; Wang et al., 2019) (cross dataset evaluation), where the categories of samples in the testing set are apparently different from the training set. For instructional video, it is difficult to directly recognize unseen step categories. Another evaluation method is Unsupervised Domain Adaptation (UDA) (Busto et al., 2018; Zhang et al., 2019) , which trains the model with the data and annotations in the source domain and target domain index (e.g. unannotated target data). Compared with UDA, OOD generalization further considers the setting without target domain information. VDG (Yao et al., 2021) is an OOD generation problem which evaluates the models by the videos with changing the scene or background. However, this setting is more suitable for conventional actions. On the contrast, the key domain changing in In other word, with this criterion we condition on Z such that we (1) block all spurious paths between S and Y ; (2) don't disturb any directed paths from S to Y ; (3) create no new spurious paths.With these definitions, we can show that how the backdoor adjustment can help for our OOD task. When we intervene S with the Back-Door Criterion, i.e. do(S), the link between S and Z is cut off so as the dependency. We formulate the model prediction process under the intervention:where Z = z is independent from S. The only difference between Eq 6 and Eq 9 is that we change P (Z|S) to P (Z), which shows that Z is no longer affected by S. After intervention, when the model predicts from do(S) to the label Y , it fairly takes every z into consideration. Thus, the backdoor adjustment can mitigate the negative effect of the biased confounder Z. 

C MORE DETAILS OF GAIN C.1 A DETAILED EXAMPLE

To show how to construct the OOD task in GAIN, in Fig. 6 , we display a detailed example about "Hang Up Clock" which can be reassembled by the steps in the training tasks "Hang Up Curtain" and "Install Wood flooring". Specifically, the "Hang up curtain" task consists of three steps including {drill, install shelves, hang up}, and the "Install wood flooring" task is composed of {cut raw boards, f it on boards, knock in nails}. Our collected OOD task "Hang up clock" contains the "drill" and "hang up" steps in the ''Hang up curtain" task and the "knock in nails" step in another.

C.2 TASKS & STEPS

In order to present more details of our GAIN dataset, we show all the selected tasks with their corresponding steps in Table 7 , 8, 9 and Table 10 . We display these two tables in the end of Appendix because of the typesetting. The GAIN dataset consists of 1,231 instructional videos related to 116 unseen tasks. Each task in GAIN contains 2∼24 videos with an average of 10 videos. We annotate 6382 action segments in GAIN with an average of 5 steps in each video.C.3 SAMPLE DISTRIBUTIONS Fig. 7 and Fig. 8 illustrate the sample distribution of GAIN-C and GAIN-B. Statistically, GAIN-C contains 1,000 videos of 100 unseen tasks with a length of 41.6 hours in total, where 5238 segments are annotated. The GAIN-B split includes 231 videos of 16 OOD tasks with an average length of 2 minutes and 30 seconds. These tasks consist of 20 fine-grained action categories. C.4 DURATION STATISTICS Fig. 9 illustrates the duration statistics in both video-level and step-level of our GAIN dataset, where the average length of videos is 2 minutes and 30 seconds, and the average length of steps is 12 seconds. Totally, the GAIN dataset contains OOD videos of 51.2 hours for generalizability evaluation.

C.5 VIEWS ANALYSIS

In Fig. 10 , we display the number of views on YouTube across 100 tasks in GAIN-C, which can demonstrate that with the basic principles mentioned in section 3.2, the unseen tasks meet the need of website viewers statistically. We grab the number of views from YouTube by utilizing the Python module youtubesearchpython. We form a query with "how to" preceding the task name (e.g. how to paint the wall) to search for YouTube instructional videos related to the tasks. Then for each task, we only extract the first 20 results and sum up the numbers of views to represent the popularity of this task.With approximately 767.9M views, "Make popsicle" becomes the most-viewed task and the last one "Dissolve effervescent tablet" still obtains 391.8K views. The number of views per task is 44.5M on average and the average number of views for the counted videos is 2.2M. The statistical results above prove that the selected tasks in our GAIN dataset are all common daily tasks and satisfy the We conduct experimental analysis on both COIN/GAIN-C to investigate the effect of hyperparameter learning rate with MS-TCN++ (Li et al., 2020) . As shown in Table5, with a learning rate of 5e -4 the model performs unfavorable results on both two datasets, while an increased learning rate, i.e. 1e -3 or 2e -3, can notably improve its performance on COIN(+2.3% and +2.5%) as well as the performance on the out-of-distribution tasks (+7.5% and +7.1%).We also conduct experiments of different relative sizes of reassembled videos on Breakfast/GAIN-B and the results (frame accuracy) are shown in Table 6 . For the number of steps, 1.0×, 0.5×, and 1.5× of Causal ED-TCN denote that we use 1 times, 0.5 times, and 1.5 times step numbers as that of the original ED-TCN. So as the number of videos. Our methods with different setting consistently outperform the baseline on both in-distribution and OOD scenario. Besides, we find that with the same number of augmented steps and videos the model achieves better overall performance, so we adapt this setting for all experiments.

D.2 VISUALIZATION RESULTS

In section 5.2, we have visualized the ground-truth annotations and the action segmentation results. Due to the limitation of space, we only visualized the results produced by MS-TCN++ method and its causal version. Now in Fig. 12 , we illustrate the visualization for different methods including LSTM, ED-TCN, TResNet and MS-TCN++, to demonstrate the effectiveness of our approach. We show results of the baseline methods and our corresponding causal ones on the task "Scalded shrimp" in GAIN-C. The consistent improvements for different baselines on out-of-distribution tasks indicate that our causal intervention promotes the generalizability of models.Additionally, we show a failure case of our approaches in Fig. 13 . We analyze the underlying insights from the cases that the causal-based method lower performance than the baseline, such as the video "Make Mixed Juice" in the GAIN-C dataset. Take a closer look, we find that there are some strong step dependencies, such as "juice the fruit" and "pour the juice". In this situation, context bias has positive effects on the evaluation. Thus, we got the following insights, despite the better performance on average, our method encourages step independency, which has negative effects on the examples where strong step dependencies occur for both training and testing data. It is reasonable since these steps are in-distribution samples where the context bias has a positive effect, e.g., "juice the fruit" and "pour the juice" are successive steps in both COIN and GAIN-C, so the prediction with the concurrence of them is better.Published as a conference paper at ICLR 2023 

