GROUNDING PHYSICAL CONCEPTS OF OBJECTS AND EVENTS THROUGH DYNAMIC VISUAL REASONING

Abstract

We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from dynamic scenes and language. DCL first adopts a trajectory extractor to track each object over time and to represent it as a latent, object-centric feature vector. Building upon this object-centric representation, DCL learns to approximate the dynamic interaction among objects using graph networks. DCL further incorporates a semantic parser to parse question into semantic programs and, finally, a program executor to run the program to answer the question, levering the learned dynamics model. After training, DCL can detect and associate objects across the frames, ground visual properties and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training. We further test DCL on a newly proposed video-retrieval and event localization dataset derived from CLEVRER, showing its strong generalization capacity.

1. INTRODUCTION

Visual reasoning in dynamic scenes involves both the understanding of compositional properties, relationships, and events of objects, and the inference and prediction of their temporal and causal structures. As depicted in Fig. 1 , to answer the question "What will happen next?" based on the observed video frames, one needs to detect the object trajectories, predict their dynamics, analyze the temporal structures, and ground visual objects and events to get the answer "The blue sphere and the yellow object collide". Recently, various end-to-end neural network-based approaches have been proposed for joint understanding of video and language (Lei et al., 2018; Fan et al., 2019) . While these methods have shown great success in learning to recognize visually complex concepts, such as human activities (Xu et al., 2017; Ye et al., 2017) , they typically fail on benchmarks that require the understanding of compositional and causal structures in the videos and text (Yi et al., 2020) . Another line of research has been focusing on building modular neural networks that can represent the compositional structures in scenes and questions, such as object-centric scene structures and multi-hop reasoning (Andreas et al., 2016; Johnson et al., 2017b; Hudson & Manning, 2019) . However, these methods are designed for static images and do not handle the temporal and causal structure in dynamic scenes well, leading to inferior performance on video causal reasoning benchmark CLEVRER (Yi et al., 2020) . To model the temporal and causal structures in dynamic scenes, Yi et al. (2020) proposed an oracle model to combine symbolic representation with video dynamics modeling and achieved state-ofthe-art performance on CLEVRER. However, this model requires videos with dense annotations for visual attributes and physical events, which are impractical or extremely labor-intensive in real scenes.

Observed Frames Predicted Dynamics

Event Grounding Object Grounding Question: What will happen next? Answer: The blue sphere and the yellow object collide. We argue that such dense explicit video annotations are unnecessary for video reasoning, since they are naturally encoded in the question answer pairs associated with the videos. For example, the question answer pair and the video in Fig. 1 can implicitly inform a model what the concepts "sphere", "blue", "yellow" and "collide" really mean. However, a video may contain multiple fast-moving occluded objects and complex object interactions, and the questions and answers have diverse forms. It remains an open and challenging problem to simultaneously represent objects over time, train an accurate dynamic model from raw videos, and align objects with visual properties and events for accurate temporal and causal reasoning, using vision and language as the only supervision. Our main ideas are to factorize video perception and reasoning into several modules: object tracking, object and event concept grounding, and dynamics prediction. We first detect objects in the video, associating them into object tracks across the frames. We can then ground various object and event concepts from language, train a dynamic model on top of object tracks for future and counterfactual predictions, analyze relationships between events, and answer queries based on these extracted representations. All these modules can be trained jointly by watching videos and reading paired questions and answers. To achieve this goal, we introduce Dynamic Concept Learner (DCL), a unified neural-symbolic framework for recognizing objects and events in videos and analyzing their temporal and causal structures, without explicit annotations on visual attributes and physical events such as collisions during training. To facilitate model training, a multi-step training paradigm has been proposed. We first run an object detector on individual frames and associate objects across frames based on a motion-based correspondence. Next, our model learns concepts about object properties, relationships, and events by reading paired questions and answers that describe or explain the events in the video. Then, we leverage the acquired visual concepts in the previous steps to refine the object association across frames. Finally, we train a dynamics prediction network (Li et al., 2019b) based on the refined object trajectories and optimize it jointly with other learning parts in this unified framework. Such a training paradigm ensures that all neural modules share the same latent space for representing concepts and they can bootstrap the learning of each other. We evaluate DCL's performance on CLEVRER, a video reasoning benchmark that includes descriptive, explanatory, predictive, and counterfactual reasoning with a uniform language interface. DCL achieves state-of-the-art performance on all question categories and requires no scene supervision such as object properties and collision events. To further examine the grounding accuracy and transferability of the acquired concepts, we introduce two new benchmarks for video-text retrieval and spatial-temporal grounding and localization on the CLEVRER videos, namely CLEVRER-Retrieval and CLEVRER-Grounding. Without any further training, our model generalizes well to these benchmarks, surpassing the baseline by a noticeable margin. We adopt an object trajectory detector to detect trajectories of all objects. Then, the extracted objects are sent to a dynamic predictor to predict their dynamics. Next, the extracted objects are sent to the feature extractor to extract latent representations for objects and events. Finally, we feed the parsed programs and latent representation to the symbolic executor to answer the question and optimize concept learning.

2. RELATED WORK

visual concepts through visual question answering (Mao et al., 2019) . However, it mainly focused on learning static concepts in images, while our DCL aims at learning dynamic concepts like moving and collision in videos and at making use of these concepts for temporal and causal reasoning. Later, visual reasoning was extended to more complex dynamic videos ( 

3. DYNAMIC CONCEPT LEARNER

In this section, we introduce a new video reasoning model, Dynamic Concept Learner (DCL), which learns to recognize video attributes, events, and dynamics and to analyze their temporal and causal structures, all through watching videos and answering corresponding questions. DCL contains five modules, 1) an object trajectory detector, 2) video feature extractor, 3) a dynamic predictor, 4) a language program parser, and 5) a neural symbolic executor. As shown in Fig. 2 , given an input video, the trajectory detector detects objects in each frame and associates them into trajectories; the feature extractor then represents them as latent feature vectors. After that, DCL quantizes the objects' static concepts (i.e., color, shape, and material) by matching the latent object features with the corresponding concept embeddings in the executor. As these static concepts are motion-independent, they can be adopted as an additional criteria to refine the object trajectories. Based on the refined trajectories, the dynamics predictor predicts the objects' movement and interactions in future and counterfactual scenes. The language parser parses the question and choices into functional programs, which are executed by the program executor on the latent representation space to get answers. The object and event concept embeddings and the object-centric representation share the same latent space; answering questions associated with videos can directly optimize them through backpropagation. The object trajectories and dynamics can be refined by the object static attributes predicted by DCL. Our framework enjoys the advantages of both transparency and efficiency, since it enables step-by-step investigations of the whole reasoning process and has no requirements for explicit annotations of visual attributes, events, and object masks.

3.1. MODEL DETAILS

Object Detection and Tracking. Given a video, the object trajectory detector detects object proposals in each frame and connects them into object trajectories O = {o n } N n=1 , where o n = {b n t } T t=1 and N is the number of objects in the video. b t = [x n t , y n t , w n t , h n t ] is an object proposal at frame t and T is the frame number, where (x n t , y n t ) denotes the normalized proposal coordinate center and w n t and h n t denote the normalized width and height, respectively. The object detector first uses a pre-trained region proposal network (Ren et al., 2015) to generate object proposals in all frames, which are further linked across connective frames to get all objects' trajectories. Let {b i t } N i=1 and {b j t+1 } N j=1 to be two sets of proposals in two connective frames. Inspired by Gkioxari & Malik (2015) , we define a connection score s l between b i t and b j t+1 to be s l (b i t , b j t+1 ) = s c (b i t ) + s c (b j t+1 ) + λ 1 • IoU (b i t , b j t+1 ), where s c (b i t ) is the confidence score of the proposal b i t , IoU is the intersection over union and λ 1 is a scalar. Gkioxari & Malik (2015) adopts a greedy algorithm to connect the proposals without global optimization. Instead, we assign boxes {b j t+1 } N j=1 at the t + 1 frame to {b i t } N i=1 by a linear sum assignment. Video Feature Extraction. Given an input video and its detected object trajectories, we extract three kinds of latent features for grounding object and event concepts. It includes 1) the average visual feature f v ∈ R N ×D1 for static attribute prediction, 2) temporal sequence feature f s ∈ R N ×4T for dynamic attribute and unary event prediction, and 3) interactive feature f c ∈ R K×N ×N ×D2 for collision event prediction, where D 1 and D 2 denote dimensions of the features and K is the number of sampled frames. We give more details on how to extract these features in Appendix B. Grounding Object and Event Concepts. Video Reasoning requires a model to ground object and event concepts in videos. DCL achieves this by matching object and event representation with object and event embeddings in the symbolic executor. Specifically, DCL calculates the confidence score that the n-th object is moving by cos(s moving , m da (f s n )) -δ /λ, where f s n denotes the temporal sequence feature for the n-th object, s moving denotes a vector embedding for concept moving, and m da denotes a linear transformation, mapping f s n into the dynamic concept representation space. δ and λ are the shifting and scaling scalars, and cos() calculates the cosine similarity between two vectors. DCL grounds static attributes and the collision event similarly, matching average visual features and interactive features with their corresponding concept embeddings in the latent space. We give more details on the concept and event quantization in Appendix E. Trajectory Refinement. The connection score in Eq. 1 ensures the continuity of the detected object trajectories. However, it does not consider the objects' visual appearance; therefore, it may fail to track the objects and may connect inconsistent objects when different objects are close to each other and moving rapidly. To detect better object trajectories and to ensure the consistency of visual appearance along the track, we add a new term to Eq. 1 and re-define the connection score to be s l ({b i m } t m=0 , b j t+1 ) = s c (b i t ) + s c (b j t+1 ) + λ 1 • IoU (b i t , b j t+1 ) + λ 2 • f appear ({b i m } t m=0 , b j t+1 ), where f appear ({b i m } t m=0 , b j t+1 ) measures the attribute similarity between the newly added proposal b j t+1 and all proposals ({b i m } t m=0 in previous frames. We define f appear as fappear({b i m } t m=0 , b j t+1 ) = 1 3 × t attr t m=0 fattr(b i m , b j t+1 ), where attr ∈ {color, material, shape}. f attr (b m , b t+1 ) equals to 1 when b i m and b j t+1 have the same attribute, and 0 otherwise. In Eq. 2, f appear ensures that the detected trajectories have consistent visual appearance and helps to distinguish the correct object when different objects are close to each other in the same frame. These additional static attributes, including color, material, and shape, are extracted without explicit annotation during training. Specifically, we quantize the attributes by choosing the concept whose concept embedding has the best cosine similarity with the object feature. We iteratively connect proposals at the t + 1 frame to proposals at the t frame and get a set of object trajectories O = {o n } N n=1 , where o n = {b n t } T t=1 . Dynamic Prediction. Given an input video and the refined trajectories of objects, we predict the locations and RGB patches of the objects in future or counterfactual scenes with a Propagation Network (Li et al., 2019b) . We then generate the predicted scenes by pasting RGB patches into the predicted locations. The generated scenes are fed to the feature extractor to extract the corresponding features. Such a design enables the question answer pairs associated with the predicted scenes to optimize the concept embeddings and requires no explicit labels for collision prediction, leading to better optimization. This is different from Yi et al. (2020), which requires dense collision event labels to train a collision classifier. To predict the locations and RGB patches, the dynamic predictor maintains a directed graph V, D = {v n } N n=1 , {d n1,n2 } N,N n1=1,n2=1 . The n-th vertex v n is represented by a concatenation of tuple b n t , p n t over a small time window w, where b n t = [x n t , y n t , w n t , h n t ] is the n-th object's normalized coordinates and p n t is a cropped RGB patch centering at (x n t , y n t ). The edge d n1,n2 denotes the relation between the n 1 -th and n 2 -th objects and is represented by the concatenation of the normalized coordinate difference b n1 t -b n2 t . The dynamic predictor performs multi-step message passing to simulate instantaneous propagation effects. During inference, the dynamics predictor predicts the locations and patches at frame k + 1 using the features of the last w observed frames in the original video. We get the predictions at frame k + 2 by autoregressively feeding the predicted results at frame k + 1 as the input to the predictor. To get the counterfactual scenes where the n-th object is removed, we remove the n-th vertex and its associated edges from the input to predict counterfactual dynamics. Iteratively, we get the predicted normalized coordinates { bn k } N,K n=1,k =1 and RGB patches {p n k } N,K n=1,k =1 at all predicted K frames. We give more details on the dynamic predictor at Appendix C. Language Program Parsing. The language program parser aims to translate the questions and choices into executable symbolic programs. Each executable program consists of a series of operations like selecting objects with certain properties, filtering events happening at a specific moment, finding the causes of an event, and eventually enabling transparent and step-by-step visual reasoning. Moreover, these operations are compositional and can be combined to represent questions with various compositionality and complexity. We adopt a seq2seq model (Bahdanau et al., 2015) with an attention mechanism to translate word sequences into a set of symbolic programs and treat questions and choices, separately. We give detailed implementation of the program parser in Appendix D. Symbolic Execution. Given a parsed program, the symbolic executor explicitly runs it on the latent features extracted from the observed and predicted scenes to answer the question. The executor consists of a series of functional modules to realize the operators in symbolic programs. The last operator's output is the answer to the question. Similar to Mao et al. (2019) , we represent all object states, events, and results of all operators in a probabilistic manner during training. This makes the whole execution process differential w.r.t. the latent representations from the observed and predicted scenes. It enables the optimization of the feature extractor and concept embeddings in the symbolic executor. We provide the implementation of all the operators in Appendix E.

3.2. TRAINING AND INFERENCE

Training. We follow a multi-step training paradigm to optimize the model: 1) We first extract object trajectories with the scoring function in Eq. 1 concept embeddings in the symbolic executor with only descriptive and explanatory questions; 2) We quantize the static attributes for all objects with the feature extractor and the concept embeddings learned in Step 1) and refine object trajectories with the scoring function Eq. 2; 3) Based on the refined trajectories, we train the dynamic predictor and predict dynamics for future and counterfactual scenes; 4) We train the full DCL with all the question answer pairs and get the final model. The program executor is fully differentiable w.r.t. the feature extractor and concept embeddings. We use cross-entropy loss to supervise open-ended questions and use mean square error loss to supervise counting questions. We provide specific loss functions for each module in Appendix H. Inference. During inference, given an input video and a question, we first detect the object trajectories and predict their motions and interactions in future and counterfactual scenes. We then extract object and event features for both the observed and predicted scenes with the feature extractor. We parse the questions and choices into executable symbolic programs. We finally execute the programs on the latent feature space and get the answer to the question.

4. EXPERIMENTS

To show the proposed DCL's advantages, we conduct extensive experiments on the video reasoning benchmark CLEVRER. Existing other video datasets either ask questions about the complex visual context ( To provide more extensive analysis, we introduce DCL-Oracle by adding object attribute and collision supervisions into DCL's training. We summarize their requirement for visual labels and language programs in the second and third columns of Table 1 . According to the results in Table 1 , we have the following observations. Although HCRN achieves state-of-the-art performance on human-centric action datasets (Jang et al., 2017; Xu et al., 2017; 2016), it only performs slightly better than Memory and much worse than NS-DR on CLEVRER. We believe the reason is that HCRN mainly focuses on motion modeling across frames while CLEVRER requires models to perform dynamic visual reasoning on videos and analyze its temporal and causal structures. NS-DR performs best among all the baseline models, showing the power of combining symbolic representation with dynamics modeling. Our model achieves the state-of-the-art question answering performance on all kinds of questions even without visual attributes and event labels from simulations during training, showing its effectiveness and label-efficiency. Compared with NS-DR, our model achieves more significant gains on predictive and counterfactual questions than that on the descriptive questions. This shows DCL's effectiveness in modeling for temporal and causal reasoning. Unlike NS-DR, which directly predicts collision event labels with its dynamic model, DCL quantizes concepts and executes symbolic programs in an end-to-end training manner, leading to better predictions for dynamic concepts. DCL-Oracle shows the upper-bound performance of the proposed model to ground physical object and event concepts through question answering.

4.3. EVALUATION OF OBJECT AND EVENT CONCEPT GROUNDING IN VIDEOS

Previous methods like MAC (V) and TbD-net (V) did not learn explicit concepts during training, and NS-DR required intrinsic attribute and event labels as input. Instead, DCL can directly quantize video concepts, including static visual attributes (i.e. color, shape, and material), dynamic attributes (i.e. moving and stationary) and events (i.e. in, out, and collision). Specifically, DCL quantizes the concepts by mapping the latent object features into the concept space by linear transformation and calculating their cosine similarities with the concept embeddings in the neural-symbolic executor. We predict the static attributes of each object by averaging the visual object features at each sampled frame. We regard an object to be moving if it moves at any frame, and otherwise stationary. We consider there is a collision happening between a pair of objects if they collide at any frame of the video. We get the ground-truth labels from the provided video annotation and report the accuracy in table 2 on the validation set. We observe that DCL can learn to recognize different kinds of concepts without explicit concept labels during training. This shows DCL's effectiveness to learn object and event concepts through natural question answering. We also find that DCL recognizes static attributes and events better than dynamic attributes. We further find that DCL may misclassify objects to be "stationary" if they are missing for most frames and only move slowly at specific frames. We suspect the reason is that we only learn the dynamic attributes through question answering and question answering pairs for such slow-moving objects rarely appear in the training set.

4.4. GENERALIZATION

We further apply DCL to two new applications, including CLEVRER-Grounding, spatio-temporal localization of objects or events in a video, and CLEVRER-Retrieval, finding semantic-related videos for the query expressions and vice versa. We first build datasets for video grounding and video-text retrieval by synthesizing language expressions for videos in CLEVRER. We generate the expressions by filling visual contents from the video annotations into a set of pre-defined templates. For example, given the text template, "The <static attribute> that is <dynamic attribute> <time identifier>", we can fill it and generate "The Query: The collision that happens after the blue sphere exits the scene. App1: CLEVRER-Grounding. Query: A video that contains a gray metal cube that enters the scene. App2: CLEVRER-Retrieval. metal cube that is moving when the video ends.". Fig. 3 shows examples for the generated datasets, and we provide more statistics and examples in Appendix G. We transform the grounding and retrieval expressions into executable programs by training new language parsers on the expressions of the synthetic training set. To provide more extensive comparisons, we adopt the representative video grounding/ retrieval model WSSTG (Chen et al., 2019) as a baseline. We provide more details of the baseline implementation in Appendix A. CLEVRER-Grounding. CLEVRER-Grounding contains object grounding and event grounding. For video object grounding, we localize each described object's whole trajectory and compute the mean intersection over union (IoU) with the ground-truth trajectory. For event grounding, including collision, in and out, we temporally localize the frame that the event happens at and calculate the frame difference with the ground-truth frames. For collision event, we also spatially localize the collided objects' the union box and compute it's IoU with the ground-truth. We don't perform spatial localization for in and out events since the target object usually appears to be too small to localize at the frame it enters or leaves the scene. Table 3 lists the results. From the table, we can find that our proposed DCL transforms to the new CLEVRER-Grounding task well and achieves high accuracy for spatial localization and low frame differences for temporal localization. On the contrary, the traditional video grounding method WSSTG performs much worse, since it mainly aligns simple visual concepts between text and images and has difficulties in modeling temporal structures and understanding the complex logic. CLEVRER-Retrieval. For CLEVRER-Retrieval, an expression-video pair is considered as a positive pair if the video contains the objects and events described by the expression and otherwise negative. Given a video, we define its matching similarity with the query expression to be the maximal similarity between the query expression and all the object or event proposals. Additionally, we also introduce a recent state-of-the-art video-text retrieval model HGR (Chen et al., 2020) for performance comparison, which decomposes video-text matching into global-to-local levels and performs cross-modal matching with attention-based graph reasoning. We densely compare every possible expression-video pair and use mean average precision (mAP) as the retrieval metric. We report the retrieval mAP in Table 4 . Compared with CLEVRER-Grounding, CLEVRER-Retrieval is more challenging since it contains many more distracting objects, events and expressions. WSSTG performs worse on the retrieval setting because it does not model temporal structures and understand its logic. HGR achieves better performance than the previous baseline WSSTG since it performs hierarchical modeling for events, actions and entities. However, it performs worse than DCL since it doesn't explicitly model the temporal structures and the complex logic behind the video-text pairs in CLEVRER-Retrieval. On the other hand, DCL is much more robust since it can explicitly ground object and event concepts, analyze their relations and perform step-by-step visual reasoning. Q1: How many objects are falling? A1: 2. Q2: Are there any falling red objects? A2: No. Q3: Are there any falling blue objects? A3: Yes.

Falling block tower

Q1: What is the color of the block that is at the bottom? A1: Blue. Q2: Are there any falling yellow objects? A2: No. Stable block tower. Figure 4 : Typical videos and question-answer pairs of the block tower dataset. Stroboscopic imaging is applied for motion visualization.

4.5. EXTENSION TO REAL VIDEOS AND THE NEW CONCEPT

We further conduct experiments on a real block tower video dataset (Lerer et al., 2016) to learn the new physical concept "falling". The block tower dataset has 493 videos and each video contains a stable or falling block tower. Since the original dataset aims to study physical intuition and doesn't contain question-answer pairs, we manually synthesize question-answer pairs in a similar way to CLEVRER (Yi et al., 2020) . We show examples of the new dataset in Fig 4 . We train models on randomly-selected 393 videos and their associated question-answer pairs and evaluate their performance on the rest 100 videos. Similar to the setting in CLEVRER, we use the average visual feature from ResNet-34 for static attribute prediction and temporal sequence feature for the prediction of the new dynamic concept "falling". Additionally, we train a visual reasoning baseline MAC (V) (Hudson & Manning, 2018) for performance comparison. Table 5 lists the results. Our model achieves better question-answering performance on the block tower dataset especially on the counting questions like "How many objects are falling?". We believe the reason is that counting questions require a model to estimate the states of each object. MAC (V) just simply adopts an MLP classifier to predict each answer's probability and doesn't model the object states. Differently, DCL answers the counting questions by accumulating the probabilities of each object and is more transparent and accurate. We also show the accuracy of color and "falling" concept prediction on the validation set in Table 6 . Our DCL can naturally learn to ground the new dynamic concept "falling" in the real videos through question answering. This shows DCL's effectiveness and strong generalization capacity.

5. DISCUSSION AND FUTURE WORK

We present a unified neural symbolic framework, named Dynamic Concept Learner (DCL), to study temporal and causal reasoning in videos. DCL, learned by watching videos and reading questionanswers, is able to track objects across different frames, ground physical object and event concepts, understand the causal relationship, make future and counterfactual predictions and combine all these abilities to perform temporal and causal reasoning. DCL achieves state-of-the-art performance on the video reasoning benchmark CLEVRER. Based on the learned object and event concepts, DCL generalizes well to spatial-temporal object and event grounding and video-text retrieval. We also extend DCL to real videos to learn new physical concepts. Our DCL suggests several future research directions. First, it still requires further exploration for dynamic models with stronger long-term dynamic prediction capability to handle some counterfactual questions. Second, it will be interesting to extend our DCL to more general videos to build a stronger model for learning both physical concepts and human-centric action concepts. A IMPLEMENTATION DETAILS DCL Implementation Details. Since it's extremely computation-intensive to predict the object states and events at every frame, we evenly sample 32 frames for each video. All models are trained using Adam (Kingma & Ba, 2014) for 20 epochs and the learning rate is set to 10 -4 . We adopt a two-stage training strategy for training the dynamic predictor. For the dynamic predictor, we set the time window size w, the propogation step L and dimension of hidden states to be 3, 2 and 512, respectively. Following the sample rate at the observed frames, we sample a frame for prediction every 4 frames. We first train the dynamic model with only location prediction and then train it with both location and RGB patch prediction. Experimentally, we find this training strategy provides a more stable prediction. We train the language parser with the same training strategy as Yi et al. (2018) for fair comparison. Baseline Implementation. We implement the baselines HCRN (Le et al., 2020), HGR Chen et al. (2020) and WSSTG (Chen et al., 2019) carefully based on the public source code. WSSTG first generate a set of object or event candidates and match them with the query sentence. We choose the proposal candidate with the best similarity as the grounding result. For object grounding, we use the same tube trajectory candidates as we use for implementing DCL. For grounding event concepts in and out, we treat each object at each sampled frame as a potential candidate for selection. For grounding event concept collision, we treat the union regions of any object pairs as candidates. For CLEVRER-Retrieval, we treat the proposal candidate with the best similarity as the similarity score between the video and the query sentence. We train WSSTG with a synthetic training set generated from the videos of CLEVRER-VQA training set. A fully-supervised triplet loss is adopted to optimize the model.

B FEATURE EXTRACTION

We evenly sample K frames for each video and use a ResNet-34 (He et al., 2016) to extract visual features. For the n-th object in the video, we define its average visual feature to be f v n = 1 K K k=1 f n k , where f n k is the concatenation of the regional feature and the global context feature at the k-th frame. We define its temporal sequence feature f s n to be the contenation of [x n t , y n t , w n t , h n t ] at all T frames, where (x n t , y n t ) denotes the normalized object coordinate centre and w n t and h n t denote the normalized width and height, respectively. For the collision feature between the n 1 -th object and n 2 -th objet at the k-th frame, we define it to be f c n1,n2,k = f u n1,n2,k ||f loc n1,n2,k , where f u n1,n2,k is the ResNet feature of the union region of the n 1 -th and n 2 -th objects at the k-th frame and f loc n1,n2,k is a spatial embedding for correlations between bounding box trajectories. We define f loc n1,n2,k = IoU(s n1 , s n2 )||(s n1 -s n2 )||(s n1 × s n2 ), which is the concatenation of the intersection over union (IoU), difference (-) and multiplication (×) of the normalized trajectory coordinates for the n 1 -th and n 2 -th objects centering at the k-th frame. We padding f u n1,n2,k with a zero vector if either the n 1 -th or the n 2 -th objects doesn't appear at the k-th frame.

C DYNAMIC PREDICTOR

To predict the locations and RGB patches, the dynamic predictor maintains a directed graph V, D = {v n } N n=1 , {d n1,n2 } N,N n1=1,n2=1 . The n-th vertex o n is represented by the concatenation of its normalized coordinates b n t = [x n t , y n t , w n t , h n t ] and RGB patches p n t . The edge d n1,n2 is represented by the concatenation of the normalized coordinate difference b n1 t -b n2 t . To capture the object dynamics, we concatenate the features over a small history window. To predict the dynamics at the k + 1 frame, we first encode the vertexes and edges e o n,k = f enc O (|| k t=k-w (b n t ||p n t )), e r n1,n2,k = f enc R (|| k t=k-w (b n1 t -b n2 t )), where || indicates concatenation, w is the history window size, f enc O and f enc R are CNN-based encoders for objects and relations. w is set to 3. We then update the object influences {h l n,k } N n=1 and relation influences {e l n1,n2,k } N,N n1=1,n2=1 through L propagation steps. Specifically, we have e l n1,n2,k = f R (e r n1,n2,k , h l-1 n1,k , h l-1 n2,k ), h l n,k = f O (e o n,k , n1,n2 e l n1,n2,k , h l-1 n,k ), where l ∈ [1, L], denoting the l-th step, f O and f R denote the object propagator and relation propagator, respectively. We initialize h o n,t = 0. We finally predict the states of objects and relations at the k+1 frame to be bn k+1 = f pred O1 (e o n,k , h L n,k ), pn k+1 = f pred O2 (e o n,k , h L n,k ), where f pred O1 and f pred O2 are predictors for the normalized object coordinates and RGB patches at the next frame. We optimize this dynamic predictor by mimizing the L 2 distance between the predicted bn k+1 , pn k+1 and the real future locations b n k+1 and extracted patches p n k+1 . During inference, the dynamics predictor predicts the locations and patches at k+1 frames by using the features of the last w observed frames in the original video. We get the predictions at the k+2 frames by feeding the predicted results at the k+1 frame to the encoder in Eq. 4. To get the counterfactual scenes where the n-th object is removed, we use the first w frames of the original video as the start point and remove the n-th vertex and its associated edges of the input to predict counterfactual dynamics. Iteratively, we get the predicted normalized coordinates { bn k } N,K n=1,k =1 and RGB patches {p n k } N,K n=1,k =1 at all predicted K frames. 

D PROGRAM PARSER

i = ----→ LSTM(f enc w (w i ), - → h i-1 ), ← -e i , ← - h i = ← ---- LSTM(f enc w (w i ), ← - h i+1 ), where I is the number of words and f enc w is an encoder for word embeddings. To decode the encoded vectors {e i } I i=1 into symbolic programs {p j } J j=1 , we have qj = LSTM(f dec c (pj-1)), αi,j = exp(q T j ei) i exp(q T j ei) , pj ∼ softmax(W • (qj|| i αi,jei)), where e i = -→ e i || ←e i and J is the number of programs. The dimension of the word embedding and all the hidden states is set to 300 and 256, respectively.

E CLEVRER OPERATIONS AND PROGRAM EXECUTION

We list all the available data types and operations for CLEVRER VQA (Yi et al., 2020) in Table 8 and Table 7 . In this section, we first introduce how we represent the objects, events and moments in the video. Then, we describe how we quantize the static and dynamic concepts and perform temporal and causal reasoning. Finally, we summarize the detailed implementation of all operations in Table 9 . Representation for Objects, Events and Time. We consider a video with N objects and T frames and we sample K frames for collision prediction. The objects in Table 8 can be represented by a vector objects of length N , where objects n ∈ [0, 1] represents the probability of the n-h object being referred to. Similarly, we use a vector events in of length N to representing the probability of objects coming into the visible scene. We additionally store frame indexes t in for event in, where t in n indicates the moment when the n-th object first appear in the visual scene. We represent event out in a similar way as we represent event in. For event collision, we represent it with a matrix events col ∈ R N ×N ×K , where events col n1,n2,k represents the n 1 -th and n 2 -th objects collide at the k-th frame. Since CLEVRER requires temporal relations of events, we also maintain a time mask M ∈ R T to annotate valid time steps, where M t = 1 indicates the t-th is valid at the current step and M t = 0 indicating invalid. In CLEVRER, Unique also involves transformation from objects (object set) to object (a single object). We achieve by selecting the object with the largest probability. We perform in a similar way to transform events to event. static attributes are more robust. Empirically, we find that linear sum assignment and static attributes can help distinguish close object proposals and make the correct image proposal assignments. Similar to normal object tracking algorithms (Bewley et al., 2016; Wojke et al., 2017) , we also find that adding additional Kalman filter can further slightly improve the trajectory quality.

G STATISTICS FOR CLEVRER-GROUNDING AND CLEVRER-RETRIEVAL

We simply use the videos from original CLEVRER training set as the training videos for CLEVRER-Grounding and CLEVRER-Retrieval and evaluate their performance on the validation set. CLEVERER-Grounding contains expressions for each video on average. CLEVERER-Retrieval contains 7.4 expressions for each video in the training set. We for evaluating the video retrieval task on the validation set. We evaluate the performance of CLEVRER-Grounding task on all 5,000 videos from the original CLEVRER validation set. For CLEVERER-Retrieval, We additionally generate 1,129 unique expressions from the validation set as query and treat the first 1,000 videos from CLEVRER validation set as the gallery. We provide more examples for CLEVRER-Grounding and CLEVRER-Retrieval datasets in Fig. 5 , Fig. 6 and Fig. 7 . It can be seen from the examples that the newly proposed CLEVRER-Grounding and CLEVRER-Retrieval datasets contain delicate and compositional expressions for objects and physical events. It can evaluate models' ability to perform compositional temporal and causal reasoning.

H TRAINING OBJECTIVES

In this section, we provide the explicit training objectives for each module. We optimize the feature extractor and the concept embeddings in the executors by question answering. We treat each option of a multiple-choice question as an independent boolean question during training and we use different loss functions for different question types. Specifically, we use cross-entropy loss to supervise open-ended questions and use mean square error loss to supervise counting questions. where C is the size of the pre-defined answer set, p c is the probability for the c-th answer and y a is the ground-truth answer label. For counting questions, we have L QA,count = (y a -z) 2 , ( ) where z is the predicted number and y a is the ground-truth number label. We train the program parser with program labels using cross-entropy loss, L program = - J j=1 1{y p = j} log(p j ), where J is the size of the pre-defined program set, p j is the probability for the j-th program and y p is the ground-truth program label. We optimize the dynamic predictor with mean square error loss. Mathematically, we have L dynamic = N n=1 b n -bn 2 2 + N n=1 Np i1=1 Np i2=1 p n i1,i2 -pn i1,i2 2 , ( ) where b n is the object coordinates for the n-th object, p n i1,i2 is the pixel value of the n-th object's cropped patch at (i 1 , i 2 ), and N p is the cropped size. bn and pn i1,i2 are the dynamic predictor's predictions for b n and p n i1,i2 . Table 9 : Neural operations in DCL. events col denotes the collision events happening at the unseen future frames. objs exp ∈ R N ×N ×K and objs exp n 1 ,n 2 ,k = max(objsn 1 , objsn 2 ). events col n,k denotes all the collision events that the n-th object get involved at the k-th frame. 



Figure 1: The process to handle visual reasoning in dynamic scenes. The trajectories of the target blue and yellow spheres are marked by the sequences of bounding boxes. Object attributes of the blue sphere and yellow sphere and the collision event are marked by blue, yellow and purple colors. Stroboscopic imaging is applied for motion visualization.

Figure 2: DCL's architecture for counterfactual questions during inference.Given an input video and its corresponding question and choice, we first use a program parser to parse the question and the choice into executable programs. We adopt an object trajectory detector to detect trajectories of all objects. Then, the extracted objects are sent to a dynamic predictor to predict their dynamics. Next, the extracted objects are sent to the feature extractor to extract latent representations for objects and events. Finally, we feed the parsed programs and latent representation to the symbolic executor to answer the question and optimize concept learning.

Figure 3: Examples of CLEVRER-Grounding and CLEVRER-Retrieval Datasets. The target region are marked by purple boxes and stroboscopic imaging is applied for visualization purposes. In CLEVRER-Retrieval, we mark randomly-selected positive and negative gallery videos with green and red borders, respectively.

Formally, for open-ended questions, we haveL QA,open = -C c=1 1{y a = c} log(p c ),(10)

and optimize the video feature extractor and Question-answering accuracy on CLEVRER. The first and the second parts of the table show the models without and with visual attribute and event labels during training, respectively. Best performance is highlighted in boldface. DCL and DCL-Oracle denote our models trained without and with labels of visual attributes and events, respectively.

Tapaswi et al., 2016;Lei et al., 2018) or study dynamics and reasoning without question answering(Girdhar & Ramanan, 2020;Baradel et al., 2020). Thus, they are unsuitable for evaluating video causal reasoning and learning object and event concepts through question answering. We first show its strong performance on video causal reasoning. Then, we show DCL's ability on concept learning, predicting object visual attributes and events happening in videos. We show DCL's generalization capacity to new applications, including CLEVRER-Grounding and CLEVRER-Retrieval. We finally extend DCL to a real block tower video dataset(Lerer et al., 2016).4.1 IMPLEMENTATION DETAILSFollowing the experimental setting in Yi et al. (2020), we train the language program parser with 1000 programs for all question types. We train all our models without attribute and event labels. Our models for video question answering are trained on the training set, tuned on the validation set, and evaluated in the test set. To show DCL's generalization capacity, we build CLEVRER-Grounding and CLEVRER-Retrieval datasets from the original CLEVRER videos and their associated video annotations. We provide more implementation details in Appendix A. Evaluation of video concept learning on the validation set.

Evaluation of video grounding. For spatial grounding, we consider it to be accurate if the IoU between the detected trajectory and the ground-truth trajectory is greater than 0.5.

Evaluation of CLEVRER-Retrieval. Mean average precision (mAP) is adopted as the metric.

QA results on the block tower dataset.

Evaluation of concept learning on the block tower dataset. Our DCL can learn to quantize the new concept "falling" on real videos through QA.

The evaluation of different methods for object trajectory generation.

acknowledgement

Acknowledgement This work is in part supported by ONR MURI N00014-16-1-2007, the Center for Brain, Minds, and Machines (CBMM, funded by NSF STC award CCF-1231216), the Samsung Global Research Outreach (GRO) Program, Autodesk, and IBM Research.

annex

Object and Event Concept Quantization. We first introduce how DCL quantizes different concepts by showing an example how DCL quantizes the static object concept cube. Let f v n denote the latent visual feature for the n-th object in the video, SA denotes the set of all static attributes. The concept cube is represented by a semantic vector s Cube and an indication vector i cube . i cube is of length |SA| and L-1 normalized, indicating concept Cube belongs to the static attribute Shape. We compute the confidence scores that an object is a Cube bywhere δ and λ denotes the shifting and scaling scalars and are set to 0.15 and 0.2, respectively. cos() calculates the cosine similarity between two vectors and m sa denotes a linear transformation, mapping object features into the concept representation space. We get a vector of length N by applying this concept filter to all objects, denoted as ObjF ilter(cube).We perform similar quantization to temporal dynamic concepts. For event in and out, we replace f v n with temporal sequence features f s n ∈ R 4T to get events in n . For event collision, we replace f v n with f c n1,n2,k to predict the confidence that the n 1 -th and the n 2 -th objects collide at the k-frame and get events out n1,n2,k . For moment-specific dynamic concepts moving and stationary, we adopt frame-specific feature f s n,t * ∈ R 4T for concept prediction. We denote the filter result on all objects as ObjF ilter(moving, t * ). Specifically, we generate the sequence feature f s n,t * at the t * -th frame by only concatenating [x n t , y n t , w n t , h n t ] from t * -τ to t * + τ frames and padding other dimensions with 0.Temporal and causal Reasoning. One unique feature for CLEVRER is that it requires a model to reason over temporal and causal structures of the video to get the answer. We handle Filter before and Filter after by updating the valid time mask M . For example, to filter events happening after a target event. We first get the frame t * that the target event happens at and update valid time mask M by setting M t = 1 if t > t * else M t = 0. We then ignore the events happening at the invalid frames and update the temporal sequence features to be f s n = f s n • M exp , where • denotes the component-wise multiplication andFor Filter order of events type , we first filter all the valid events by find events who event type > η. η is simply set to 0 and type ∈ {in, out, collision}. We then sort all the remain events based on t type to find the target event.For Filter ancestor of a collision event, we first predict valid events by finding events type > η.We then return all valid events that are in the causal graphs of the given collision event.We summarize the implementation of all operations in Table 9 .F TRAJECTORY PERFORMANCE EVALUATION.In this section, we compare different kinds of methods for generating object trajectory proposals. Greedy+IoU denotes the method used in (Gkioxari & Malik, 2015) , which adopts a greedy Viterbi algorithm to generate trajectories based on IoUs of image proposals in connective frames. Greedy+IoU+Attr. denotes the method adopts the greedy algorithm to generate trajectory proposals based on the IoUs and predicted static attributes. LSM+IoU denotes the method that we use linear sum assignment to connect the image proposals based on IoUs. LSM+IoU+Attr. denotes the method we use linear sum assignment to connect image proposals based on IoUs and predicted static attributes. LSM+IoU+Attr.+KF denotes the method that we apply additional Kalman filtering (Kalman, 1960; Bewley et al., 2016; Wojke et al., 2017) to LSM+IoU+Attr.. We evaluate the performance of different methods by compute the IoU between the generated trajectory proposals and the ground-truth trajectories. We consider it a "correct" trajectory proposal if the IoU between the proposal and the ground-truth is larger than a threshold. Specifically, two metrics are used evaluation, precision = Ncorrect Np and recall = Ncorrect Ngt , where N correct , N p and N gt denotes the number of correct proposals, the number of proposals and the number of ground-truth objects, respectively. 

Filter in

(events, objects) → events Select incoming events of the input objects Filter out (events, objects) → events Select existing events of the input objects Filter collision (events, objects) → events Select all collisions that involve an of the input objects Get col partner (event, object) → object Return the collision partner of the input object Filter before (events, events) → events Select all events before the target event Filter after (events, events) → events Select all events after the target event Filter order (events, order) → event Select the event at the specific time order Filter ancestor (event, events) → events Select all ancestors of the input event in the causal graph Get frame (event) → frame Return the frame of the input event in the video

Query Attribute

(object) → concept Returns the query attribute of the input object Count (objects) → int Returns the number of the input objects/ events (events) → int Exist (objects) → bool Returns "yes" if the input objects is not empty Belong to (event, events) → bool Returns "yes" if the input event belongs to the input event sets Negate (bool) → bool Returns the negation of the input boolean

Unique

(events) → event Return the only event /object in the input list (objects) → object The chronological order of an event, e.g. "First", "Second" and "Last".

static concept

Object-level static concepts like "Red", "Sphere" and "Mental". dynamic concept Object-level dynamic concepts like "Moving" and "Stationary". attribute Static attributes including "Color", "Shape" and "Material". frameThe frame number of an event. int A single integer like "0" and "1". bool A single boolean value, "True" or "False". 

Object Filter Modules

Filter static concept min(objs, ObjFilter(sa)) (objs: objects, sa: concept) → objects Filter dynamic concept min(objects, ObjFilter(da,t)) (objs: objects, da: concept, t: frame) → objects

Event Filter Modules

Filter in min(objs, events in ) (events in : events, objs: objects) → events Filter out min(objs, events out ) (events out : events, objs: objects) → events Filter collision min(objs exp , events col. ) (events col : events, objs: objects) → events Get col partner max k∈[1,K] (events col n,k ) (events col : events, obj n : object) → objects Filter before events in n = -1 if t in n > t event1 (events in n : events, event1: event) Filter after events in n = -1 if t in n < t event1 (events in n : events, event1: event) → events Filter order events in n > 0 if order in n = or (events in n : events, or: order) → event Filter ancestor {event1 n > 0 and events1 n (event1: event, events1: events) → events in the causal graph of event1} The collision that happens before the gray object enters the scene.Query: The green object enters the scene before the rubber sphere enters the scene Query: The cube exits the scene after the sphere enters the scene Query: The metal cylinder that is stationary when the sphere enters the scene 1. A video that contains an object that collides with the brown metal cube. 2. A video that contains an object that collides with the gray metal sphere. 3. A video that contains an object to collide with the brown metal cube. 4. A video that contains an object to collide with the gray metal sphere. 5. A video that contains a collision that happens after the yellow metal cube enters the scene. 6. A video that contains a collision that happens after the brown metal cube enters the scene. 7. A video that contains a collision that happens before the yellow metal cube enters the scene. 

