ACTION CONCEPT GROUNDING NETWORK FOR SEMANTICALLY-CONSISTENT VIDEO GENERATION Anonymous authors Paper under double-blind review

Abstract

Recent works in self-supervised video prediction have mainly focused on passive forecasting and low-level action-conditional prediction, which sidesteps the problem of semantic learning. We introduce the task of semantic action-conditional video prediction, which can be regarded as an inverse problem of action recognition. The challenge of this new task primarily lies in how to effectively inform the model of semantic action information. To bridge vision and language, we utilize the idea of capsule and propose a novel video prediction model Action Concept Grounding Network (ACGN). Our method is evaluated on two newly designed synthetic datasets, CLEVR-Building-Blocks and Sapien-Kitchen, and experiments show that given different action labels, our ACGN can correctly condition on instructions and generate corresponding future frames without need of bounding boxes. We further demonstrate our trained model can make out-of-distribution predictions for concurrent actions, be quickly adapted to new object categories and exploit its learnt features for object detection. Additional visualizations can be found at https://iclr-acgn.github.io/ACGN/.

1. INTRODUCTION

Recently, video prediction and generation has drawn a lot of attention due to its ability to capture meaningful representations learned though self-supervision (Wang et al. (2018b) ; Yu et al. (2019) ). Although modern video prediction methods have made significant progress in improving predictive accuracy, most of their applications are limited in the scenarios of passive forecasting (Villegas et al. (2017) ; Wang et al. (2018a) ; Byeon et al. (2018) ), meaning models can only passively observe a short period of dynamics and accordingly make a short-term extrapolation. Such settings neglect the fact that the observer can also become an active participant in the environment. To model interactions between agent and environment, several low-level action-conditional video prediction models have been proposed in the community (Oh et al. (2015) ; Mathieu et al. (2015) ; Babaeizadeh et al. (2017); Ebert et al. (2017) ). In this paper, we go one step further by introducing the task of semantic action-conditional video prediction. Instead of using low-level single-entity actions such as action vectors of robot arms as done in prior works (Finn et al. (2016) ; Kurutach et al. ( 2018)), our task provides semantic descriptions of actions, e.g. "Open the door", and asks the model to imagine "What if I open the door" in the form of future frames. This task requires the model to recognize the object identity, assign correct affordances to objects and envision the long-term expectation by planning a reasonable trajectory toward the target, which resembles how humans might imagine conditional futures. The ability to predict correct and semantically consistent future perceptual information is indicative of conceptual grounding of actions, in a manner similar to object grounding in image based detection and generation tasks. The challenge of action-conditional video prediction primarily lies in how to inform the model of semantic action information. Existing low-level counterparts usually achieve this by employing a naive concatenation (Finn et al. (2016) ; Babaeizadeh et al. ( 2017)) with action vector of each timestep. While this implementation enables model to move the desired objects, it fails to produce consistent long-term predictions in the multi-entity settings due to its limited inductive bias. To distinguish instances in the image, other related works heavily rely on pre-trained object detectors or ground-truth bounding boxes (Anonymous (2021); Ji et al. ( 2020)). However, we argue that utilizing a pre-trained detector actually simplifies the task since such a detector already solves the major difficulty by mapping high-dimension inputs to low-dimension groundings. Furthermore, bounding boxes cannot 1. We introduce a new task, semantic action-conditional video prediction as illustrated in Fig 1 , which can be viewed as an inverse problem of action recognition, and create two new video datasets, CLEVR-Building-blocks and Sapien-Kitchen, for evaluation. 2. We propose a novel video prediction model, Action Concept Grounding Network, in which routing between capsules is directly controlled by action labels. ACGN can successfully depict the long-term counterfactual evolution without need of bounding boxes. 3. We demonstrate that ACGN is capable of making out-of-distribution generation for concurrent actions. We further show that our trained model can be fine-tuned for new categories of objects with a very small number of samples and exploit its learnt features for detection.

2. RELATED WORK

Passive video prediction: ConvLSTM (Shi et al. ( 2015)) was the first deep learning model that employed a hybrid of convolutional and recurrent units for video prediction, enabling it to learn spatial and temporal relationship simultaneously, which was soon followed by studies looking at a similar problem Kalchbrenner et al. ( 2017 It is worth noticing that SVG also used concatenation to incorporate action information. Such implementations are prevalent in low-level action-conditional video prediction because the action vector only encodes the spatial information of a single entity, usually a robotic manipulator (Finn et al. (2016) ) or human being. A common failure case for such models is the presence of multiple affordable entities (Kim et al. ( 2019)), a scenario that our task definition and datasets focus on.



Figure 1: Semantic action-conditional video prediction. An agent is asked to predict what will happen in the form of future frames if it takes a series of semantic actions after observing the scene. Each column depicts alternative futures conditioned on the first outcome in its previous column. effectively describe complex visual changes including rotations and occlusions. Thus, a more flexible way of representing objects and actions is required. We present a new video prediction model, ACGN, short for Action Concept Grounding Network. ACGN leverages the idea of attention-based capsule networks (Zhang et al. (2018a)) to bridge semantic actions and video frame generation. The compositional nature of actions can be efficiently represented by the structure of capsule network in which each group of capsules encodes the spatial representation of specific entity or action. The contributions of this work are summarized as follows:

);Mathieu et al. (2015). Following ConvLSTM,PredRNN (Wang  et al. (2017)) designed a novel spatiotemporal LSTM, which allowed memory to flow both vertically and horizontally. The same group further improved predictive results by rearranging spatial and temporal memory in a cascaded mechanism in PredRNN++(Wang et al. (2018a)) and by introducing 3D convolutions in E3D-LSTM(Wang et al. (2018b)). The latest SOTA, CrevNet (Yu et al. (2019)), utilized the invertible architecture to reduce the memory and computation consumption significantly while preserving all information from input. All above models require multiple frames to warm up and can only make relatively short-term predictions since real-world videos are volatile. Models usually don't have sufficient information to predict the long-term future due to partial observation, egomotion and randomness. Also, this setting prevents models from interacting with environment.Action-conditional video prediction:The low-level action-conditional video prediction task, on the other hand, provides an action vector at each timestep as additional input to guide the prediction (Oh et al. (2015); Chiappa et al. (2017)). CDNA (Finn et al. (2016)) is representative of such an actionconditional video prediction model. In CDNA, the states and action vectors of the robotic manipulator are first spatially tiled and integrated into the model through concatenation. SVG (Denton & Fergus (2018)) was initially proposed for stochastic video generation but later was extended to actionconditional version in (Babaeizadeh et al. (2017); Villegas et al. (2019); Chiappa et al. (2017)).

