ACTION CONCEPT GROUNDING NETWORK FOR SEMANTICALLY-CONSISTENT VIDEO GENERATION Anonymous authors Paper under double-blind review

Abstract

Recent works in self-supervised video prediction have mainly focused on passive forecasting and low-level action-conditional prediction, which sidesteps the problem of semantic learning. We introduce the task of semantic action-conditional video prediction, which can be regarded as an inverse problem of action recognition. The challenge of this new task primarily lies in how to effectively inform the model of semantic action information. To bridge vision and language, we utilize the idea of capsule and propose a novel video prediction model Action Concept Grounding Network (ACGN). Our method is evaluated on two newly designed synthetic datasets, CLEVR-Building-Blocks and Sapien-Kitchen, and experiments show that given different action labels, our ACGN can correctly condition on instructions and generate corresponding future frames without need of bounding boxes. We further demonstrate our trained model can make out-of-distribution predictions for concurrent actions, be quickly adapted to new object categories and exploit its learnt features for object detection. Additional visualizations can be found at https://iclr-acgn.github.io/ACGN/.

1. INTRODUCTION

Recently, video prediction and generation has drawn a lot of attention due to its ability to capture meaningful representations learned though self-supervision (Wang et al. (2018b) ; Yu et al. (2019) ). Although modern video prediction methods have made significant progress in improving predictive accuracy, most of their applications are limited in the scenarios of passive forecasting (Villegas et al. (2017) ; Wang et al. (2018a); Byeon et al. (2018) ), meaning models can only passively observe a short period of dynamics and accordingly make a short-term extrapolation. Such settings neglect the fact that the observer can also become an active participant in the environment. To model interactions between agent and environment, several low-level action-conditional video prediction models have been proposed in the community (Oh et al. ( 2015 2018)), our task provides semantic descriptions of actions, e.g. "Open the door", and asks the model to imagine "What if I open the door" in the form of future frames. This task requires the model to recognize the object identity, assign correct affordances to objects and envision the long-term expectation by planning a reasonable trajectory toward the target, which resembles how humans might imagine conditional futures. The ability to predict correct and semantically consistent future perceptual information is indicative of conceptual grounding of actions, in a manner similar to object grounding in image based detection and generation tasks. The challenge of action-conditional video prediction primarily lies in how to inform the model of semantic action information. Existing low-level counterparts usually achieve this by employing a naive concatenation (Finn et al. (2016); Babaeizadeh et al. (2017) ) with action vector of each timestep. While this implementation enables model to move the desired objects, it fails to produce consistent long-term predictions in the multi-entity settings due to its limited inductive bias. To distinguish instances in the image, other related works heavily rely on pre-trained object detectors or ground-truth bounding boxes (Anonymous (2021); Ji et al. ( 2020)). However, we argue that utilizing a pre-trained detector actually simplifies the task since such a detector already solves the major difficulty by mapping high-dimension inputs to low-dimension groundings. Furthermore, bounding boxes cannot



); Mathieu et al. (2015); Babaeizadeh et al. (2017); Ebert et al. (2017)). In this paper, we go one step further by introducing the task of semantic action-conditional video prediction. Instead of using low-level single-entity actions such as action vectors of robot arms as done in prior works (Finn et al. (2016); Kurutach et al. (

