COMPOSITIONAL VIDEO SYNTHESIS WITH ACTION GRAPHS

Abstract

Paper under double-blind review Figure 1: We focus on video synthesis from actions and propose a new task called Action Graph to Video. To represent input actions, we use a graph structure called Action Graph, and together with the first frame and first scene layout, our goal is to synthesize a video that matches the input actions. For illustration, we include above a (partial) example. Our model outperforms various baselines and can generalize to previously unseen compositions of actions.

1. INTRODUCTION

Learning to generate visual content is a fundamental task in computer vision, with numerous applications from sim-to-real training of autonomous agents, to creating visuals for games and movies. While the quality of generating still images has leaped forward recently (Karras et al., 2020; Brock et al., 2019) , generating videos is much harder. Generating actions and interactions is perhaps the most challenging aspect of conditional video generation. Actions create long-range spatio-temporal dependencies between people and the objects they interact with. For example, when a player passes a ball, the entire movement sequence of all entities (thrower, ball, receiver) must be coordinated and carefully timed. The current paper focuses on this difficult obstacle, the task of generating coordinated and timed actions, as an important step towards generating videos of complex scenes. Current approaches for conditional video generation are not well suited to condition the generation on actions. First, future video prediction (Ye et al., 2019; Watters et al., 2017) , generates future frames based on an initial input frame, but a first frame cannot be used to predict coordinated actions. Second, in video-to-video , the goal is to translate a sequence of semantic masks into an output video.

