LEARNING TO REPRESENT AND PREDICT IMAGE SE-QUENCES VIA POLAR STRAIGHTENING

Abstract

Observer motion and continuous deformations of objects and textures imbue natural videos with distinct temporal structures, enabling partial prediction of future frames from past ones. Conventional methods first estimate local motion, or optic flow, and then use it to predict future frames by warping and copying content. Here, we explore a more direct methodology, in which frames are mapped into a learned representation space where the structure of temporal evolution is more readily accessible. Motivated by the geometry of the Fourier shift theorem and its group-theoretic generalization, we formulate a simple architecture that represents video frames in learned polar coordinates to facilitate prediction. Specifically, we construct networks in which pairs of convolutional channel coefficients are interpreted as complex-valued, and are expected to evolve with slowly varying amplitudes and linearly advancing phases. We train these models on next-frame prediction, and compare their performance with that of conventional methods using optic flow, and other learned predictive networks, evaluated on natural videos from two datasets. We find that the polar predictor achieves high prediction performance while remaining interpretable and fast, thereby demonstrating the potential of a flow-free video processing methodology that is trained end-to-end to predict natural video content.

1. INTRODUCTION

One way to frame the fundamental problem of vision is that of representing the signal in a form that is more useful for performing visual tasks, be they estimation, recognition, or guiding motor actions. Perhaps the most general "task" is that of temporal prediction, which has been proposed as a fundamental goal for unsupervised learning of visual representations (Földiák, 1991) . But previous research along these lines has generally focused on estimating these transformations rather than using them to predict: extracting slow features (Wiskott & Sejnowski, 2002) , and finding dictionaries and sparse codes that have slow amplitudes and phases (Cadieu & Olshausen, 2012) . In video processing and computer vision, a common strategy for temporal prediction is to first estimate local translational motion, and to then copy and paste content to predict the next frame. Such motion compensation is an important component in making compression schemes like MPEG successful (Wiegand et al., 2003) . These video coding standards are the fruit of major engineering efforts, they make digital video communication feasible and are widely used. But motion estimation is a difficult nonlinear problem, and existing methods fail in regions where temporal evolution is not translational. For example, in cases of expanding or rotating motion, discontinuous motion at occlusion boundaries, or mixtures of motion arising from semi-transparent surfaces (e.g., viewing the world through a dirty pane of glass). In compression schemes, these failures of motion estimation lead to prediction errors, which are then fixed by sending additional corrective bits. Human perception does not seem to suffer from such failures -at least, our subjective sense is that we can anticipate the time-evolution of visual input even in the vicinity of these commonly occurring non-translational changes. In fact, those changes are often the most informative ones as they reveal object boundaries, provide ordinal depth and other information about the visual scene. This may imply that humans make use of a different strategy, perhaps bypassing altogether the estimation of motion, to represent and predict evolving visual input. Toward this end, and inspired by recent hypotheses that primate visual representations support prediction by "straightening" the temporal

