LEARNING TO REPRESENT AND PREDICT IMAGE SE-QUENCES VIA POLAR STRAIGHTENING

Abstract

Observer motion and continuous deformations of objects and textures imbue natural videos with distinct temporal structures, enabling partial prediction of future frames from past ones. Conventional methods first estimate local motion, or optic flow, and then use it to predict future frames by warping and copying content. Here, we explore a more direct methodology, in which frames are mapped into a learned representation space where the structure of temporal evolution is more readily accessible. Motivated by the geometry of the Fourier shift theorem and its group-theoretic generalization, we formulate a simple architecture that represents video frames in learned polar coordinates to facilitate prediction. Specifically, we construct networks in which pairs of convolutional channel coefficients are interpreted as complex-valued, and are expected to evolve with slowly varying amplitudes and linearly advancing phases. We train these models on next-frame prediction, and compare their performance with that of conventional methods using optic flow, and other learned predictive networks, evaluated on natural videos from two datasets. We find that the polar predictor achieves high prediction performance while remaining interpretable and fast, thereby demonstrating the potential of a flow-free video processing methodology that is trained end-to-end to predict natural video content.

1. INTRODUCTION

One way to frame the fundamental problem of vision is that of representing the signal in a form that is more useful for performing visual tasks, be they estimation, recognition, or guiding motor actions. Perhaps the most general "task" is that of temporal prediction, which has been proposed as a fundamental goal for unsupervised learning of visual representations (Földiák, 1991) . But previous research along these lines has generally focused on estimating these transformations rather than using them to predict: extracting slow features (Wiskott & Sejnowski, 2002) , and finding dictionaries and sparse codes that have slow amplitudes and phases (Cadieu & Olshausen, 2012) . In video processing and computer vision, a common strategy for temporal prediction is to first estimate local translational motion, and to then copy and paste content to predict the next frame. Such motion compensation is an important component in making compression schemes like MPEG successful (Wiegand et al., 2003) . These video coding standards are the fruit of major engineering efforts, they make digital video communication feasible and are widely used. But motion estimation is a difficult nonlinear problem, and existing methods fail in regions where temporal evolution is not translational. For example, in cases of expanding or rotating motion, discontinuous motion at occlusion boundaries, or mixtures of motion arising from semi-transparent surfaces (e.g., viewing the world through a dirty pane of glass). In compression schemes, these failures of motion estimation lead to prediction errors, which are then fixed by sending additional corrective bits. Human perception does not seem to suffer from such failures -at least, our subjective sense is that we can anticipate the time-evolution of visual input even in the vicinity of these commonly occurring non-translational changes. In fact, those changes are often the most informative ones as they reveal object boundaries, provide ordinal depth and other information about the visual scene. This may imply that humans make use of a different strategy, perhaps bypassing altogether the estimation of motion, to represent and predict evolving visual input. Toward this end, and inspired by recent hypotheses that primate visual representations support prediction by "straightening" the temporal trajectories of naturally-occurring input (Hénaff et al., 2019) , we formulate an objective for learning an image representation that facilitates prediction by linearizing the temporal trajectories of frames of natural video. This separation of the instantaneous representation and the temporal prediction is best motivated by considering the behavior of rigidly translating video content when viewed in the frequency domain. First in section 1.1, we review how translation corresponds to steady phase advancement in the frequency domain, and then in section 1.2, we explain how this relationship reduces prediction of rigidly translating content to angular extrapolation. We place this observation in the general context of group representation theory in section 1.3. Next in section 2, we describe how to fit parameterized mappings of individual video frames into complex coefficients which can be temporally predicted by phase advancement. These predicted representations are then used to synthesize an estimated frame, and the entire systems are trained end-to-end to minimize next frame prediction errors. In section 3, we report training results of several such systems, and show that they produce systematic improvements in predictive performance over conventional motion compensation methods, or direct predictive neural networks. Finally, in section 4, we relate our approach to existing work and then in section 5 discuss its significance and implications.

1.1. BASE CASE: THE FOURIER SHIFT THEOREM

Our approach is motivated by the well-known behavior of Fourier representations with respect to signal translation. Specifically, the complex exponentials that make up the Fourier basis are the eigenfunctions of the translation operator, and translation of inputs produces systematic phase advances of frequency coefficients. Let x 2 R N be a discrete signal indexed by spatial location n 2 [0, N 1], and let e x 2 C N be its Fourier transform indexed by k 2 [0, N 1]. We write x v (n) = x(n v), the translation of x by v modulo N (ie. circular shift with period N). Defining = e i2⇡/N , the primitive N-th root of unity, we can express the Fourier shift theorem as: f x v (k) = N 1 X n=0 x(n v) kn = N 1 v X m= v x(m) km kv = kv N 1 X n=0 x(n) kn = kv e x(k). This relationship may be depicted in a compact diagram: e x(k) kv e x(k) x(n) x(n v) advance phase F 1 F shift (1) where F indicates the Fourier transform. In the context of our goals, the diagram illustrates the point that transforming to the Fourier domain renders translation a "simpler" operation: a phase advance is a rotation in the two dimensional (complex) plane.

1.2. PREDICTION VIA ANGULAR EXTRAPOLATION

Now consider observations of a signal that translates at a constant velocity over time, x(n, t) = y(n vt). Although the temporal evolution is easy to describe, the trajectory of the signal is quite complicated, rendering prediction difficult. As an example, Figure 1 shows a signal consisting of a sum of two sinusoidal components. Transforming the signal to the Fourier domain simplifies the description. In particular, the translational motion now corresponds to circular motion of the two (complex-valued) Fourier coefficients associated with the constituent sinusoids. The motion is further simplified by a polar coordinate transform to extract phase and amplitude of each Fourier coefficient. Specifically, the motion is now along a straight trajectory, with both phases advancing linearly (but at different rates), and both amplitudes constant. Note that this is a geometric property that holds for any rigidly translating signal, and offers a simple means of predicting content over time. Indeed, we can use the shift property on x(n, t+1) = x v (n, t) and observe that prediction

