UNSUPERVISED OBJECT KEYPOINT LEARNING USING LOCAL SPATIAL PREDICTABILITY

Abstract

We propose PermaKey, a novel approach to representation learning based on object keypoints. It leverages the predictability of local image regions from spatial neighborhoods to identify salient regions that correspond to object parts, which are then converted to keypoints. Unlike prior approaches, it utilizes predictability to discover object keypoints, an intrinsic property of objects. This ensures that it does not overly bias keypoints to focus on characteristics that are not unique to objects, such as movement, shape, colour etc. We demonstrate the efficacy of PermaKey on Atari where it learns keypoints corresponding to the most salient object parts and is robust to certain visual distractors. Further, on downstream RL tasks in the Atari domain we demonstrate how agents equipped with our keypoints outperform those using competing alternatives, even on challenging environments with moving backgrounds or distractor objects.

1. INTRODUCTION

An intelligent agent situated in the visual world critically depends on a suitable representation of its incoming sensory information. For example, a representation that captures only information about relevant aspects of the world makes it easier to learn downstream tasks efficiently (Barlow, 1989; Bengio et al., 2013) . Similarly, when explicitly distinguishing abstract concepts, such as objects, at a representational level, it is easier to generalize (systematically) to novel scenes that are composed of these same abstract building blocks (Lake et al., 2017; van Steenkiste et al., 2019; Greff et al., 2020) . In recent work, several methods have been proposed to learn unsupervised representations of images that aim to facilitate agents in this way (Veerapaneni et al., 2019; Janner et al., 2019) . Of particular interest are methods based on learned object keypoints that correspond to highly informative (salient) regions in the image as indicated by the presence of object parts (Zhang et al., 2018; Jakab et al., 2018; Kulkarni et al., 2019; Minderer et al., 2019) . Many real world tasks primarily revolve around (physical) interactions between objects and agents. Therefore it is expected that a representation based on a set of task-agnostic object keypoints can be re-purposed to facilitate downstream learning (and generalization) on many different tasks (Lake et al., 2017) . One of the main challenges for learning representations based on object keypoints is to discover salient regions belonging to objects in an image without supervision. Recent methods take an information bottleneck approach, where a neural network is trained to allocate a fixed number of keypoints (and learn corresponding representations) in a way that helps making predictions about an image that has undergone some transformation (Jakab et al., 2018; Minderer et al., 2019; Kulkarni et al., 2019) . However, keypoints that are discovered in this way strongly depend on the specific transformation that is considered and therefore lack generality. For example, as we will confirm in our experiments, the recent Transporter (Kulkarni et al., 2019) learns to prioritize image regions that change over time, even when they are otherwise uninformative. Indeed, when relying on extrinsic object properties (i.e. that are not unique to objects) one becomes highly susceptible to distractors as we will demonstrate. In this work, we propose PermaKey a novel representation learning approach based on object keypoints that does not overly bias keypoints in this way. The key idea underlying our approach is to view objects as local regions in the image that have high internal predictive structure (self-information). We argue that local predictability is an intrinsic property of an object and therefore more reliably captures objectness in images (Alexe et al., 2010) . This allows us to formulate a local spatial prediction problem to infer which of the image regions contain object parts. We perform this prediction task in the learned feature space of a convolutional neural network (CNN) to assess predictability based on a rich collection of learned low-level features. Using PointNet (Jakab et al., 2018) we can then convert these predictability maps to highly informative object keypoints. We extensively evaluate our approach on a number of Atari environments and compare to Transporter (Kulkarni et al., 2019) . We demonstrate how our method is able to discover keypoints that focus on image regions that are unpredictable and which often correspond to salient object parts. By leveraging local predictability to learn about objects, our method profits from a simpler yet better generalizable definition of an object. Indeed, we demonstrate how it learns keypoints that do not solely focus on temporal motion (or any other extrinsic object property) and is more robust to uninformative (but predictable) distractors in the environment, such as moving background. On Atari games, agents equipped with our keypoints outperform those using Transporter keypoints. Our method shows good performance even on challenging environments such as Battlezone involving shifting viewpoints where the Transporters' explicit motion bias fails to capture any task-relevant objects. As a final contribution, we investigate the use of graph neural networks (Battaglia et al., 2018) for processing keypoints, which potentially better accommodates their discrete nature when reasoning about their interactions, and provide an ablation study.

2. METHOD

To learn representations based on task-agnostic object keypoints we require a suitable definition of an object that can be applied in an unsupervised manner. At a high level, we define objects as abstract patterns in the visual input that can serve as modular building blocks (i.e. they are self-contained and reusable independent of context) for solving a particular task, in the sense that they can be separately intervened upon or reasoned with (Greff et al., 2020) . This lets us treat objects as local regions in input space that have high internal predictive structure based on statistical co-occurrence of features such as color, shape, etc. across a large number of samples. Hence, our focus is on their local predictability, which can be viewed as an "intrinsic" object property according to this definition. For example, Bruce & Tsotsos (2005) have previously shown that local regions with high self-information typically correspond to salient objects. More generally, self-information approximated via a set of cues involving center-surround feature differences has been used to quantify objectness (Alexe et al., 2010) . In this paper we introduce Prediction ERror MAp based KEYpoints (PermaKey), which leverages this definition to learn about object keypoints and corresponding representations. The main component of PermaKey is a local spatial prediction network (LSPN), which is trained to solve a local spatial prediction problem in feature space (light-blue trapezoid in Figure 1 ). It involves predicting the value of a feature from its surrounding neighbours, which can only be solved accurately when they belong to the same object. Hence, the error map (predictability map) that is obtained by evaluating the LSPN at different locations carves the feature space up into regions that have high internal predictive structure (see rows 4 & 5 in Figure 2 (a)). In what follows, we delve into each of the 3 modules that constitute



Figure 1: PermaKey consists of three modules (encapsulated by the dotted lines): learning a suitable spatial feature embedding (1), solving a local spatial prediction task (2), and converting error maps to keypoints (3). Objective functions used to learn each of the 3 modules shown within dotted blocks.

