UNSUPERVISED OBJECT KEYPOINT LEARNING USING LOCAL SPATIAL PREDICTABILITY

Abstract

We propose PermaKey, a novel approach to representation learning based on object keypoints. It leverages the predictability of local image regions from spatial neighborhoods to identify salient regions that correspond to object parts, which are then converted to keypoints. Unlike prior approaches, it utilizes predictability to discover object keypoints, an intrinsic property of objects. This ensures that it does not overly bias keypoints to focus on characteristics that are not unique to objects, such as movement, shape, colour etc. We demonstrate the efficacy of PermaKey on Atari where it learns keypoints corresponding to the most salient object parts and is robust to certain visual distractors. Further, on downstream RL tasks in the Atari domain we demonstrate how agents equipped with our keypoints outperform those using competing alternatives, even on challenging environments with moving backgrounds or distractor objects.

1. INTRODUCTION

An intelligent agent situated in the visual world critically depends on a suitable representation of its incoming sensory information. For example, a representation that captures only information about relevant aspects of the world makes it easier to learn downstream tasks efficiently (Barlow, 1989; Bengio et al., 2013) . Similarly, when explicitly distinguishing abstract concepts, such as objects, at a representational level, it is easier to generalize (systematically) to novel scenes that are composed of these same abstract building blocks (Lake et al., 2017; van Steenkiste et al., 2019; Greff et al., 2020) . In recent work, several methods have been proposed to learn unsupervised representations of images that aim to facilitate agents in this way (Veerapaneni et al., 2019; Janner et al., 2019) . Of particular interest are methods based on learned object keypoints that correspond to highly informative (salient) regions in the image as indicated by the presence of object parts (Zhang et al., 2018; Jakab et al., 2018; Kulkarni et al., 2019; Minderer et al., 2019) . Many real world tasks primarily revolve around (physical) interactions between objects and agents. Therefore it is expected that a representation based on a set of task-agnostic object keypoints can be re-purposed to facilitate downstream learning (and generalization) on many different tasks (Lake et al., 2017) . One of the main challenges for learning representations based on object keypoints is to discover salient regions belonging to objects in an image without supervision. Recent methods take an information bottleneck approach, where a neural network is trained to allocate a fixed number of keypoints (and learn corresponding representations) in a way that helps making predictions about an image that has undergone some transformation (Jakab et al., 2018; Minderer et al., 2019; Kulkarni et al., 2019) . However, keypoints that are discovered in this way strongly depend on the specific transformation that is considered and therefore lack generality. For example, as we will confirm in our experiments, the recent Transporter (Kulkarni et al., 2019) learns to prioritize image regions that change over time, even when they are otherwise uninformative. Indeed, when relying on extrinsic object properties (i.e. that are not unique to objects) one becomes highly susceptible to distractors as we will demonstrate. In this work, we propose PermaKey a novel representation learning approach based on object keypoints that does not overly bias keypoints in this way. The key idea underlying our approach is to view objects as local regions in the image that have high internal predictive structure (self-information). We argue that local predictability is an intrinsic property of an object and therefore more reliably captures objectness in images (Alexe et al., 2010) . This allows us to formulate a local spatial prediction

