AN INFORMATION-THEORETIC APPROACH TO UNSUPER-VISED KEYPOINT REPRESENTATION LEARNING

Abstract

Extracting informative representations from videos is fundamental for effective learning of various downstream tasks. Inspired by classical works on saliency, we present a novel information-theoretic approach to discover meaningful representations from videos in an unsupervised fashion. We argue that local entropy of pixel neighborhoods and its evolution in a video stream is a valuable intrinsic supervisory signal for learning to attend to salient features. We, thus, abstract visual features into a concise representation of keypoints that serve as dynamic information transporters. We discover in an unsupervised fashion spatio-temporally consistent keypoint representations, thanks to two original information-theoretic losses. First, a loss that maximizes the information covered by the keypoints in a frame. Second, a loss that optimizes transportation over time, imposing consistency of the information flow. We compare our keypoint-based representation to state-ofthe-art baselines in different downstream tasks such as learning object dynamics. To evaluate the expressivity and consistency of the keypoints, we propose a new set of metrics. Our empirical results showcase the superior performance of our information-driven keypoints that resolve challenges like attendance to both static and dynamic objects, and to objects abruptly entering and leaving the scene. 1

1. INTRODUCTION

Humans are remarkable for their ability to form representations of essential visual entities and store information to effectively learn downstream tasks from experience (Cooper, 1990; Radulescu et al., 2021) . Research evidence shows that the human visual system processes visual information in two stages; first, it extracts sparse features of salient objects (Bruce & Tsotsos, 2005) ; second, it discovers the interrelations of local features for grouping them to find correspondences (Marr, 2010; Kadir & Brady, 2001) . For videos with dynamic entities, humans not only focus on dynamic objects, but also on the structure of the background scene if this plays a key role in the information flow (Riche et al., 2012; Borji et al., 2012) . Ideally, we want a learning algorithm to extract similar abstractions that can be useful to various downstream tasks. Notable research works in Computer Vision (CV) and Machine Learning (ML) have proposed different feature representations from pixels for challenging downstream tasks (Szeliski, 2010; Harris et al., 1988; Lowe, 2004; Rublee et al., 2011; Rosten & Drummond, 2006; Mur-Artal et al., 2015) . Recent efforts focus on deep learning representations of Points of Interest (PoI) for tasks like localization and pose estimation (DeTone et al., 2018; Florence et al., 2018; Sarlin et al., 2020; Ono et al., 2018; Sarlin et al., 2019; Dusmanu et al., 2019) . Keypoints stand out as PoI with semantic interpretation (Jiang et al., 2009; Alexe et al., 2010 ), e.g., representing objects (Xiongwei et al., 2020) or the joints of a human pose (Kreiss et al., 2019) and can represent structure useful for learning control (Xiong et al., 2021) . Many keypoint detection methods are trained in a supervised fashion, relying on annotations (Cao et al., 2017) . Unsupervised and self-supervised learning methods can compensate for the need for expensive human annotations (Wang et al., 2019; 2020; Minaee et al., 2021; Kim et al., 2019; Yang et al., 2020; Gopalakrishnan et al., 2020; Chen et al., 2019) . Current state-of-the-art methods for unsupervised keypoint discovery mainly focus on dynamic entities in a video (Kulkarni et al., 2019; Minderer et al., 2019) , not effectively representing the scene's static and dynamic entities. Namely, these methods are trained to



Project website: https://sites.google.com/view/mint-iclr 1

