AN INFORMATION-THEORETIC APPROACH TO UNSUPER-VISED KEYPOINT REPRESENTATION LEARNING

Abstract

Extracting informative representations from videos is fundamental for effective learning of various downstream tasks. Inspired by classical works on saliency, we present a novel information-theoretic approach to discover meaningful representations from videos in an unsupervised fashion. We argue that local entropy of pixel neighborhoods and its evolution in a video stream is a valuable intrinsic supervisory signal for learning to attend to salient features. We, thus, abstract visual features into a concise representation of keypoints that serve as dynamic information transporters. We discover in an unsupervised fashion spatio-temporally consistent keypoint representations, thanks to two original information-theoretic losses. First, a loss that maximizes the information covered by the keypoints in a frame. Second, a loss that optimizes transportation over time, imposing consistency of the information flow. We compare our keypoint-based representation to state-ofthe-art baselines in different downstream tasks such as learning object dynamics. To evaluate the expressivity and consistency of the keypoints, we propose a new set of metrics. Our empirical results showcase the superior performance of our information-driven keypoints that resolve challenges like attendance to both static and dynamic objects, and to objects abruptly entering and leaving the scene. 1

1. INTRODUCTION

Humans are remarkable for their ability to form representations of essential visual entities and store information to effectively learn downstream tasks from experience (Cooper, 1990; Radulescu et al., 2021) . Research evidence shows that the human visual system processes visual information in two stages; first, it extracts sparse features of salient objects (Bruce & Tsotsos, 2005) ; second, it discovers the interrelations of local features for grouping them to find correspondences (Marr, 2010; Kadir & Brady, 2001) . For videos with dynamic entities, humans not only focus on dynamic objects, but also on the structure of the background scene if this plays a key role in the information flow (Riche et al., 2012; Borji et al., 2012) . Ideally, we want a learning algorithm to extract similar abstractions that can be useful to various downstream tasks. Notable research works in Computer Vision (CV) and Machine Learning (ML) have proposed different feature representations from pixels for challenging downstream tasks (Szeliski, 2010; Harris et al., 1988; Lowe, 2004; Rublee et al., 2011; Rosten & Drummond, 2006; Mur-Artal et al., 2015) . Recent efforts focus on deep learning representations of Points of Interest (PoI) for tasks like localization and pose estimation (DeTone et al., 2018; Florence et al., 2018; Sarlin et al., 2020; Ono et al., 2018; Sarlin et al., 2019; Dusmanu et al., 2019) . Keypoints stand out as PoI with semantic interpretation (Jiang et al., 2009; Alexe et al., 2010 ), e.g., representing objects (Xiongwei et al., 2020) or the joints of a human pose (Kreiss et al., 2019) and can represent structure useful for learning control (Xiong et al., 2021) . Many keypoint detection methods are trained in a supervised fashion, relying on annotations (Cao et al., 2017) . Unsupervised and self-supervised learning methods can compensate for the need for expensive human annotations (Wang et al., 2019; 2020; Minaee et al., 2021; Kim et al., 2019; Yang et al., 2020; Gopalakrishnan et al., 2020; Chen et al., 2019) . Current state-of-the-art methods for unsupervised keypoint discovery mainly focus on dynamic entities in a video (Kulkarni et al., 2019; Minderer et al., 2019) , not effectively representing the scene's static and dynamic entities. Namely, these methods are trained to reconstruct differences between frames and cannot easily disambiguate occlusions or consistently represent randomly appearing-then-disappearing objects in a video stream. This work introduces Maximum Information keypoiNTs (MINT), an information-theoretic treatment of keypoint-based representation learning by considering keypoints as the "transporters" of prominent information in a frame and subsequently through a video stream. Our proposed method relies on local entropy computed in neighborhoods (patches) around candidate keypoints. We argue that image entropy, and its changes over time, provide a strong inductive bias for training keypoints to represent salient objects, as early works in saliency detection pointed out (Kadir & Brady, 2001; Bruce & Tsotsos, 2005) . To compute the entropy, we introduce a novel, efficient entropy layer that operates locally on image patches. MINT maximizes both the image entropy coverage by the keypoints and the conditional entropy coverage across frames. To do so, MINT relies on an original formulation of unsupervised keypoint representation learning with loss functions to maximize the represented image entropy and the information transportation across frames by the keypoints, imposing spatio-temporal consistency of the represented entities. We provide qualitative and quantitative empirical results on four different video datasets that allow us to unveil the representation power of MINT against strong baselines of unsupervised keypoint discovery. Unsupervised keypoint representation learning is challenging to benchmark due to the absence of designated metrics and datasets. We, therefore, provide a new set of metrics with a downstream task in the domain of multiple object detection and tracking, based on CLEVRER (Yi et al., 2019) . Moreover, we provide results on two challenging datasets (Sharma et al., 2018; Memmesheimer et al., 2019) that contain interesting dynamic scenes of various difficulties (close-up frames with dynamic interactions vs. high-res wide frames with clutter). We show that MINT economizes the use of keypoints, deactivating excessive ones when the information is well contained, and dynamically activates them to represent new entities entering the scene temporarily. Finally, to demonstrate the suitability of MINT as a representation for control, we devise an imitation learning downstream task based on the toy environments of the MAGICAL benchmark (Toyer et al., 2020) . To summarize, our contributions are: (1) an original information-theoretic approach to unsupervised keypoint representation learning that uses local image entropy as a training inductive bias, seeking to maximize the represented information in videos; (2) an entropy layer to compute local image entropy in patches; (3) an unsupervised way for learning to represent a variable number of entities in video streams by activating/deactivating keypoints to cover the necessary information; (4) a new set of evaluation metrics in a simple and intuitive downstream task for benchmarking the performance of unsupervised keypoint discovery methods.

2. MAXIMUM INFORMATION KEYPOINTS

We propose an unsupervised method for keypoint discovery in videos based on information-theoretic principles. Keypoints should adequately represent the scene and dynamic changes in it. Starting from the original assumption that a keypoint represents a patch of information on the image, we leverage local image entropy to measure the representation power of keypoints in terms of their transmitted amount of information. Consequently, we argue that keypoints should cover areas in the image that are rich in information, while the number of keypoints should dynamically adapt to represent new information. Finally, keypoints should consistently represent the same information pattern spatiotemporally in a video. With this motivation, we propose maximizing the information covered by the keypoint representation in a video through the introduction of two novel losses based on information-theoretic measures. (1) An information maximization loss encourages the keypoints to cover areas with high entropy in a single frame. (2) An information transportation loss enables the keypoints to represent the same entity over subsequent frames. In the following, we present these losses and theoretical analyses supporting their design.

2.1. PIXEL INFORMATION & LOCAL IMAGE ENTROPY

Our information-theoretic approach for unsupervised keypoint discovery requires quantifying the amount of information each pixel location in a single frame carries. We measure the information of a pixel via Shannon's entropy definition (Shannon, 2001) , based on the probability of each pixel. Images can be considered lattices, with pixels being the random variables (Li, 2009) . We compute the



Project website: https://sites.google.com/view/mint-iclr

