LEARNING GEOMETRIC REPRESENTATIONS OF INTERACTIVE OBJECTS

Abstract

We address the problem of learning geometric representations from observations perceived by an agent operating within an environment and interacting with an external object. To this end, we propose a representation learning framework that extracts the state of both the agent and the object from unstructured observations of arbitrary nature (e.g., images). Supervision comes from the performed actions alone, while the dynamics of the object is assumed to be unknown. We provide a theoretical foundation and formally prove that an ideal learner is guaranteed to infer an isometric representation, disentangling the agent from the object. Finally, we investigate empirically our framework on a variety of scenarios. Results show that our model reliably infers the correct representation and outperforms visionbased approaches such as a state-of-the-art keypoint extractor.

1. INTRODUCTION

A fundamental aspect of intelligent behavior by part of an agent is building rich and structured representations of the surrounding world (Ha & Schmidhuber (2018) ). Through structure, in fact, a representation can potentially achieve semantic understanding, efficient reasoning and generalization (Lake et al. ( 2017)). However, in a realistic scenario an agent perceives unstructured and high-dimensional observations of the world (e.g., images). The ultimate goal of inferring a representation consists thus of extracting structure from such observed data (Bengio et al. (2013) ). This is challenging and in some instances requires supervision or biases. For example, it is known that disentangling factors in data is mathematically impossible in a completely unsupervised way (Locatello et al. (2019) ). In order to extract structure, it is thus necessary to design methods and paradigms relying on additional information and specific assumptions. In the context of an agent interacting with the world, a fruitful source of information is provided by the actions performed and collected together with the observations. Based on that, several recent works have explored the role of actions in representation learning and proposed methods to extract structure from interaction (Kipf et al. ( 2019 2022)). The common principle underlying this line of research is encouraging the representation to replicate the effect of actions in a structured space -a property referred to as equivariancefoot_0 . In particular, it has been shown in Marchetti et al. (2022) that equivariance enables to extract the internal state of the agent (i.e., its pose in space), resulting in a lossless and geometric representation. The question of how to represent features of the world which are extrinsic to the agent (e.g., objects) has been left open. Such features are dynamic since they change as a consequence of interaction. They are thus challenging to capture in the representation but are essential for understanding and reasoning by part of the agent. In this work we consider the problem of learning a representation of an agent together with an external object the agent interacts with (see Figure 1 ). We focus on a scenario where the object displaces only when it comes in contact with the agent, which is realistic and practical. This enables to design a representation learner that attracts the representation of the object to the one of the agent when interaction happens or keeps it invariant otherwise. Crucially, we make no assumption on the complexity of the interaction: the object is allowed to displace arbitrarily and its dynamics is unknown. All the



Alternative terminologies from the literature are World Model Kipf et al. (2019) and Markov Decision Process Homomorphism (van der Pol et al. (2020)).



); Mondal et al. (2020); Park et al. (

