LEARNING GEOMETRIC REPRESENTATIONS OF INTERACTIVE OBJECTS

Abstract

We address the problem of learning geometric representations from observations perceived by an agent operating within an environment and interacting with an external object. To this end, we propose a representation learning framework that extracts the state of both the agent and the object from unstructured observations of arbitrary nature (e.g., images). Supervision comes from the performed actions alone, while the dynamics of the object is assumed to be unknown. We provide a theoretical foundation and formally prove that an ideal learner is guaranteed to infer an isometric representation, disentangling the agent from the object. Finally, we investigate empirically our framework on a variety of scenarios. Results show that our model reliably infers the correct representation and outperforms visionbased approaches such as a state-of-the-art keypoint extractor.

1. INTRODUCTION

A fundamental aspect of intelligent behavior by part of an agent is building rich and structured representations of the surrounding world (Ha & Schmidhuber (2018) ). Through structure, in fact, a representation can potentially achieve semantic understanding, efficient reasoning and generalization (Lake et al. (2017) ). However, in a realistic scenario an agent perceives unstructured and high-dimensional observations of the world (e.g., images). The ultimate goal of inferring a representation consists thus of extracting structure from such observed data (Bengio et al. (2013) ). This is challenging and in some instances requires supervision or biases. For example, it is known that disentangling factors in data is mathematically impossible in a completely unsupervised way (Locatello et al. (2019) ). In order to extract structure, it is thus necessary to design methods and paradigms relying on additional information and specific assumptions. In the context of an agent interacting with the world, a fruitful source of information is provided by the actions performed and collected together with the observations. Based on that, several recent works have explored the role of actions in representation learning and proposed methods to extract structure from interaction (Kipf et al. ( 2019 2022)). The common principle underlying this line of research is encouraging the representation to replicate the effect of actions in a structured space -a property referred to as equivariancefoot_0 . In particular, it has been shown in Marchetti et al. (2022) that equivariance enables to extract the internal state of the agent (i.e., its pose in space), resulting in a lossless and geometric representation. The question of how to represent features of the world which are extrinsic to the agent (e.g., objects) has been left open. Such features are dynamic since they change as a consequence of interaction. They are thus challenging to capture in the representation but are essential for understanding and reasoning by part of the agent. In this work we consider the problem of learning a representation of an agent together with an external object the agent interacts with (see Figure 1 ). We focus on a scenario where the object displaces only when it comes in contact with the agent, which is realistic and practical. This enables to design a representation learner that attracts the representation of the object to the one of the agent when interaction happens or keeps it invariant otherwise. Crucially, we make no assumption on the complexity of the interaction: the object is allowed to displace arbitrarily and its dynamics is unknown. All the Figure 1 : Our framework enables to learn a representation φ recovering the geometric and disentangled state of both an agent (z int , white) and an interactable object (z ext , brown) from unstructured observations o (e.g., images). The only form of supervision comes from actions a, b performed by the agent, while the transition of the object (question mark) in case of interaction is unknown. In case of no interaction, the object stays invariant. losses optimized by our learner rely on supervision from interaction alone i.e., on observations and performed actions. This makes the framework general and in principle applicable to observations of arbitrary nature. We moreover provide a formalization of the problem and theoretical grounding for the method. Our core theoretical result guarantees that the representation inferred by an ideal learner recovers both the ground-truth states up to a translation. This implies that the representation is isometric (i.e., fully extracts the geometric structure of states) and disentangles the agent from the object. As a consequence, the representation preserves the geometry of the state space underlying observations. The preservation of geometry makes the representation lossless, interpretable and disentangled. We empirically show that our representations not only outperform in quality of structure a state-of-the-art keypoint extractor, but can be leveraged by a downstream learner in order to solve control tasks efficiently. In summary our contributions include: • A representation learning framework extracting geometry from observations in the context of an agent interacting with an external object. • A theoretical result guaranteeing that the above learning framework, when implemented by an ideal learner, infers an isometric representation for data of arbitrary nature. • An empirical investigation of the framework on a variety of environments with comparisons to computer vision approaches (i.e., keypoint extraction) and applications to a control task. We provide Python code implementing our framework together with all the experiments as part of the supplementary material.

2. RELATED WORK

Equivariant Representation Learning. Several recent works have explored the idea of incorporating interactions into representation learning. The common principle is to infer a representation which is equivariant i.e., such that transitions in observations are replicated as transitions in the latent space. One option is to learn the latent transition end-to-end together with the representation (Kipf et al. 2022), for static scenarios (i.e., with no interactive external objects) the so-obtained representations are structured and completely recover the geometry of the underlying state of the agent. Our framework adheres to this line of research by modelling the latent transitions via the additive Lie group R n . We however further extend the representation to



Alternative terminologies from the literature are World Model Kipf et al. (2019) and Markov Decision Process Homomorphism (van der Pol et al. (2020)).



); Mondal et al. (2020); Park et al. (

(2019);van der Pol et al. (2020);Watter et al. (2015)). This approach is however non-interpretable and the so-obtained representations are not guaranteed to extract any structure. Alternatively, the latent transition can be designed a priori. Linear and affine latent transitions have been considered in Guo et al. (2019), Mondal et al. (2020) and Park et al. (2022) while transitions defined by (the multiplication of) a Lie group have been discussed in Marchetti et al. (2022), Mondal et al. (2022). As shown in Marchetti et al. (

