KEYCLD: LEARNING CONSTRAINED LAGRANGIAN DYNAMICS IN KEYPOINT COORDINATES FROM IMAGES

Abstract

We present KeyCLD, a framework to learn Lagrangian dynamics from images. Learned keypoint representations derived from images are directly used as positional state vector for jointly learning constrained Lagrangian dynamics. KeyCLD is trained unsupervised end-to-end on sequences of images. Our method explicitly models the mass matrix, potential energy and the input matrix, thus allowing energy based control. We demonstrate learning of Lagrangian dynamics from images on the dm control pendulum, cartpole and acrobot environments, wether they are unactuated, underactuated or fully actuated. Trained models are able to produce long-term video predictions, showing that the dynamics are accurately learned. Our method strongly outperforms recent works on learning Lagrangian or Hamiltonian dynamics from images. The benefits of including a Lagrangian prior and prior knowledge of a constraint function is further investigated and empirically evaluated.

1. INTRODUCTION AND RELATED WORK

Learning dynamical models from data is a crucial aspect while striving towards intelligent agents interacting with the physical world. Understanding the dynamics and being able to predict future states is paramount for controlling autonomous systems or robots interacting with their environment. For many dynamical systems, the equations of motion can be derived from scalar functions such as the Lagrangian or Hamiltonian. This strong physics prior enables more data-efficient learning and holds energy conserving properties. Greydanus et al. (2019) introduced Hamiltonian neural networks. By using Hamiltonian mechanics as inductive bias, the model respects exact energy conservation laws. Lutter et al. (2018; 2019) Learning Lagrangian dynamics from images It is often not possible to observe the full state of a system directly. Cameras provide a rich information source, containing the full state when properly positioned. However, the difficulty lies in interpreting the images and extracting the underlying state. As was recently argued by Lutter & Peters (2021), learning Lagrangian or Hamiltonian dynamics from realistic renderings remains an open challenge. The majority of related work (Greydanus et al., 2019; Toth et al., 2020; Saemundsson et al., 2020; Allen-Blanchette et al., 2020; Botev et al., 2021) use a variational auto-encoder (VAE) framework to represent the state in a latent space embedding. The dynamics model is expressed in this latent space. Zhong & Leonard (2020) use interpretable coordinates, however need full knowledge of the kinematic chain, and the images are segmented per object. Table 1 provides an overview of closely related work in literature. Table 1 : An overview of closely related Lagrangian or Hamiltonian models. Lag-caVAE (Zhong & Leonard, 2020) is capable of modelling external forces and learning from images, but individual moving bodies need to be segmented in the images, on a black background. It additionally needs full knowledge of the kinematic chain, which is more prior information than the constraint function necessary for our method (see Section 2). HGN (Toth et al., 2020) needs no prior knowledge of the kinematic chain, but is unable to model external forces. CHNN (Finzi et al., 2020) expresses Lagrangian or Hamiltonian dynamics in Cartesian coordinates, but can not be learned from images. Our method, KeyCLD, is capable of learning Lagrangian dynamics with external forces, from unsegmented images with shadows, reflections and backgrounds. 

Contributions (1)

We introduce KeyCLD, a framework to learn constrained Lagrangian dynamics from images. We are the first to use learned keypoint representations from images to learn Lagrangian dynamics. We show that keypoint representations derived from images can directly be used as positional state vector for learning constrained Lagrangian dynamics, expressed in Cartesian coordinates. (2) We show how to control constrained Lagrangian dynamics in Cartesian coordinates with energy shaping, where the state is estimated from images. (3) We adapt the pendulum, cartpole and acrobot



learns Lagrangian dynamics from images. (a) An observation of a dynamical system is processed by a learned keypoint estimator model. (b) The model represents the positions of the keypoints with a set of spatial probability heatmaps. (c) Cartesian coordinates are extracted using spatial softmax and used as positional state vector to learn Lagrangian dynamics. (d) The information in the keypoint coordinates bottleneck suffices for a learned renderer model to reconstruct the original observation, including background, reflections and shadows. The keypoint estimator model, Lagrangian dynamics models and renderer model are jointly learned unsupervised on sequences of images.

pioneered the use of Lagrangian mechanics as physics priors for learning dynamical models from data. Cranmer et al. (2020) expanded this idea to a more general setting. By modelling the Lagrangian itself with a neural network instead of explicitly modelling mechanical kinetic energy, they can model physical systems beyond classical mechanics. Zhong et al. (2020) included external input forces and energy dissipation, and introduced energy-based control by leveraging the learned energy models. Finzi et al. (2020) introduced learning of Lagrangian or Hamiltonian dynamics in Cartesian coordinates, with explicit constraints. This enables more data efficient models, at the cost of providing extra knowledge about the system in the form of a constraint function.

Keypoints Instead of using VAE inspired latent embeddings, our method leverages fully convolutional keypoint estimator models to observe the state from images. Because the model is fully convolutional, it is also translation equivariant, this leads to a higher data efficiency. Objects can be represented with one or more keypoints, fully capturing the position and orientation. Zhou et al. (2019) used keypoints for object detection, with great success. Keypoint detectors are commonly used for human pose estimation (Zheng et al., 2020). More closely related to this work, keypoints can be learned for control and robotic manipulation (Chen et al., 2021; Vecerik et al., 2021). Minderer et al. (2019) learn unsupervised keypoints from videos to represent objects and dynamics. Jaques et al. (2021) leverage keypoints for system identification and dynamic modelling. Jakab et al. (2018) learn a keypoint representation unsupervised by using it as an information bottleneck for reconstructing images. The keypoints represent semantic landmarks in the images and generalise well to unseen data. It is the main inspiration for the use of keypoints in our work.

