KEYCLD: LEARNING CONSTRAINED LAGRANGIAN DYNAMICS IN KEYPOINT COORDINATES FROM IMAGES

Abstract

We present KeyCLD, a framework to learn Lagrangian dynamics from images. Learned keypoint representations derived from images are directly used as positional state vector for jointly learning constrained Lagrangian dynamics. KeyCLD is trained unsupervised end-to-end on sequences of images. Our method explicitly models the mass matrix, potential energy and the input matrix, thus allowing energy based control. We demonstrate learning of Lagrangian dynamics from images on the dm control pendulum, cartpole and acrobot environments, wether they are unactuated, underactuated or fully actuated. Trained models are able to produce long-term video predictions, showing that the dynamics are accurately learned. Our method strongly outperforms recent works on learning Lagrangian or Hamiltonian dynamics from images. The benefits of including a Lagrangian prior and prior knowledge of a constraint function is further investigated and empirically evaluated. Ground truth KeyCLD

1. INTRODUCTION AND RELATED WORK

Learning dynamical models from data is a crucial aspect while striving towards intelligent agents interacting with the physical world. Understanding the dynamics and being able to predict future states is paramount for controlling autonomous systems or robots interacting with their environment. For many dynamical systems, the equations of motion can be derived from scalar functions such as the Lagrangian or Hamiltonian. This strong physics prior enables more data-efficient learning and holds energy conserving properties. Greydanus et al. (2019) introduced Hamiltonian neural networks. By using Hamiltonian mechanics as inductive bias, the model respects exact energy conservation laws. Lutter et al. (2018; 2019) pioneered the use of Lagrangian mechanics as physics priors for learning dynamical models from data. Cranmer et al. (2020) expanded this idea to a more general setting. By modelling the Lagrangian itself with a neural network instead of explicitly modelling mechanical kinetic energy, they can model physical systems beyond classical mechanics. Zhong et al. (2020) included external input forces and energy dissipation, and introduced energy-based control by leveraging the learned energy models. Finzi et al. (2020) introduced learning of Lagrangian or Hamiltonian dynamics in Cartesian coordinates, with explicit constraints. This enables more data efficient models, at the cost of providing extra knowledge about the system in the form of a constraint function. Learning Lagrangian dynamics from images It is often not possible to observe the full state of a system directly. Cameras provide a rich information source, containing the full state when properly positioned. However, the difficulty lies in interpreting the images and extracting the underlying state. As was recently argued by Lutter & Peters (2021) , learning Lagrangian or Hamiltonian dynamics from realistic renderings remains an open challenge. The majority of related work (Greydanus et al., 2019; Toth et al., 2020; Saemundsson et al., 2020; Allen-Blanchette et al., 2020; Botev et al., 2021) use a variational auto-encoder (VAE) framework to represent the state in a latent space embedding. The dynamics model is expressed in this latent space. Zhong & Leonard (2020) use interpretable coordinates, however need full knowledge of the kinematic chain, and the images are segmented per object. Table 1 provides an overview of closely related work in literature. Table 1 : An overview of closely related Lagrangian or Hamiltonian models. Lag-caVAE (Zhong & Leonard, 2020) is capable of modelling external forces and learning from images, but individual moving bodies need to be segmented in the images, on a black background. It additionally needs full knowledge of the kinematic chain, which is more prior information than the constraint function necessary for our method (see Section 2). HGN (Toth et al., 2020) needs no prior knowledge of the kinematic chain, but is unable to model external forces. CHNN (Finzi et al., 2020) expresses Lagrangian or Hamiltonian dynamics in Cartesian coordinates, but can not be learned from images. Our method, KeyCLD, is capable of learning Lagrangian dynamics with external forces, from unsegmented images with shadows, reflections and backgrounds.

HGN Lag-caVAE CHNN KeyCLD

External forces (control) ✓ ✓ Interpretable coordinates ✓ ✓ ✓ Cartesian coordinates ✓ ✓ Learns from images ✓ ✓ ✓ Learns from unsegmented images ✓ ✓ Needs kinematic chain prior ✓ Needs constraint prior ✓ ✓ Keypoints Instead of using VAE inspired latent embeddings, our method leverages fully convolutional keypoint estimator models to observe the state from images. Because the model is fully convolutional, it is also translation equivariant, this leads to a higher data efficiency. Objects can be represented with one or more keypoints, fully capturing the position and orientation. Zhou et al. (2019) used keypoints for object detection, with great success. Keypoint detectors are commonly used for human pose estimation (Zheng et al., 2020) . More closely related to this work, keypoints can be learned for control and robotic manipulation (Chen et al., 2021; Vecerik et al., 2021) . Minderer et al. (2019) learn unsupervised keypoints from videos to represent objects and dynamics. Jaques et al. (2021) leverage keypoints for system identification and dynamic modelling. Jakab et al. (2018) learn a keypoint representation unsupervised by using it as an information bottleneck for reconstructing images. The keypoints represent semantic landmarks in the images and generalise well to unseen data. It is the main inspiration for the use of keypoints in our work.

Contributions (1)

We introduce KeyCLD, a framework to learn constrained Lagrangian dynamics from images. We are the first to use learned keypoint representations from images to learn Lagrangian dynamics. We show that keypoint representations derived from images can directly be used as positional state vector for learning constrained Lagrangian dynamics, expressed in Cartesian coordinates. (2) We show how to control constrained Lagrangian dynamics in Cartesian coordinates with energy shaping, where the state is estimated from images. (3) We adapt the pendulum, cartpole and acrobot systems from dm control (Tunyasuvunakool et al., 2020) as benchmarks for learning Lagrangian or Hamiltonian dynamics from images. (4) We show that KeyCLD can be learned on these systems, wether they are unactuated, underactuated or fully actuated. We compare quantitatively with Lag-caVAE, Lag-VAE (Zhong & Leonard, 2020) and HGN (Toth et al., 2020) , and investigate the benefit of the Lagrangian prior and the constraint function. KeyCLD performs best on all benchmarks.

2. CONSTRAINED LAGRANGIAN DYNAMICS

Lagrangian Dynamics For a dynamical system with m degrees of freedom, a set of independent generalized coordinates q ∈ R m represents all possible kinematic configurations of the system. The time derivatives q ∈ R m are the velocities of the system. If the system is fully deterministic, its dynamics are described by the equations of motion, a set of second order ordinary differential equations (ODE): q = f (q(t), q(t), t, u(t)) (1) where u(t) are the external forces acting on the system. From a known initial value (q, q), we can integrate f through time to predict future states of the system. It is possible to model f with a neural network, and train the parameters with backpropagation through an ODE solver (Chen et al., 2018) . However, by expressing the dynamics with a Lagrangian we introduce a strong physics prior (Lutter et al., 2019) : L(q, q) = T (q, q) -V (q) (2) T is the kinetic energy and V is the potential energy of the system. For any mechanical system the kinetic energy is defined as: T (q, q) = 1 2 q⊤ M(q) q where M(q) ∈ R m×m is the positive semi-definite mass matrix. Ensuring that M(q) is positive semi-definite can be done by expressing M(q) = L(q)L(q) ⊤ , where L(q) is a lower triangular matrix. It is now possible to describe the dynamics with two neural networks, one for the mass matrix and one for the potential energy. Since both are only in function of q and not q, and expressing the mass matrix and potential energy is more straightforward than expressing the equations of motion, it is generally much more simple to learn dynamics with this framework. In other words, adding more physics priors in the form of Lagrangian mechanics, makes learning the dynamics more robust and data-efficient (Lutter et al., 2018; 2019; Cranmer et al., 2020; Lutter & Peters, 2021) . The Euler-Lagrange equations (4) allow transforming the Lagrangian into the equations of motion by solving for q: d dt ∇ qL -∇ q L = ∇ q W ∇ q W = g(q)u (5) where W is the external work done on the system, e.g. forces applied for control. The input matrix g ∈ R m×l allows introducing external forces u ∈ R l for modelling any control-affine system. If the external forces and torques are aligned with the degrees of freedom q, g can be a diagonal matrix or even an identity matrix. More generally, if no prior knowledge is present about the relationship between u and the generalized coordinates q, g(q) : R m → R m×l is a function of q and can be modelled with a third neural network (Zhong et al., 2020) . If the system is fully actuated l = m, if it is underactuated l < m. Finzi et al. (2020) showed that expressing Lagrangian mechanics in Cartesian coordinates x ∈ R k instead of independent generalized coordinates q ∈ R m has several advantages. The mass matrix no longer changes in function of the state, and is thus static. This means that a neural network is no longer required to model the mass matrix, simply the values in the matrix itself are optimized. The potential energy V (x) and input matrix g(x) are now in function of x. Expressing the potential energy in Cartesian coordinates can often be simpler than in generalized coordinates. E.g. for gravity, this is simply a linear function.

Cartesian coordinates

To use the Euler-Lagrange equations without constraint forces, it is required that the system is expressed in independent generalized coordinates, meaning that all possible values of q correspond to possible states of the system. Since we are now expressing the system in Cartesian coordinates, Figure 2 : Example of a constraint function Φ(x) to express the cartpole system with Cartesian coordinates. The cartpole system has 2 degrees of freedom, but is expressed in x ∈ R 4 . Valid configurations of the system in R 4 are constrained on a manifold defined by 0 = Φ(x). The first constraint only allows horizontal movement of x 1 , and the second constraint enforces a constant distance between x 1 and x 2 . Although unknown l 1 and l 2 constants are present in Φ(x), their values are irrelevant, since only the Jacobian of Φ(x) is used in our framework (see equation ( 6)). See Appendix A.4 for more examples of constraint functions. this requirement no longer holds. We need additionally a set of n holonomic constraint functions Φ(x) : R k → R n , where n is the number of constraints so that the degrees of freedom are correct: m = kn. Deriving the equations of motion including the holonomic constraints yields (see Appendix A.1 for the full derivation and details): f = -∇ x V + gu ẍ = M -1 f -M -1 DΦ ⊤ DΦM -1 DΦ ⊤ -1 DΦM -1 f + ⟨D 2 Φ, ẋ⟩ ẋ with D being the Jacobian operator. Since time derivatives of functions modelled with neural networks are no longer present, equation ( 6) can be implemented in an autograd library which handles the calculation of gradients and Jacobians automatically. See Appendix A.2 for details and the implementation of equation ( 6) in JAX (Bradbury et al., 2018) . Note that in equation ( 6) only the Jacobian of Φ(x) is present. This means that there is no need to learn explicit constants in Φ(x), such as lengths or distances between points. Rather that constant distances and lengths through time are enforced by DΦ(x) ẋ = 0. We use this property to our advantage since this simplifies the learning process. See Fig. 2 for an example. The constraint function Φ(x) adds extra prior information to our model. Alternatively, we could use a mapping function x = F(q). This leads directly to an expression of the Lagrangian in Cartesian coordinates using ẋ = DF(q) q: L(q, q) = 1 2 q⊤ DF(q) ⊤ MDF(q) q -V (F(q)) from which the equations of motion can be derived using the Euler-Lagrange equations, similar to equation ( 6). In terms of explicit knowledge about the system, the mapping x = F(q) is equivalent to the kinematic chain as required for the method of Zhong & Leonard (2020). Using the constraint function is however more general. Some systems, such as systems with closed loop kinematics, can not be expressed in generalized coordinates q, and thus have no mapping function (Betsch, 2005) . We therefore argue that adopting the constraint function Φ(x) is more general and requires less explicit knowledge injected in the model.

Relationship between Lagrangian and Hamiltonian

Both Lagrangian and Hamiltonian mechanics ultimately express the dynamics in terms of kinetic and potential energy. The Hamiltonian expresses the total energy of the system H(q, p) = T (q, p) + V (q) (Greydanus et al., 2019; Toth et al., 2020) . It is expressed in the position and the generalized momenta (q, p), instead of generalized velocities. Using the Legendre transformation it is possible to transform L into H or back. We focus in our work on Lagrangian mechanics because it is more general (Cranmer et al., 2020) and observing the momenta p is impossible from images. See also Botev et al. (2021) for a short discussion on the differences.

Keypoints as state representations

We introduce the use of keypoints to learn Lagrangian dynamics from images. KeyCLD is trained unsupervised on sequences of n images {z i }, i ∈ {1, . . . , n} and a constant input vector u. See Fig. 3 for a schematic overview. Figure 3 : Schematic overview of training KeyCLD. A sequence of n images {z i }, i ∈ {1, . . . , n}, is processed by the keypoint estimator model, returning heatmaps {s i } representing spatial probabilities of the keypoints. s i consists of m heatmaps s i k , one for every keypoint x i k , k ∈ {1, . . . , m}. Spatial softmax is used to extract the Cartesian coordinates of the keypoints, and all keypoints are concatenated in the state vector x i . x i is transformed back to a spatial representation s ′i using Gaussian blobs. This prior is encouraged on the keypoint estimator model by a binary cross-entropy loss L e between s i and s ′i . The renderer model reconstructs images z ′i based on s ′i , with reconstruction loss L r . The dynamics loss L d is calculated on the sequence of state vectors x i . Keypoint estimator model, renderer model and the dynamics models (mass matrix, potential energy and input matrix) are jointly trained with a weighted sum of the losses L = L r + L e + λL d . All images z i in the sequence are processed by the keypoint estimator model, returning each a set of heatmaps s i representing the spatial probabilities of keypoint positions. s i consists of m heatmaps s i k , one for every keypoint x i k , k ∈ {1, . . . , m}. The keypoint estimator model is a fully convolutional neural network, maintaining a spatial representation from input to output (see Fig. 4 for the detailed architecture). This contrasts with a model ending in fully connected layers regressing to the coordinates directly, where the spatial representation is lost (Zhong & Leonard, 2020) . Because a fully convolutional model is equivariant to translation, it can better generalize to unseen states that are translations of seen states. Another advantage is the possibility of augmenting z with e.g. random transformations of the D 4 dihedral group to increase robustness and data efficiency. Because s can be transformed back with the inverse transformation, this augmentation is confined to the keypoint estimator model and has no effect on the dynamics. To distill keypoint coordinates from the heatmaps, we define a Cartesian coordinate system in the image (see for example Fig. 1 ). Based on this definition, every pixel p corresponds to a point x p in the Cartesian space. The choice of the Cartesian coordinate system is arbitrary but is equal to the space of the dynamics ẍ( ẋ, x, t, u) and the constraint function Φ(x) (see Section 2). We use spatial softmax over all pixels p ∈ P to distill the coordinates of keypoint x k from its probability heatmap: x k = p∈P x p e s k (p) p∈P e s k (p) (8) Spatial softmax is differentiable, and the loss will backpropagate through the whole heatmap since x k depends on all the pixels. Cartesian coordinates x k of the different keypoints are concatenated in vector x which serves as the positional state vector of the system. This compelling connection between image keypoints and Cartesian coordinates forms the basis of this work. The keypoint estimator model serves directly as state estimator to learn Lagrangian dynamics from images. Similar to Jakab et al. (2018) , x acts as an information bottleneck, through which only the Cartesian coordinates of the keypoints flow to reconstruct the image with the renderer model. First, all x k are transformed back to spatial representations s ′ k using Gaussian blobs, parameterized by a hyperparameter σ. s ′ k = exp - ∥x p -x k ∥ 2 2σ 2 (9) A binary cross-entropy loss L e is formulated over s and s ′ to encourage this Gaussian prior. The renderer model can more easily interpret the state in this spatial representation, as it lies closer to its semantic meaning of keypoints as semantic landmarks in the reconstructed image. The renderer model learns a constant feature tensor (inspired by Nguyen-Phuoc et al. ( 2019)), which provides it with positional information, since the model itself is translation equivariant. See Fig. 4 for the detailed architecture. 2016). The renderer model learns a constant feature tensor that is concatenated with the input s ′ . The feature tensor provides positional information since the fully-convolutional model is translation equivariant. Finally, a reconstruction loss is formulated over the reconstructed images {z ′i } and original images {z i }: L r = n i=1 ∥z ′i -z i ∥ 2 (10) Dynamics loss function The sequence {x i }, corresponding to the sequence of given images {z i }, and the constant input u is used to calculate the dynamics loss. A fundamental aspect in learning dynamics from images is that velocities can not be directly observed. A single image only captures the position of a system, and contains no information about its velocitiesfoot_0 . Other work uses sequences of images as input to a model (Toth et al., 2020) or a specific velocity estimator model trained to estimate velocities from a sequence of positions (Jaques et al., 2019) . Zhong & Leonard (2020) demonstrate that for estimating velocities, finite differencing simply works better. We use a central first order finite difference estimation, and project the estimated velocity on the constraints, so that the constraints are not violated: ẋi = I -DΦ(x i ) + DΦ(x i ) x i+1 -x i-1 2h , i ∈ {2, . . . , n -1} where (•) + signifies the Moore-Penrose pseudo-inverse and h the timestep. We can now integrate future timesteps x starting from initial values (x i , ẋi ) using an ODE solver. The equations of motion (6) are solved starting from all initial values in parallel, for ν timesteps This maximizes the learning signal obtained to learn the dynamics and leads to overlapping subsequences of length ν: {x i+1 , . . . , xi+ν }, i ∈ {2, . . . , n -ν} (12) Thus, xi+j is obtained by integrating j timesteps forward in time, starting from initial value x i , which was derived by the keypoint estimator model. All xi+j in all subsequences are compared with their corresponding keypoint states x i+j in an L 2 loss: L d = n-ν i=2 ν j=1 x i+j -xi+j 2 (13) Total loss The total loss is the weighted sum of L r , L e and L d , with a weighing hyperparameter λ: L = L r + L e + λL d . To conclude, the keypoint estimator model, renderer model and dynamics models (mass matrix, potential energy and input matrix) are jointly trained end-to-end on sequences of images {z i } and constant inputs u with stochastic gradient descent. Rigid bodies as rigid sets of point masses By interpreting a set of keypoints as a set of point masses, we can represent any rigid body and its corresponding kinetic and potential energy. Additional constraints are added for the pairwise distances between keypoints representing a single rigid body (Finzi et al., 2020) . For 3D systems, at least four keypoints are required to represent any rigid body (Laus & Selig, 2020) . We focus in our work on 2D systems in a plane parallel to the camera plane. 2D rigid bodies can be expressed with a set of 2 point masses, which can further be reduced depending on the constraints and connections between bodies. See Appendix A.3 for more details and proof.

4. EXPERIMENTS

We adapted the pendulum, cartpole and acrobot environments from dm control (Tunyasuvunakool et al., 2020; Todorov et al., 2012) for our experiments. See Appendix A.4 for details about the environments, their constraint functions and the data generation procedure. The exact same model architectures, hyperparameters and control parameters were used across the environments. This further demonstrates the generality and robustness of our method. See Appendix A.5 for more details. Since KeyCLD is trained directly on image observations, quantitative metrics can only be expressed in the image domain. The mean square error (MSE) in the image domain is not a good metric of long term prediction accuracy (Minderer et al., 2019; Zhong & Leonard, 2020) . A model that trivially learns to predict a static image, which is the average of the dataset, learns no dynamics at all yet this model could report a lower MSE than a model that did learn the dynamics but started drifting from the groundtruth after some time. Therefore, we use the valid prediction time (VPT) score (Botev et al., 2021; Jin et al., 2020) which measures how long the predicted images stay close to the groundtruth images of a sequence: VPT = argmin i [MSE(z ′i , z i ) > ϵ] where z i are the groundtruth images, z ′i are the predicted images and ϵ is the error threshold. ϵ is determined separately for the different environments because it depends on the relative size in pixels of moving parts. We define it as the MSE of the averaged image of the respective validation dataset. Thus it is the lower bound for a model that would simply predict a static image. We present evaluations with the following ablations and baselines: KeyCLD The full framework as described in Sections 2 and 3. KeyLD The constraint function is omitted. KeyODE2 A second order neural ODE modelling the acceleration is used instead of the Lagrangian prior. The keypoint estimator and renderer model are identical to KeyCLD. Lag-caVAE The model presented by Zhong & Leonard (2020) . We adapted the model to the higher resolution and color images. Lag-VAE The model presented by Zhong & Leonard (2020) . We adapted the model to the higher resolution and color images. HGN Hamiltonian Generative Network presented by Toth et al. (2020) . Table 2: Valid prediction time (higher is better) in number of predicted frames (mean ± std) for the different models evaluated on the 50 sequences in the validation set. Lag-caVAE and Lag-VAE are only reported on the pendulum environment, since they are unable to model more than one moving body without segmented images. HGN is only reported on non-actuated systems, since it is incapable of modelling external forces and torques. KeyCLD achieves the best results on all benchmarks.

# actuators

KeyCLD KeyLD KeyODE2 Lag-caVAE Lag-VAE HGN Pendulum 0 (Fig. 5 ) 43.1 ± 9.7 16.4 ± 11.3 19.1 ± 6.2 0.0 ± 0.0 10.8 ± 13.8 0.2 ± 1.4 1 (Fig. 9 ) 39.3 ± 9.8 14.9 ± 7.9 12.0 ± 4.1 0.0 ± 0.1 8.0 ± 10.2 -Cartpole 0 (Fig. 10 ) 39.9 ± 7.4 29.8 ± 11.2 29.5 ± 9.5 --0.0 ± 0.0 1 (Fig. 11 ) 38.4 ± 8.7 28.0 ± 9.7 24.4 ± 7.9 ---2 (Fig. 12 ) 30.2 ± 10.7 23.9 ± 9.6 17.7 ± 8.2 ---Acrobot 0 (Fig. 13 ) 47.0 ± 6.0 40.0 ± 7.9 34.3 ± 9.5 --2.2 ± 6.9 1 (Fig. 14 2 . 50 frames are predicted based on the first three frames of the ground truth sequence to estimate the velocity. Every third frame of every sequence is shown. KeyCLD is capable of making accurate long-term predictions with minimal drift of the dynamics. Without constraint function, KeyLD is not capable of making long-term predictions. Similarly, KeyODE2 is unable of making long-term predictions. Lag-caVAE is fundamentally incapable of modelling data with background information, since the reconstructed images are explicitly rotated. Lag-VAE does not succeed in modelling moving parts in the data, and simply learns to predict static images. HGN also does not capture the dynamics and only learns the background. See Table 2 for an overview of results, and Fig. 5 for qualitative results on the unactuated pendulum. KeyCLD achieves the best results on all benchmarks. Lag-caVAE is unable to model data with background (see also Fig. 5 ). Despite our best efforts for implementation and training, Lag-VAE and HGN perform very poorly. The models are not capable of handling the relatively more challenging visual structure of dm control environments. Removing the constraint function (KeyLD) has a detrimental effect on the ability to make long-term predictions. Results are comparable to removing the Lagrangian prior altogether (KeyODE2). This suggests that modeling dynamics in Cartesian coordinates coupled with keypoint representations is in itself a very strong prior, consistent with recent findings by Gruver et al. (2021) . However, using a Lagrangian formulation allows leveraging a constraint function, since a general neural ODE model can not make use of a constraint function. Thus, if a constraint function is available, the Lagrangian prior becomes much more powerful. See Appendix A.7 for more qualitative results and insights. Interpretable energy models and control A major argument in favor of expressing dynamics in terms of a mass matrix and potential energy is the straightforward control design via passivity based control and energy shaping. See Appendix A.6 for details and derivation of an energy shaping controller, and successful swing-up results of the pendulum, cartpole and acrobot system.

5. CONCLUSION AND FUTURE WORK

We introduce the use of keypoints to learn Lagrangian dynamics from images. Learned keypoint representations derived from images are directly used as positional state vector for learning constrained Lagrangian dynamics. The pendulum, cartpole and acrobot systems of dm control are adapted as benchmarks. Previous works in literature on learning Lagrangian or Hamiltonian dynamics from images were benchmarked on very simple renderings of flat sprites on blank backgrounds (Botev et al., 2021; Zhong & Leonard, 2020; Toth et al., 2020) , whereas dm control is rendered with lighting effects, shadows, reflections and backgrounds. We believe that working towards more realistic datasets is crucial for applying Lagrangian models in the real world. The challenge of learning Lagrangian dynamics from more complex images should not be underestimated. Despite our best efforts in implementing and training Lag-caVAE, Lag-VAE (Zhong & Leonard, 2020) and HGN (Toth et al., 2020) , they perform very poorly on our dm control benchmark. KeyCLD is capable of making long-term predictions and learning accurate energy models, suitable for energy shaping control. When no constraint prior is available, results are comparable to a general second order neural ODE. This signifies the benefit of using keypoint representations coupled with Cartesian coordinates to model the dynamics. Our work focusses on 2D systems, where the plane of the system is parallel with the camera plane. Elevation to 3D, e.g. setups with multiple cameras, is an interesting future direction. Secondly, modelling contacts by using inequality constraints could be a useful addition. Thirdly, our work focusses on energy-conserving systems. Modelling energy dissipation is necessary for real-world applications. Several recent papers have proposed methods to incorporate energy dissipation in the Lagrangian dynamics models (Zhong et al., 2020; Greydanus & Sosanya, 2022) . However, Gruver et al. (2021) argue that modelling the acceleration directly with a second order differential equation and expressing the system in Cartesian coordinates, is a better approach. Further research into both approaches would clarify the benefit of Lagrangian and Hamiltonian priors on real-world applications. Lastly, a major argument in using the Lagrangian prior is the availability of the mass matrix, potential energy and input matrix. This allows explainability and powerful control designs, such as energy shaping. Model-based research for underactuated systems uses stochastic optimal control which often fails since a long prediction horizon is required. With our method, feedback controllers based on the mass matrix, potential energy and input matrix are possible which are more robust and do not require complicated optimal control.

6. BROADER IMPACT

A tenacious divide exists between control engineering researchers and computer science researchers working on control. Where the first would use known equations of motion for a specific class of systems and investigate system identification, the latter would strive for the most general method with no prior knowledge. We believe this is a spectrum worth exploring, and as such use strong physics priors as Lagrangian mechanics, but still model e.g. the input matrix and the potential energy with arbitrary neural networks. The broad field of model-based reinforcement learning could benefit from decades of theory and practice in classic control theory and system identification. We hope this paper could help bridge both worlds. Using images as input is, in a broad sense, very powerful. Since camera sensors are consistently becoming cheaper and more powerful due to advancements in technology and scaling opportunities, we can leverage these rich information sources for a deeper understanding of the world our intelligent agents are acting in. Image sensors can replace and enhance multiple other sensor modalities, at a lower cost. This work demonstrates the ability to efficiently model and control dynamical systems that are captured by cameras, with no supervision and minimal prior knowledge. We want to stress that we have shown it is possible to learn both the Lagrangian dynamics and state estimator model from images in one end-to-end process. The complex interplay between both, often makes them the most labour intensive parts in system identification. We believe this is a gateway step in achieving reliable end-to-end learned control from pixels, especially since the availability of mass matrix, potential energy and input matrix models allows powerful control designs.

REPRODUCIBILITY STATEMENT

Please see the attached codebase to reproduce all experiments reported in this paper. The README.md file contains detailed installation instructions and scripts for every experiment.

A APPENDIX A.1 DERIVATION OF CONSTRAINED EULER-LAGRANGE EQUATIONS

The Lagrangian of a mechanical system described in Cartesian coordinates x ∈ R k is: L(x, ẋ) = 1 2 ẋ⊤ M ẋ -V (x) with M a static mass matrix, not depending on x, and V (x) the potential energy. If the system has m degrees of freedom, additionally n holonomic constraints are necessary such that m = kn. These are described by a constraint function Φ(x) : R k → R n . Including the input matrix g(x) ∈ R k×l and external inputs u(t) ∈ R l , the constrained Euler-Lagrange equations is expressed with a vector λ(t) ∈ R n containing Lagrange multipliers for the constraints (Finzi et al., 2020; Lanczos, 2020) : d dt ∇ ẋL(x, ẋ) -∇ x L(x, ẋ) = g(x)u(t) + DΦ(x) ⊤ λ(t) Because the mass matrix is staticfoot_1 , this is simplified to: Mẍ + ∇ x V (x) = g(x)u(t) + DΦ(x) ⊤ λ(t) ẍ = M -1 f + M -1 DΦ(x) ⊤ λ(t), f = -∇ x V (x) + g(x)u(t) ) Calculating twice the time derivative of the constraint conditions yields: 0 ≡ Φ(x) 0 = Φ(x) 0 = DΦ(x) ẋ 0 = D Φ(x) ẋ + DΦ(x)ẍ The Lagrange multipliers λ(t) are solved by substituting ẍ from equation (18) in equation ( 19): -D Φ(x) ẋ = DΦ(x)M -1 f + DΦ(x)M -1 DΦ(x) ⊤ λ(t) λ(t) = DΦ(x)M -1 DΦ(x) ⊤ -1 DΦ(x)M -1 f + D Φ(x) ẋ We use the chain rule a second time to get rid of the time derivative of DΦ(x): D Φ(x) ẋ = ⟨D 2 Φ, ẋ⟩ ẋ Substituting λ(t) in ( 18) we arrive at equation (6).

A.2 IMPLEMENTATION OF CONSTRAINED EULER-LAGRANGE EQUATIONS IN JAX

It could seem a daunting task to implement the derivation of the constrained Euler-Lagrange equations (6) in an autograd library. Therefore, we provide an implementation in JAX (Bradbury et al., 2018) . For more context, please see the full code base in the supplementary materials (keycld/models.py). # Lagrange multiplicators: l = jnp.linalg.pinv(Dphi @ m_inv @ Dphi.T) @ (Dphi @ m_inv @ f + DDphi @ x_t @ x_t) x_tt = m_inv @ (f -Dphi.T @ l) return x_tt

A.3 RIGID BODIES AS SETS OF POINT MASSES

The position of a rigid body in 2D is fully described by the position of its center of mass x c and orientation θ. Potential energy only depends on the position, thus if we want to describe the potential energy with an equivalent rigid set of point masses, two points are sufficient to fully determine x c and θ. For the kinetic energy, we provide the following Theorem and proof: Theorem 1. For any 2D rigid body, described by its center of mass c, mass m and rotational inertia I, there exists an equivalent rigid set of two point masses x 1 and x 2 with masses m 1 and m 2 . Proof. To find conditions such that the kinetic energy expressed in two point masses should be equal to the rigid body representation, we start by expressing general 3D-movement: x i = x c + x i/c , i ∈ {1, 2} Where the vector x c are the coordinates of the center of mass and the vector x i/c is the position of the point mass relative to the center of mass. Since this relative position x i/c has fixed length, only a rotation is possible and hence the equation of the velocity is: ẋi = ẋc + ω × x i/c , i ∈ {1, 2} where ω is the rotational velocity of the body. Substituting this in the kinetic energy of the point masses, we get: T = 1 2 2 i=1 m i ∥ ẋc + ω × x i/c ∥ 2 = 1 2 2 i=1 m i ∥ ẋc ∥ 2 + ∥ω × x i/c ∥ 2 + 2x i/c • ẋc × ω (24) Where we calculated the square and used the circular shift property of the triple product on the last term. For movement in the 2D-plane (i.e. ω = ⃗ e z ω z and x i = ⃗ e x x i,x + ⃗ e y x i,y ), this becomes: T = 1 2 2 i=1 m i ∥ ẋc ∥ 2 + ∥x i/c ∥ 2 ω 2 z + 2x i/c • ẋc × ω = 1 2 m 1 + m 2 ∥ ẋc ∥ 2 + 1 2 m 1 ∥x 1/c ∥ 2 + m 2 ∥x 2/c ∥ 2 ω 2 z + m 1 x 1/c + m 2 x 2/c • ẋc × ω (25) Matching the kinetic energy of the 2 point masses (equation ( 25)) with that of the rigid body representation (left hand side of Figure 6 ), we get following conditions:    m = m 1 + m 2 I = m 1 ∥x 1/c ∥ 2 + m 2 ∥x 2/c ∥ 2 0 = m 1 x 1/c + m 2 x 2/c Since the last equation is a vector equation, this gives us four equations in six unknowns (m 1 ,m 2 ,x 1,x ,x 1,y ,x 2,x ,x 2,y ), which leaves us the freedom to choose two. It follows from the third condition of (26) that points x 1 , x 2 and x c should be collinear. To conclude, we can freely choose the positions of the point masses (as long as x c is on the line between them), and will be able to model the rigid body as a set of two point masses. In practice, KeyCLD will freely choose the keypoint positions to be able to model the dynamics. Depending on the constraints in the system, it is possible to further reduce the number of necessary keypoints. See Appendix A.4 for examples. The interpretation of rigid bodies as sets of point masses allows expressing the kinetic energy as the sum of the kinetic energies of the point masses. The mass matrix for a 2D system is thus defined as a diagonal matrix with masses m k for every keypoint x k , leading to the following expression for the kinetic energy of the system: T ( ẋ) = 1 2 ẋ⊤ M ẋ = 1 2 ẋ1 . . . ẋn         m 1 0 . . . 0 0 0 m 1 . . . 0 0 . . . . . . . . . . . . . . . 0 0 . . . m n 0 0 0 . . . 0 m n            ẋ1 . . . ẋn    To enforce positive values, the masses are parameterized by their square root and squared. We adapted the pendulum, cartpole and acrobot environments from dm control (Tunyasuvunakool et al., 2020) implemented in MuJoCo (Todorov et al., 2012) . Both are released under the Apache-2.0 license. Following changes were made to the environments to adapt them to our use-case: Pendulum The camera was repositioned so that it is in a parallel plane to the system. Friction was removed. Torque limits of the motor are increased. Cartpole The camera was moved further away from the system to enable a wider view, the two rails are made longer and the floor lowered so that they are not cut-off with the wider view. All friction is removed. The pole is made twice as thick, the color of the cart is changed. Torque limits are increased and actuation is added to the cart to make full actuation possible. Acrobot The camera and system are moved a little bit upwards. The two poles are made twice as thick, and one is changed in color. Torque limits are increased and actuation is added to the upper part to make full actuation possible. Data generation For every environment, 500 runs of 50 timesteps are generated with a 10% validation split. Initial state for every sequence is at random position with small random velocity. The control inputs u are constant throughout a sequence, and uniform randomly chosen between the force and torque limits of input. We set u = 0 for 20% of the sequences. We found this helps the model to learn the dynamics better, discouraging confusion of the energy models with external actions. The constraint function for each of the environments are given in Fig. 7 . As explained in Appendix A.3, every rigid body needs to be represented by two keypoints. But due to the constraints it is possible to omit certain keypoints, because they do not move or coincide with other keypoints. As experimentally validated, we can thus model all three systems with a lower number of keypoints, where the number of keypoints equals the number of bodies. Pendulum One keypoint is used to model the pendulum. The second keypoint of this rigid body can be omitted because it can be assumed to be at the origin. Due to the constraint function, this point will provide no kinetic energy since it will not move. Since the other keypoints position and mass is freely chosen, any pendulum can be modelled. The constraint function expresses that the distance l 1 from the origin to x 1 is fixed. The value of l 1 in the implementation is irrelevant because it vanishes when taking the Jacobian. Cartpole Two keypoints are used to model the cartpole. The constraint function expresses that x 1 does not move in the vertical direction and the distance l 1 between x 1 and x 2 is constant. Again, the values of l 1 and l 2 in the implementation are irrelevant. Acrobot Two keypoints are used to model the acrobot. The constraint function expresses that lengths l 1 and l 2 are constant through time. Again, the values are irrelevant in the implementation.

A.5 TRAINING HYPERPARAMETERS AND DETAILS

All models were trained on one NVIDIA RTX 2080 Ti GPU.

KeyCLD, KeyLD and KeyODE2

We use the Adam optimizer (Kingma & Ba, 2015) , implemented in Optax (Hessel et al., 2020) with a learning rate of 3×10 -4 . We use the exact same hyperparameters for all the environments and did not tune them individually. Dynamics loss weight λ = 1, σ = 0.1 for the Gaussian blobs in s ′ . The hidden layers in the keypoint estimator and renderer model have at the first block 32 features, this increases to respectively 64 and 128 after every maxpool operation. All convolutions have kernel size 3 × 3, and maxpool operations scale down with factor 2 with a kernel size of 2 × 2. The potential energy is modelled with an MLP with two hidden layers with 32 neurons and celu activation functions (Barron, 2017) . The weights are initialized with a normal distribution with standard deviation 0.01. Likewise, the input matrix is modelled with an MLP similar the potential energy. The ouputs of this MLP are reshaped in the required shape of the input matrix. The KeyODE2 dynamics model is an MLP with three hidden layers with each 64 neurons. We chose a higher number of layers and neurons, to allow this model more expressivity compared to the potential energy and input matrix models of KeyCLD. Lag-caVAE, Lag-VAE and HGN For the Lag-caVAE and Lag-VAE baselines, the official public codebase was used (Zhong & Leonard, 2020) . We adapted the implementation to work with the higher input resolution of 64 by 64 (instead of 32 by 32), and 3 color channels (instead of 1). For the HGN baseline, we used the implementation that was also released by Zhong & Leonard (2020) . The architecture was adapted to work with the higher input resolution of 64 by 64 (instead of 32 by 32) by adding an extra upscale layer in the decoder, and a maxpool layer and one extra convolutional layer in the encoder.

A.6 ENERGY SHAPING CONTROL

A major argument in favor of expressing dynamics in terms of a mass matrix and potential energy is the straightforward control design via passivity based control and energy shaping (Ortega et al., 2001) . Recent works of Zhong et al. ( 2020); Zhong & Leonard (2020) use energy shaping in generalized coordinates. In Cartesian coordinates, energy shaping can still be used. This is easily seen from the fact that for the holonomic constraints Φ(x) ≡ 0, we have the derivative DΦ(x) ẋ = 0, which means that the constraint forces in equation ( 6) are perpendicular to the path and hence do no work nor influence the energy (Lanczos, 2020) . Energy shaping control makes sure that the controlled system behaves according to a potential energy V d (x) instead of V (x): u = (g ⊤ g) -1 g ⊤ (∇ x V -∇ x V d ) -y passive ( ) where y passive can be any passive output, the easiest choice being y passive = k d g ⊤ ẋ, where k d is a tuneable control parameter. The proposed potential energy V d should be such that: x * = argminV d (x) 0 = g ⊥ (∇ x V -∇ x V d ) Where g ⊥ is the left-annihilator of g, meaning that g ⊥ g = 0. For fully actuated systems, the first condition of equation ( 29) is always met and the easiest choice is: V d (x) = (x -x * ) ⊤ k p (x -x * ) ( ) where k p is a tuneable control parameter. The desired equilibrium position x * is obtained by putting a picture of the desired position of the system into the keypoint estimator model. Finally, the passivity-based controller that is used is: u = (g ⊤ g) -1 g ⊤ [∇ x V -k p (x -x * )] -k d g ⊤ ẋ (31) Changing the behavior of the kinetic energy is also possible (Gomez-Estern et al., 2001) , but if left for future work. The passivity-based controller is (see Appendix A.6 for full derivation and details): u = (g ⊤ g) -1 g ⊤ [∇ x V -k p (x -x * )] -k d g ⊤ ẋ Many model-based reinforcement learning algorithms require the learning of a full neural network as controller. Whilst in this work, due to knowledge of the potential energy, we only need to tune two parameters k p and k d . Here we present additional qualitative results. Please refer to the supplementary materials for movies. Future frame predictions We generate predictions of 50 frames, given the first 3 frames of the ground truth sequence to estimate the initial velocity. Please compare these qualitative results for the unactuated and actuated pendulum environment(Fig. 5 , 9), unactuated, underactuated and fully actuated cartpole environment (Fig. 10 , 11, 12) and unactuated, underactuated and fully actuated acrobot environment (Fig. 13, 14, 15) . Every third frame of the sequence is shown. See also the supplementary materials for movies of all sequences in the validation set. For every environment very long predictions of 500 frames are included, including visualizations of the keypoint representations and predictions. Learned potential energy models Since the potential energy V is explicitly modelled, we can plot values throughout sequences of the state space. A sequence of images is processed by the learned keypoint estimator model, and the states are then used to calculate the potential energy with the learned potential energy model. Absolute values of the potential energy are irrelevant, since the potential is relative, but we gain insights by moving parts of the system separately. See Figure 16 for results for the pendulum, Figures 17 and 18 for the cartpole and Figures 19 and 20 for the acrobot.



Neglecting side-effects such as motion blur, which are not very useful for this purpose. In other words, the centrifugal and Coriolis forces are zero because Ṁ = 0 and ∇xM = 0.



Figure 1: KeyCLD learns Lagrangian dynamics from images. (a) An observation of a dynamical system is processed by a learned keypoint estimator model. (b) The model represents the positions of the keypoints with a set of spatial probability heatmaps. (c) Cartesian coordinates are extracted using spatial softmax and used as positional state vector to learn Lagrangian dynamics. (d) The information in the keypoint coordinates bottleneck suffices for a learned renderer model to reconstruct the original observation, including background, reflections and shadows. The keypoint estimator model, Lagrangian dynamics models and renderer model are jointly learned unsupervised on sequences of images.

Figure 4: Visualization of the keypoint estimator (top) and renderer (bottom) model architectures. The keypoint estimator model and renderer model have similar architectures, utilizing down-and upsampling and skip connections wich help increasing the receptive field as in Gu et al. (2019); Newell et al. (2016). The renderer model learns a constant feature tensor that is concatenated with the input s ′ . The feature tensor provides positional information since the fully-convolutional model is translation equivariant.

Figure5: Future frame predictions of the unactuated pendulum. These correspond to the first row in Table2. 50 frames are predicted based on the first three frames of the ground truth sequence to estimate the velocity. Every third frame of every sequence is shown. KeyCLD is capable of making accurate long-term predictions with minimal drift of the dynamics. Without constraint function, KeyLD is not capable of making long-term predictions. Similarly, KeyODE2 is unable of making long-term predictions. Lag-caVAE is fundamentally incapable of modelling data with background information, since the reconstructed images are explicitly rotated. Lag-VAE does not succeed in modelling moving parts in the data, and simply learns to predict static images. HGN also does not capture the dynamics and only learns the background.

Figure6: Any 2D rigid body with mass m and rotational inertia I is equivalent to a set of two point masses x 1 and x 2 with masses m 1 and m 2 The kinetic energy of the rigid body, expressed in a translational part and a rotational part, is equal to the sum of the kinetic energies of the point masses.

DETAILS ABOUT THE D M C O N T R O L ENVIRONMENTS AND DATA GENERATION

Figure 7: From left to right the pendulum, cartpole and acrobot dm control environments. The respective constraint functions are given below each schematic.

Figure 8: KeyCLD allows using energy shaping control because the learned potential energy model is available. Based on a swing-up target image z * , the target state x * is determined by the keypoint detector model. The sequences show that all three systems can achieve the target state. The control parameters k p = 5.0 and k d = 2.0 are the same for all systems, demonstrating the generality of the control method.

annex

Here the potential energy changes less, because the first link is not moving.

