CONTEXTUAL SUBSPACE APPROXIMATION WITH NEURAL HOUSEHOLDER TRANSFORMS

Abstract

Choosing an appropriate action representation is an integral part of solving robotic manipulation problems. Published approaches include latent action models which compress the control space into a low dimensional manifold. These involve training a conditional autoencoder, where the current observation and a lowdimensional action are passed through a neural network decoder to compute high dimensional actuation commands. Such models can have a large number of parameters, and can be difficult to interpret from a user perspective. In this work, we propose that similar performance gains in robotics tasks can be achieved by restructuring the neural network to map observations to a basis for a contextdependent linear actuation subspace. This results in an action interface wherein a user's actions determine a linear combination of a state-conditioned actuation basis. We introduce the Neural Householder Transform (NHT) as a method for computing this basis. Our results suggest that reinforcement learning agents trained with NHT in kinematic manipulation and locomotion environments are more robust to hyperparameter choice and achieve higher final success rates compared to agents trained with alternative action representations. NHT agents outperformed agents trained with joint velocity/torque actions, agents trained with an SVD actuation basis, and agents trained with a LASER action interface in the WAMWipe, WAMGrasp, and HalfCheetah environments.

1. INTRODUCTION

In real-world applications of reinforcement learning, its imperative to choose appropriate representations when defining the Markov decision process. The consequences of poor design decisions can have adverse effects in domains like robotics, where safety (Tosatto et al., 2021) and sample efficiency (Li et al., 2021) are desirable properties. Typically these properties can be captured by choice of action space. Choices of robot action types distinct from basic joint motor control, such as Cartesian control or impedance control, have been shown to influence the efficiency of robotic learning, depending on the task (Martín-Martín et al., 2019) . Researchers have typically focused on learning action representations that can capture a variety of robotic motions. This interest has led to developing several different action representation frameworks. One framework includes motor primitives in which entire trajectories are encoded as the action (Paraschos et al., 2013; Schaal, 2006) . Motor primitives have seen much success in robotics leading to impressive real-world experimental results by constraining the action space (Tosatto et al., 2021; Kober & Peters, 2009) . Another framework is the latent actions framework, in which actions-per-time-step are compressed into a latent subspace. Typically these are conditional auto-encoders trained to predict the highdimensional actions given the state and latent action. These methods have been used successfully in both learning systems (Allshire et al., 2021; Zhou et al., 2020; van der Pol et al., 2020) as well as human-in-the-loop settings (Losey et al., 2021; 2020; Karamcheti et al., 2021; Jun Jeon et al., 2020) . It remains unclear whether robotics tasks must have deep, complex action models. There is little work comparing latent action models across varying complexity tasks. For example, hand poses -a complex high dimensional action space -can have up to 80% of configurations explained by two principal components (Santello et al., 1998) . This result has been exploited to develop lowdimensional linear control algorithms, but they assume all actions exist in a global linear subspace (Matrone et al., 2012; Odest & Jenkins, 2007; Artemiadis & Kyriakopoulos, 2010; Liang et al., 2022) . In this work we propose an approach in which we use a neural network to produce a state-dependent basis for a linear actuation subspace. We refer to this as contextual subspace approximation. Actuation commands (e.g. joint velocities) are locally linear with respect to low dimensional inputs, but globally non-linear as the actuation subspace changes as a function of context. The motivation for contextual subspace approximation and the corresponding solutions can be summarized as follows: 1) Contextual subspace approximation requires less data because a kdimensional subspace is completely determined by just k linearly independent samples. 2) From the agent's perspective, action maps change the transition dynamics of the environment, and using simpler functions results in simpler dynamics. 3) Models for contextual subspace approximation can be notably smaller by doing away with the encoder from the latent actions framework. The model proposed here uses Householder transformations to obtain an orthonormal basis for the desired actuation subspace. Householder transformations are often used in QR factorization to efficiently compute least square solutions to over-determined systems of linear equations. This property has been exploited in several settings to define learnable orthonormal matrices in applications of QR factorization for machine learning (Nirwan & Bertschinger, 2019; Dass & Mahapatra, 2021; van den Berg et al., 2018) . Additional work has studied applications of Householder reflections that include normalizing flows (Tomczak & Welling, 2016; Mathiasen et al., 2020) , network activation functions (Singla et al., 2021) , and decomposition of recurrent and dense layers in neural networks (Mhammedi et al., 2017; Zhang et al., 2018; Obukhov et al., 2021) . To the best of our knowledge, our work is the first to study Householder matrices in the context of latent action models. We identify our contributions as the following: • We propose contextual subspace approximation as a novel alternative to end-to-end nonlinear latent action models for robotic control. • We prove that the Neural Householder Transform is smooth with respect to changes in context, and can output bases for the optimal actuation subspace associated with each context. • Our experiments empirically suggest that in two simulated kinematic manipulation tasks and one locomotion task, reinforcement learning agents trained with Neural Householder Transforms learn more efficiently than agents trained to act in with 7dof, SVD, or LASER action interfaces.

2. BACKGROUND AND PRELIMINARIES

In this section, we formalize our framework for learning action representations. We outline relevant background knowledge to contextualize our work, including deep latent action models, and their combination with Markov decision processes. We compare linear, locally-linear, and nonlinear action mapping approaches by conducting experiments on reinforcement learning problems.

2.1. PROBLEM STATEMENT

We assume that the data we wish to model was observed in some context, and the resulting dataset is a collection of context-datapoint pairs (datapoints and context are both represented by vectors). We formulate the problem of contextual subspace approximation by supposing that, for every context c, there exists an associated subspace that best approximates the data observed in the neighborhood of c. We use x = (c, u) to denote a tuple consisting of a datapoint u and the context c in which it was observed. For convenience, we define the following functions to extract the data and context from a tuple x, respectively: C(x) = c; U(x) = u. In addition, we denote the neighborhood of a context point as N (c) = {c ′ : ∥c -c ′ ∥ < δ} for some δ ∈ R. Definition 2.1 (Optimal Contextual Subspace). We define W * (c), the optimal k-dimensional subspace associated with context c, as the k-dimensional subspace that minimizes the expected projection error of data observed in the neighborhood of c:

