CONTEXTUAL SUBSPACE APPROXIMATION WITH NEURAL HOUSEHOLDER TRANSFORMS

Abstract

Choosing an appropriate action representation is an integral part of solving robotic manipulation problems. Published approaches include latent action models which compress the control space into a low dimensional manifold. These involve training a conditional autoencoder, where the current observation and a lowdimensional action are passed through a neural network decoder to compute high dimensional actuation commands. Such models can have a large number of parameters, and can be difficult to interpret from a user perspective. In this work, we propose that similar performance gains in robotics tasks can be achieved by restructuring the neural network to map observations to a basis for a contextdependent linear actuation subspace. This results in an action interface wherein a user's actions determine a linear combination of a state-conditioned actuation basis. We introduce the Neural Householder Transform (NHT) as a method for computing this basis. Our results suggest that reinforcement learning agents trained with NHT in kinematic manipulation and locomotion environments are more robust to hyperparameter choice and achieve higher final success rates compared to agents trained with alternative action representations. NHT agents outperformed agents trained with joint velocity/torque actions, agents trained with an SVD actuation basis, and agents trained with a LASER action interface in the WAMWipe, WAMGrasp, and HalfCheetah environments.

1. INTRODUCTION

In real-world applications of reinforcement learning, its imperative to choose appropriate representations when defining the Markov decision process. The consequences of poor design decisions can have adverse effects in domains like robotics, where safety (Tosatto et al., 2021) and sample efficiency (Li et al., 2021) are desirable properties. Typically these properties can be captured by choice of action space. Choices of robot action types distinct from basic joint motor control, such as Cartesian control or impedance control, have been shown to influence the efficiency of robotic learning, depending on the task (Martín-Martín et al., 2019) . Researchers have typically focused on learning action representations that can capture a variety of robotic motions. This interest has led to developing several different action representation frameworks. One framework includes motor primitives in which entire trajectories are encoded as the action (Paraschos et al., 2013; Schaal, 2006) . Motor primitives have seen much success in robotics leading to impressive real-world experimental results by constraining the action space (Tosatto et al., 2021; Kober & Peters, 2009) . Another framework is the latent actions framework, in which actions-per-time-step are compressed into a latent subspace. Typically these are conditional auto-encoders trained to predict the highdimensional actions given the state and latent action. These methods have been used successfully in both learning systems (Allshire et al., 2021; Zhou et al., 2020; van der Pol et al., 2020) as well as human-in-the-loop settings (Losey et al., 2021; 2020; Karamcheti et al., 2021; Jun Jeon et al., 2020) . It remains unclear whether robotics tasks must have deep, complex action models. There is little work comparing latent action models across varying complexity tasks. For example, hand poses -a complex high dimensional action space -can have up to 80% of configurations explained by two principal components (Santello et al., 1998) . This result has been exploited to develop lowdimensional linear control algorithms, but they assume all actions exist in a global linear subspace (Matrone et al., 2012; Odest & Jenkins, 2007; Artemiadis & Kyriakopoulos, 2010; Liang et al., 2022) . In this work we propose an approach in which we use a neural network to produce a state-dependent basis for a linear actuation subspace. We refer to this as contextual subspace approximation. Actuation commands (e.g. joint velocities) are locally linear with respect to low dimensional inputs, but globally non-linear as the actuation subspace changes as a function of context. The motivation for contextual subspace approximation and the corresponding solutions can be summarized as follows: 1) Contextual subspace approximation requires less data because a kdimensional subspace is completely determined by just k linearly independent samples. 2) From the agent's perspective, action maps change the transition dynamics of the environment, and using simpler functions results in simpler dynamics. 3) Models for contextual subspace approximation can be notably smaller by doing away with the encoder from the latent actions framework. The model proposed here uses Householder transformations to obtain an orthonormal basis for the desired actuation subspace. Householder transformations are often used in QR factorization to efficiently compute least square solutions to over-determined systems of linear equations. This property has been exploited in several settings to define learnable orthonormal matrices in applications of QR factorization for machine learning (Nirwan & Bertschinger, 2019; Dass & Mahapatra, 2021; van den Berg et al., 2018) . Additional work has studied applications of Householder reflections that include normalizing flows (Tomczak & Welling, 2016; Mathiasen et al., 2020) , network activation functions (Singla et al., 2021) , and decomposition of recurrent and dense layers in neural networks (Mhammedi et al., 2017; Zhang et al., 2018; Obukhov et al., 2021) . To the best of our knowledge, our work is the first to study Householder matrices in the context of latent action models. We identify our contributions as the following: • We propose contextual subspace approximation as a novel alternative to end-to-end nonlinear latent action models for robotic control. • We prove that the Neural Householder Transform is smooth with respect to changes in context, and can output bases for the optimal actuation subspace associated with each context. • Our experiments empirically suggest that in two simulated kinematic manipulation tasks and one locomotion task, reinforcement learning agents trained with Neural Householder Transforms learn more efficiently than agents trained to act in with 7dof, SVD, or LASER action interfaces.

2. BACKGROUND AND PRELIMINARIES

In this section, we formalize our framework for learning action representations. We outline relevant background knowledge to contextualize our work, including deep latent action models, and their combination with Markov decision processes. We compare linear, locally-linear, and nonlinear action mapping approaches by conducting experiments on reinforcement learning problems.

2.1. PROBLEM STATEMENT

We assume that the data we wish to model was observed in some context, and the resulting dataset is a collection of context-datapoint pairs (datapoints and context are both represented by vectors). We formulate the problem of contextual subspace approximation by supposing that, for every context c, there exists an associated subspace that best approximates the data observed in the neighborhood of c. We use x = (c, u) to denote a tuple consisting of a datapoint u and the context c in which it was observed. For convenience, we define the following functions to extract the data and context from a tuple x, respectively: C(x) = c; U(x) = u. In addition, we denote the neighborhood of a context point as N (c) = {c ′ : ∥c -c ′ ∥ < δ} for some δ ∈ R. Definition 2.1 (Optimal Contextual Subspace). We define W * (c), the optimal k-dimensional subspace associated with context c, as the k-dimensional subspace that minimizes the expected projection error of data observed in the neighborhood of c: W * (c) . = arg min W E x|C(x)∈N (c) ∥U(x) -proj W (U(x)) ∥ 2 2 (1) where W is a k-dimensional linear subspace of R n , and proj W (U(x)) is the orthogonal projection of the data u onto W . Our goal is to approximate a function Q * (c) that maps context vectors to an orthonormal basis for the associated optimal contextual subspace. Q * : c → Q | col( Q) = W * (c) where Q ∈ R n×k is a n × k matrix of real numbers, and col( Q) is the column space of Q. We assume access to a dataset of datapoint-context pairs.

2.2. MARKOV DECISION PROCESSES

A Markov Decision Process (MDP) is defined by the tuple ⟨S, A, T, p(s 0 ), r(s, a, s ′ )⟩ where S is the state space and A is the action space. The transition probability operator T (s, a, s ′ ) : S × A × S → [0, 1] denotes the probability of transitioning to state s ′ ∈ S when taking an action a ∈ A from a state s ∈ S. p(s 0 ) is the initial state distribution, and r(s, a, s ′ ) defines the reward function. In this framework, the optimal policy search problem involves finding some π * (s) that maximizes the discounted return: π * (s) = arg max π V π (s) = E[ T i=t γ t r(s, a, s ′ )]. Often in real-world problems, reinforcement learning agents must approximate π * (s) as the state representation is intractably large. As we do not assume access to the underlying state s, we will deal with observations, which serve as our context c. We are interested in representing low-dimensional contextual subspaces that approximate the high-dimensional actuations u of robotic agents. This paper studies learning action interfaces that map actions a ∈ R k to raw actuation commands, u ∈ R n . Throughout this work, the action space A will be R k , where k is smaller than the dimensionality of the raw actuation space (e.g. number of joints) of the robotic agent. In the MDP framework, we can interpret action interfaces as being absorbed in the transition dynamics, T (s, a, s ′ ).

2.3. LATENT POLICY FRAMEWORK

The latent actions framework assumes that the actuation commands produced by the optimal policy π * exist on some lower dimensional manifold. In latent action models, latent actions z ∈ R k are mapped to this manifold. These models have typically been studied in settings where there exists a dataset of transition tuples (c, u, c ′ , r). Here c ′ is the context observed after the agent performs actuation u in context c, and r is the corresponding reward. We follow this paradigm of learning from offline demonstrations, and leave the study of learning latent action models online as future work, noting that some researchers have previously studied this setting (Allshire et al., 2021) . Broadly, the class of models previously studied are conditional autoencoders. These models include a neural encoder f θ (c, u) = z which predicts the latent action. If the model is a variational CAE, then f θ (c, u) = (µ, σ), and z is sampled from the Gaussian parameterized by µ, σ using the reparameterization trick (Kingma & Welling, 2014) . These latent actions are then reconstructed with a decoder g θ (z, c) = u, where c is assumed to contextualize how the latent action z should map to the higher dimensional space. In some works, there is also a latent transition model T θ (z, c) = c ′ , which is trained to encourage the latent space to be predictive of transitions (Allshire et al., 2021; van der Pol et al., 2020) . The most general loss function incorporating the above models is the following: arg min θ L recon (c, u, g θ , f θ ) + βL reg (c, u, f θ ) + αL dyn (z, c, c ′ , T θ ). (3) The first term L recon is responsible for enforcing that the reconstructed latent actions approximate the demonstration actuations. The second term, L reg incorporates all the terms that enforce additional requirements of the latent space. The typical choice are compression terms that pack the latent codes into some desired distribution which can include the Kullback-Leibler divergence, maximum mean discrepancy, or simply the L2-norm of z. The third term L dyn is used to encourage the latent actions to be predictive of transitions. The LASER algorithm is a representative example of this framework (Allshire et al., 2021) . LASER trains a latent dynamics model in conjunction with a variational autoencoder.

3. CONTEXTUAL SUBSPACE APPROXIMATION

In this section, we describe our proposed alternative to the conditional autoencoder paradigm of latent action models. The goal is to compute a useful map from low-dimensional task relevant actions a ∈ R k to high-dimensional actuation commands (e.g. motor torques in a robotic manipulator) u ∈ R n , where n > k. Our approach centers on optimizing a parameterized approximation of Q * (see Equation 2). First, let us consider using a linear map from the latent space to the actuation space. Instead of a nonlinear function g θ : a, c → u that jointly maps context vectors and actions to the actuation space, we work with a non-linear function Q θ : c → Q that maps context vectors to a matrix. The matrix Q : a → u itself is a linear map from low-dimensional actions to high dimensional actuation commands. As Q serves a similar purpose to g θ (both map actions to actuation commands), we could consider Q to be a linear decoder in the latent action framework. Then optimization of the standard reconstruction loss is formulated as follows: θ * = arg min θ E π ∥u -Qa∥ 2 2 (4) Where Q = Q θ (c ) is a function of the context c. θ * represents the optimal parameter vector for Q θ . The problem now becomes how to select the action a ∈ R k to use in this optimization. One approach is to follow the conditional autoencoder paradigm and predict a with an encoder neural network. We opt instead to compute the optimal action a * , which we define as the action that minimizes equation ( 4) for fixed u and Q when θ is held constant: a * = arg min a ∥u -Qa∥ 2 2 (5) Finding a * is a least squares problem, and the solution can be computed with the Moore-Penrose left pseudoinverse Q+ = ( Q⊤ Q) -1 Q⊤ . The solution to ( 5) is given by a * = Q+ u. Now our optimization problem becomes: θ * = arg min θ E||u -Q Q+ u|| 2 2 (6) Note that the matrix Q Q+ is an orthogonal projector onto span( Q). Therefore, when we calculate û = Q Q+ u, we are performing an orthogonal projection of u onto span( Q). That is, û = proj span( Q) (u). Now it is clear that the solution to Equation ( 6) is the best approximation attainable by Q θ to the optimal actuation subspace W * (c) defined in Equation (1).

3.1. NEURAL HOUSEHOLDER TRANSFORM

It can be desirable for the matrix Q produced by Q θ to have orthonormal columns. One reason is that Q+ can be trivially computed as Q+ = Q⊤ , which is computationally cheaper to perform. Our experimental results also indicate that learning an Q θ that produces Q with orthonormal columns tends to be more robust to hyperparameter choices (see Appendix). For these reasons we compute Q by first computing an orthogonal matrix Q, and then extracting the first k columns: Q = Q I k 0 (7) We can obtain n × n orthogonal matrices by computing Householder transformations. However, in order to span an arbitrary k-dimensional subspace, we need to chain together k reflections: Q = H(v 1 )H(v 2 ) • • • H(v k ) where H : R n → R n×n computes the Householder matrix that reflects about the hyperplane orthogonal to v i : H(v i ) = I -2v i v ⊤ i , i ∈ {1, ..., k} where each v i has unit norm. Next, we describe how NHT uses a neural network to compute these v i unit vectors.

3.1.1. EXPONENTIAL MAP ON UNIT SPHERE

We would like to leverage neural networks to learn a map from contexts c to the v i needed to compute Q. We can readily obtain unit v i from the output of a typical neural network h θ by simple normalization: v i = h θ (c)/∥h θ (c)∥. Unfortunately, this approach can result in unstable approximations. As the norm of h θ (c) shrinks, arbitrarily small perturbations to the context can cause disproportionate changes in v i . As a more stable alternative, we make use of the exponential map from Riemannian geometryfoot_0 (Absil et al., 2008) , which maps points in the tangent space of a manifold to the manifold itself (in our case, the sphere). We seek unit vectors in R n , which lie on the (n-1)-sphere. We can therefore make use of the exponential map on S (n-1) at e 1 (the first standard basis vector, e 1 = [1, 0, . . . , 0] ⊤ ). The mapping Exp e1 : ξ i → v i (10) maps 2 tangent vectors ξ i ∈ R (n-1) to unit vectors v ∈ R n . We require k tangent vectors ξ i that will map to the v i vectors used to compute Q. We therefore configure the neural network h θ to output a vector ξ ∈ R k(n-1) . We treat the output of h θ as k stacked tangent vectors: h θ : c → ξ; ξ = ξ ⊤ 1 , ξ ⊤ 2 , . . . , ξ ⊤ k ⊤ (11) We then use the exponential map on each tangent vector, resulting in a vector v ∈ R nk of stacked unit n-vectors:     ξ 1 ξ 2 . . . ξ k     Exp --→     v 1 v 2 . . . v k     (12) where v = v ⊤ 1 , v ⊤ 2 , . . . , v ⊤ k ⊤ . Each v i is then used to compute a Householder matrix (Eq. 9), which are composed to obtain Q(v) (Eq. 8). Overall, NHT (Q θ : c → Q) can be understood as the composition of each of these computations: c h θ -→ ξ Exp --→ v Q -→ Q(v) → Q(v) NHT ( ) where c is a context vector of arbitrary dimension, and Q(v) is an n × n orthogonal matrix, and Q(v) is the matrix formed by the first k columns of Q(v).

3.2. EXISTENCE

If we hope to use NHT to approximate arbitrary subspaces, it is important to ensure that for every k-dimensional subspace W of R n , there exists a vector v ∈ R nk such that W = span Q(v) . Remark. Let W ⊆ R n be an arbitrary k-dimensional subspace. There is sequence of k House- holder reflectors Q = H 1 H 2 • • • H k such that the first k columns of Q are an orthonormal basis of W . Proof. Let M be an a n × k matrix whose column space is W . By (Trefethen & Bau, 1997) Algorithm 10.1 we can construct a QR decomposition, M = QR where Q is the product of exactly k Householder reflections. Now we are done because it is a basic property of QR decompositions that the first k columns of Q are an orthonormal basis for the column space of M, which is W . Thus, given the existence of an optimal contextual subspace W * (c), we can be sure that there exists some v * such that Q(v * ) spans W * . It is left to the neural network h θ to approximate a set of tangent vectors ξ that map to v * , given c.

3.3. SMOOTHNESS OF Q θ

We conjecture that low-dimensional action interfaces that change abruptly from state-to-state may degrade learning in RL agents. Thus we are interested in whether or not Q = Q θ (c) changes smoothly with respect to changes in the context. Concretely, we would like to limit the change in the high-dimensional actuation û corresponding to an identical low-dimensional action a given that the change in the context is small. Let Q1 = Q θ (c 1 ) and Q2 = Q θ (c 2 ) for two nearby contexts, c 1 and c 2 . Suppose an agent takes the same low-dimensional action a in both contexts. Denote the corresponding actuation commands as û1 = Q1 a and û2 = Q2 a, respectively. We would like to limit the magnitude of the change in û (i.e. ∥û 1 -û2 ∥) with respect to changes in context. That is, we would like to find a constant L u such that: ∥û 1 -û2 ∥ ≤ L u ∥c 1 -c 2 ∥ ( ) where ∥ • ∥ refers to the vector 2-norm. We begin by assuming that the agent is limited to lowdimensional actions with norm less than or equal to M . Then we have: ∥û 1 -û2 ∥ = ∥ Q1 a -Q2 a∥ (15) = ∥( Q1 -Q2 )a∥ (16) ≤ ∥( Q1 -Q2 )∥ • ∥a∥ (17) ≤ M ∥( Q1 -Q2 )∥ where the norm in Eq. 18 is the matrix norm induced by the vector 2-norm. We now seek to limit this norm by finding a scalar constant L such that ∥Q(c 1 ) -Q(c 2 )∥ ≤ L∥c 1 -c 2 ∥ ( ) Given that such an L exists, it is called a Lipschitz constant, and Q is considered to be L-Lipschitz. It turns out that there is a well understood procedure for training Lipschitz continuous neural networks Gouk et al. (2018) . Using this Lipschitz regularization, we can choose a constant L h such that ∥ ξ1 -ξ2 ∥ ≤ L h ∥c 1 -c 2 ∥ ( ) where ξ1 = h(c 1 ) and ξ2 = h(c 2 ). The exponential map on the sphere has a Lipschitz constant of 1, so we have the same result for the Lipschitz continuity of v with respect to changes in context. All that remains is to show that Q(v) is Lipschitz continuous. Theorem 1. Let v1 , v2 ∈ R nk be constructed from k stacked unit n-vectors, and Q(v) be the product of the corresponding Householder reflections (as defined in Eq. 8, 9). Then, ∥Q(v 1 ) -Q(v 2 )∥ ≤ L Q ∥v 1 -v2 ∥ ( ) where Thus, the low dimensional action a is guaranteed to result in similar actuations in nearby contexts: L Q = 2 √ k. ∥û 1 -û2 ∥ ≤ 2L h M √ k • ∥c 1 -c 2 ∥ ( ) where û1 = Q1 a and û2 = Q2 a, respectively.

4. EXPERIMENTAL SETUP

Our experimental results focus on validating the efficacy of neural householder transforms for learning kinematic tasks within a custom MuJoCo simulation of a Barrett WAM robotic manipulator with seven degrees-of-freedom (see Figure 2 ). We model each task as an MDP, and report results in two environments: WAMWipe and WAMGrasp. Learning involves first training an NHT model on an offline dataset of demonstrations, and then fixing the parameters of NHT and using Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016) to learn a policy online. We compare DDPG agents trained with an NHT action interface against agents trained with a state-of-the-art latent action model (Allshire et al., 2021) , agents trained with an actuation basis computed by SVD, and agents trained in the raw actuation space of the task (7dof joint velocity control). In our experiments we used a publicly available implementation of deep deterministic policy gradient (Andrychowicz et al., 2017) foot_2 . 

4.1. WAM ENVIRONMENTS

WAMWipe and WAMGrasp were designed to study the effects of using NHT to augment reinforcement learning in kinematic manipulation tasks with a binary reward function. These tasks can be classified according to how their goals are defined, and constraints on the configuration of the manipulator during execution of the task. Table 1 in the appendix enumerates the goal type and constraints present in WAMWipe and WAMGrasp. Section A.1 includes detailed descriptions of these environments.

4.2. ACTION INTERFACE BASELINES

In addition to training NHT from a dataset of demonstrations, we trained LASER (Allshire et al., 2021) from the same dataset, and computed the singular value decomposition (SVD) of the dataset of joint velocities executed during the demonstrations. In our experiments, the state-conditioned actuation basis computed by NHT, static basis computed by SVD, and nonlinear decoder of LASER all serve as different choices of interface between DDPG and the raw actuation commands that determine the next state of the environment. In our WAMWipe experiments, NHT, LASER, and SVD all exposed a two-dimensional action interface to DDPG, while in WAMGrasp they all exposed three-dimensional interfaces. The k ∈ {2, 3} actuation bases provided by SVD were the vectors in R 7 corresponding to the k largest singular values. Demonstrations were collected by recording observation-actuation (c, u) pairs from PD controllers that were hand-engineered for each environment. For WAMWipe, the dataset consisted of 20,000 transitions, where a single demonstration consisted of roughly 250 transitions on average. In WAMGrasp, the dataset consisted of 100,000 transitions, where a single demonstration consisted of roughly 100 transitions on average. In both environments, the observation c upon which the output of NHT and LASER are conditioned was the concatenation of joint angles and Cartesian coordinates of the end-effector. LASER is regularized by the KL and dynamics terms in its loss function (see equation 3), while we regularize NHT by enforcing Lipschitz continuity with Lipschitz constant L at each layer during training (Gouk et al., 2018) . The Adam optimizer is used for both NHT and LASER, with learning rate α map , and otherwise default parameters. Likewise, Adam is used as the optimizer in our chosen implementation of DDPG, with learning rates α actor and α critic for the policy and value function, respectively. We would like to estimate the performance of the best policy that could be learned by DDPG in a finite amount of time for agents trained with (1) an NHT action interface, (2) a LASER (Allshire et al., 2021) action interface, (3) an actuation basis computed by SVD, and (4) seven degree-of-freedom joint velocity actions. In addition, we would like to study the sensitivity of agent learning-dynamics to different hyperparameter configurations for each action interface. As such, we performed a random search over map (i.e. NHT, LASER) and DDPG hyperparameters. We jointly sampled 128 configurations each for NHT + DDPG, LASER + DDPG, DDPG with SVD, and DDPG with joint velocity actions. For the latter two conditions the only hyperparameters of interest are those of DDPG itself. For each configuration, we first trained the mapping function (if applicable), and then trained the DDPG agent, over five runs with different random seeds. We chose to randomly sample hyperparameters because of previous work suggesting it to be more computationally efficient to find better hyper parameters (Bergstra & Bengio, 2012) . The ranges and method of sampling used for each hyperparameter are listed in Table 2 of the appendix.

4.4. HALFCHEETAH

While our main interest for the application of NHT lies in constrained/safe robotic manipulation, there is value in validating the utility of NHT on more standard reinforcement learning environments. In addition, it is important to show that the action interface learned by NHT is useful for agents trained with various RL algorithms; not only for DDPG agents. We therefore performed a hyperparameter search experiment with a standard implementation Dhariwal et al. (2017) of PPO Schulman et al. (2017) on the HalfCheetah-v4 environment from OpenAI Gym Brockman et al. (2016) . This environment has a 17-dimensional observation space that includes angular positions and velocities, and a 6-dimensional torque actuation space (compared to the joint velocity actuation space in WAMGrasp and WAMWipe). We compared NHT agents to agents that learned in the standard 6dof actuation space of HalfCheetah, and agents with LASER Allshire et al. (2021) and SVD action interfaces. NHT, SVD, and LASER all learned 2-dimensional action interfaces. This experiment precisely mirrored the hyperameter search experiments reported in section 4.3, except that the demonstrations used to train NHT, SVD and LASER were collected from the best-performing policy learned by the standard 6dof agent. A total of just 1,000 transitions were recorded from this expert policy. The return on this demonstration episode was over 6,000. The hyperparameter ranges and sampling methods for this experiment are summarized in Table 3 (see Appendix).

5. EXPERIMENTAL RESULTS

Figure 4 summarizes the results of the random hyperparameter search in the WAMWipe and WAMGrasp environments. The violin plots represent the distribution of final success rates (success rate after 100 epochs of training) across every randomly sampled hyperparameter configuration. The learning curves in Figure 3 plot the mean success rate during training for the best performing agent in each condition, averaged over five runs. It can be seen that DDPG agents trained with an NHT action interface produced the best performing agents after hyperparameter optimization (higher success rates in fewer epochs) in both WAMWipe and WAMGrasp. In addition, the distributions of final success rates across hyperparameter configurations suggest that agents trained with NHT are more robust to hyperparameter choices compared to the baselines. Although in some runs the 7dof agent managed to reach a success rate of 100% in WAMGrasp, the variance of final success rates amongst 7dof agents is much larger than the variance of success rates for NHT agents. In general there was not a strong correlation between any one hyperparameter and the final performance of the agents (coefficient of determination < 0.1). The learning curves of the agents with the best average final performance, and the distribution of final agent performances for each method in HalfCheetah-v4 are shown in Figure 5 . Interestingly, we found that the constant (i.e. not state-dependent) action interface of SVD was sufficient to learn more efficiently than the standard 6dof agent while still achieving the same asymptotic performance. This suggests that all of the instantaneous actuations used by an expert (> 6, 000 return) HalfCheetah agent lie close to a fixed 2-dimensional linear subspace! There appears to be some benefit to the agent learning in an adaptive actuation subspace with NHT, although the performance gains are small in this environment. The agents learning with NHT tended again to be more robust to different hyperparameter configurations.

6. CONCLUSION

We proposed contextual subspace approximation as a novel alternative to deep latent actions models for robotic control. We derived the Neural Householder Transform model as an approach to contextual subspace approximation, and showed that it is smooth with respect to changes in context. In a large hyperparameter search experiment, we found that reinforcement learning agents trained with NHT outperformed agents trained to act in (1) the original actuation space, ( 2 In WAMWipe the goal is to control the manipulator such that the flat face of the last link remains flush against a table while sliding to a randomly sampled goal position. The reward is -1 every step unless the end-effector is within a small distance of the goal position, in which case the reward is 0. Episode failure occurs if the end-effector: (1) Pushes into the table, (2) Lifts off of the table, or (3) The end-effector tilts such that it is no longer flush with the table. Let p denote the unit vector orthogonal to the face of the end-effector, pictured as a purple arrow in figure A.1. Constraint (3), the orientation constraint, was considered violated when the angle between p and the vector orthogonal to the surface of the table (not pictured) was greater than π/16 radians. The agent observation in our experiments was a concatenated vector of joint angles, Cartesian coordinates of the end-effector, Cartesian coordinates of the goal position, and the unit vector orthogonal to the face of the endeffector. The actions in our WAMWipe experiments were either 7dof joint velocity commands, or 2-dimensional actions input to an NHT, SVD, or LASER action interface. A.1.2 WAMGRASP In WAMGrasp the goal is to simultaneously reach a randomly sampled grasp-point, while achieving a goal orientation that is determined by the grasp-point. Let p * denote the unit vector pointing from the grasp-point (small sphere in We consider the orientation satisfactory if the angle θ between p * and the vector orthogonal to the face of the manipulator, p, is less than π/16 radians. The reward in WAMGrasp is -1 at every step unless the endeffector is within a small distance of the grasppoint with a satisfactory orientation. Episode failure occurs if the end-effector collides with either the object being grasped (large red sphere in Figure A.1) or the table . In each episode the grasp-point is randomly sampled from the surface of a sphere with the same center but larger radius than the large red sphere in Figure A.1. The agent observation was a concatenated vector of joint angles, Cartesian coordinates of the endeffector, and Cartesian coordinates of the grasp point. The actions in our WAMGrasp experiments were either 7dof joint velocity commands, or 3-dimensional actions input to an NHT, SVD, or LASER action interface.

A.2 HYPERPARAMETER SEARCH DETAILS

The hyperparameter search experiment described in section 4.3 of the main paper was designed to estimate the performance of the best policy that could be learned by DDPG in a finite amount of time for agents trained with (1) an NHT action interface, (2) a LASER Allshire et al. (2021) action interface, (3) an actuation basis computed by SVD, and (4) seven degree-of-freedom joint velocity actions. The hyperparameter search also enabled us to study the sensitivity of agent learning-dynamics to different hyperparameter configurations for each action interface. The results serve as empirical evidence with which to answer questions such as: "Could changing the neural architecture of NHT A.3 SMOOTHNESS OF Q(v) In this section we prove the Lipschitz continuity of Q(v), as stated in theorem 1. Theorem 1. Let v1 , v2 ∈ R nk be constructed from k stacked unit n-vectors, and Q(v) be the product of the corresponding Householder reflections (as defined in Eq. 8, 9). Then, ∥Q(v 1 ) -Q(v 2 )∥ ≤ L Q ∥v 1 -v2 ∥ (21) where L Q = 2 √ k. We write H i as shorthand for H(v i ) = I -2v i v ⊤ i . We write v ∈ R nk to denote the concatenated column vector of v i ∈ R n : v = [v ⊤ 1 , v ⊤ 2 , . . . , v ⊤ k ] ⊤ . We denote the map from v to the corresponding product of reflections as Q : v → Q(v), where Q(v) = H(v 1 )H(v 2 ) • • • H(v k ) We likewise write δ ∈ R nk to denote the concatenated vector of perturbations to each v i δ =     δ ′ 1 δ ′ 2 . . . δ ′ k     =     c 1 δ 1 c 2 δ 2 . . . c k δ k     where ∥ δ∥ = 1, with scalars c i ∈ R scaling the unit norm δ i vectors that represent the direction of change for each v i . We consider the directional derivative of Q(v) in the direction of δ: ∇δQ(v) . = lim ϵ→0 Q(v + ϵ δ) -Q(v) ϵ where ∥ δ∥ = 1. The existence of a positive constant L Q that bounds ∥∇δQ(v)∥ implies Lipschitz continuity of Q(v): ∥Q(v 1 ) -Q(v 2 )∥ ≤ L Q ∥v 1 -v2 ∥ ) for all v1 , v2 constructed with k stacked unit n-vectors. We explicitly compute such an L Q below. As a first step, we show in section A.3.1 that ∥∇ δ H(v)∥ = 2. We will then use this result to compute an upper bound on L Q in section A.3.2.

A.3.1 LIPSCHITZ CONTINUITY OF H(v)

The directional derivative of H(v) in the direction of δ is defined as: ∇ δ H(v) . = lim ϵ→0 H(v + ϵδ) -H(v) ϵ where δ ∈ R n . Recall that v is in the n -1 sphere, and thus any instantaneous change to v must occur in a direction tangent to the sphere at v; that is, δ ⊥ v. Furthermore, without loss of generality we let ∥δ∥ = 1. Thus, δ is a unit vector in the direction of the perturbation of v. We first simplify the first term in the numerator: H(v + ϵδ) = I -2(v + ϵδ)(v + ϵδ) ⊤ (28) = I -2(vv ⊤ + ϵδv ⊤ + ϵvδ ⊤ + ϵ 2 δδ ⊤ ) (29) = I -2vv ⊤ -2(ϵδv ⊤ + ϵvδ ⊤ + ϵ 2 δδ ⊤ ) (30) = H(v) -2(ϵδv ⊤ + ϵvδ ⊤ + ϵ 2 δδ ⊤ ) Substituting the result into the definition of ∇ δ H(v), we have: ∇ δ H(v) = lim ϵ→0 -2(ϵδv ⊤ + ϵvδ ⊤ + ϵ 2 δδ ⊤ ) ϵ (32) = lim ϵ→0 -2(δv ⊤ + vδ ⊤ + ϵδδ ⊤ ) (33) = -2(δv ⊤ + vδ ⊤ ) Now we compute ∥∇ δ H(v)∥. Note that the symmetry of the sphere guarantees that ∥∇ δ H(v)∥ is invariant with respect to both δ and v. ∥∇ δ H(v)∥ . = max x̸ =0 ∥∇ δ H(v)x∥ ∥x∥ (35) The numerator is maximized when x is in the plane spanned by v and δ. Given this is the case, we can write x as a linear combination of v and δ. Let x = αv + βδ for some α, β ∈ R. We then have the following: ∥∇ δ H(v)∥ = max x̸ =0 ∥∇ δ H(v)x∥ ∥x∥ (37) = max x̸ =0 ∥ -2(δv ⊤ + vδ ⊤ )x∥ ∥x∥ (38) = 2 ∥(δv ⊤ + vδ ⊤ )(αv + βδ)∥ ∥x∥ (39) = 2 ∥αδv ⊤ v + αvδ ⊤ v + βδv ⊤ δ + βvδ ⊤ δ∥ ∥x∥ (40) = 2 ∥αδv ⊤ v + βvδ ⊤ δ∥ ∥x∥ (41) = 2 ∥αδ + βv∥ ∥x∥ where equation 41 follows from 40 by the fact that δ ⊥ v. Equation 42 follows from the fact that both δ and v have unit norm. Now, recall that x = αv + βδ. The numerator in 42 represents a simple change of basis for x. Since δ and v are orthonormal, this change of basis preserves the norm of x. Hence ∥αδ + βv∥ = ∥x∥, and we have: ∥∇ δ H(v)∥ = 2 This implies H(v) is Lipschitz continuous with Lipschitz constant 2.

A.3.2 LIPSCHITZ CONTINUITY OF Q

We now consider the directional derivative of Q(v) in the direction of δ: ∇δQ(v) . = lim ϵ→0 Q(v + ϵ δ) -Q(v) ϵ where ∥ δ∥ = 1. Recall: δ =     δ ′ 1 δ ′ 2 . . . δ ′ k     =     c 1 δ 1 c 2 δ 2 . . . c k δ k     Theorem 1. Let v1 , v2 ∈ R nk be constructed from k stacked unit n-vectors, and Q(v) be the product of the corresponding Householder reflections (as defined in Eq. 8, 9). Then, ∥Q(v 1 ) -Q(v 2 )∥ ≤ L Q ∥v 1 -v2 ∥ where L Q = 2 √ k. Proof. We begin by expanding the numerator of ∇δQ ∇δQ (v) = lim ϵ→0 H(v 1 + ϵδ ′ 1 )H(v 2 + ϵδ ′ 2 ) • • • H(v k + ϵδ ′ k ) -Q(v) ϵ We now consider the first term in the numerator. In the following we write H i as shorthand for H(v i ), and ∇H i as shorthand for ∇ δi H(v i ), the derivative of H(v i ) as defined in equation ( 27). H(v 1 + ϵδ ′ 1 )H(v 2 + ϵδ ′ 2 ) • • • H(v k + ϵδ ′ k ) (47) = (H 1 + ϵc 1 ∇H 1 )(H 2 + ϵc 2 ∇H 2 ) • • • (H k + ϵc k ∇H k ) + O(ϵ 2 ) (48) = H 1 H 2 • • • H k + ϵc 1 (∇H 1 )H 2 • • • H k + ϵc 2 H 1 (∇H 2 )H 3 • • • H k + . . . + ϵc k H 1 H 2 • • • H k-1 (∇H k ) + O(ϵ 2 ) (50) = Q(v) + ϵc 1 (∇H 1 )( k i=2 H i ) + ϵc 2 H 1 (∇H 2 )( k H i ) + • • • + ϵc k ( k-1 i=1 H i )(∇H k ) + O(ϵ 2 ) (51) = Q(v) + ϵ k j=1 c j   ( j-1 i=1 H i )(∇H j )( k l=j+1 H l )   + O(ϵ 2 ) (52) and substitute the result into the definition of ∇δQ(v): (2) 2 (60) ∇δQ(v) = lim ϵ→0 Q(v) + ϵ k j=1 c j ( j-1 i=1 H i )(∇H j )( k l=j+1 H l ) + O(ϵ 2 ) -Q(v) ϵ (53) = k j=1 c j   ( j-1 i=1 H i )(∇H j )( = 2 √ k Where equation ( 57) is thanks to the fact that each of the H i in the preceding equation are orthogonal, and equation (58) follows by the Cauchy-Schwarz inequality. Hence, the norm of the directional derivative of Q(v) is bounded by 2 √ k; that is: ∥∇Q(v)∥ ≤ 2 √ k (62) which implies ∥Q(v 1 ) -Q(v 2 )∥ ≤ L Q ∥v 1 -v2 ∥ with Lipschitz constant L Q = 2 √ k. We compared NHT to the Jacobian pseudoinverse as an action interface for a DDPG agent in WAMGrasp and WAMWipe in a hyperparameter search experiment with the same methodology described in section 4.3 of the main text (128 hyperparameter configurations, 5 seeds for each configuration). As already noted, the dimensionality of the agent's action space was 6 when using the Jacobian pseudoinverse interface. NHT was used to learn a 2-dimensional action interface for WAMWipe, and a 3-dimensional action interface for WAMGrasp. The hyperparameter search results are plotted in figure A.3. The NHT hyperparameter search results reported in this figure the same as those reported in section 4.3 of the main text. The variation in performance for the Jacobian pseudoinverse agents are entirely due to different DDPG agent configurations (the Jacobian has no hyperparameters). As expected, the agent with the Jacobian pseudoinverse action interface performed poorly in WAMWipe; like the 7dof joint velocity agent, the Jacobian pseudoinverse agent was able to freely jam the end-effector of the robot into the table, or lift the end-effector from the table, resulting in an automatic failure for its training episodes. Without a learned action interface, the exploratory behavior inherent in reinforcement learning resulted in destructive behavior that made learning in the highly constrained environment of WAMWipe difficult. In WAMGrasp, the Jacobian pseudoinverse agents were sometimes able to learn to achieve 100% success rate. However, it is clear from the violin plots in figure A.3 that limiting the joint velocity commands to useful subspaces learned by NHT has some benefit over allowing free exploration with the Jacobian pseudoinverse interface. Some of the poorer hyperparameter configurations resulted in close to 0% success rate when interacting with WAMGrasp through the Jacobian pseudoinverse interface. NHT tended to concentrate agent performance, over all hyperparameter configurations, toward a success rate of 75% to 100%.



In the context of learning systems, exponential maps have been previously studied in the literature on normalizing flows(Rezende et al., 2020).2 For the sphere, the exponential map at e1 is computed as vi = e1cos(∥ξi∥) + 1 ∥ξ i ∥ 0 ξi sin(∥ξi∥). Although we used the implementation of DDPG introduced in the HER paper, we did not use HER in any of our experiments.



Figure 1: Training procedure for NHT. Q θ uses a neural network and Householder transformations to map a context vector to an n × k matrix Q with orthonormal columns. The data u associated with contex c is projected onto the column space of Q.

Figure 3: Learning curves corresponding to the configurations with the best average final success rate, over all hyperparameter configurations, for each method. Each curve shows the mean success rate over five runs of the best configuration, with the shaded regions indicating the standard error. Proof. Please see section A.3 in the appendix. Using the fact that the Lipschitz constant of a composition of Lipschitz continuous functions is upper bounded by the product of the constituent Lipschitz constants Gouk et al. (2018), we combine the results of equation 20 and theorem 1 to obtain a Lipschitz constant for Q: NHT is Lipschitz continuous with L = 2L h √ k.

Figure 2: Simulated kinematic manipulation environments with distinct goal types and constraints.

Figure 4: Violin plots of final success rates across 128 randomly sampled hyperparameter configurations (5 runs each).

Figure 5: Results of hyperparameter search for action mapping methods in HalfCheetah-v4. Left: Learning curves corresponding to the configurations with the best average final success rate. Right: Violin plots of final success rates across 128 randomly sampled hyperparameter configurations (5 runs each).

Figure A.1: The 3-vector p pictured here was used to determine whether the orientation constraint/goal-condition was satisfied in WAMWipe/WAMGrasp, respectively.

Figure A.1) to the object being grasped (large sphere in Figure A.1).

Figure A.3: Violin plots of final success rates across 128 randomly sampled hyperparameter configurations (5 runs each) for NHT vs a Jacobian pseudoinverse (Jacobian pinv) action mapping baseline.

Properties of reinforcement learning environments in simulation experiments.

annex

cause a significant drop in the final success rate of a policy learned by DDPG?" Answering such questions is non-trivial since there may or may not be complex interactions between map (i.e. NHT, LASER) hyperparameters, DDPG hyperparameters, and final agent performance. It is unknown whether NHT hyperparameters tuned for an agent with arbitrary configuration A will be the best NHT hyperparameters for an agent with a different configuration B. For example, it is conceivable that a DDPG agent with hyperparameter configuration A may perform best with NHT configuration C, while DDPG with configuration B performs best with NHT configuration D. Thus a meaningful search should jointly vary the hyperparameters of the mapping models and the DDPG agent.We jointly sampled 128 configurations each for NHT + DDPG, LASER + DDPG, DDPG with SVD, and DDPG with joint velocity actions. For each configuration, we first trained the mapping function (if applicable), and then trained the DDPG agent, over five runs with different random seeds. The range of values and sampling method used for each hyperparameter are listed in table 2.For both WAMWipe and WAMGrasp, each agent was trained for one million environment steps, using three workers to generate experience. This resulted in 100 training epochs of 10,000 steps each. 

A.4 ABLATION OF ORTHONORMAL CONSTRAINT

What is the benefit of enforcing orthonormal actuation bases in NHT? Beside the fact that the pseudoinverse of Q can be computed trivially as the transpose during training, we wanted to find out if there was any empirical benefit. To answer this question we performed an experiment in which we trained a neural network to produce an arbitrary state-conditioned matrix as an actuation basis for WAMWipe and WAMGrasp. Unlike NHT, this baseline is not constrained to output a matrix with orthonormal columns. We will refer to the baseline as the state-conditioned linear map (SCL) model. The SCL baseline is a neural network h θ : c → B ∈ R n×k that maps context vectors to an n × k matrix B. In our WAMWipe experiment n = 7 and k = 2, while for WAMGrasp n = 7 and k = 3. We found that 64 out of 640 (10%) of the runs for SCL in WAMWipe failed due to numerical instability. In these cases the matrices output by the unconstrained neural network had large norms, resulting in very large joint velocity actuations that caused the mujoco simulations to fail. Interestingly, we did not observe the same numerical stability issues in the SCL models that were trained for WAMGrasp. Note that, in contrast, for NHT numerical stability is not an empirical issue. The 2-norm of the matrix produced by NHT is guaranteed to be equal to one.In WAMWipe, the best hyperparameter configurations of SCL resulted in actuation interfaces that were suitable for the DDPG agent to achieve 100% success rate. However, in both WAMWipe and WAMGrasp, the distributions of final agent performance in figure A.2 indicate that NHT was more robust than SCL with respect to variation in hyperparameter configurations. This suggests that NHT may be less sensitive to different choices of hyperparameters, making it easier to tune in practice.

A.5 COMPARISON TO JACOBIAN PSEUDOINVERSE INTERFACE

Here we compare NHT to an additional choice of action interface that, unlike the baselines discussed in the main text, is not learned from demonstrations. The Jacobian of the robotic manipulator describes the relationship between the joint velocities and the Cartesian and angular velocity of the end-effector. The pseudoinverse of the Jacobian can be used to define a six-dimensional action interface for an RL agent: 3 dimensions in the agent's action space correspond to Cartesian velocity, and the remaining 3 correspond to angular velocity.

