CONTEXTUAL SUBSPACE APPROXIMATION WITH NEURAL HOUSEHOLDER TRANSFORMS

Abstract

Choosing an appropriate action representation is an integral part of solving robotic manipulation problems. Published approaches include latent action models which compress the control space into a low dimensional manifold. These involve training a conditional autoencoder, where the current observation and a lowdimensional action are passed through a neural network decoder to compute high dimensional actuation commands. Such models can have a large number of parameters, and can be difficult to interpret from a user perspective. In this work, we propose that similar performance gains in robotics tasks can be achieved by restructuring the neural network to map observations to a basis for a contextdependent linear actuation subspace. This results in an action interface wherein a user's actions determine a linear combination of a state-conditioned actuation basis. We introduce the Neural Householder Transform (NHT) as a method for computing this basis. Our results suggest that reinforcement learning agents trained with NHT in kinematic manipulation and locomotion environments are more robust to hyperparameter choice and achieve higher final success rates compared to agents trained with alternative action representations. NHT agents outperformed agents trained with joint velocity/torque actions, agents trained with an SVD actuation basis, and agents trained with a LASER action interface in the WAMWipe, WAMGrasp, and HalfCheetah environments.

1. INTRODUCTION

In real-world applications of reinforcement learning, its imperative to choose appropriate representations when defining the Markov decision process. The consequences of poor design decisions can have adverse effects in domains like robotics, where safety (Tosatto et al., 2021) and sample efficiency (Li et al., 2021) are desirable properties. Typically these properties can be captured by choice of action space. Choices of robot action types distinct from basic joint motor control, such as Cartesian control or impedance control, have been shown to influence the efficiency of robotic learning, depending on the task (Martín-Martín et al., 2019) . Researchers have typically focused on learning action representations that can capture a variety of robotic motions. This interest has led to developing several different action representation frameworks. One framework includes motor primitives in which entire trajectories are encoded as the action (Paraschos et al., 2013; Schaal, 2006) . Motor primitives have seen much success in robotics leading to impressive real-world experimental results by constraining the action space (Tosatto et al., 2021; Kober & Peters, 2009) . Another framework is the latent actions framework, in which actions-per-time-step are compressed into a latent subspace. Typically these are conditional auto-encoders trained to predict the highdimensional actions given the state and latent action. These methods have been used successfully in both learning systems (Allshire et al., 2021; Zhou et al., 2020; van der Pol et al., 2020) as well as human-in-the-loop settings (Losey et al., 2021; 2020; Karamcheti et al., 2021; Jun Jeon et al., 2020) . It remains unclear whether robotics tasks must have deep, complex action models. There is little work comparing latent action models across varying complexity tasks. For example, hand poses -a complex high dimensional action space -can have up to 80% of configurations explained by two principal components (Santello et al., 1998) . This result has been exploited to develop lowdimensional linear control algorithms, but they assume all actions exist in a global linear subspace 1

