MY BODY IS A CAGE: THE ROLE OF MORPHOLOGY IN GRAPH-BASED INCOMPATIBLE CONTROL

Abstract

Multitask Reinforcement Learning is a promising way to obtain models with better performance, generalisation, data efficiency, and robustness. Most existing work is limited to compatible settings, where the state and action space dimensions are the same across tasks. Graph Neural Networks (GNN) are one way to address incompatible environments, because they can process graphs of arbitrary size. They also allow practitioners to inject biases encoded in the structure of the input graph. Existing work in graph-based continuous control uses the physical morphology of the agent to construct the input graph, i.e., encoding limb features as node labels and using edges to connect the nodes if their corresponded limbs are physically connected. In this work, we present a series of ablations on existing methods that show that morphological information encoded in the graph does not improve their performance. Motivated by the hypothesis that any benefits GNNs extract from the graph structure are outweighed by difficulties they create for message passing, we also propose AMORPHEUS, a transformer-based approach. Further results show that, while AMORPHEUS ignores the morphological information that GNNs encode, it nonetheless substantially outperforms GNN-based methods that use the morphological information to define the message-passing scheme.

1. INTRODUCTION

Multitask Reinforcement Learning (MTRL) (Vithayathil Varghese & Mahmoud, 2020) leverages commonalities between multiple tasks to obtain policies with better returns, generalisation, data efficiency, or robustness. Most MTRL work assumes compatible state-action spaces, where the dimensionality of the states and actions is the same across tasks. However, many practically important domains, such as robotics, combinatorial optimization, and object-oriented environments, have incompatible state-action spaces and cannot be solved by common MTRL approaches. Incompatible environments are avoided largely because they are inconvenient for function approximation: conventional architectures expect fixed-size inputs and outputs. One way to overcome this limitation is to use Graph Neural Networks (GNNs) (Gori et al., 2005; Scarselli et al., 2005; Battaglia et al., 2018) . A key feature of GNNs is that they can process graphs of arbitrary size and thus, in principle, allow MTRL in incompatible environments. However, GNNs also have a second key feature: they allow models to condition on structural information about how state features are related, e.g., how a robot's limbs are connected. In effect, this enables practitioners to incorporate additional domain knowledge where states are described as labelled graphs. Here, a graph is a collection of labelled nodes, indicating the features of corresponding objects, and edges, indicating the relations between them. In many cases, e.g., with the robot mentioned above, such domain knowledge is readily available. This results in a structural inductive bias that restricts the model's computation graph, determining how errors backpropagate through the network. GNNs have been applied to MTRL in continuous control environments, a staple benchmark of modern Reinforcement Learning (RL), by leveraging both of the key features mentioned above (Wang et al., 2018; Huang et al., 2020) . In these two works, the labelled graphs are based on the agent's physical morphology, with nodes labelled with the observable features of their corresponding limbs, e.g., coordinates, angular velocities and limb type. If two limbs are physically connected, there is an edge between their corresponding nodes. However, the assumption that it is beneficial to restrict the model's computation graph in this way has to our knowledge not been validated. To investigate this issue, we conduct a series of ablations on existing GNN-based continuous control methods. The results show that removing morphological information does not harm the performance of these models. In addition, we propose AMORPHEUS, a new continuous control MTRL method based on transformers (Vaswani et al., 2017) instead of GNNs that use morphological information to define the message-passing scheme. AMORPHEUS is motivated by the hypothesis that any benefit GNNs can extract from the morphological domain knowledge encoded in the graph is outweighed by the difficulty that the graph creates for message passing. In a sparsely connected graph, crucial state information must be communicated across multiple hops, which we hypothesise is difficult in practice to learn. AMORPHEUS uses transformers instead, which can be thought of as fully connected GNNs with attentional aggregation (Battaglia et al., 2018) . Hence, AMORPHEUS ignores the morphological domain knowledge but in exchange obviates the need to learn multi-hop communication. Similarly, in Natural Language Processing, transformers were shown to perform better without an explicit structural bias and even learn such structures from data (Vig & Belinkov, 2019; Goldberg, 2019; Tenney et al., 2019; Peters et al., 2018) . Our results on incompatible MTRL continious control benchmarks (Huang et al., 2020; Wang et al., 2018) strongly support our hypothesis: AMORPHEUS substantially outperforms GNN-based alternatives with fixed message-passing schemes in terms of sample efficiency and final performance. In addition, AMORPHEUS exhibits nontrivial behaviour such as cyclic attention patterns coordinated with gaits.

2. BACKGROUND

We now describe the necessary background for the rest of the paper.

2.1. REINFORCEMENT LEARNING

A Markov Decision Process (MDP) is a tuple S, A, R, T , ρ 0 . The first two elements define the set of states S and the set of actions A. The next element defines the reward function R(s, a, s ) with s, s ∈ S and a ∈ A. T (s |s, a) is the probability distribution function over states s ∈ S after taking action a in state s. The last element of the tuple ρ 0 is the distribution over initial states. Task and environment are synonyms for MDPs in this work. A policy π(a|s) is a mapping from states to distributions over actions. The goal of an RL agent is to find a policy that maximises the expected discounted cumulative return J = E ∞ t=0 γ t r t , where γ ∈ [0, 1) is a discount factor, t is the discrete environment step and r t is the reward at step t. In the MTRL setting, the agent aims to maximise the average performance across N tasks: 1 N N i=1 J i . We use MTRL return to denote the average performance across the tasks. In this paper, we assume that states and actions are multivariate, but dimensionality remains constant for one MDP: s ∈ R k , ∀s ∈ S, and a ∈ R k , ∀a ∈ A. We use dim(S) = k and dim(A) = k to denote this dimensionality, which can differ amongst MDPs. We consider two tasks MDP 1 and MDP 2 as incompatible if the dimensionality of their state or action spaces disagree, i.e., dim(S 1 ) =

