END-TO-END INVARIANCE LEARNING WITH RE-LATIONAL INDUCTIVE BIASES IN MULTI-OBJECT ROBOTIC MANIPULATION

Abstract

Although reinforcement learning has seen remarkable progress over the last years, solving robust dexterous object-manipulation tasks in multi-object settings remains a challenge. In this paper, we focus on models that can learn manipulation tasks in fixed multi-object settings and extrapolate this skill zero-shot without any drop in performance when the number of objects changes. We consider the generic task of moving a single cube out of a set to a goal position. We find that previous approaches, which primarily leverage attention and graph neural network-based architectures, do not exhibit this invariance when the number of input objects changes while scaling as K 2 . We analyse effects on generalization of different relational inductive biases and then propose an efficient plug-and-play module that overcomes these limitations. Besides exceeding performances in their training environment, we show that our approach, which scales linearly in K, allows agents to extrapolate and generalize zero-shot to any new object number.

1. INTRODUCTION

Deep reinforcement learning (RL) has witnessed remarkable progress over the last years, particularly in domains such as video games or other synthetic toy settings (Mnih et al., 2015; Silver et al., 2016; Vinyals et al., 2019) . On the other hand, applying deep RL on real-world grounded robotic setups such as learning seemingly simple dexterous manipulation tasks in multi-object settings is still confronted with many fundamental limitations being the focus of many recent works (Duan et al., 2017; Janner et al., 2018; Deisenroth et al., 2011; Kroemer et al., 2018; Andrychowicz et al., 2020; Rajeswaran et al., 2017; Lee et al., 2021; Funk et al., 2021) . The RL problem in robotics setups is much more challenging (Dulac-Arnold et al., 2019) . Compared to discrete toy environments, state and action spaces are continuous, and solving tasks typically requires long-horizon time spans, where the agent needs to apply long sequences of precise low-level control actions. Accordingly, exploration under easy-to-define sparse rewards becomes only feasible with horrendous amounts of data. This is usually impossible in the real world but has been likewise hard for computationally demanding realistic physics simulators. To alleviate this, manually designing task-specific dense reward functions is usually required, but this can often lead to undesirable or very narrow solutions. Numerous approaches exist to alleviate this even further by e.g. imitation learning from expert demonstrations (Abbeel & Ng, 2004 ), curricula (Narvekar et al., 2020 ), or model-based learning (Kaelbling et al., 1996) . Another promising path is to constrain the possible solution space of a learning agent by encoding suitable inductive biases in their architecture (Geman et al., 1992) . Choosing inductive biases that leverage the underlying problem structure can help to learn solutions that facilitate desired generalization capabilities Mitchell (1980); Baxter (2000) ; Hessel et al. (2019) . In robotics, multi-object manipulation tasks naturally suit a compositional description of their current state in terms of symbol-like entities (such as physical objects, robot parts, etc.). These representations can be directly obtained in simulator settings and ultimately hoped to be inferred robustly from learned object perception modules Greff et al. During an episode, a random cube (here green) is selected to be transported to a random episode-specific goal-location. The remaining cube (here dark red) acts as an unused distractor during that episode but could be the cube to be transported in the next episode. Can we train RL agents that learn such manipulation tasks within a fixed multi-object setting -training with only one distractor -and extrapolate this skill zero-shot when the number of distractors changes? Consider transporting a cube to a specified target location and assume the agent has mastered solving this task with three available cubes (see Figure 1 for this problem setting which forms the basis of our work). We refer to cubes different from the ones to transport as distractors. If we now introduced two additional cubes, the solution to solve this task would remain the same, but the input distribution changed. So why not just train with much more cubes and use a sufficiently large input state? First, the learning problem requires exponentially more data when solving such a task with more available cubes (see Figure 2 , left plot, red line), making it infeasible already for only half a dozen cubes. Second, we do not want to put any constraint on the possible number of cubes at test time but simply might not have this flexibility at train time. In such a task, the agent must be able to learn end-to-end a policy that is invariant to the number of distractors, while never observing at training time such input distribution shift. Therefore, we demand to learn challenging manipulation tasks in fixed multi-object settings and extrapolate this skill zero-shot without any drop in performance when the number of cubes changes. Achieving this objective will be the primary goal of this work. An essential prerequisite for achieving this is to endow the agents with the ability to process a variable number of input objects by choosing a suitable inductive bias. One popular model class for these object encoding modules are graph neural networks (GNNs), and a vast line of previous approaches builds upon the subclass of attentional GNNs to achieve this (Zambaldi et al., 2018; Li et al., 2019; Wilson & Hermans, 2020; Zadaianchuk et al., 2020) . Another possible model class -although largely ignored for robotic manipulation tasks so far -are relation networks that process input sets by computing relations for every object-object pair Santoro et al. (2017) . As is the case for attentional GNNs, relation networks scale quadratically in the number of input objects (see Figure 2 , right plot), which can become computationally impractical for training and inference with many objects. In this work, we are therefore primarily interested in learning invariances regarding the manipulable objects of a robotic agents' environment but not invariance regarding the objects' properties such as shape or weight which is an orthogonal robustness objective in machine learning. Without properly accounting for such modularity, task complexities can otherwise grow exponentially. Main contributions. As a first main contribution, we will demonstrate that utilizing popular attentionbased GNN approaches does not achieve the desired invariance. Instead, attention-based GNN approaches fail to extrapolate any learned behavior to a changing number of objects. As a solution, we will then present support to build agents upon relational reasoning inductive biases. In addition to a more traditional implementation of a relational network, we introduce a linearized relation network module (LRN) which improves the computational complexity in the number of objects from quadratic to linear. Finally, we show that agents based on this proposed module can extrapolate and therefore generalize zero-shot without any drop in performance when the number of objects changes. We will present supporting evidence for this currently under-explored type of generalization and necessary requirements for multi-object manipulation across two challenging robotics tasks.



(2020a); Kipf et al. (2021); Locatello et al. (2020). While such a compositional understanding is in principle considered crucial for any systematic generalization ability Greff et al. (2020b); Spelke (1990); Battaglia et al. (2018); Garnelo et al. (2016) it remains an open question how to design an agent that can process this type of input data to leverage this promise.

Figure1: We consider the following problem setting: There are 2 identical cubes with unique identifiers in the arena. During an episode, a random cube (here green) is selected to be transported to a random episode-specific goal-location. The remaining cube (here dark red) acts as an unused distractor during that episode but could be the cube to be transported in the next episode. Can we train RL agents that learn such manipulation tasks within a fixed multi-object setting -training with only one distractor -and extrapolate this skill zero-shot when the number of distractors changes?

