SPATIAL GENERALIZATION OF VISUAL IMITATION LEARNING WITH POSITION-INVARIANT REGULARIZATION

Abstract

How the visual imitation learning models can generalize to novel unseen visual observations is a highly challenging problem. Such a generalization ability is very crucial for their real-world applications. Since this generalization problem has many different aspects, we focus on one case called spatial generalization, which refers to generalization to unseen setup of object (entity) locations in a task, such as a novel setup of object locations in the robotic manipulation problem. In this case, previous works observe that the visual imitation learning models will overfit to the absolute information (e.g., coordinates) rather than the relational information between objects, which is more important for decision making. As a result, the models will perform poorly in novel object location setups. Nevertheless, so far, it remains unclear how we can solve this problem effectively. Our insight into this problem is to explicitly remove the absolute information from the features learned by imitation learning models so that the models can use robust, relational information to make decisions. To this end, we propose a novel, position-invariant regularizer for generalization. The proposed regularizer will penalize the imitation learning model when its features contain absolute, positional information of objects. We carry out experiments on the MAGICAL and ProcGen benchmark, as well as a real-world robot manipulation problem. We find that our regularizer can effectively boost the spatial generalization performance of imitation learning models. Through both qualitative and quantitative analysis, we verify that our method does learn robust relational representations.

1. INTRODUCTION

Imitation learning is a class of algorithms that enable robots to acquire behaviors from human demonstrations (Hussein et al., 2017) . The recent advance in deep learning has boosted the development of visual imitation learning and supported its applications like autonomous driving, robotic manipulation, and human-robot interaction (Hussein et al., 2017) . In spite of its success, visual imitation learning methods still face many practical challenges. One major challenge is its ability to generalize to novel unseen visual observations, which is very common when we deploy the trained models (Toyer et al., 2020; Park et al., 2021) . In the literature, this generalization problem is also known as the robustness problem. The problem covers many different aspects. For example, here we can identify two basic generalization capabilities: observational generalization and spatial generalization (Figure 1 ). Observational generalization refers to the generalization to novel visual textures. The changes in background color, object texture, or ambient light in the robotic manipulation task are examples of observational generalization. Such kind of visual change does not affect the physics structure (e.g., the position of object and targets) and only requires the robot to reason about semantic meanings correctly. In contrast, spatial generalization refers to the generalization to unseen setup of objects' (entities) locations in one task, which instead requires physical common sense about space and object. Consider the task of letting a warehouse robot move a box to some target region. If we set the initial position of the box to a place that is not covered by the demonstration dataset, then the imitation learning methods must be able to perform spatial generalization so as to succeed. In reality, the generalization challenge usually emerges as a

