SPATIAL GENERALIZATION OF VISUAL IMITATION LEARNING WITH POSITION-INVARIANT REGULARIZATION

Abstract

How the visual imitation learning models can generalize to novel unseen visual observations is a highly challenging problem. Such a generalization ability is very crucial for their real-world applications. Since this generalization problem has many different aspects, we focus on one case called spatial generalization, which refers to generalization to unseen setup of object (entity) locations in a task, such as a novel setup of object locations in the robotic manipulation problem. In this case, previous works observe that the visual imitation learning models will overfit to the absolute information (e.g., coordinates) rather than the relational information between objects, which is more important for decision making. As a result, the models will perform poorly in novel object location setups. Nevertheless, so far, it remains unclear how we can solve this problem effectively. Our insight into this problem is to explicitly remove the absolute information from the features learned by imitation learning models so that the models can use robust, relational information to make decisions. To this end, we propose a novel, position-invariant regularizer for generalization. The proposed regularizer will penalize the imitation learning model when its features contain absolute, positional information of objects. We carry out experiments on the MAGICAL and ProcGen benchmark, as well as a real-world robot manipulation problem. We find that our regularizer can effectively boost the spatial generalization performance of imitation learning models. Through both qualitative and quantitative analysis, we verify that our method does learn robust relational representations.

1. INTRODUCTION

Imitation learning is a class of algorithms that enable robots to acquire behaviors from human demonstrations (Hussein et al., 2017) . The recent advance in deep learning has boosted the development of visual imitation learning and supported its applications like autonomous driving, robotic manipulation, and human-robot interaction (Hussein et al., 2017) . In spite of its success, visual imitation learning methods still face many practical challenges. One major challenge is its ability to generalize to novel unseen visual observations, which is very common when we deploy the trained models (Toyer et al., 2020; Park et al., 2021) . In the literature, this generalization problem is also known as the robustness problem. The problem covers many different aspects. For example, here we can identify two basic generalization capabilities: observational generalization and spatial generalization (Figure 1 ). Observational generalization refers to the generalization to novel visual textures. The changes in background color, object texture, or ambient light in the robotic manipulation task are examples of observational generalization. Such kind of visual change does not affect the physics structure (e.g., the position of object and targets) and only requires the robot to reason about semantic meanings correctly. In contrast, spatial generalization refers to the generalization to unseen setup of objects' (entities) locations in one task, which instead requires physical common sense about space and object. Consider the task of letting a warehouse robot move a box to some target region. If we set the initial position of the box to a place that is not covered by the demonstration dataset, then the imitation learning methods must be able to perform spatial generalization so as to succeed. In reality, the generalization challenge usually emerges as a combination of different generalization capabilities. In this paper, we focus on the study of spatial generalization. For better spatial generalization, the visual imitation learning models should be able to obtain knowledge about objects and their spatial relations with proper inductive biases. Some work finds that vanilla deep visual imitation learning models strongly overfit to the absolute position of objects (Toyer et al., 2020) , which suggests that they do not extract relational information of objects to make decisions like humans (Doumas et al., 2022) . Since the representation learning methods can usually lead to good semantic representations (features) and lead to generalization, Chen et al. ( 2021) investigate the use of self-supervised representation learning methods in visual imitation learning. However, they find that these general-purpose representations fail to improve the generalization performance of vanilla visual imitation learning models effectively. Aside from these works, we also notice that some works propose variants of vision transformers (Dosovitskiy et al., 2021) to improve spatial generalization (Yuan et al., 2021) , though these methods are not for imitation learning. Moreover, they make additional assumptions, such as the availability of object information. So far, it remains unclear how to ensure spatial generalization in visual imitation learning. Based on these observations, our main insight into this problem is to explicitly remove the absolute, positional information from the learned features in the visual imitation learning models. Note that this does not mean that the decision-making process is not dependent on absolute information. Rather, we expect that the model can extract the relational information (e.g., distance, direction) from the absolute information to make robust decisions. To this end, we propose a novel position-invariant regularizer called POINT. This regularizer will penalize the imitation learning model when it finds that the learned feature highly correlates with absolute, positional information. As a result, the imitation learning model has to discover more robust relational features. To validate our idea, we test the proposed regularizer on the MAGICAL (Toyer et al., 2020) and ProcGen (Cobbe et al., 2020) benchmark, as well as a real-world robot manipulation problem. We find that our method can effectively improve spatial generalization performance. Furthermore, we conduct qualitative and quantitative analysis and find that the imitation learning models can indeed learn relational features with our proposed regularizer. To summarize, our contributions in this paper are as follows. • We define the spatial generalization problem of visual imitation learning models and propose a novel position-invariant regularizer called POINT to tackle this problem. • We test our method on MAGICAL and ProcGen benchmarks, as well as a real-world robot manipulation problem. We find that our proposed regularizer can improve the spatial generalization performance of previous imitation learning models effectively. • Through qualitative and quantitative studies, we verify that our proposed regularizer does make the visual imitation learning models extract relational information.



Figure 1: Left and Middle: Two kinds of visual generalization. The examples are based on the MAGICAL benchmark provided by Toyer et al. (2020), in which a robot is required to relocate a box to a target region. The left figure shows an example of observational generalization, in which the only change during the testing phase is the visual texture of objects. The middle figure shows an example of spatial generalization. The object setup in the testing phase is unseen. Right: To achieve spatial generalization, we suggest that absolute information should be removed from the feature while the relational information should be kept. We propose a novel, position-invariant regularizer for this purpose.

