LOSSLESS ADAPTATION OF PRETRAINED VISION MODELS FOR ROBOTIC MANIPULATION

Abstract

Recent works have shown that large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems, as well as a variety of robotic manipulation tasks. While prior work on robotic manipulation has predominantly used frozen pretrained features, we demonstrate that in robotics this approach can fail to reach optimal performance, and that fine-tuning of the full model can lead to significantly better results. Unfortunately, fine-tuning disrupts the pretrained visual representation, and causes representational drift towards the fine-tuned task thus leading to a loss of the versatility of the original model. We introduce lossless adaptation to address this shortcoming of classical fine-tuning. We demonstrate that appropriate placement of our parameter efficient adapters can significantly reduce the performance gap between frozen pretrained representations and full end-to-end finetuning without changes to the original representation and thus preserving original capabilities of the pretrained model. We perform a comprehensive investigation across three major model architectures (ViTs, NFNets, and ResNets), supervised (ImageNet-1K classification) and self-supervised pretrained weights (CLIP, BYOL, Visual MAE) in 3 task domains and 35 individual tasks, and demonstrate that our claims are strongly validated in various settings.

1. INTRODUCTION

Pretrained general-purpose vision models, often also referred to as vision foundation models (Yuan et al., 2021) , have developed a growing set of perceptual capabilities in recent years. Large-scale vision-language models such as CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) ) are examples of these highly capable general-purpose vision models which have enabled many applications for image generation/editing (Ramesh et al., 2022; Saharia et al.) and image-based dialog (Alayrac et al., 2022) . Existing self-supervised pretrained visual models, such as SimCLR (Chen et al., 2020) , BYOL (Grill et al., 2020) or Visual MAE (He et al., 2022) , have also been shown to provide strong initializations for a wide range of visual downstream tasks. How can we unlock the power of these models for increasingly novel and challenging control applications? One solution is to add an output head for each control task and fine-tune the entire architecture. However, fine-tuning degrades performance on the original task(s) the model was trained for, and therefore requires maintaining copies of the model for all tasks we wish to concurrently support. This strategy quickly becomes infeasible as we move towards more general and multi-task agents. For instance, embodied agents acting in the real world will end up solving thousands of downstream manipulation tasks. Given limited hardware capabilities of robots keeping separate copies of increasingly large models (e.g. billions of parameters) for a growing set of tasks is unscalable. This is further exacerbated for robot manipulation wherein hardware and tool differences can result in different task configurations which may require different representations. In this paper our target is to achieve lossless adaptation, which we define as adapting the original pretrained model for the new task or series of tasks, while maintaining the original capabilities of the model. To solve the lossless adaptation problem we inject additional parameters, i.e. adapters, to

annex

several specific locations throughout pretrained architecture. We use similar adapters as in previous non-control settings (Rebuffi et al., 2017; Houlsby et al., 2019) , but carefully insert them at different network locations to improve off-domain representations for control. We demonstrate that, with a small cost (≈ 1% of the original model size) of additional parameters scattered throughout the model, we can bridge the performance gap between frozen pretrained features and full end-to-end fine-tuning while maintaining all the original capabilities of a pretrained model (the original model definition can co-exist with the new task head, reusing the vast majority of parameters). We show that as manipulation tasks get harder through complex multi-object interaction and increased level of variation in the randomized initial configurations, the pretrained visual features can't cope with the increased complexity but our parameter efficient adapters can. Overall our contributions include:• We show that frozen pretrained representations are insufficient to reach optimal manipulation task performance especially for complex manipulation tasks. • We propose the use of adapters with strong evidence that our adapters can largely close the performance gap between frozen pretrained representations and full end-to-end fine-tuning while adding only a small amount of (≈ 1% of the original model) adapter parameters. • Comprehensive evaluation of our approach across 3 different manipulation suites (35 individual tasks), 3 major model architectures (ViTs, NFNets, and ResNets) with supervised (imagenet classification) and self-supervised pretraining (CLIP, BYOL, Visual MAE). • Experiments demonstrating that adapters also help in sim2real transfer. Thus, enabling fixed large pretrained visual models to be directly used for real world manipulation tasks.

2. RELATED WORKS

The general problem of representation learning for control can be broadly divided into two distinct categories -works that use in-domain task data to learn task relevant representations for the underlying task, and on the other hand are more recent works that use out-of-domain visual data to learn generally useful representations for control.In Domain Representation Learning for Control: Within the first broad category the majority of works learn representations that reflect certain invariances that are presumed to be relevant for the downstream task(s) of interest. Prior works show that useful representations can be learned via data augmentations (Kostrikov et al., 2020; Laskin et al., 2020b) , temporal contrastive learning (Laskin et al., 2020a) , information bottlenecks (Pacelli & Majumdar, 2020), goal relabeling (Zhou et al., 2021) , or via real world priors (Jonschkowski et al., 2017; Jonschkowski & Brock, 2014) .Out of Domain Representation Learning for Control: An alternative set of works have emphasized the use of in-the-wild visual datasets for representation learning (Parisi et al., 2022; Shridhar et al., 2022; Khandelwal et al., 2022) . They have shown that features learned by large visual models pretrained on common visual learning tasks such as image classification, image inpainting, and contrastive learning can be surprisingly effective for downstream control tasks. Many of these works utilize the large scale pretrained CLIP model (Gadre et al., 2022; Khandelwal et al., 2022; Shridhar 

