FEW-SHOT INCREMENTAL LEARNING USING HYPERTRANSFORMERS

Abstract

Incremental few-shot learning methods make it possible to learn without forgetting from multiple few-shot tasks arriving sequentially. In this work we approach this problem using the recently published HyperTransformer (HT): a hypernetwork that generates task-specific CNN weights directly from the support set. We propose to re-use these generated weights as an input to the HT for the next task of the continual-learning sequence. Thus, the HT uses the weights themselves as the representation of the previously learned tasks. This approach is different from most continual learning algorithms that typically rely on using replay buffers, weight regularization or task-dependent architectural changes. Instead, we show that the HT works akin to a recurrent model, relying on the weights from the previous task and a support set from a new task. We demonstrate that a single HT equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for two continual learning scenarios: incremental-task learning and incremental-class learning.

1. INTRODUCTION

Incremental few-shot learning combines the challenges of both few-shot learning and continual learning together: it seeks a way to learn from very limited demonstrations presented continually to the learner. This combination is desirable since it represents a more genuine model of how biological systems including humans acquire new knowledge: we often do not need a large amount of information to learn a novel concept and after learning about it we retain that knowledge for a long time. In addition, achieving this would dramatically simplify learning of important practical applications, such as robots continually adapting to a novel environment layout from an incoming stream of demonstrations. Another example is privacy-preserving learning, where users run the model sequentially on their private data, sharing only the weights that are continually absorbing the data without ever exposing it. We focus on a recently published few-shot learning method called HYPERTRANSFORMER (HT; Zhmoginov et al. 2022) , which trains a large hypernetwork (Ha et al., 2016) by extracting knowledge from a set of training few-shot learning tasks. The HT is then able to directly generate weights of a much smaller Convolutional Neural Network (CNN) model focused on solving a particular task from just a few examples provided in the support set. It works by decoupling the task-domain knowledge (represented by a transformer; Vaswani et al. 2017) from the learner itself (a CNN), which only needs to know about the specific task that is being solved. In this paper, we propose an INCREMENTAL HYPERTRANSFORMER (IHT) aimed at exploring the capability of the HT to update the CNN weights with information about new tasks, while retaining the knowledge about previously learned tasks. In other words, given the weights θ t-1 generated after seeing some previous tasks {τ } t-1 τ =0 and a new task t, the IHT generates the weights θ t that are suited for all the tasks {τ } t τ =0 . In order for the IHT to be able to absorb a continual stream of tasks, we modified the loss function from a cross-entropy that was used in the HT to a more flexible prototypical loss (Snell et al., 2017) . As the tasks come along, we maintain and update a set of prototypes in the embedding space, one for each class of any given task. The prototypes are then used to predict the class and task attributes for a given input sample. The IHT works in two different continual learning scenarios: task-incremental learning (predict class attribute using the task information) and class-incremental learning (predict both class and task attributes). Moreover, we show empirically that a model trained with the class-incremental learning objective is also suited for the task-incremental learning with performance comparable to the models specifically trained with a task-incremental objective. We demonstrate that models learned by the IHT do not suffer from catastrophic forgetting. Moreover, in some smaller models we even see cases of positive backward transfer, where the performance on a given task actually improves for subsequently generated weights. Since the IHT is trained to optimize all the generated weights {θ τ } T τ =0 together, the model can be preempted at any point τ ≤ T during the inference with weights θ τ suited for any task 0 ≤ p ≤ τ . Moreover, we show that the performance of the model improves for all the generated weights when the IHT is trained on more tasks T . We also designed the IHT to work as a recurrent system and its parameters are independent from a given step. Therefore, it can be continued and applied beyond the number of tasks T it was trained for.

2. RELATED WORK

Few-shot learning Many of few-shot learning methods fall into one of two categories: metricbased learning and optimization-based learning. First, metric-based learning methods (Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018; Oreshkin et al., 2018) train a fixed embedding network that works universally for any task. The prediction is then based on the distances between the labeled and query samples in that embedding space. These methods are not specifically tailored for the continual learning problem, since they treat every task independently and have no memory of the past tasks. In contrast to this method, our proposed IHT can be seen as an "adaptive" metric-based learner, where the weights θ t are changing to adapt better to the task t and retain the knowledge of the past tasks. Second, optimization-based methods (Finn et al., 2017; Nichol & Schulman, 2018; Antoniou et al., 2019; Rusu et al., 2019) consisting of variations of a seminal MAML paper propose to learn an initial fixed embedding, that is later adapted to a given task using few gradient-based steps. All by themselves these methods are not able to learn continually, since naively adapting for a new task will result in a catastrophic forgetting of the previously learned information. Continual learning. We believe that compared to the related work (see Biesialska et al. 2020 for an overview), our approach requires the least conceptual overhead, since it does not add additional constraints to the method beyond the weights generated from the previous task. In particular, we do not inject replay data from past tasks (Lopez-Paz & Ranzato, 2017; Riemer et al., 2018; Rolnick et al., 2019; Wang et al., 2021a) , do not explicitly regularize the weights (Kirkpatrick et al., 2017; Zenke et al., 2017) and do not introduce complex graph structures (Tao et al., 2020; Zhang et al., 2021) , data routing or any other architectural changes to the inference model (Rusu et al., 2016) . Instead, we reuse the same principle that made HYPERTRANSFORMER work in the first



Figure1: The information flow of the IHT. In the original HT each of the input weight embeddings are initialized with an empty weight slice. Our proposal is to pass weight slice information from previously learned tasks as an input to the new iteration of the HT.

