WANDERING WITHIN A WORLD: ONLINE CONTEXTUALIZED FEW-SHOT LEARNING

Abstract

We aim to bridge the gap between typical human and machine-learning environments by extending the standard framework of few-shot learning to an online, continual setting. In this setting, episodes do not have separate training and testing phases, and instead models are evaluated online while learning novel classes. As in the real world, where the presence of spatiotemporal context helps us retrieve learned skills in the past, our online few-shot learning setting also features an underlying context that changes throughout time. Object classes are correlated within a context and inferring the correct context can lead to better performance. Building upon this setting, we propose a new few-shot learning dataset based on large scale indoor imagery that mimics the visual experience of an agent wandering within a world. Furthermore, we convert popular few-shot learning approaches into online versions and we also propose a new contextual prototypical memory model that can make use of spatiotemporal contextual information from the recent past. 1

1. INTRODUCTION

In machine learning, many paradigms exist for training and evaluating models: standard train-thenevaluate, few-shot learning, incremental learning, continual learning, and so forth. None of these paradigms well approximates the naturalistic conditions that humans and artificial agents encounter as they wander within a physical environment. Consider, for example, learning and remembering peoples' names in the course of daily life. We tend to see people in a given environment-work, home, gym, etc. We tend to repeatedly revisit those environments, with different environment base rates, nonuniform environment transition probabilities, and nonuniform base rates of encountering a given person in a given environment. We need to recognize when we do not know a person, and we need to learn to recognize them the next time we encounter them. We are not always provided with a name, but we can learn in a semi-supervised manner. And every training trial is itself an evaluation trial as we repeatedly use existing knowledge and acquire new knowledge. In this article, we propose a novel paradigm, online contextualized few-shot learning, that approximates these naturalistic conditions, and we develop deep-learning architectures well suited for this paradigm. In traditional few-shot learning (FSL) (Lake et al., 2015; Vinyals et al., 2016) , training is episodic. Within an isolated episode, a set of new classes is introduced with a limited number of labeled examples per class-the support set-followed by evaluation on an unlabeled query set. While this setup has inspired the development of a multitude of meta-learning algorithms which can be trained to rapidly learn novel classes with a few labeled examples, the algorithms are focused solely on the few classes introduced in the current episode; the classes learned are not carried over to future episodes. Although incremental learning and continual learning methods (Rebuffi et al., 2017; Hou et al., 2019) address the case where classes are carried over, the episodic construction of these frameworks seems artificial: in our daily lives, we do not learn new objects by grouping them with five other new objects, process them together, and then move on. To break the rigid, artificial structure of continual and few-shot learning, we propose a new continual few-shot learning setting where environments are revisited and the total number of novel object classes increases over time. Crucially, model evaluation happens on each trial, very much like the setup in online learning. When encountering a new class, the learning algorithm is expected to indicate that the class is "new," and it is then expected to recognize subsequent instances of the class once a label has been provided. When learning continually in such a dynamic environment, contextual information can guide learning and remembering. Any structured sequence provides temporal context: the instances encountered recently are predictive of instances to be encountered next. In natural environments, spatial contextinformation in the current input weakly correlated with the occurrence of a particular class-can be beneficial for retrieval as well. For example, we tend to see our boss in an office setting, not in a bedroom setting. Human memory retrieval benefits from both spatial and temporal context (Howard, 2017; Kahana, 2012) . In our online few-shot learning setting, we provide spatial context in the presentation of each instance and temporal structure to sequences, enabling an agent to learn from both spatial and temporal context. Besides developing and experimenting on a toy benchmark using handwritten characters (Lake et al., 2015) , we also propose a new large-scale benchmark for online contextualized few-shot learning derived from indoor panoramic imagery (Chang et al., 2017) . In the toy benchmark, temporal context can be defined by the co-occurrence of character classes. In the indoor environment, the context-temporal and spatial-is a natural by-product as the agent wandering in between different rooms. We propose a model that can exploit contextual information, called contextual prototypical memory (CPM), which incorporates an RNN to encode contextual information and a separate prototype memory to remember previously learned classes (see Figure 4 ). This model obtains significant gains on few-shot classification performance compared to models that do not retain a memory of the recent past. We compare to classic few-shot algorithms extended to an online setting, and CPM consistently achieves the best performance. The main contributions of this paper are as follows. First, we define an online contextualized few-shot learning (OC-FSL) setting to mimic naturalistic human learning. Second, we build three datasets: 1) RoamingOmniglot is based on handwritten characters from Omniglot (Lake et al., 2015) ; 2) RoamingImageNet is based on images from ImageNet (Russakovsky et al., 2015) ; and 3) Roam-ingRooms is our new few-shot learning dataset based on indoor imagery (Chang et al., 2017) , which resembles the visual experience of a wandering agent. Third, we benchmark classic FSL methods and also explore our CPM model, which combines the strengths of RNNs for modeling temporal context and Prototypical Networks (Snell et al., 2017) for memory consolidation and rapid learning.



Our code and dataset are released at: https://github.com/renmengye/oc-fewshot-public



Figure1: Online contextualized few-shot learning. A) Our setup is similar to online learning, where there is no separate testing phase; model training and evaluation happen at the same time. The input at each time step is an (image, class-label) pair. The number of classes grows incrementally and the agent is expected to answer "new" for items that have not yet been assigned labels. Sequences can be semi-supervised; here the label is not revealed for every input item (labeled/unlabeled shown by red solid/grey dotted boxes). The agent is evaluated on the correctness of all answers. The model obtains learning signals only on labeled instances, and is correct if it predicts the label of previously-seen classes, or 'new' for new ones. B) The overall sequence switches between different learning environments. While the environment ID is hidden from the agent, inferring the current environment can help solve the task.

