UNSUPERVISED PROGRESSIVE LEARNING AND THE STAM ARCHITECTURE

Abstract

We first pose the Unsupervised Progressive Learning (UPL) problem: an online representation learning problem in which the learner observes a non-stationary and unlabeled data stream, learning a growing number of features that persist over time even though the data is not stored or replayed. To solve the UPL problem we propose the Self-Taught Associative Memory (STAM) architecture. Layered hierarchies of STAM modules learn based on a combination of online clustering, novelty detection, forgetting outliers, and storing only prototypical features rather than specific examples. We evaluate STAM representations using clustering and classification tasks. While there are no existing learning scenarios that are directly comparable to UPL, we compare the STAM architecture with two recent continual learning models, Memory Aware Synapses (MAS) and Gradient Episodic Memories (GEM), after adapting them in the UPL setting.

1. INTRODUCTION

The Continual Learning (CL) problem is predominantly addressed in the supervised context with the goal being to learn a sequence of tasks without "catastrophic forgetting" (Goodfellow et al., 2013; Parisi et al., 2019; van de Ven & Tolias, 2019) . There are several CL variations but a common formulation is that the learner observes a set of examples {(x i , t i , y i )}, where x i is a feature vector, t i is a task identifier, and y i is the target vector associated with (x i , t i ) (Chaudhry et al., 2019a; b; Lopez-Paz & Ranzato, 2017) . Other CL variations replace task identifiers with task boundaries that are either given (Hsu et al., 2018) or inferred (Zeno et al., 2018) . Typically, CL requires that the learner either stores and replays some previously seen examples (Aljundi et al., 2019a; b; Gepperth & Karaoguz, 2017; Hayes et al., 2019; Kemker et al., 2018; Rebuffi et al., 2017) or generates examples of earlier learned tasks (Kemker & Kanan, 2018; Liu et al., 2020; Shin et al., 2017) . The Unsupervised Feature (or Representation) Learning (FL) problem, on the other hand, is unsupervised but mostly studied in the offline context: given a set of examples {x i }, the goal is to learn a feature vector h i = f (x i ) of a given dimensionality that, ideally, makes it easier to identify the explanatory factors of variation behind the data (Bengio et al., 2013) , leading to better performance in tasks such as clustering or classification. FL methods differ in the prior P (h) and the loss function. Autoencoders, for instance, aim to learn features of a lower dimensionality than the input that enable a sufficiently good reconstruction (Bengio, 2014; Kingma & Welling, 2013; Tschannen et al., 2018; Zhou et al., 2012) . A similar approach is self-supervised methods, which learn representations by optimizing an auxiliary task (Berthelot et al., 2019; Doersch et al., 2015; Gidaris et al., 2018; Kuo et al., 2019; Oord et al., 2018; Sohn et al., 2020) . In this work, we focus on a new and pragmatic problem that adopts some elements of CL and FL but is also different than them -we refer to this problem as single-pass unsupervised progressive learning or UPL for short. UPL can be described as follows: 1. the data is observed as a non-IID stream (e.g., different portions of the stream may follow different distributions and there may be strong temporal correlations between successive examples), 2. the features should be learned exclusively from unlabeled data, 3. each example is "seen" only once and the unlabeled data are not stored for iterative processing, 4. the number of learned features may need to increase over time, in response to new tasks and/or changes in the data distribution, 5. to avoid catastrophic forgetting, previously learned features need to persist over time, even when the corresponding data are no longer observed in the stream.

