UNSUPERVISED PROGRESSIVE LEARNING AND THE STAM ARCHITECTURE

Abstract

We first pose the Unsupervised Progressive Learning (UPL) problem: an online representation learning problem in which the learner observes a non-stationary and unlabeled data stream, learning a growing number of features that persist over time even though the data is not stored or replayed. To solve the UPL problem we propose the Self-Taught Associative Memory (STAM) architecture. Layered hierarchies of STAM modules learn based on a combination of online clustering, novelty detection, forgetting outliers, and storing only prototypical features rather than specific examples. We evaluate STAM representations using clustering and classification tasks. While there are no existing learning scenarios that are directly comparable to UPL, we compare the STAM architecture with two recent continual learning models, Memory Aware Synapses (MAS) and Gradient Episodic Memories (GEM), after adapting them in the UPL setting.

1. INTRODUCTION

The Continual Learning (CL) problem is predominantly addressed in the supervised context with the goal being to learn a sequence of tasks without "catastrophic forgetting" (Goodfellow et al., 2013; Parisi et al., 2019; van de Ven & Tolias, 2019) . There are several CL variations but a common formulation is that the learner observes a set of examples {(x i , t i , y i )}, where x i is a feature vector, t i is a task identifier, and y i is the target vector associated with (x i , t i ) (Chaudhry et al., 2019a; b; Lopez-Paz & Ranzato, 2017) . Other CL variations replace task identifiers with task boundaries that are either given (Hsu et al., 2018) or inferred (Zeno et al., 2018) . Typically, CL requires that the learner either stores and replays some previously seen examples (Aljundi et al., 2019a; b; Gepperth & Karaoguz, 2017; Hayes et al., 2019; Kemker et al., 2018; Rebuffi et al., 2017) or generates examples of earlier learned tasks (Kemker & Kanan, 2018; Liu et al., 2020; Shin et al., 2017) . The Unsupervised Feature (or Representation) Learning (FL) problem, on the other hand, is unsupervised but mostly studied in the offline context: given a set of examples {x i }, the goal is to learn a feature vector h i = f (x i ) of a given dimensionality that, ideally, makes it easier to identify the explanatory factors of variation behind the data (Bengio et al., 2013) , leading to better performance in tasks such as clustering or classification. FL methods differ in the prior P (h) and the loss function. Autoencoders, for instance, aim to learn features of a lower dimensionality than the input that enable a sufficiently good reconstruction (Bengio, 2014; Kingma & Welling, 2013; Tschannen et al., 2018; Zhou et al., 2012) . A similar approach is self-supervised methods, which learn representations by optimizing an auxiliary task (Berthelot et al., 2019; Doersch et al., 2015; Gidaris et al., 2018; Kuo et al., 2019; Oord et al., 2018; Sohn et al., 2020) . In this work, we focus on a new and pragmatic problem that adopts some elements of CL and FL but is also different than them -we refer to this problem as single-pass unsupervised progressive learning or UPL for short. UPL can be described as follows: 1. the data is observed as a non-IID stream (e.g., different portions of the stream may follow different distributions and there may be strong temporal correlations between successive examples), 2. the features should be learned exclusively from unlabeled data, 3. each example is "seen" only once and the unlabeled data are not stored for iterative processing, 4. the number of learned features may need to increase over time, in response to new tasks and/or changes in the data distribution, 5. to avoid catastrophic forgetting, previously learned features need to persist over time, even when the corresponding data are no longer observed in the stream. The UPL problem is encountered in important AI applications, such as a robot learning new visual features as it explores a time-varying environment. Additionally, we argue that UPL is closer to how animals learn, at least in the case of perceptual learning (Goldstone, 1998) . We believe that in order to mimic that, ML methods should be able to learn in a streaming manner and in the absence of supervision. Animals do not "save off" labeled examples to train in parallel with unlabeled data, they do not know how many "classes" exist in their environment, and they do not have to replay/dream periodically all their past experiences to avoid forgetting them. To address the UPL problem, we describe an architecture referred to as STAM ("Self-Taught Associative Memory"). STAM learns features through online clustering at a hierarchy of increasing receptive field sizes. We choose online clustering, instead of more complex learning models, because it can be performed through a single pass over the data stream. Further, despite its simplicity, clustering can generate representations that enable better classification performance than more complex FL methods such as sparse-coding or some deep learning methods (Coates et al., 2011; Coates & Ng, 2012) . STAM allows the number of clusters to increase over time, driven by a novelty detection mechanism. Additionally, STAM includes a brain-inspired dual-memory hierarchy (short-term versus long-term) that enables the conservation of previously learned features (to avoid catastrophic forgetting) that have been seen multiple times at the data stream, while forgetting outliers. To the extent of our knowledge, the UPL problem has not been addressed before. The closest prior work is CURL ("Continual Unsupervised Representation Learning") (Rao et al., 2019) . CURL however does not consider the single-pass, online learning requirement. We further discuss this difference with CURL in Section 6.

2. STAM ARCHITECTURE

In the following, we describe the STAM architecture as a sequence of its major components: a hierarchy of increasing receptive fields, online clustering (centroid learning), novelty detection, and a dual-memory hierarchy that stores prototypical features rather than specific examples. The notation is summarized for convenience in the Supplementary Material (SM) (section SM-A). I. Hierarchy of increasing receptive fields: An input vector x t ∈ R n (an image in all subsequent examples) is analyzed through a hierarchy of Λ layers. Instead of neurons or hidden-layer units, each layer consists of STAM units -in its simplest form a STAM unit functions as an online clustering module. Each STAM unit processes one ρ l × ρ l patch (e.g. 8 × 8 subvector) of the input at the l'th layer. The patches are overlapping, with a small stride (set to one pixel in our experiments) to accomplish translation invariance (similar to CNNs). The patch dimension ρ l increases in higher layers -the idea is that the first layer learns the smallest and most elementary features while the top layer learns the largest and most complex features. II. Centroid Learning: Every patch of each layer is clustered, in an online manner, to a set of centroids. These time-varying centroids form the features that the STAM architecture gradually learns at that layer. All STAM units of layer l share the same set of centroids C l (t) at time t -again for translation invariance.foot_0 Given the m'th input patch x l,m at layer l, the nearest centroid of C l selected for x l,m is c l.j = arg min c∈C l d(x l,m , c) where d(x l,m , c) is the Euclidean distance between the patch x l,m and centroid c. 2 The selected centroid is updated based on a learning rate parameter α, as follows: c l,j = α x l,m + (1 -α)c l,j , 0 < α < 1 A higher α value makes the learning process faster but less predictable. A centroid is only updated by at most one patch and the update is not performed if patch is considered "novel" (defined in the next paragraph). We do not use a decreasing value of α because the goal is to keep learning in a non-stationary environment rather than convergence to a stable centroid. III. Novelty detection: When an input patch x l,m at layer l is significantly different than all centroids at that layer (i.e., its distance to the nearest centroid is a statistical outlier), a new centroid is created



We drop the time index t from this point on but it is still implied that the centroids are dynamically learned over time. We have also experimented with the L1 metric with only minimal differences. Different distance metrics may be more appropriate for other types of data.

