CONTINUAL LEARNING USING HASH-ROUTED CONVO-LUTIONAL NEURAL NETWORKS

Abstract

Continual learning could shift the machine learning paradigm from data centric to model centric. A continual learning model needs to scale efficiently to handle semantically different datasets, while avoiding unnecessary growth. We introduce hash-routed convolutional neural networks: a group of convolutional units where data flows dynamically. Feature maps are compared using feature hashing and similar data is routed to the same units. A hash-routed network provides excellent plasticity thanks to its routed nature, while generating stable features through the use of orthogonal feature hashing. Each unit evolves separately and new units can be added (to be used only when necessary). Hash-routed networks achieve excellent performance across a variety of typical continual learning benchmarks without storing raw data and train using only gradient descent. Besides providing a continual learning framework for supervised tasks with encouraging results, our model can be used for unsupervised or reinforcement learning.

1. INTRODUCTION

When faced with a new modeling challenge, a data scientist will typically train a model from a class of models based on her/his expert knowledge and retain the best performing one. The trained model is often useless when faced with different data. Retraining it on new data will result in poor performance when trying to reuse the model on the original data. This is what is known as catastrophic forgetting (McCloskey & Cohen, 1989) . Although transfer learning avoids retraining networks from scratch, keeping the acquired knowledge in a trained model and using it to learn new tasks is not straightforward. The real knowledge remains with the human expert. Model training is usually a data centric task. Continual learning (Thrun, 1995) makes model training a model centric task by maintaining acquired knowledge in previous learning tasks. Recent work in continual (or lifelong) learning has focused on supervised classification tasks and most of the developed algorithms do not generate stable features that could be used for unsupervised learning tasks, as would a more generic algorithm such as the one we present. Models should also be able to adapt and scale reasonably to accommodate different learning tasks without using an exponential amount of resources, and preferably with little data scientist intervention. To tackle this challenge, we introduce hash-routed networks (HRN). A HRN is composed of multiple independent processing units. Unlike typical convolutional neural networks (CNN), the data flow between these units is determined dynamically by measuring similarity between hashed feature maps. The generated feature maps are stable. Scalability is insured through unit evolution and by increasing the number of available units, while avoiding exponential memory use. This new type of network maintains stable performance across a variety of tasks (including semantically different tasks). We describe expansion, update and regularization algorithms for continual learning. We validate our approach using multiple publicly available datasets, by comparing supervised classification performance. Benchmarks include Pairwise-MNIST, MNIST/Fashion-MNIST (Xiao et al., 2017) and SVHN/incremental-Cifar100 (Netzer et al., 2011; Krizhevsky et al., 2009) . Relevant background is introduced in section 2. Section 3 details the hash-routing algorithm and discusses its key attributes. Section 4 compares our work with other continual learning and dynamic network studies. A large set of experiments is carried out in section 5. Feature hashing, also known as the hashing trick (Weinberger et al., 2009) is a dimension reduction transformation with key properties for our work: inner product conservation and quasi-orthogonality. A feature hashing function φ : R N → R s , can be built using two uniform hash functions h : N → {1, 2..., s} and ξ : N → {-1, 1}, as such:  φ i (x) = j∈[[1,N ]] j:h(j)=i P r(|v T φ (x)| > ) ≤ 2 exp - 2 /2 s -1 v 2 2 x 2 2 + v ∞ x ∞ /3 (1) Eq.1 shows that approximate orthogonality is better when φ handles bounded vectors. Data independent bounds can be obtained by setting x ∞ = 1 and replacing v by v v 2 , which leads to x 2 2 ≤ N and v ∞ ≤ 1, hence: P r(|v T φ (x)| > ) ≤ 2 exp - 2 /2 s -1 x 2 2 + v ∞ /3 ≤ 2 exp - 2 /2 N/s + /3 Better approximate orthogonality significantly reduces correlation when summing feature vectors generated by different hashing functions, as is done in hash-routed networks. 3 HASH-ROUTED NETWORKS

3.1. STRUCTURE

A hash-routed network maps input data to a feature vector of size s that is stable across successive learning tasks. An HRN exploits inner product preservation to insure that similarity between generated feature vectors reflect the similarity between input samples. Quasi-orthogonality of different feature hashing functions is used to reduce correlation between the output's components, as it is the sum of individual hashed feature vectors. An HRN H is composed of M units {U 1 , ..., U M }. Each unit U k is composed of: • A series of convolution operations f k . It is characterized by a number of input channels and a number of output channels, resulting in a vector of trainable parameters w k . Note that f k can also include pooling operations. • An orthonormal projection basis B k . It contains a maximum of m non-zeros orthogonal vectors of size s. Each basis is filled with zero vectors at first. These will be replaced by non-zero vectors during training. • A feature hashing function φ k that maps a feature vector of any size to a vector of size s. The network also has an independent feature hashing function φ 0 . All the feature hashing functions are different but generate feature vectors of size s.

3.2.1. HASH-ROUTING ALGORITHM

H maps an input sample x to a feature vector H(x) of size s. In a vanilla CNN, x would go through a series of deterministic convolutional layers to generate feature maps of growing size. In a HRN,



ξ(j)x j where φ i denotes the i th component of φ. Inner product is preserved as E[φ(a) T φ(b)] = a T b. φ provides an unbiased estimator of the inner product. It can also be shown that if ||a|| 2 = ||b|| 2 = 1, then σ a,b = O( 1 s ). Two different hash functions φ and φ (e.g. h = h or ξ = ξ ) are orthogonal. In other words, ∀(v, w) ∈ Im(φ) × Im(φ ), E[v T w] ≈ 0. Furthermore, Weinberger et al. (2009) details the inner product bounds, given v ∈ Im(φ) and x ∈ R N :

