KANERVA++: EXTENDING THE KANERVA MACHINE WITH DIFFERENTIABLE, LOCALLY BLOCK ALLOCATED LATENT MEMORY

Abstract

Episodic and semantic memory are critical components of the human memory model. The theory of complementary learning systems (McClelland et al., 1995) suggests that the compressed representation produced by a serial event (episodic memory) is later restructured to build a more generalized form of reusable knowledge (semantic memory). In this work we develop a new principled Bayesian memory allocation scheme that bridges the gap between episodic and semantic memory via a hierarchical latent variable model. We take inspiration from traditional heap allocation and extend the idea of locally contiguous memory to the Kanerva Machine, enabling a novel differentiable block allocated latent memory. In contrast to the Kanerva Machine, we simplify the process of memory writing by treating it as a fully feed forward deterministic process, relying on the stochasticity of the read key distribution to disperse information within the memory. We demonstrate that this allocation scheme improves performance in memory conditional image generation, resulting in new state-of-the-art conditional likelihood values on binarized MNIST (≤41.58 nats/image) , binarized Omniglot (≤66.24 nats/image), as well as presenting competitive performance on CIFAR10, DMLab Mazes, Celeb-A and ImageNet32×32.

1. INTRODUCTION

Memory is a central tenet in the model of human intelligence and is crucial to long-term reasoning and planning. Of particular interest is the theory of complementary learning systems McClelland et al. (1995) which proposes that the brain employs two complementary systems to support the acquisition of complex behaviours: a hippocampal fast-learning system that records events as episodic memory, and a neocortical slow learning system that learns statistics across events as semantic memory. While the functional dichotomy of the complementary systems are well-established McClelland et al. (1995) ; Kumaran et al. (2016) , it remains unclear whether they are bounded by different computational principles. In this work we introduce a model that bridges this gap by showing that the same statistical learning principle can be applied to the fast learning system through the construction of a hierarchical Bayesian memory. Traditional heap allocated memory affords O(1) free / malloc computational complexity and serves as inspiration for K++ which uses differentiable neural proxies. While recent work has shown that using memory augmented neural networks can drastically improve the performance of generative models (Wu et al., 2018a; b), language models (Weston et al., 2015 ), meta-learning (Santoro et al., 2016 ), long-term planning (Graves et al., 2014; 2016) and sample efficiency in reinforcement learning (Zhu et al., 2019) , no model has been proposed to exploit the inherent multi-dimensionality of biological memory Reimann et al. (2017) . Inspired by the traditional (computer-science) memory model of heap allocation (Figure 1 -Left), we propose a novel differentiable memory allocation scheme called Kanerva ++ (K++), that learns to compress an episode of samples, referred to by the set of pointers {p1, ..., p4} in Figure 1 , into a latent multi-dimensional memory (Figure 1 -Right). The K++ model infers a key distribution as a proxy to the pointers (Marlow et al., 2008) and is able to embed similar samples to an overlapping latent representation space, thus enabling it to be more efficient on compressing input distributions. In this work, we focus on applying this novel memory allocation scheme to latent variable generative models, where we improve the memory model in the Kanerva Machine (Wu et al., 2018a; b) .

2. RELATED WORK

Variational Autoencoders: Variational autoencoders (VAEs) (Kingma & Welling, 2014) are a fundamental part of the modern machine learning toolbox and have wide ranging applications from generative modeling (Kingma & Welling, 2014; Kingma et al., 2016; Burda et al., 2016) , learning graphs (Kipf & Welling, 2016), medical applications (Sedai et al., 2017; Zhao et al., 2019) and video analysis (Fan et al., 2020) . As a latent variable model, VAEs infer an approximate posterior over a latent representation Z and can be used in downstream tasks such as control in reinforcement learning (Nair et al., 2018; Pritzel et al., 2017) . VAEs maximize an evidence lower bound (ELBO), L(X, Z), of the log-marginal likelihood, ln p(X) > L(X, Z) = ln p θ (X|Z) -D KL (q φ (Z|X)||p θ (Z)). The produced variational approximation, q φ (Z|X), is typically called the encoder, while p θ (X|Z) comes from the decoder. Methods that aim to improve these latent variable generative models typically fall into two different paradigms: learning more informative priors or leveraging novel decoders. While improved decoder models such as PixelVAE (Gulrajani et al., 2017) and PixelVAE++ (Sadeghi et al., 2019) drastically improve the performance of p θ (X|Z), they suffer from a phenomenon called posterior collapse (Lucas et al., 2019) , where the decoder can become almost independent of the posterior sample, but still retains the ability to reconstruct the original sample by relying on its auto-regressive property (Goyal et al., 2017a) . In contrast, VampPrior (Tomczak & Welling, 2018) , Associative Compression Networks (ACN) (Graves et al., 2018) , VAE-nCRP (Goyal et al., 2017b) and VLAE (Chen et al., 2017) tighten the variational bound by learning more informed priors. VLAE for example, uses a powerful auto-regressive prior; VAE-nCRP learns a non-parametric Chinese restaurant process prior and VampPrior learns a Gaussian mixture prior representing prototypical virtual samples. On the other hand, ACN takes a two-stage approach: by clustering real samples in the space of the posterior; and by using these related samples as inputs to a learned prior, ACN provides an information theoretic alternative to improved code transmission. Our work falls into this latter paradigm: we parameterize a learned prior by reading from a common memory, built through a transformation of an episode of input samples. Memory Models: Inspired by the associative nature of biological memory, the Hopfield network (Hopfield, 1982) introduced the notion of content-addressable memory, defined by a set of binary neurons coupled with a Hamiltonian and a dynamical update rule. Iterating the update rule minimizes the Hamiltonian, resulting in patterns being stored at different configurations (Hopfield, 1982; Krotov & Hopfield, 2016) . Writing in a Hopfield network, thus corresponds to finding weight configurations such that stored patterns become attractors via Hebbian rules (Hebb, 1949) . This concept of memory was extended to a distributed, continuous setting in Kanerva (1988) and to a complex valued, holographic convolutional binding mechanism by Plate (1995) . The central difference between associative memory models Hopfield (1982); Kanerva (1988) and holographic memory Plate (1995) is that the latter decouples the size of the memory from the input word size. Most recent work with memory augmented neural networks treat memory in a slot-based manner (closer to the associative memory paradigm), where each column of a memory matrix, M , represents a single slot. Reading memory traces, z, entails using a vector of addressing weights, r, to attend to the appropriate column of M , z = r T M . This paradigm of memory includes models such as the Neural Turing Machine (NTM) (Graves et al., 2014), Differentiable Neural Computer (DNC) 



Figure 1: Example final state of a traditional heap allocator (Marlow et al., 2008) (Left) vs. K++ (Right); final state created by sequential operations listed on the left. K++ uses a key distribution to stochastically point to a memory sub-region while Marlow et al. (2008) uses a direct pointer.Traditional heap allocated memory affords O(1) free / malloc computational complexity and serves as inspiration for K++ which uses differentiable neural proxies.

