HOPFIELD NETWORKS IS ALL YOU NEED

Abstract

We introduce a modern Hopfield network with continuous states and a corresponding update rule. The new Hopfield network can store exponentially (with the dimension of the associative space) many patterns, retrieves the pattern with one update, and has exponentially small retrieval errors. It has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. The new update rule is equivalent to the attention mechanism used in transformers. This equivalence enables a characterization of the heads of transformer models. These heads perform in the first layers preferably global averaging and in higher layers partial averaging via metastable states. The new modern Hopfield network can be integrated into deep learning architectures as layers to allow the storage of and access to raw input data, intermediate results, or learned prototypes. These Hopfield layers enable new ways of deep learning, beyond fully-connected, convolutional, or recurrent networks, and provide pooling, memory, association, and attention mechanisms. We demonstrate the broad applicability of the Hopfield layers across various domains. Hopfield layers improved state-of-the-art on three out of four considered multiple instance learning problems as well as on immune repertoire classification with several hundreds of thousands of instances. On the UCI benchmark collections of small classification tasks, where deep learning methods typically struggle, Hopfield layers yielded a new state-ofthe-art when compared to different machine learning methods. Finally, Hopfield layers achieved state-of-the-art on two drug design datasets. The implementation is available at: https://github.com/ml-jku/hopfield-layers 

1. INTRODUCTION

The deep learning community has been looking for alternatives to recurrent neural networks (RNNs) for storing information. For example, linear memory networks use a linear autoencoder for sequences as a memory (Carta et al., 2020) . Additional memories for RNNs like holographic reduced representations (Danihelka et al., 2016) , tensor product representations (Schlag & Schmidhuber, 2018; Schlag et al., 2019) and classical associative memories (extended to fast weight approaches) (Schmidhuber, 1992; Ba et al., 2016a; b; Zhang & Zhou, 2017; Schlag et al., 2021) have been suggested. Most approaches to new memories are based on attention. The neural Turing machine (NTM) is equipped with an external memory and an attention process (Graves et al., 2014) . Memory networks (Weston et al., 2014) use an arg max attention by first mapping a query and patterns into a space and then retrieving the pattern with the largest dot product. End to end memory networks (EMN) make this attention scheme differentiable by replacing arg max through a softmax (Sukhbaatar et al., 2015a; b) . EMN with dot products became very popular and implement a key-value attention (Daniluk et al., 2017) for self-attention. An enhancement of EMN is the transformer (Vaswani et al., 2017a; b) and its extensions (Dehghani et al., 2018) . The transformer has had a great impact on the natural language processing (NLP) community, in particular via the BERT models (Devlin et al., 2018; 2019) . Contribution of this work: (i) introducing novel deep learning layers that are equipped with a memory via modern Hopfield networks, (ii) introducing a novel energy function and a novel update rule for continuous modern Hopfield networks that are differentiable and typically retrieve patterns after one update. Differentiability is required for gradient descent parameter updates and retrieval with one update is compatible with activating the layers of deep networks. We suggest using modern Hopfield networks to store information or learned prototypes in different layers of neural networks. Binary Hopfield networks were introduced as associative memories that can store and retrieve patterns (Hopfield, 1982) . A query pattern can retrieve the pattern to which it is most similar or an average over similar patterns. Hopfield networks seem to be an ancient technique, however, new energy functions improved their properties. The stability of spurious states or metastable states was sensibly reduced (Barra et al., 2018) . The largest and most impactful successes are reported on increasing the storage capacity of Hopfield networks. In a d-dimensional space, the standard Hopfield model can store d uncorrelated patterns without errors but only Cd/ log(d) random patterns with C < 1/2 for a fixed stable pattern or C < 1/4 if all patterns are stable (McEliece et al., 1987) . The same bound holds for nonlinear learning rules (Mazza, 1997) . Using tricks-of-trade and allowing small retrieval errors, the storage capacity is about 0.138d (Crisanti et al., 1986; Hertz et al., 1991; Torres et al., 2002) . If the learning rule is not related to the Hebb rule, then up to d patterns can be stored (Abu-Mostafa & StJacques, 1985) . For Hopfield networks with non-zero diagonal matrices, the storage can be increased to Cd log(d) (Folli et al., 2017) . In contrast to the storage capacity, the number of energy minima (spurious states, stable states) of Hopfield networks is exponential in d (Tanaka & Edwards, 1980; Bruck & Roychowdhury, 1990; Wainrib & Touboul, 2013) . The standard binary Hopfield network has an energy function that can be expressed as the sum of interaction functions F with F (x) = x 2 . Modern Hopfield networks, also called "dense associative memory" (DAM) models, use an energy function with interaction functions of the form F (x) = x n and, thereby, achieve a storage capacity proportional to d n-1 (Krotov & Hopfield, 2016; 2018) . The energy function of modern Hopfield networks makes them robust against adversarial attacks (Krotov & Hopfield, 2018) . Modern binary Hopfield networks with energy functions based on interaction functions of the form F (x) = exp(x) even lead to storage capacity of 2 d/2 , where all stored binary patterns are fixed points but the radius of attraction vanishes (Demircigil et al., 2017) . However, in order to integrate Hopfield networks into deep learning architectures, it is necessary to make them differentiable, that is, we require continuous Hopfield networks (Hopfield, 1984; Koiran, 1994) . (reasonable settings for c = 1.37 and c = 3.15 are given in Theorem 3). The retrieval of patterns with one update is important to integrate Hopfield networks in deep learning architectures, where layers are activated only once. Surprisingly, our new update rule is also the key-value attention as used in transformer and BERT models (see Fig. 1 ). Our modern Hopfield networks can be integrated as a new layer in deep learning architectures for pooling, memory, prototype learning, and attention. We test these new layers on different benchmark datasets and tasks like immune repertoire classification. Figure 1 : We generalize the energy of binary modern Hopfield networks to continuous states while keeping fast convergence and storage capacity properties. We also propose a new update rule that minimizes the energy. The new update rule is the attention mechanism of the transformer. Formulae are modified to express softmax as row vector. "="-sign means "keeps the properties".



Therefore, we generalize the energy function of Demircigil et al. (2017) that builds on exponential interaction functions to continuous patterns and states and obtain a new modern Hopfield network. We also propose a new update rule which ensures global convergence to stationary points of the energy (local minima or saddle points). We prove that our new modern Hopfield network typically retrieves patterns in one update step ( -close to the fixed point) with an exponentially low error and has a storage capacity proportional to c d-1 4

