ON THE MAPPING BETWEEN HOPFIELD NETWORKS AND RESTRICTED BOLTZMANN MACHINES

Abstract

Hopfield networks (HNs) and Restricted Boltzmann Machines (RBMs) are two important models at the interface of statistical physics, machine learning, and neuroscience. Recently, there has been interest in the relationship between HNs and RBMs, due to their similarity under the statistical mechanics formalism. An exact mapping between HNs and RBMs has been previously noted for the special case of orthogonal ("uncorrelated") encoded patterns. We present here an exact mapping in the case of correlated pattern HNs, which are more broadly applicable to existing datasets. Specifically, we show that any HN with N binary variables and p < N arbitrary binary patterns can be transformed into an RBM with N binary visible variables and p gaussian hidden variables. We outline the conditions under which the reverse mapping exists, and conduct experiments on the MNIST dataset which suggest the mapping provides a useful initialization to the RBM weights. We discuss extensions, the potential importance of this correspondence for the training of RBMs, and for understanding the performance of deep architectures which utilize RBMs.

1. INTRODUCTION

Hopfield networks (HNs) (Hopfield, 1982; Amit, 1989) are a classical neural network architecture that can store prescribed patterns as fixed-point attractors of a dynamical system. In their standard formulation with binary valued units, HNs can be regarded as spin glasses with pairwise interactions J ij that are fully determined by the patterns to be encoded. HNs have been extensively studied in the statistical mechanics literature (e.g. (Kanter & Sompolinsky, 1987; Amit et al., 1985) ), where they can be seen as an interpolation between the ferromagnetic Ising model (p = 1 pattern) and the Sherrington-Kirkpatrick spin glass model (many random patterns) (Kirkpatrick & Sherrington, 1978; Barra & Guerra, 2008) . By encoding patterns as dynamical attractors which are robust to perturbations, HNs provide an elegant solution to pattern recognition and classification tasks. They are considered the prototypical attractor neural network, and are the historical precursor to modern recurrent neural networks. Concurrently, spin glasses have been used extensively in the historical machine learning literature where they comprise a sub-class of "Boltzmann machines" (BMs) (Ackley et al., 1985) . Given a collection of data samples drawn from a data distribution, one is generally interested in "training" a BM by tuning its weights J ij such that its equilibrium distribution can reproduce the data distribution as closely as possible (Hinton, 2012) . The resulting optimization problem is dramatically simplified when the network has a two-layer structure where each layer has no self-interactions, so that there are only inter-layer connections (Hinton, 2012) (see Fig. 1 ). This architecture is known as a Restricted Boltzmann Machine (RBM), and the two layers are sometimes called the visible layer and the hidden layer. The visible layer characteristics (dimension, type of units) are determined by the training data, whereas the hidden layer can have binary or continuous units and the dimension is chosen somewhat arbitrarily. In addition to generative modelling, RBMs and their multi-layer extensions have been used for a variety of learning tasks, such as classification, feature extraction, and dimension reduction (e.g. Salakhutdinov et al. (2007) ; Hinton & Salakhutdinov (2006) ). There has been extensive interest in the relationship between HNs and RBMs, as both are built on the Ising model formalism and fulfill similar roles, with the aim of better understanding RBM behaviour and potentially improving performance. Various results in this area have been recently reviewed (Marullo & Agliari, 2021) . In particular, an exact mapping between HNs and RBMs has been previously noted for the special case of uncorrelated (orthogonal) patterns (Barra et al., 2012) . Several related models have since been studied (Agliari et al., 2013; Mézard, 2017) , which partially relax the uncorrelated pattern constraint. However, the patterns observed in most real datasets exhibit significant correlations, precluding the use of these approaches. In this paper, we demonstrate exact correspondence between HNs and RBMs in the case of correlated pattern HNs. Specifically, we show that any HN with N binary units and p < N arbitrary (i.e. non-orthogonal) binary patterns encoded via the projection rule (Kanter & Sompolinsky, 1987; Personnaz et al., 1986) , can be transformed into an RBM with N binary and p gaussian variables. We then characterize when the reverse map from RBMs to HNs can be made. We consider a practical example using the mapping, and discuss the potential importance of this correspondence for the training and interpretability of RBMs.

2. RESULTS

We first introduce the classical solution to the problem of encoding N -dimensional binary {-1, +1} vectors {ξ µ } p µ=1 , termed "patterns", as global minima of a pairwise spin glass H(s) = -1 2 s T J s. This is often framed as a pattern retrieval problem, where the goal is to specify or learn J ij such that an energy-decreasing update rule for H(s) converges to the patterns (i.e. they are stable fixed points). Consider the N × p matrix ξ with the p patterns as its columns. Then the classical prescription known as the projection rule (or pseudo-inverse rule) (Kanter & Sompolinsky, 1987; Personnaz et al., 1986) , J = ξ(ξ T ξ) -1 ξ T , guarantees that the p patterns will be global minima of H(s). This resulting spin model is commonly called a (projection) Hopfield network, and has the Hamiltonian H(s) = - 1 2 s T ξ(ξ T ξ) -1 ξ T s. Note that ξ T ξ invertibility is guaranteed as long as the patterns are linearly independent (we therefore require p ≤ N ). Also note that in the special (rare) case of orthogonal patterns ξ µ • ξ ν = N δ µν (also called "uncorrelated"), studied in the previous work (Barra et al., 2012) , one has ξ T ξ = N I and so the pseudo-inverse interactions reduce to the well-known Hebbian form J = 1 N ξξ T (the properties of which are studied extensively in Amit et al. (1985) ). Additional details on the projection HN Eq. ( 1) are provided in Appendix A. To make progress in analyzing Eq. (1), we first consider a transformation of ξ which eliminates the inverse factor.

2.1. MAPPING A HOPFIELD NETWORK TO A RESTRICTED BOLTZMANN MACHINE

In order to obtain a more useful representation of the quadratic form Eq. (1) (for our purposes), we utilize the QR-decomposition (Schott & Stewart, 1999 ) of ξ to "orthogonalize" the patterns, ξ = QR, with Q ∈ R N ×p , R ∈ R p×p . The columns of Q are the orthogonalized patterns, and form an orthonormal basis (of non-binary vectors) for the p-dimensional subspace spanned by the binary patterns. R is upper triangular, and if its diagonals are held positive then Q and R are both unique (Schott & Stewart, 1999) . Note both the order and sign of the columns of ξ are irrelevant for HN pattern recall, so there are n = 2 p • p! possible Q, R pairs. Fixing a pattern ordering, we can use the orthogonality of Q to re-write the interaction matrix as J = ξ(ξ T ξ) -1 ξ T = QR(R T R) -1 R T Q T = QQ T (3) (the last equality follows from (R T R) -1 = R -1 (R T ) -1 ). Eq. ( 3) resembles the simple Hebbian rule but with non-binary orthogonal patterns. Defining q ≡ Q T s in analogy to the classical pattern overlap parameter m ≡ 1 N ξ T s (Amit et al., 1985) , we have H(s) = - 1 2 s T QQ T s = - 1 2 q(s) • q(s). Using a Gaussian integral as in Amit et al. (1985) ; Barra et al. (2012) ; Mézard (2017) to transform (exactly) the partition function Z ≡ {s} e -βH(s) of Eq. ( 1), we get Z = {s} e 1 2 (βq) T (β -1 I)(βq) = {s} e -β 2 µ λ 2 µ +β µ λµ i Qiµsi µ dλ µ 2π/β . (5) The second line can be seen as the partition function of an expanded Hamiltonian for the N (binary) original variables {s i } and the p (continuous) auxiliary variables {λ µ }, i.e. H RBM ({s i }, {λ µ }) = 1 2 µ λ 2 µ - µ i Q iµ s i λ µ . Note that this is the Hamiltonian of a binary-continuous RBM with inter-layer weights Q iµ . The original HN is therefore equivalent to an RBM described by Eq. ( 6) (depicted in Fig. 1 ). As mentioned above, there are many RBMs which correspond to the same HN due to the combinatorics of choosing Q. In fact, instead of QR factorization one can use any decomposition which satisfies J = U U T , with orthogonal U ∈ R N ×p (see Appendix B), in which case U acts as the RBM weights. Also note the inclusion of an applied field termi b i s i in Eq. ( 1) trivially carries through the procedure, i.e. HRBM ( {s i }, {λ µ }) = 1 2 µ λ 2 µ -i b i s i -µ i Q iµ s i λ µ . Figure 1 : Correspondence between Hopfield Networks (HNs) with correlated patterns and binary-gaussian restricted boltzmann machines (RBMs). The HN has N binary units and pairwise interactions J defined by p < N (possibly correlated) patterns {ξ µ } p µ=1 . The patterns are encoded as minima of Eq. (1) through the projection rule J = ξ(ξ T ξ) -1 ξ T , where ξ µ form the columns of ξ. We orthogonalize the patterns through a QR decomposition ξ = QR. The HN is equivalent to an RBM with N binary visible units and p gaussian hidden units with inter-layer weights defined as the orthogonalized patterns Q iµ , and Hamiltonian Eq. (6). See (Agliari et al., 2017) for the analogous mapping in the uncorrelated case. Instead of working with the joint form Eq. ( 6), one could take a different direction from Eq. ( 5) and sum out the original variables {s i }, i.e. Z = e -β 2 µ λ 2 µ 2 N i cosh β µ Q iµ λ µ µ dλ µ 2π/β . ( ) This continuous, p-dimensional representation is useful for numerical estimation of Z (Section 3.1). We may write Eq. ( 7) as Z = e -F0(λ) dλ µ , where F 0 ({λ µ }) = 1 2 µ λ 2 µ - 1 β i ln cosh β µ Q iµ λ µ . Eq. ( 8) is an approximate Lyapunov function for the mean dynamics of {λ µ }; ∇ λ F 0 describes the effective behaviour of the stochastic dynamics of the N binary variables {s i } at temperature β -1 .

2.2. COMMENTS ON THE REVERSE MAPPING

With the mapping from HNs (with correlated patterns) to RBMs established, we now consider the reverse direction. Consider a binary-continuous RBM with inter-layer weights W iµ which couple a visible layer of N binary variables {s i } to a hidden layer of p continuous variables {λ µ }, H(s, λ) = 1 2 µ λ 2 µ - i b i s i - µ i W iµ s i λ µ . Here we use W instead of Q for the RBM weights to emphasize that the RBM is not necessarily an HN. First, following Mehta et al. (2019) , we transform the RBM to a BM with binary states by integrating out the hidden variables. The corresponding Hamiltonian for the visible units alone is (see Appendix D.1 for details), H(s) = - i b i s i - 1 2 i j µ W iµ W jµ s i s j , a pairwise Ising model with a particular coupling structure J ij = µ W iµ W jµ , which in vector form is J = µ w µ w T µ = W W T , where {w µ } are the p columns of W . In general, this Ising model Eq. ( 10) produced by integrating out the hidden variables need not have Hopfield structure (discussed below). However, it automatically does (as noted in Barra et al. (2012) ), in the very special case where W iµ ∈ {-1, +1}. In that case, the binary patterns are simply {w µ }, so that Eq. ( 11) represents a Hopfield network with the Hebbian prescription. This situation is likely rare and may only arise as a by-product of constrained training; for a generically trained RBM the weights will not be binary. It is therefore interesting to clarify when and how real-valued RBM interactions W can be associated with HNs. Approximate binary representation of W : In Section 2.1, we orthogonalized the binary matrix ξ via the QR decomposition ξ = QR, where Q is an orthogonal (but non-binary) matrix, which allowed us to map a projection HN (defined by its patterns ξ, Eq. ( 1)) to an RBM (defined by its inter-layer weights Q, Eq. ( 6)). Here we consider the reverse map. Given a trained RBM with weights W ∈ R N ×p , we look for an invertible transformation X ∈ R p×p which binarizes W . We make the mild assumption that W is rank p. If we find such an X, then B = W X will be the Hopfield pattern matrix (analogous to ξ), with B iµ ∈ {-1, +1}. This is a non-trivial problem, and an exact solution is not guaranteed. As a first step to study the problem, we relax it to that of finding a matrix X ∈ GL p (R) (i.e. invertible, p × p, real) which minimizes the binarization error arg min X∈GLp(R) ||W X -sgn(W X)|| F . We denote the approximately binary transformation of W via a particular solution X by B p = W X. ( ) We also define the associated error matrix E ≡ B psgn(B p ). We stress that B p is non-binary and approximates B ≡ sgn(B p ), the columns of which will be HN patterns under certain conditions on E. We provide an initial characterization and example in Appendix D.

3. EXPERIMENTS ON MNIST DATASET

Next we investigate whether the Hopfield-RBM correspondence can provide an advantage for training binary-gaussian RBMs. We consider the popular MNIST dataset of handwritten digits (LeCun et al., 1998) which consists of 28 × 28 images of handwritten images, with greyscale pixel values 0 to 255. We treat the sample images as N ≡ 784 dimensional binary vectors of {-1, +1} by setting all non-zero values to +1. The dataset includes M ≡ 60, 000 training images and 10, 000 testing images, as well as their class labels µ ∈ {0, . . . , 9}.

3.1. GENERATIVE OBJECTIVE

The primary task for generative models such as RBMs is to reproduce a data distribution. Given a data distribution p data , the generative objective is to train a model (here an RBM defined by its parameters θ), such that the model distribution p θ is as close to p data as possible. This is often quantified by the Kullback-Leibler (KL) divergence D KL (p data p θ ) = s p data (s) ln pdata(s) p θ (s) . One generally does not have access to the actual data distribution, instead there is usually a representative training set S = {s a } M a=1 sampled from it. As the data distribution is constant with respect to θ, the generative objective is equivalent to maximizing L(θ) = 1 M a ln p θ (s a ).

3.1.1. HOPFIELD RBM SPECIFICATION

With labelled classes of training data, specification of an RBM via a one-shot Hopfield rule ("Hopfield RBM") is straightforward. In the simplest approach, we define p = 10 representative patterns via the (binarized) class means ξ µ ≡ sgn   1 |S µ | s∈Sµ s   . ( ) where µ ∈ {0, . . . , 9} and S µ is the set of sample images for class µ. These patterns comprise the columns of the N × p pattern matrix ξ, which is then orthogonalized as in Eq. ( 2) to obtain the RBM weights W which couple N binary visible units to p gaussian hidden units. We also consider refining this approach by considering sub-classes within each class, representing, for example, the different ways one might draw a "7". As a proof of principle, we split each digit class into k sub-patterns using hierarchical clustering. We found good results with Agglomerative clustering using Ward linkage and Euclidean distance (see Murtagh & Contreras (2012) for an overview of this and related methods). In this way, we can define a hierarchy of Hopfield RBMs. At one end, k = 1, we have our simplest RBM which has p = 10 hidden units and encodes 10 patterns (using Eq. ( 14)), one for each digit class. At the other end, 10k/N → 1, we can specify increasingly refined RBMs that encode k sub-patterns for each of the 10 digit classes, for a total of p = 10k patterns and hidden units. This approach has an additional cost of identifying the sub-classes, but is still typically faster than training the RBM weights directly (discussed below). The generative performance as a function of k and β is shown in Fig. 2 , and increases monotonically with k in the range plotted. If β is too high (very low temperature) the free energy basins will be very deep directly at the patterns, and so the model distribution will not capture the diversity of images from the data. If β is too low (high temperature), there is a "melting transition" where the original pattern basins disappear entirely, and the data will therefore be poorly modelled. Taking α = p/N ∼ 0.1 (roughly k = 8), Fig. 1 of Kanter & Sompolinsky (1987) predicts β m ≈ 1.5 for the theoretical melting transition for the pattern basins. Interestingly, this is quite close to our observed peak near β = 2. Note also as k is increased, the generative performance is sustained at lower temperatures. 1999) . Conventional RBM training, discussed below, requires significantly more computation.

3.1.2. CONVENTIONAL RBM TRAINING

RBM training is performed through gradient ascent on the log-likelihood of the data, L(θ) = 1 M a ln p θ (s a ) (equivalent here to minimizing KL divergence, as mentioned above). We are focused here on the weights W in order to compare to the Hopfield RBM weights, and so we neglect the biases on both layers. As is common (Hinton, 2012) , we approximate the total gradient by splitting the training dataset into "mini-batches", denoted B. The resulting gradient ascent rule for the weights is (see Appendix E) W t+1 iµ = W t iµ + η   j W jµ s a i s a j a∈B -s i λ µ model   , where s i λ µ model ≡ Z -1 {s} s i λ µ e -βH(s,λ) µ dλ µ is an average over the model distribution. The first bracketed term of Eq. ( 15) is simple to calculate at each iteration of the weights. The second term, however, is intractable as it requires one to calculate the partition function Z. We instead approximate it using contrastive divergence (CD-K) (Carreira-Perpinan & Hinton, 2005; Hinton, 2012) . See Appendix E for details. Each full step of RBM weight updates involves O(KBN p) operations (Melchior et al., 2017) . Training generally involves many mini-batch iterations, such that the entire dataset is iterated over (one epoch) many times. In our experiments we train for 50 epochs with mini-batches of size 100 (3 • 10 5 weight updates), so the overall training time can be extensive compared to the one-shot Hopfield approach presented above. For further details on RBM training see e.g. Hinton (2012) ; Melchior et al. (2017) . In Fig. 3 , we give an example of the Hopfield RBM weights (for k = 1), as well as how they evolve during conventional RBM training. Note Fig. 3 (a), (b) appear qualitatively similar, suggesting that the proposed initialization Q from Eqs. ( 2), ( 14) may be near a local optimum of the objective. Despite being a common choice, the random initialization trains surprisingly slowly, taking roughly 40 epochs in Fig. 4 (a), and in Fig. 4 (b) we had to increase the basal learning rate η 0 = 10 -4 by a factor of 5 for the first 25 epochs due to slow training. The non-random initializations, by comparison, arrive at the same maximum value much sooner. The relatively small change over training for the Hopfield initialized weights supports the idea that they may be near a local optimum of the objective, and that conventional training may simply be mildly tuning them (Fig. 3 ). That the HN initialization performs well at 0 epochs suggests that the p Hopfield patterns concisely summarize the dataset. This is intuitive, as the projection rule encodes the patterns (and nearby states) as high probability basins in the free energy landscape of Eq. ( 1). As the data itself is clustered near the patterns, these basins should model the true data distribution well. Overall, our results suggest that the HN correspondence provides a useful initialization for generative modelling with binary-gaussian RBMs, displaying excellent performance with minimal training.

3.2. CLASSIFICATION OBJECTIVE

As with the generative objective, we find that the Hopfield initialization provides an advantage for classification tasks. Here we consider the closely related MNIST classification problem. The goal is to train a model on the MNIST Training dataset which accurately predicts the class of presented images. The key statistic is the number of misclassified images on the MNIST Testing dataset. We found relatively poor classification results with the single (large) RBM architecture from the preceding Section 3.1. Instead, we use a minimal product-of-experts (PoE) architecture as described in Hinton (2002) : the input data is first passed to 10 RBMs, one for each class µ. This "layer of RBMs" functions as a pre-processing layer which maps the high dimensional sample s to a feature vector f (s) ∈ R 10 . This feature vector is then passed to a logistic regression layer in order to predict the class of s. The RBM layer and the classification layer are trained separately. The first step is to train the RBM layer to produce useful features for classification. As in Hinton (2002) , each small RBM is trained to model the distribution of samples from a specific digit class µ. We use CD-20 generative training as in Section 3.1, with the caveat that each expert is trained solely on examples from their respective class. Each RBM connects N binary visible units to k gaussian hidden units, and becomes an "expert" at generating samples from one class. To focus on the effect of interlayer weight initialization, we set the layer biases to 0. After generative training, each expert should have relatively high probability p (µ) θ (s a ) for sample digits s a of the corresponding class µ, and lower probability for digits from other classes. This idea is used to define 10 features, one from each expert, based on the log-probability of a given sample under each expert, ln p (µ) θ (s a ) = -βH (µ) (s a ) -ln Z (µ) . Note that β and ln Z (µ) are constants with respect to the data and thus irrelevant for classification. For a binary-gaussian RBM, H (µ) (s a ) has the simple form Eq. ( 10), so the features we use are f (µ) (s) = i j ν W (µ) iν W (µ) jν s i s j = ||s T W (µ) || 2 . ( ) With the feature map defined, we then train a standard logistic regression classifier (using scikitlearn (Pedregosa et al., 2011) ) on these features. In Fig. 5 , we report the classification error on the MNIST Testing set of 10,000 images (held out during both generative and classification training). Note the size p = 10 of the feature vector is independent of the hidden dimension k of each RBM, so the classifier is very efficient. 15) (as in Fig. 4 ) for a fixed number of epochs. A given sample image is mapped to the 10dimensional feature vector with elements Eq. ( 16). These features are used to train a logistic regression classifier, and the average MNIST Test set errors are reported. The initializations considered for each expert's weights PCA (pink, dashed) , and the orthogonalized Hopfield sub-patterns for digit class µ (blue, solid). The error bars show the min/max of three runs. W (µ) init ∈ R N ×k are W iµ ∼ N (0, 0.01) (purple, dash- dot), Despite this relatively simple approach, the PoE initialized using the orthogonalized Hopfield patterns ("Hopfield PoE") performs fairly well (Fig. 5 , blue curves), especially as the number of subpatterns is increased. We found that generative training beyond 50 epochs did not significantly improve performance for the projection HN or PCA. (in Fig. E .1, we train to 100 epochs and also display the aforementioned "Hebbian" initial condition, which performs much worse for classification). Intuitively, increasing the number of hidden units k increases classification performance independent of weight initialization (with sufficient training). For k fixed, the Hopfield initialization provides a significant benefit to classification performance compared to the randomly initialized weights (purple curves). For few sub-patterns (circles k = 10 and squares k = 20), the Hopfield initialized models perform best without additional training and until 1 epoch, after which PCA (pink) performs better. When each RBM has k = 100 hidden features (triangles), the Hopfield and PCA PoE reach 3.0% training error, whereas the randomly initialized PoE reaches 4.5%. However, the Hopfield PoE performs much better than PCA with minimal training, and maintains its advantage until 10 epochs, after which they are similar. Interestingly, both the Hopfield and PCA initialized PoE with just k = 10 encoded patterns performs better than or equal to the k = 100 randomly initialized PoE at each epoch despite having an order of magnitude fewer trainable parameters. Finally, without any generative training (0 epochs), the k = 100 Hopfield PoE performs slightly better (4.4%) than the k = 100 randomly initialized PoE with 50 epochs of training.

4. DISCUSSION

We have presented an explicit, exact mapping from projection rule Hopfield networks to Restricted Boltzmann Machines with binary visible units and gaussian hidden units. This provides a generalization of previous results which considered uncorrelated patterns (Barra et al., 2012) , or special cases of correlated patterns (Agliari et al., 2013; Mézard, 2017) . We provide an initial characterization of the reverse map from RBMs to HNs, along with a matrix-factorization approach to construct approximate associated HNs when the exact reverse map is not possible. Importantly, our HN to RBM mapping correlated patterns found real world As a result, we are to conduct experiments (Section 3) on the MNIST dataset which suggest the mapping provides several advantages. The conversion of an HN to an equivalent RBM has practical utility: it trades simplicity of presentation for faster processing. The weight matrix of the RBM is potentially much smaller than the HN (N p elements instead of N (N -1)/2). More importantly, proper sampling of stochastic trajectories in HNs requires asynchronous updates of the units, whereas RBM dynamics can be simulated in a parallelizable, layer-wise fashion. We also utilized the mapping to efficiently estimate the partition function of the Hopfield network (Fig. 2 ) by summing out the spins after representing it as an RBM. This mapping also has another practical utility. When used as an RBM weight initialization, the HN correspondence enables efficient training of generative models (Section 3.1, Fig. 4 ). RBMs initialized with random weights and trained for a moderate amount of time perform worse than RBMs initialized to orthogonalized Hopfield patterns and not trained at all. Further, with mild training of just a few epochs, Hopfield RBMs outperform conventionally initialized RBMs trained several times longer. The revealed initialization also shows advantages over alternative non-random initializations (PCA and the "Hebbian" Hopfield mapping) during early training. By leveraging this advantage for generative tasks, we show that the correspondence can also be used to improve classification performance (Section 3.2, Fig. 5 , Appendix E.3). Overall, the RBM initialization revealed by the mapping allows for smaller models which perform better despite shorter training time (for instance, using fewer hidden units to achieve similar classification performance). Reducing the size and training time of models is critical, as more realistic datasets (e.g. gene expression data from single-cell RNA sequencing) may require orders of magnitude more visible units. For generative modelling of such high dimensional data, our proposed weight initialization based on orthogonalized Hopfield patterns could be of practical use. Our theory and experiments are a proof-of-principle; if they can be extended to the large family of deep architectures which are built upon RBMs, such as deep belief networks (Hinton et al., 2006) and deep Boltzmann machines (Salakhutdinov & Hinton, 2009) , it would be of great benefit. This will be explored in future work. More broadly, exposing the relationship between RBMs and their representative HNs helps to address the infamous interpretability problem of machine learning which criticizes trained models as "black boxes". HNs are relatively transparent models, where the role of the patterns as robust dynamical attractors is theoretically well-understood. We believe this correspondence, along with future work to further characterize the reverse map, will be especially fruitful for explaining the performance of deep architectures constructed from RBMs.

Fionn Murtagh for hierarchical Wiley

Reviews: Data and Knowledge Discovery, 2(1): 86-97, 2012. 19424787. doi: 10.1002 A HOPFIELD NETWORK DETAILS Consider p < N N -dimensional binary patterns {ξ µ } p µ=1 that are to be "stored". From them, construct the N × p matrix ξ whose columns are the p patterns. If they are mutually orthogonal (e.g. randomly sampled patterns in the large N → ∞ limit), then choosing interactions according to the Hebbian rule, J Hebb = 1 N ξξ T , guarantees that they will all be stable minima of H(s) = -1 2 s T J s, provided α ≡ p/N < α c , where α c ≈ 0.14 (Amit et al., 1985) . If they are not mutually orthogonal (referred to as correlated), then using the "projection rule" J Proj = ξ(ξ T ξ) -1 ξ T guarantees that they will all be stable minima of H(s), provided p < N (Kanter & Sompolinsky, 1987; Personnaz et al., 1986) . Note J Proj → J Hebb in the limit of orthogonal patterns. In the main text, we use J as shorthand for J Proj . We provide some relevant notation from Kanter & Sompolinsky (1987) . Define the overlap of a state s with the p patterns as m(s) ≡ 1 N ξ T s, and define the projection of a state s onto the p patterns as a(s) ≡ (ξ T ξ) -1 ξ T s ≡ A -1 m(s). Note A ≡ ξ T ξ is the overlap matrix, and m µ , a µ ∈ [-1, 1]. We can re-write the projection rule Hamiltonian Eq. ( 1) as H(s) = -N 2 m(s) • a(s). (A.1) For simplicity, we include the self-interactions rather than keeping track of their omission; the results are the same as N → ∞. From Eq. (A.1), several quadratic forms can be written depending on which variables one wants to work with: I. H(s) = -N 2 2 m T (ξ T ξ) -1 m II. H(s) = -1 2 a T (ξ T ξ)a These are the starting points for the alternative Boltzmann Machines (i.e. not RBMs) presented in Appendix C.

B MAPPINGS

We used QR factorization the the to RBM mapping. However, one can use any decomposition which satisfies J Proj = U U T (B.1) such that U ∈ R N ×p is orthogonal (orthogonal for tall matrices means U T U = I). In that case, U becomes the RBM weights. We provide two simple alternatives below, and show they are all part of the same family of orthogonal decompositions. "Square root" decomposition: Define the matrix K ≡ ξ(ξ T ξ) -1/2 . Note that K is orthogonal, and that J Proj = KK T . Singular value decomposition: More generally, consider the SVD of the pattern matrix ξ: ξ = U ΣV T (B.2) where U ∈ R N ×p , V ∈ R p×p store the left and right singular vectors (respectively) of ξ as orthogonal columns, and Σ ∈ R p×p is diagonal and contains the singular values of ξ. Note in the limit of orthogonal patterns, we have Σ = √ N I. This decomposition gives several relations for quantities of interest: A ≡ ξ T ξ = V Σ 2 V T J Hebb ≡ 1 N ξξ T = 1 N U Σ 2 U T J Proj ≡ ξ(ξ T ξ) -1 ξ T = U U T . (B.3) The last line is simply the diagonalization of J Proj , and shows that our RBM mapping is preserved if we swap the Q from QR with U from SVD. However, since there are p degenerate eigenvalues σ 2 = 1, U is not unique -any orthogonal basis for the 1-eigenspace can be chosen. Thus U = U O where O is orthogonal is also valid, and the QR decomposition and "square root" decomposition correspond to particular choices of O.

C HN TO BM MAPPINGS USING ALTERNATIVE REPRESENTATIONS OF THE HOPFIELD HAMILTONIAN

In addition to the orthogonalized representation in the main text, there are two natural representations to consider based on the pattern overlaps and projections introduced in Appendix A. These lead to generalized Boltzmann Machines (BMs) consisting of N original binary spins and p continuous variables. These representations are not RBMs as the continuous variables interact with each other. We present them for completeness. 



Figure 2: Hopfield RBM generative performance as a function of β for varying numbers of encoded sub-patterns k per digit. The number of hidden units in each RBM is p = 10k, corresponding to the total number of encoded patterns. ln Z is computed using annealed importance sampling (AIS) (Neal, 2001) on the continuous representation of Z, Eq. (7), with 500 chains for 1000 steps (see Appendix E). Each curve displays the mean of three runs.

Figure 3: Binary-gaussian RBM weights for p = 10 hidden units prior to and during generative training. (a) Initial values of the columns of W (specified as the orthogonalized Hopfield patterns via Eqs. (2), (14)). (b) Same columns of W after 50 epochs of CD-20 training (see Fig. 4(a)).

Figure 4: Generative performance of binary-gaussian RBMs trained with (a) p = 10 and (b) p = 50 hidden units. Curves are colored according to the choice of weight initialization (see legend in (b), and further detail in the preceding text). Each curve shows the mean and standard deviation over 5 runs. The inset in (a) details the first two epochs. We compute ln p θ as in Fig. 2, but with 100 AIS chains. The learning rate is η 0 = 10 -4 except the first 25 epochs of the randomly initialized weights in (b), where we used η = 5η 0 due to slow training. The mini-batch size is B = 100 for all curves in (b) and the purple curve in (a), and B = 1000 otherwise. (c), (d) Samples from two RBMs from (b) (projection HN and random) after 15 epochs, generated by initializing the visible state to an example image from the desired class and performing 20 RBM updates with β = 2. Training parameters: β = 2, and CD-20.

Figure 5: Product-of-experts classification performance for the various weight initializations. For each digit model (expert), we perform CD-20 training according to Eq. (15) (as in Fig.4) for a fixed number of epochs. A given sample image is mapped to the 10dimensional feature vector with elements Eq. (16). These features are used to train a logistic regression classifier, and the average MNIST Test set errors are reported. The initializations considered for each expert's weights W

Overlap" Boltzmann Machine: Writing H(s) = -N 2 2 m T (ξ T ξ) -1 m, we have ν (ξ T ξ) µν λ µ λ ν -1 √ N µ i ξ iµ s i λ µ . (C.3)This is the analog of Eq. (6) in the main text for the "overlap" representation. Note we can also sum out the binary variables in Eq. (C.2), which allows for an analogous expression to Eq. (8),F 0 ({λ µ }) = 1 2N µ,ν (ξ T ξ) µν λ µ λ ν -1 β i ln cosh β √ N µ ξ iµ λ µ . (C.4)To avoid this issue, we consider the case of a Hopfield-initialized training. At epoch fact exact solution X * = R (from the QR decomposition, Eq. (2)), which recovers encoded binary patterns ξ. We may use X 0 = R, as an informed initial condition to Eq. (D.7), to approximately binarize the weight at later epochs and monitor how the learned patterns change.In Fig. D.1, we give an example of this approximate reverse mapping for a Hopfield-initialized RBM following generative training (Fig. 4). Fig. D.1(a) shows the p = 10 encoded binary patterns, denoted below by ξ 0 , and Fig. D.1(b) shows the approximate reverse mapping applied to the RBM weights at epoch 10. We denote these nearly binary patterns by ξ 10 . Interestingly, some of the non-binary regions in Fig. D.1(b) coincide with features that distinguish the respective pattern. For example, the strongly "off" area in the top-right of the "six" pattern.

Figure D.1: Reverse mapping example. The p = 10 encoded binary patterns used to initialize the RBM in Fig. 4(a). (b) The approximate reverse mapping applied to the RBM weights after 10 epochs of CD-k training. Parameters: α = 200 and γ = 0.05.

/widm.53.

ACKNOWLEDGMENTS

The authors thank Duncan Kirby and Jeremy Rothschild for helpful comments and discussions. This work is supported by the National Science and Engineering Research Council of Canada (NSERC) through Discovery Grant RGPIN 402591 to A.Z. and CGS-D Graduate Fellowship to M.S.

annex

Curiously, we perform a on (C.2), introducing auxiliary variables τ ν to the interactions between the λ µ variables:Eq. (C.5) describes a three-layer RBM with complex interactions between the λ µ and τ ν variables, a representation which could be useful in some contexts."Projection" Boltzmann Machine: Proceeding as above but for H(s) = -1 2 a T (ξ T ξ)a, one findswhich corresponds to the BM HamiltonianThe analogous expression to Eq. ( 8) in this case isThe explanation from Mehta et al. (2019) for integrating out the hidden variables of an RBM is presented here for completeness. For a given binary-gaussian RBM defined by H RBM (s, λ) (as in Eq. ( 9)), we have p(s) = Z -1 e -βHRBM(s,λ) dλ. Consider also that p(s) = Z -1 e -βH(s) for some unknown H(s). Equating these expressions gives,Decompose the argument of ln(•) in Eq. (D.1) by defining q µ (λ µ ), a gaussian with zero mean and variance β -1 . Writing t µ = β i W iµ s i , one observes that the second term (up to a constant) is a sum of cumulant generating functions, i.e.K µ (t µ ) ≡ ln e tµλµ qµ = ln q µ e tµλµ dλ µ . (D.2)These can be written as a cumulant expansion,∂Kµ ∂tµ | tµ=0 is the n th cumulant of q µ . However, since q µ (λ µ ) is gaussian, only the second term remains, leaving. Putting this all together, we haveNote that in general, q µ (λ µ ) need not be gaussian, in which case the resultant Hamiltonian H(s) can have higher order interactions (expressed via the cumulant expansion).

D.2 APPROXIMATE REVERSE MAPPING

Suppose one has a trial solution B p = W X to the approximate binarization problem Eq. ( 12), with error matrix E ≡ B psgn(B p ). We consider two cases depending on if W is orthogonal.Case 1: If W is orthogonal, then starting and applying Eq. ( 13), J T = B p (X T X) -1 T . Using I = W T W , we getThus, the interactions between the visible units is the familiar projection rule used to store the approximately binary patterns B p . "Storage" means the patterns are stable fixed points of the deterministic update rule s t+1 ≡ sgn(J s t ).We cannot initialize the network to a non-binary state. Therefore, the columns of B = sgn(B p ) are our candidate patterns. To test if they are fixed points, considerWe need the error E to be such that J E will not alter the sign of B p . Two sufficient conditions are:(a) small error:When either of these conditions hold for each element, we have sgn(J B) = sgn(B p ) = B, so that the candidate patterns B are fixed points. It remains to show that they are also stable (i.e. minima).Case 2: If W is not orthogonal but its singular values remain close to one, then Löwdin Orthogonalization (also known as Symmetric Orthogonalization) (Löwdin, 1970) provides a way to preserve the HN mapping from Case 1 above.Consider the SVD of W : W = U ΣV T . The closest matrix to W (in terms of Frobenius norm) with orthogonal columns is L = U V T , and the approximation W ≈ L is called the Löwdin Orthogonalization of W . Note the approximation becomes exact when all the singular values are one. We then write W W T ≈ U U T , and the orthogonal W approach of Case 1 can then be applied.On the other hand, W may be strongly not orthogonal (singular values far from one). If it is still full rank, then its pseudo-inverseRepeating the steps from the orthogonal case, we note here, we arrive at the corresponding result,This is analogous to the projection rule but with a "correction factor" C. However, it is not immediately clear how C affects pattern storage. Given the resemblance between J Proj and Eq. (D.4) (relative to Eq. (D.5)), we expect that RBMs trained with an orthogonality constraint on the weights may be more readily mapped to HNs.

D.3 EXAMPLE OF THE APPROXIMATE REVERSE MAPPING

In the main text we introduced the approximate binarization problem Eq. ( 12), the solutions of which provide approximately binary patterns through Eq. ( 13). To numerically solve Eq. ( 12) and obtain a candidate solution X * , we perform gradient descent on a differentiable variant.Specifically, we replace sgn(u) with tanh(αu) for large α. Define E = W X -tanh(αW X) as in the main text. Then the derivative of the "softened" Eq. ( 12) with respect to X isGiven an initial condition X 0 , we apply the update ruleuntil convergence to a local minimum X * .In the absence of prior information, we consider randomly initialized X 0 . Our preliminary experiments using Eq. (D.7) to binarize arbitrary RBM weights W have generally led to high binarization error. This is due in part to the difficulty in choosing a good initial condition X 0 , which will be explored in future work.Published as a conference paper at ICLR 2021The results in Fig. D .2 suggest that p = 10 patterns may be too crude for the associative memory network to be used as an accurate MNIST classifier (as compared to e.g. Fig. 5 ). the HN constructed from ξ 10 performs about 3% worse than the HN constructed from ξ 0 , although this a sophisticated for There therefore be a cost, terms of memory performance, to the generative functionality of such models (Fig. 4 ). Our results from Appendix D.2 indicate that incorporating an orthogonality constraint on the weights training may provide a preserve associative memory functionality. This will be explored in work.

RBM TRAINING

Consider a general binary-gaussian RBM with N visible and p hidden units. The energy function isFirst, we note the Gibbs-block update distributions for sampling one layer of a binary-gaussian RBM given the other (see e.g. Melchior et al. (2017) ),Visible units:For completeness, we re-derive the binary-gaussian RBM weight updates, along the lines of Melchior et al. (2017) . We want to maximize L ≡ 1 M a ln p θ (s a ). The contribution for a single datapoint s a has the form ln p θ (s a ) = ln(C -1 e -βH(sa,λ) dλ) -ln Z with C ≡ (2π/β) p/2 . The gradient with respect to the model isWe are focused on the interlayer weights, with ∂H(s,λ) ∂Wiµ = -s i λ µ , so -βH(sa,λ) µ dλ µ e -βH (sa,λ) µ dλ µ -β {s} s i λ µ e -βH(sa,λ)The first term is straightforward to compute:The second term is intractable and needs to be approximated. We use contrastive divergence (Carreira-Perpinan & Hinton, 2005; Hinton, 2012) :µ . Here k denotes CD-k -k steps of Gibbsblock updates (introduced above) -from which sµ comprise the final state. We evaluate both terms over mini-batches of the training data to arrive at the weight update rule Eq. ( 15). Published as a conference paper at ICLR 2021 which can be computed deterministically. ln Z, on the other hand, needs to be estimated. For this we perform Annealed Importance Sampling (AIS) (Neal, 2001) on its continuous representationFor AIS need specify(E.7) as well as an initial proposal distribution we fix unit distribution N (0, I).

E.3 CLASSIFICATION PERFORMANCE

Here we provide extended data (Fig. E .1) on the classification performance shown in Fig. 5 . As in the main text, color denotes initial condition (introduced in Section 3.1) and shape denotes the number of sub-patterns. Here we train for each 100 epochs, which allows convergence of most of the curves. We also include the "Hebbian" initialization (green curves). Notably, the Hebbian initialization performs quite poorly in the classification task (as compared to direct generative objective, Fig. 4 ). In particular, for the 100 sub-patterns case, where the projection HN performs best, the Hebbian curve trains very slowly (still not converged after 100 epochs) and lags behind even the 10 sub-pattern Hebbian curve for most of the training. This emphasizes the benefits of the projection rule HN over the Hebbian HN when the data is composed of correlated patterns, which applies to most real-world datasets.

