QUANTUM DEFORMED NEURAL NETWORKS

Abstract

We develop a new quantum neural network layer designed to run efficiently on a quantum computer but that can be simulated on a classical computer when restricted in the way it entangles input states. We first ask how a classical neural network architecture, both fully connected or convolutional, can be executed on a quantum computer using quantum phase estimation. We then deform the classical layer into a quantum design which entangles activations and weights into quantum superpositions. While the full model would need the exponential speedups delivered by a quantum computer, a restricted class of designs represent interesting new classical network layers that still use quantum features. We show that these quantum deformed neural networks can be trained and executed on normal data such as images, and even classically deliver modest improvements over standard architectures.

1. INTRODUCTION

Quantum mechanics (QM) is the most accurate description for physical phenomena at very small scales, such as the behavior of molecules, atoms and subatomic particles. QM has a huge impact on our every day lives through technologies such as lasers, transistors (and thus microchips), superconductors and MRI. A recent view of QM has formulated it as a (Bayesian) statistical methodology that only describes our subjective view of the (quantum) world, and how we update that view in light of evidence (i.e. measurements) (;t Hooft, 2016; Fuchs & Schack, 2013) . This is in perfect analogy to the classical Bayesian view, a statistical paradigm extensively used in artificial intelligence where we maintain probabilities to represent our beliefs for events in the world. The philosophy of this paper will be to turn this argument on its head. If we can view QM as just another consistent statistical theory that happens to describe nature at small scales, then we can also use this theory to describe classical signals by endowing them with a Hilbert space structure. In some sense, the 'only' difference with Bayesian statistics is that the positive probabilities are replaced with complex 'amplitudes'. This however has the dramatic effect that, unlike in classical statistics, interference between events now becomes a possibility. In this paper we show that this point of view uncovers new architectures and potential speedups for running neural networks on quantum computers. We shall restrict our attention here to binary neural networks. We will introduce a new class of quantum neural networks and interpret them as generalizations of probabilistic binary neural networks, discussing potential speedups by running the models on a quantum computer. Then we will devise classically efficient algorithms to train the networks for a restricted set of quantum circuits. We present results of classical simulations of the quantum neural networks on real world data sizes and related gains in accuracy due to the quantum deformations. Contrary to almost all other works on quantum deep learning, our quantum neural networks can be simulated for practical classical problems, such as images or sound. The quantum nature of our models is there to increase the flexibility of the model-class and add new operators to the toolbox of the deep learning researcher, some of which may only reach their full potential when quantum computing becomes ubiquitous.

1.1. RELATED WORK

In Farhi & Neven (2018) variational quantum circuits that can be learnt via stochastic gradient descent were introduced. Their performance could be studied only on small input tasks such as classifying 4 × 4 images, due to the exponential memory requirement to simulate those circuits. Other works on variational quantum circuits for neural networks are Verdon et al. (2018) ; Beer et al. (2019) . Their focus is similarly on the implementation on near term quantum devices and these models cannot be efficiently run on a classical computer. Exceptions are models which use tensor network simulations (Cong et al., 2019; Huggins et al., 2019) where the model can be scaled to 8 × 8 image data with 2 classes, at the price of constraining the geometry of the quantum circuit (Huggins et al., 2019) . The quantum deformed neural networks introduced in this paper are instead a class of variational quantum circuits that can be scaled to the size of data that are used in traditional neural networks as we demonstrate in section 4.2. Another line of work directly uses tensor networks as full precision machine learning models that can be scaled to the size of real data (Miles Stoudenmire & Schwab, 2016; Liu et al., 2017; Levine et al., 2017; Levine et al., 2019) . However the constraints on the network geometry to allow for efficient contractions limit the expressivity and performance of the models. See however Cheng et al. (2020) for recent promising developments. Further, the tensor networks studied in these works are not unitary maps and do not directly relate to implementations on quantum computers. A large body of work in quantum machine learning focuses on using quantum computing to provide speedups to classical machine learning tasks (Biamonte et al., 2017; Ciliberto et al., 2018; Wiebe et al., 2014) , culminating in the discovery of quantum inspired speedups in classical algorithms (Tang, 2019) . In particular, (Allcock et al., 2018; Cao et al., 2017; Schuld et al., 2015; Kerenidis et al., 2019) discuss quantum simulations of classical neural networks with the goal of improving the efficiency of classical models on a quantum computer. Our models differ from these works in two ways: i) we use quantum wave-functions to model weight uncertainty, in a way that is reminiscent of Bayesian models; ii) we design our network layers in a way that may only reach its full potential on a quantum computer due to exponential speedups, but at the same time can, for a restricted class of layer designs, be simulated on a classical computer and provide inspiration for new neural architectures. Finally, quantum methods for accelerating Bayesian inference have been discussed in Zhao et al. (2019b; a) but only for Gaussian processes while in this work we shall discuss relations to Bayesian neural networks.

2. GENERALIZED PROBABILISTIC BINARY NEURAL NETWORKS

Binary neural networks are neural networks where both weights and activations are binary. Let B = {0, 1}. A fully connected binary neural network layer maps the N activations h ( ) at level to the N +1 activations h ( +1) at level + 1 using weights W ( ) ∈ B N N +1 : h ( +1) j = f (W ( ) , h ( ) ) = τ 1 N + 1 N i=1 W ( ) j,i h ( ) i , τ (x) = 0 x < 1 2 1 x ≥ 1 2 . ( ) We divide by N + 1 since the sum can take the N + 1 values {0, . . . , N }. We do not explicitly consider biases which can be introduced by fixing some activations to 1. In a classification model h (0) = x is the input and the last activation function is typically replaced by a softmax which produces output probabilities p(y|x, W ), where W denotes the collection of weights of the network. Given M input/output pairs X = (x 1 , . . . , x M ), Y = (y 1 , . . . , y M ), a frequentist approach would determine the binary weights so that the likelihood p(Y |X, W ) = M i=1 p(y i |x i , W ) is maximized. Here we consider discrete or quantized weights and take the approach of variational optimization Staines & Barber (2012) , which introduces a weight distribution q θ (W ) to devise a surrogate differential objective. For an objective O(W ), one has the bound max W ∈B N O(W ) ≥ E q θ (W ) [O(W )], and the parameters of q θ (W ) are adjusted to maximize the lower bound. In our case we consider the objective: max W ∈B N log p(Y |X, W ) ≥ L := E q θ (W ) [log p(Y |X, W )] = M i=1 E q θ (W ) [log p(y i |x i , W )] . (2) While the optimal solution to equation 2 is a Dirac measure, one can add a regularization term R(θ) to keep q soft. In appendix A we review the connection with Bayesian deep learning, where q θ (W ) is the approximate posterior, R(θ) is the KL divergence between q θ (W ) and the prior over weights, and the objective is derived by maximizing the evidence lower bound. In both variational Bayes and variational optimization frameworks for binary networks, we have a variational distribution q(W ) and probabilistic layers where activations are random variables. We consider an approximate posterior factorized over the layers: q(W ) = L =1 q ( ) (W ( ) ). If h ( ) ∼ p ( ) , equation 1 leads to the following recursive definition of distributions: p ( +1) (h ( +1) ) = h∈B N W ∈B N N +1 δ(h ( +1) -f (W ( ) , h ( ) ))p ( ) (h ( ) )q ( ) (W ( ) ) . (3) We use the shorthand p ( ) (h ( ) ) for p ( ) (h ( ) |x) and the x dependence is understood. The average appearing in equation 2 can be written as an average over the network output distribution: E q θ (W ) [log p(y i |x i , W )] = -E p (L) (h (L) ) [g i (y i , h (L) )] , where the function g i is typically MSE for regression and cross-entropy for classification. In previous works (Shayer et al., 2017; Peters & Welling, 2018) , the approximate posterior was taken to be factorized: q(W ( ) ) = ij q i,j (W ( ) i,j ), which results in a factorized activation distribution as well: (Shayer et al., 2017; Peters & Welling, 2018) used the local reparameterization trick Kingma et al. (2015) to sample activations at each layer. p ( ) (h ( ) ) = i p ( ) i (h ( ) i ). The quantum neural network we introduce below will naturally give a way to sample efficiently from complex distributions and in view of that we here generalize the setting: we act with a stochastic matrix S φ (h , W |h, W ) which depends on parameters φ and correlates the weights and the input activations to a layer as follows: π φ,θ (h , W ) = h∈B N W ∈B N M S φ (h , W |h, W )p(h)q θ (W ) . To avoid redundancy, we still take q θ (W ) to be factorized and let S create correlation among the weights as well. The choice of S will be related to the choice of a unitary matrix D in the quantum circuit of the quantum neural network. A layer is now made of the two operations, S φ and the layer map f , resulting in the following output distribution: p ( +1) (h ( +1) ) = h∈B N W ∈B N N +1 δ(h ( +1) -f (W ( ) , h ( ) ))π ( ) φ,θ (h ( ) , W ( ) ) , which allows one to compute the network output recursively. Both the parameters φ and θ will be learned to solve the following optimization problem: min θ,φ R(θ) + R (φ) -L . ( ) where R(θ), R (φ) are regularization terms for the parameters θ, φ. We call this model a generalized probabilistic binary neural network, with φ deformation parameters chosen such that φ = 0 gives back the standard probabilistic binary neural network. To study this model on a classical computer we need to choose S which leads to an efficient sampling algorithm for π φ,θ . In general, one could use Markov Chain Monte Carlo, but there exists situations for which the mixing time of the chain grows exponentially in the size of the problem (Levin & Peres, 2017). In the next section we will show how quantum mechanics can enlarge the set of probabilistic binary neural networks that can be efficiently executed and in the subsequent sections we will show experimental results for a restricted class of correlated distributions inspired by quantum circuits that can be simulated classically.

3. QUANTUM IMPLEMENTATION

Quantum computers can sample from certain correlated distributions more efficiently than classical computers (Aaronson & Chen, 2016; Arute et al., 2019) . In this section, we devise a quantum circuit that implements the generalized probabilistic binary neural networks introduced above, encoding π θ,φ in a quantum circuit. This leads to an exponential speedup for running this model on a quantum computer, opening up the study of more complex probabilistic neural networks. A quantum implementation of a binary perceptron was introduced in Schuld et al. (2015) as an application of the quantum phase estimation algorithm (Nielsen & Chuang, 2000) . However, no quantum advantage of the quantum simulation was shown. Here we will extend that result in several ways: i) we will modify the algorithm to represent the generalized probabilistic layer introduced above, showing the quantum advantage present in our setting; ii) we will consider the case of multi layer percetrons as well as convolutional networks.

3.1. INTRODUCTION TO QUANTUM NOTATION AND QUANTUM PHASE ESTIMATION

As a preliminary step, we introduce notations for quantum mechanics. We refer the reader to Appendix B for a more thorough review of quantum mechanics. A qubit is the vector space of normalized vectors |ψ ∈ C 2 . N qubits form the set of unit vectors in (C 2 ) ⊗N ∼ = C 2 N spanned by all N -bit strings, |b 1 , . . . , b N ≡ |b 1 ⊗ • • • ⊗ |b N , b i ∈ B. Quantum circuits are unitary matrices on this space. The probability of a measurement with outcome φ i is given by matrix element of the projector |φ i φ i | in a state |ψ , namely p i = ψ|φ i φ i |ψ = | φ i |ψ | 2 , a formula known as Born's rule. Next, we describe the quantum phase estimation (QPE), a quantum algorithm to estimate the eigenphases of a unitary U . Denote the eigenvalues and eigenvectors of U by exp 2πi 2 t ϕ α and |v α , and assume that the ϕ α 's can be represented with a finite number t of bits: ϕ α = 2 t-1 ϕ 1 α + • • • + 2 0 ϕ t α . (This is the case of relevance for a binary network.) Then introduce t ancilla qubits in state |0 ⊗t . Given an input state |ψ , QPE is the following unitary operation: |0 ⊗t ⊗ |ψ QPE -→ α v α |ψ |ϕ α ⊗ |v α . Appendix B.1 reviews the details of the quantum circuit implementing this map, whose complexity is linear in t. Now using the notation τ for the threshold non-linearity introduced in equation 1, and recalling the expansion 2 -t ϕ = 2 -1 ϕ 1 + • • • + 2 -t ϕ t , we note that if the first bit ϕ 1 = 0 then 2 -t ϕ < 1 2 and τ (2 -t ϕ) = 0, while if ϕ 1 = 1, then 2 -t ϕ ≥ 1 2 and τ (2 -t ϕ) = 1. In other words, δ ϕ 1 ,b = δ τ (2 -t ϕ),b and the probability p(b) that after the QPE the first ancilla bit is b is given by: α v α |ψ ϕ α | ⊗ v α | |b b|⊗1 β v β |ψ |ϕ β ⊗|v β = α | v α |ψ | 2 δ τ (2 -t ϕα),b , (9) where |b b|⊗1 is an operator that projects the first bit to the state |b and leaves the other bits untouched.

3.2. DEFINITION AND ADVANTAGES OF QUANTUM DEFORMED NEURAL NETWORKS

Armed with this background, we can now apply quantum phase estimation to compute the output of the probabilistic layer of equation 6. Let N be the number of input neurons and M that of output neurons. We introduce qubits to represent inputs and weights bits: |h, W ∈ V h ⊗ V W , V h = N i=1 (C 2 ) i , V W = N i=1 M j=1 (C 2 ) ij . ( ) Then we introduce a Hamiltonian H j acting non trivially only on the N input activations and the N weights at the j-th row: H j = N i=1 B W ji B h i , and B h i (B W ji ) is the matrix B = |1 1| acting on the i-th activation (ji-th weight) qubit. Note that H j singles out terms from the state |h, W where both h j = 1 and W ij = 1 and then adds them |0 ⊗t . . . |0 ⊗t |ψ h |ψ W 1,: . . . |ψ W M,: QPE(U1) • • • • • • • • • • • • • • • QPE(U M ) . . . . . . (a) |0 |0 ⊗t 2 -1 |0 |0 ⊗t 1 -1 |0 |0 ⊗t 1 -1 |x |ψ W 1 1,: |ψ W 1 2,: |ψ W 2 1,: QPE(U 1 1 ) QPE(U 1 2 ) QPE(U 2 1 ) y Layer 1 Layer 2 (b) |0 ⊗t |ψ h |ψ W 1,: . . . |0 ⊗t |ψ h |ψ W M,: QPE( Ũ1) QPE( ŨM ) . . . (c) Figure 1 : (a) Quantum circuit implementing a quantum deformed layer. The thin vertical line indicates that the gate acts as identity on the wires crossed by the line. (b) Quantum deformed multilayer perceptron with 2 hidden quantum neurons and 1 output quantum neuron. |x is an encoding of the input signal, y is the prediction. The superscript in U j and W j,: refers to layer . We split the blocks of t ancilla qubits into a readout qubit that encodes the layer output amplitude and the rest. (c) Modification of a layer for classical simulations. up, i.e. the eigenvalues of H j are the preactivations of equation 1: H j |h, W = ϕ(h, W j,: ) |h, W , ϕ(h, W j,: ) = N i=1 W ji h i . Now define the unitary operators: U j = De 2πi N +1 Hj D -1 , ( ) where D is another generic unitary, and as we shall see shortly, its eigenvectors will be related to the entries of the classical stochastic matrix S in section 2. Since U j U j = De 2πi N +1 (Hj +H j ) D -1 = U j U j , we can diagonalize all the U j 's simultaneously and since they are conjugate to e 2πi N +1 Hj they will have the same eigenvalues. Introducing the eigenbasis |h, W D = D |h, W , we have: U j |h, W D = e 2πi N +1 ϕ(h,Wj,:) |h, W D . ( ) Note that ϕ ∈ {0, . . . , N } so we can represent it with exactly t bits, N = 2 t -1. Then we add M ancilla resources, each of t qubits, and sequentially perform M quantum phase estimations, one for each U j , as depicted in figure 1 (a). We choose the following input state |ψ = |ψ h ⊗ M j=1 |ψ Wj,: , |ψ Wj,: = N i=1 q ji (W ji = 0) |0 + q ji (W ji = 1) |1 , where we have chosen the weight input state according to the factorized variational distribution q ij introduced in section 2. In fact, this state corresponds to the following probability distribution via Born's rule: p(h, W ) = | h, W |ψ | 2 = p(h) M j=1 N i=1 q ji (W ji ) , p(h) = | h|ψ h | 2 . ( ) The state |ψ h is discussed below. Now we show that a non-trivial choice of D leads to an effective correlated distribution. The j-th QPE in figure 1 (a) corresponds to equation 8 where we identify |v α ≡ |h, W D , |ϕ α ≡ |ϕ(h, W j,: ) and we make use of the j-th block of t ancillas. After M steps we compute the outcome probability of a measurement of the first qubit in each of the M registers of the ancillas. We can extend equation 9 to the situation of measuring multiple qubits, and recalling that the first bit of an integer is the most significant bit, determining whether 2 -t ϕ(h, W j,: ) = (N + 1) -1 ϕ(h, W j,: ) is greater or smaller than 1/2, the probability of outcome h = (h 1 , . . . , h M ) is p(h ) = h∈B N W ∈B N M δ h ,f (W ,h) | ψ|h, W D | 2 , ( ) where f is the layer function introduced in equation 1. We refer to appendix C for a detailed derivation. Equation 17 is the generalized probabilistic binary layer introduced in equation 6 where D corresponds to a non-trivial S and a correlated distribution when D entangles the qubits: π(h, W ) = | ψ| D |h, W | 2 . ( ) The variational parameters φ of S are now parameters of the quantum circuit D. Sampling from π can be done by doing repeated measurements of the first M ancilla qubits of this quantum circuit. On quantum hardware e 2πi N +1 Hj can be efficiently implemented since it is a product of diagonal two-qubits quantum gates. We shall consider unitaries D which have efficient quantum circuit approximations. Then computing the quantum deformed layer output on a quantum computer is going to take time O(tM u(N )) where u(N ) is the time it takes to compute the action of U j on an input state. There exists D such that sampling from equation 18 is exponentially harder classically than quantum mechanically, a statement forming the basis for quantum supremacy experiments on noisy, intermediate scale quantum computers (Aaronson & Chen, 2016; Arute et al., 2019) . Examples are random circuits with two-dimensional entanglement patterns, which from a machine learning point of view can be natural when considering image data. Other examples are D implementing time evolution operators of physical systems, whose simulation is exponentially hard classically, resulting in hardness of sampling from the time evolved wave function. Quantum supremacy experiments give foundations to which architectures can benefit from quantum speedups, but we remark that the proposed quantum architecture, which relies on quantum phase estimation, is designed for error-corrected quantum computers. Even better, on quantum hardware we can avoid sampling intermediate activations altogether. At the first layer, the input can be prepared by encoding the input bits in the state |x . For the next layers, we simply use the output state as the input to the next layer. One obtains thus the quantum network of figure 1 (b) and the algorithm for a layer is summarized in procedure 1. Note that all the qubits associated to the intermediate activations are entangled. Therefore the input state |ψ h would have to be replaced by a state in V h plus all the other qubits, where the gates at the next layer would act only on V h in the manner described in this section. (An equivalent and more economical mathematical description is to use the reduced density matrix ρ h as input state.) We envision two other possible procedures for what happens after the first layer: i) we sample from equation 17 and initialize |ψ h to the bit string sampled in analogy to the classical quantization of activations; ii) we sample many times to reconstruct the classical distribution and encode it in |ψ h . In our classical simulations below we will be able to actually calculate the probabilities and can avoid sampling. Finally, we remark that at present it is not clear whether the computational speedup exhibited by our architecture translates to a learning advantage. This is an outstanding question whose full answer will require an empirical evaluation with a quantum computer. Next, we will try to get as close as possible to answer this question by studying a quantum model that we can simulate classically.

3.3. MODIFICATIONS FOR CLASSICAL SIMULATIONS

In this paper we will provide classical simulations of the quantum neural networks introduced above for a restricted class of designs. We do this for two reasons: first to convince the reader that the quantum layers hold promise (even though we can not simulate the proposed architecture in its full glory due to the lack of access to a quantum computer) and second, to show that these ideas can be interesting as new designs, even "classically" (by which we mean architectures that can be executed on a classical computer). To parallelize the computations for different output neurons, we do the modifications to the setup just explained which are depicted in figure 1 (c ). We clone the input activation register M times, an operation that quantum mechanically is only approximate (Nielsen & Chuang, 2000) but exact classically. Then we associate the j-th copy to the j-th row of the weight matrix, thus forming pairs for each j = 1, . . . , M : |h, W j,: ∈ V h ⊗ V W,j , V W,j = N i=1 (C 2 ) ji Fixing j, we introduce the unitary e 2πi N +1 Hj diagonal in the basis |h, W j,: as in equation 11 and define the new unitary: Ũj = D j e 2πi N +1 Hj D -1 j , where w.r.t. equation 13 we now let D j depend on j. We denote the eigenvectors of Ũj by |h, W j,: Dj = D j |h, W j,: and the eigenvalue is ϕ(h, W j,: ) introduced in equation 12. Supposing that we know p(h) = i p i (h i ), we apply the quantum phase estimation to Ũj with input: |ψ j = |ψ h ⊗ |ψ Wj,: , |ψ h = N i=1 p i (h i = 0) |0 + p i (h i = 1) |1 , and |ψ Wj,: is defined in equation 15. Going through similar calculations as those done above shows that measurements of the first qubit will be governed by the probability distribution of equation 6 factorized over output channels since the procedure does not couple them: π(h, W ) = M j=1 | ψ j | D j |h, W j,: | 2 . So far, we have focused on fully connected layers. We can extend the derivation of this section to the convolution case, by applying the quantum phase estimation on images patches of the size equal to the kernel size as explained in appendix D.

4. CLASSICAL SIMULATIONS FOR LOW ENTANGLEMENT

4.1 THEORY It has been remarked in (Shayer et al., 2017; Peters & Welling, 2018 ) that when the weight and activation distributions at a given layer are factorized, p(h) = i p i (h i ) and q(W ) = ij q ij (W ij ), the output distribution in equation 3 can be efficiently approximated using the central limit theorem (CLT). The argument goes as follows: for each j the preactivations ϕ(h, W j,: ) = N i=1 W j,i h i are sums of independent binary random variables W j,i h i with mean and variance: µ ji = E w∼qji (w)E h∼pi (h) , σ 2 ji = E w∼qji (w 2 )E h∼pi (h 2 ) -µ 2 ji = µ ji (1 -µ ji ) , We used b 2 = b for a variable b ∈ {0, 1}. The CLT implies that for large N we can approximate ϕ(h, W j,: ) with a normal distribution with mean µ j = i µ ji and variance σ 2 j = i σ 2 ji . The distribution of the activation after the non-linearity of equation 1 can thus be computed as: p(τ ( 1 N +1 ϕ(h, W j,: )) = 1) = p(2ϕ(h, W j,: ) -N > 0) = Φ - 2µj -N 2σj , Φ being the CDF of the standard normal distribution. Below we fix j and omit it for notation clarity. As reviewed in appendix B, commuting observables in quantum mechanics behave like classical random variables. The observable of interest for us, DHD -1 of equation 20, is a sum of commuting terms K i ≡ DB W i B h i D -1 and if their joint probability distribution is such that these random variables are weakly correlated, i.e. ψ| K i K i |ψ -ψ| K i |ψ ψ| K i |ψ → 0 , if |i -i | → ∞ , then the CLT for weakly correlated random variables applies, stating that measurements of DHD -1 in state |ψ are governed by a Gaussian distribution N (µ, σ 2 ) with µ = ψ| DHD -1 |ψ , σ 2 = ψ| DH 2 D -1 |ψ -µ 2 . ( ) Finally, we can plug these values into equation 23 to get the layer output probability. We have cast the problem of simulating the quantum neural network to the problem of computing the expectation values in equation 25. In physical terms, these are related to correlation functions of H and H 2 after evolving a state |ψ with the operator D. These can be efficiently computed classically for one dimensional and lowly entangled quantum circuits D (Vidal, 2003) . In view of that here we consider a 1d arrangement of activation and weight qubits, labeled by i = 0, . . . , 2N -1, where the even qubits are associated with activations and the odd are associated with weights. We then choose: D = N -1 i=0 Q 2i,2i+1 N -1 i=0 P 2i+1,2i+2 , where Q 2i,2i+1 acts non-trivially on qubits 2i, 2i + 1, i.e. onto the i-th activation and i-th weight qubits, while P 2i,2i+1 on the i-th weight and i + 1-th activation qubits. We depict this quantum circuit in figure 2 (a). As explained in detail in appendix E, the computation of µ involves the matrix element of K i in the product state |ψ while σ 2 involves that of K i K i+1 . Due to the structure of D, these operators act locally on 4 and 6 sites respectively as depicted in figure 2 (b)-(c). This implies that the computation of equation 25, and so of the full layer, can be done in O(N ) and easily parallelized. Appendix E contains more details on the complexity, while Procedure 2 describes the algorithm for the classical simulation discussed here. D = h0 W0 h1 W1 h2 W2 P P Q Q Q 0 1 2 3 4 5 (a) Ki = P P Q BB Q -1 P -1 P -1 2i-1 2i 2i+12i+12 (b) KiKi+1 = P P P Q Q BB BB Q -1 Q -1 P -1 P -1 P -1 2i-1 2i 2i+12i+22i+32i+4 (c) Procedure 1 Quantum deformed layer. QPE j (U , I) is quantum phase estimation for a unitary U acting on the set I of activation qubits and the j-th weights/ancilla qubits. H j is in equation 11. Input: {q ij } j=0,...,M -1 i=0,...,N -1 , |ψ , I, D, t Output: |ψ for j = 0 to M -1 do |ψ Wj,: ← N i=1 √ q ji |0 + 1 -q ji |1 |ψ ← |0 ⊗t ⊗ |ψ ⊗ |ψ Wj,: U ← De 2πi N +1 Hj D -1 {This requires to approximate the unitary with quantum gates} |ψ ← QPE j (U , I) |ψ end for Procedure 2 Classical simulation of a quantum deformed layer with N (M ) inputs (outputs). We present experiments for the model of the previous section. At each layer, q ij and D j are learnable. They are optimized to minimize the loss of equation 7 where following (Peters & Welling, 2018; Shayer et al., 2017) we take R = β ,i,j q Input: {q ij } j=0,...,M -1 i=0,...,N -1 , {p i } i=0,...,N -1 , P = {P j 2i-1,2i } j=0,...,M -1 i=1,...,N -1 , Q = {Q j 2i,2i+1 } j=0,...,M -1 i=0,...,N -1 Output: {p i } i=1,...,M for j = 0 to M -1 do for i = 0 to N -1 do ψ 2i ← √ p i , √ 1 -p i ψ 2i+1 ← √ q ij , 1 -q ij end for for i = 0 to N -1 do µ i ← computeMu(i, ψ, P , Q) {This implements equation 45 of appendix E} γ i,i+1 ← computeGamma(i, ψ, P , Q) {This implements equation 49 of appendix E} end for µ ← N -1 i=0 µ i σ 2 ← 2 N -2 i=0 (γ i,i+1 -µ i µ i+1 ) + N -1 i=0 (µ i -µ 2 i ) p j ← Φ -2µ-N 2 √ σ 2 end for 4.2 EXPERIMENTS i,i+1 = Q j i,i+1 = 1 (baseline (Peters & Welling, 2018)); [PQ]: P j i,i+1 , Q j i,i+1 generic; [Q]: P j i,i+1 = 1, Q j i,i+1 ( ) ij (1 -q ( ) ij ), and R is the L 2 regularization loss of the parameters of D j . L coincides with equation 2. We implemented and trained several architectures with different deformations. Table 1 contains results for two standard image datasets, MNIST and Fashion MNIST. Details of the experiments are in appendix F. The classical baseline is based on (Peters & Welling, 2018), but we use fewer layers to make the simulation of the deformation cheaper and use no batch norm, and no max pooling. The general deformation ([PQ]) performs best in all cases. In the simplest case of a single dense layer (A), the gain is +3.2% for MNIST and +2.6% for Fashion MNIST on test accuracy. For convnets, we could only simulate a single deformed layer due to computational issues and the gain is around or less than 1%. We expect that deforming all layers will give a greater boost as the improvements diminish with decreasing the ratio of deformation parameters over classical parameters (q ij ). The increase in accuracy comes at the expense of more parameters. In appendix F we present additional results showing that quantum models can still deliver modest accuracy improvement w.r.t. convolutional networks with the same number of parameters.

5. CONCLUSIONS

In this work we made the following main contributions: 1) we introduced quantum deformed neural networks and identified potential speedups by running these models on a quantum computer; 2) we devised classically efficient algorithms to train the networks for low entanglement designs of the quantum circuits; 3) for the first time in the literature, we simulated the quantum neural networks on real world data sizes obtaining good accuracy, and showed modest gains due to the quantum deformations. Running these models on a quantum computer will allow one to explore efficiently more general deformations, in particular those that cannot be approximated by the central limit theorem when the Hamiltonians will be sums of non-commuting operators. Another interesting future direction is to incorporate batch normalization and pooling layers in quantum neural networks. An outstanding question in quantum machine learning is to find quantum advantages for classical machine learning tasks. The class of known problems for which a quantum learner can have a provably exponential advantage over a classical learner is small at the moment Liu et al. (2020) , and some problems that are classically hard to compute can be predicted easily with classical machine learning Huang et al. (2020) . The approach presented here is the next step in a series of papers that tries to benchmark quantum neural networks empirically, e.g. Farhi & Neven (2018) ; Huggins et al. (2019) ; Grant et al. (2019; 2018); Bausch (2020) . We are the first to show that towards the limit of entangling circuits the quantum inspired architecture does improve relative to the classical one for real world data sizes.



Figure 2: (a) The entangling circuit D for N = 3. (b) K i entering the computation of µ. (c) K i K i+1 entering the computation of σ 2 . Indices on P , Q, B are omitted for clarity. Time flows downwards.

Test accuracies for MNIST and Fashion MNIST. With the notation cKsS -C to indicate a conv2d layer with C filters of size [K, K] and stride S, and dN for a dense layer with N output

