TOWARD TRAINABILITY OF QUANTUM NEURAL NET-WORKS

Abstract

Quantum Neural Networks (QNNs) have been recently proposed as generalizations of classical neural networks to achieve the quantum speed-up. Despite the potential to outperform classical models, serious bottlenecks exist for training QNNs; namely, QNNs with random structures have poor trainability due to the vanishing gradient with rate exponential to the input qubit number. The vanishing gradient could seriously influence the applications of large-size QNNs. In this work, we provide a first viable solution with theoretical guarantees. Specifically, we prove that QNNs with tree tensor and step controlled architectures have gradients that vanish at most polynomially with the qubit number. Moreover, our result holds irrespective of which encoding methods are employed. We numerically demonstrate QNNs with tree tensor and step controlled structures for the application of binary classification. Simulations show faster convergent rates and better accuracy compared to QNNs with random structures.

1. INTRODUCTION

Neural Networks (Hecht-Nielsen, 1992) using gradient-based optimizations have dramatically advanced researches in discriminative models, generative models, and reinforcement learning. To efficiently utilize the parameters and practically improve the trainability, neural networks with specific architectures (LeCun et al., 2015) are introduced for different tasks, including convolutional neural networks (Krizhevsky et al., 2012) for image tasks, recurrent neural networks (Zaremba et al., 2014) for the time series analysis, and graph neural networks (Scarselli et al., 2008) for tasks related to graph-structured data. Recently, the neural architecture search (Elsken et al., 2019) is proposed to improve the performance of the networks by optimizing the neural structures. Despite the success in many fields, the development of the neural network algorithms could be limited by the large computation resources required for the model training. In recent years, quantum computing has emerged as one solution to this problem, and has evolved into a new interdisciplinary field known as the quantum machine learning (QML) (Biamonte et al., 2017; Havlíček et al., 2019) . Specifically, variational quantum circuits (Benedetti et al., 2019) have been explored as efficient protocols for quantum chemistry (Kandala et al., 2017) and combinatorial optimizations (Zhou et al., 2018) . Compared to the classical circuit models, quantum circuits have shown greater expressive power (Du et al., 2020a) , and demonstrated quantum advantage for the low-depth case (Bravyi et al., 2018) . Due to the robustness against noises, variational quantum circuits have attracted significant interest for the hope to achieve the quantum supremacy on near-term quantum computers (Arute et al., 2019) . Quantum Neural Networks (QNNs) (Farhi & Neven, 2018; Schuld et al., 2020; Beer et al., 2020) are the special kind of quantum-classical hybrid algorithms that run on trainable quantum circuits. Recently, small-scale QNNs have been implemented on real quantum computers (Havlíček et al., 2019) for supervised learning tasks. The training of QNNs aims to minimize the objective function f with respect to parameters θ. Inspired by the classical optimizations of neural networks, a natural strategy to train QNNs is to exploit the gradient of the loss function (Crooks, 2019) . However, the recent work (McClean et al., 2018) shows that n-qubit quantum circuits with random structures and large depth L = O(poly(n)) tend to be approximately unitary 2-design (Harrow & Low, 2009) , and the partial derivative vanishes to zero exponentially with respect to n. The vanishing gradient problem is usually referred to as the Barren Plateaus (McClean et al., 2018) , and could affect the trainability of QNNs in two folds. Firstly, simply using the gradient-based method like Stochastic Gradient Descent (SGD) to train the QNN takes a large number of iterations. Secondly, the estimation of the derivatives needs an extremely large number of samples from the quantum output to guarantee a relatively accurate update direction (Chen et al., 2018) . To avoid the Barren Plateaus phenomenon, we explore QNNs with special structures to gain fruitful results. In this work, we introduce QNNs with special architectures, including the tree tensor (TT) structure (Huggins et al., 2019) referred to as TT-QNNs and the setp controlled structure referred to as SC-QNNs. We prove that for TT-QNNs and SC-QNNs, the expectation of the gradient norm of the objective function is bounded. Theorem 1.1. (Informal) Consider the n-qubit TT-QNN and the n-qubit SC-QNN defined in Figure 1 -2 and corresponding objective functions f TT and f SC defined in (3-4), then we have: 1 + log n 2n • α(ρ in ) ≤ E θ ∇ θ f TT 2 ≤ 2n -1, 1 + n c 2 1+nc • α(ρ in ) ≤ E θ ∇ θ f SC 2 ≤ 2n -1, where n c is the number of CNOT operations that directly link to the first qubit channel in the SC-QNN, the expectation is taken for all parameters in θ with uniform distributions in [0, 2π], and α(ρ in ) ≥ 0 is a constant that only depends on the input state ρ in ∈ C 2 n ×2 n . Moreover, by preparing ρ in using the L-layer encoding circuit in Figure 4 , the expectation of α(ρ in ) could be further lower bounded as Eα(ρ in ) ≥ 2 -2L . Compared to random QNNs with 2 -O(poly(n)) derivatives, the gradient norm of TT-QNNs ad SC-QNNs is greater than Ω(1/n) or Ω(2 -nc ) that could lead to better trainability. Our contributions are summarized as follows: • We prove Ω(1/n) and Ω(2 -nc ) lower bounds on the expectation of the gradient norm of TT-QNNs and SC-QNNs, respectively, that guarantees the trainability on related optimization problems. Our theorem does not require the unitary 2-design assumption in existing works and is more realistic to near-term quantum computers. • We prove that by employing the encoding circuit in Figure 4 to prepare ρ in , the expectation of term α(ρ in ) is lower bounded by a constant 2 -2L . Thus, we further lower bounded the expectation of the gradient norm to the term independent from the input state. • We simulate the performance of TT-QNNs, SC-QNNs, and random structure QNNs on the binary classification task. All results verify proposed theorems. Both TT-QNNs and SC-QNNs show better trainability and accuracy than random QNNs. Our proof strategy could be adopted for analyzing QNNs with other architectures as future works. With the proven assurance on the trainability of TT-QNNs and SC-QNNs, we eliminate one bottleneck in front of the application of large-size Quantum Neural Networks. The rest parts of this paper are organized as follows. We address the preliminary including the definitions, the basic quantum computing knowledge and related works in Section 2. The QNNs with special structures and the corresponding results are presented in Section 3. We implement the binary classification using QNNs with the results shown in Section 4. We make conclusions in Section 5.

2. PRELIMINARY

2.1 NOTATIONS AND THE BASIC QUANTUM COMPUTING We use [N ] to denote the set {1, 2, • • • , N }. The form • denotes the • 2 norm for vectors. We denote a j as the j-th component of the vector a. The tensor product operation is denoted as "⊗". The conjugate transpose of a matrix A is denoted as A † . The trace of a matrix A is denoted as Tr [A] . We denote ∇ θ f as the gradient of the function f with respect to the vector θ. We employ notations O and Õ to describe the standard complexity and the complexity ignoring minor terms, respectively. Now we introduce the quantum computing. The pure state of a qubit could be written as |φ = a|0 +b|1 , where a, b ∈ C satisfies |a| 2 +|b| 2 = 1, and {|0 = (1, 0) T , |1 = (0, 1) T }, respectively. The n-qubit space is formed by the tensor product of n single-qubit spaces. For the vector x ∈ R 2 n , the amplitude encoded state |x is defined asfoot_0 x 2 n j=1 x j |j . The dense matrix is defined as ρ = |x x| for the pure state, in which x| = (|x ) † . A single-qubit operation to the state behaves like the matrix-vector multiplication and can be referred to as the gate in the quantum circuit language. Specifically, single-qubit operations are often used including R X (θ) = e -iθX , R Y (θ) = e -iθY , and R Z (θ) = e -iθZ : X = 0 1 1 0 , Y = 0 -i i 0 , Z = 1 0 0 -1 . Pauli matrices {I, X, Y, Z} will be referred to as {σ 0 , σ 1 , σ 2 , σ 3 } for the convenience. Moreover, two-qubit operations, the CNOT gate and the CZ gate, are employed for generating quantum entanglement: CNOT = • = |0 0| ⊗ σ 0 + |1 1| ⊗ σ 1 , CZ = • • = |0 0| ⊗ σ 0 + |1 1| ⊗ σ 3 . We could obtain information from the quantum system by performing measurements, for example, measuring the state |φ = a|0 +b|1 generates 0 and 1 with probability p(0) = |a| 2 and p(1) = |b| 2 , respectively. Such a measurement operation could be mathematically referred to as calculating the average of the observable O = σ 3 under the state |φ : σ 3 |φ ≡ φ|σ 3 |φ ≡ Tr[|φ φ| • σ 3 ] = |a| 2 -|b| 2 = p(0) -p(1) = 2p(0) -1. The average of a unitary observable under arbitrary states is bounded by [-1, 1].

2.2. RELATED WORKS

The barren plateaus phenomenon in QNNs is first noticed by McClean et al. (2018) . They prove that for n-qubit random quantum circuits with depth L = O(poly(n)), the expectation of the derivative to the objective function is zero, and the variance of the derivative vanishes to zero with rate exponential in the number of qubits n. Later, Cerezo et al. (2020) prove that for L-depth quantum circuits consisting of 2-design gates, the gradient with local observables vanishes with the rate O(2 -O(L) ). The result implies that in the low-depth L = O(log n) case, the vanishing rate could be O( 1 polyn ), which is better than previous exponential results. Recently, some techniques have been proposed to address the barren plateaus problem, including the special initialization strategy (Grant et al., 2019) and the layerwise training method (Skolik et al., 2020) . We remark that these techniques rely on the assumption of low-depth quantum circuits. Specifically, Grant et al. (2019) initialize parameters such that the initial quantum circuit is equivalent to an identity matrix (L = 0). Skolik et al. (2020) train parameters in subsets in each layer, so that a low-depth circuit is optimized during the training of each subset of parameters. Since random quantum circuits tend to be approximately unitary 2-design 1 as the circuit depth increases (Harrow & Low, 2009) , and 2-design circuits lead to exponentially vanishing gradients (Mc-Clean et al., 2018) , the natural idea is to consider circuits with special structures. On the other hand, tensor networks with hierarchical structures have been shown an inherent relationship with classical neural networks (Liu et al., 2019; Hayashi et al., 2019) . Recently, quantum classifiers using hierarchical structure QNNs have been explored (Grant et al., 2018) , including the tree tensor network and the multi-scale entanglement renormalization ansatz. Besides, QNNs with dissipative layers have shown the ability to avoid the barren plateaus (Beer et al., 2020) . However, theoretical analysis of the trainability of QNNs with certain layer structures has been little explored (Sharma et al., 2020) . Also, the 2-design assumption in the existing theoretical analysis (McClean et al., 2018; Cerezo et al., 2020; Sharma et al., 2020) is hard to implement exactly on near-term quantum devices.

3. QUANTUM NEURAL NETWORKS

In this section, we discuss the quantum neural networks in detail. Specifically, the optimizing of QNNs is presented in Section 3.1. We analyze QNNs with special structures in Section 3.2. We introduce an approximate quantum input model in Section 3.3 which helps for deriving further theoretical bounds.

3.1. THE OPTIMIZING OF QUANTUM NEURAL NETWORKS

In this subsection, we introduce the gradient-based strategy for optimizing QNNs. Like the weight matrix in classical neural networks, the QNN involves a parameterized quantum circuit that mathematically equals to a parameterized unitary matrix V (θ). The training of QNNs aims to optimize the function f defined as: f (θ; ρ in ) = 1 2 + 1 2 Tr O • V (θ) • ρ in • V (θ) † = 1 2 + 1 2 O V (θ),ρin , where O denotes the quantum observable and ρ in denotes the density matrix of the input quantum state. Generally, we could deploy the parameters θ in a tunable quantum circuit arbitrarily. A practical tactic is to encode parameters {θ j } as the phases of the single-qubit gates {e -iθj σ k , k ∈ {1, 2, 3}} while employing two-qubit gates {CNOT, CZ} among them to generate quantum entanglement. This strategy has been frequently used in existing quantum circuit algorithms (Schuld et al., 2020; Benedetti et al., 2019; Du et al., 2020b) since the model suits the noisy near-term quantum computers. Under the single-qubit phase encoding case, the partial derivative of the function f could be calculated using the parameter shifting rule (Crooks, 2019), ∂f ∂θ j = 1 2 O V (θ + ),ρin - 1 2 O V (θ -),ρin = f (θ + ; ρ in ) -f (θ -; ρ in ), where θ + and θ -are different from θ only at the j-th parameter: θ j → θ j ± π 4 . Thus, the gradient of f could be obtained by estimating quantum observables, which allows employing quantum computers for fast optimizations using stochastic gradient descents.

3.2. QUANTUM NEURAL NETWORKS WITH SPECIAL STRUCTURES

In this subsection, we introduce quantum neural networks with tree tensor (TT) (Grant et al., 2018) and step controlled (SC) architectures. We prove that the expectation of the square of gradient 2norm for the TT-QNN and the SC-QNN are lower bounded by Ω(1/n) and Ω(2 -nc ), respectively, where n c is a parameter in the SC-QNN that is independent from the qubit number n. Moreover, the corresponding theoretical analysis does not rely on 2-design assumptions for quantum circuits. Now we discuss proposed quantum neural networks in detail. We consider the n-qubit QNN constructed by the single-qubit gate W (k) j = e -iθ (k) j σ2 and the CNOT gate σ 1 ⊗ |1 1| + σ 0 ⊗ |0 0|. We define the k-th parameter in the j-th layer as θ (k) j . We only employ R Y rotations for singlequbit gates, due to the fact that the real world data lie in the real space, while applying R X and R Z rotations would introduce imaginary term to the quantum state. We demonstrate the TT-QNN in Figure 1 for the n = 4 case, which employs CNOT gates in the binary tree form to achieve the quantum entanglement. The circuit of the SC-QNN could be divided into two parts: in the first part, CNOT operations are performed between adjacent qubit channels; in the second part, CNOT operations are performed between different qubit channels and the first qubit channel. An illustration of the SC-QNN is shown in Figure 2 for the n = 4 and n c = 2 case, where n c denotes the number of CNOT operations that directly link to the first qubit channel. The number of parameters in both the TT-QNN and the SC-QNN are 2n -1. We consider correspond objective functions that are defined as f TT (θ) = 1 2 + 1 2 Tr[σ 3 ⊗ I ⊗(n-1) V TT (θ)ρ in V TT (θ) † ], f SC (θ) = 1 2 + 1 2 Tr[σ 3 ⊗ I ⊗(n-1) V SC (θ)ρ in V SC (θ) † ], where V TT (θ) and V SC (θ) denotes the parameterized quantum circuit operations for the TT-QNN and the SC-QNN, respectively. We employ the observable σ 3 ⊗ I ⊗(n-1) in Eq. (3-4) such that objective functions could be easily estimated by measuring the first qubit in corresponding quantum circuits. Main results of this section are stated in Theorem 3.1, in which we prove Ω(1/n) and Ω(2 -nc ) lower bounds on the expectation of the square of the gradient norm for the TT-QNN and the SC-QNN, respectively. By setting n c = O(log n), we could obtain the linear inverse bound for the SC-QNN as well. We provide the proof of Theorem 3.1 in Appendix D and Appendix G. |0 U (ρin) W (1) 1 W (1) 2 W (1) 3 |0 W (2) 1 • |0 W (3) 1 W (2) 2 • |0 W (4) 1 • Figure 1: Quantum Neural Network with the Tree Tensor structure (n = 4). |0 U (ρin) W (1) 1 W (1) 3 W (1) 4 |0 W (2) 1 • |0 W (3) 1 W (1) 2 • |0 W (4) 1 • Figure 2: Quantum Neural Network with the Step Controlled structure (n = 4, n c = 2). Theorem 3.1. Consider the n-qubit TT-QNN and the n-qubit SC-QNN defined in Figure 1 -2 and corresponding objective functions f TT and f SC defined in Eq. (3-4), then we have: 1 + log n 2n • α(ρ in ) ≤ E θ ∇ θ f TT 2 ≤ 2n -1, ( ) 1 + n c 2 1+nc • α(ρ in ) ≤ E θ ∇ θ f SC 2 ≤ 2n -1, where n c is the number of CNOT operations that directly link to the first qubit channel in the SC-QNN, the expectation is taken for all parameters in θ with uniform distributions in [0, 2π], ρ in ∈ C 2 n ×2 n denotes the input state, α(ρ in ) = Tr σ (1,0,••• ,0) • ρ in 2 + Tr σ (3,0,••• ,0) • ρ in 2 , and σ (i1,i2,••• ,in) ≡ σ i1 ⊗ σ i2 ⊗ • • • ⊗ σ in . From the geographic view, the value E θ ∇ θ ffoot_2 characterizes the global steepness of the function surface in the parameter space. Optimizing the objective function f using gradient-based methods could be hard if the norm of the gradient vanishes to zero. Thus, lower bounds in Eq. (5-6) provide a theoretical guarantee on optimizing corresponding functions, which then ensures the trainability of QNNs on related machine learning tasks. From the technical view, we provide a new theoretical framework during proving Eq. (5-6). Different from existing works (McClean et al., 2018; Grant et al., 2019; Cerezo et al., 2020) that define the expectation as the average of the finite unitary 2-design group, we consider the uniform distribution in which each parameter in θ varies continuously in [0, 2π]. Our assumption suits the quantum circuits that encode the parameters in the phase of single-qubit rotations. Moreover, the result in Eq. ( 6) gives the first proven guarantee on the trainability of QNN with linear depth. Our framework could be extensively employed for analyzing QNNs with other different structures as future works.

3.3. PREPARE THE QUANTUM INPUT MODEL: A VARIATIONAL CIRCUIT APPROACH

State preparation is an essential part of most quantum algorithms, which encodes the classical information into quantum states. Specifically, the amplitude encoding |x = Algorithm 1 Quantum Input Model Require: The input vector x in ∈ Rfoot_3 n , the number of alternating layers L, the iteration time T , and the learning rate {η(t)} T -1 t=0 . Ensure: A parameter vector β * which tunes the approximate encoding circuit U (β * ). 1: Initialize {β (k) j } n,2L+1 j,k=1 randomly in [0, 2π]. Denote the parameter vector as β (0) . 2: for t ∈ {0, 1, • • • , T -1} do 3: Run the circuit in Figure 3 classically to calculate the gradient ∇ β f input | β=β (t) using the parameter shifting rule (2), where the function f input is defined in (7).

4:

Update the parameter β (t+1) = β (t) -η(t) • ∇ β f input | β=β (t) . 5: end for 6: Output the trained parameter β * . W (1) 1 • W (1) 2 • W (1) 3 W (2) 1 • W (2) 2 • W (2) 3 W (3) 1 • W (3) 2 • W (3) 3 W (4) 1 • W (4) 2 • W (4) 3 W (5) 1 • W (5) 2 • W (5) 3 W (6) 1 • W (6) 2 • W (6) 3 W (7) 1 • W (7) 2 • W (7) 3 W (8) 1 • W (8) 2 • W (8) 3 Alternating Layer In this subsection, we introduce a quantum input model for approximately encoding the arbitrary vector x in ∈ R 2 n in the amplitude of the quantum state |x in . The main idea is to classically train an alternating layered circuit as summarized in Algorithm 1 and Figures 3 4 . Now we explain the detail of the input model. Firstly, we randomly initialize the parameter β (0) in the circuit 3. Then, we train the parameter to minimize the objective function defined in (7) through the gradient descent, |0 X W (β * ) † |0 X |0 X |0 X |0 X |0 X |0 X |0 X U (β * ) f input (β) = 1 n n i=1 O i W (β)|xin = 1 n n i=1 1 x in 2 Tr[O i • W (β) • x in x T in • W (β) † ], where O i = σ ⊗(i-1) 0 ⊗ σ 3 ⊗ σ ⊗(n-i) 0 , ∀i ∈ [n], and W (β) denotes the tunable alternating layered circuit in Figure 3 . Note that although the framework is given in the quantum circuit language, we actually calculate and update the gradient on classical computers by considering each quantum gate operation as the matrix multiplication. The output of Algorithm 1 is the trained parameter vector β * which tunes the unitary W (β * ). The corresponding encoding circuit could then be implemented as U (β * ) = W (β * ) † • X ⊗n , which is low-depth and appropriate for near-term quantum computers. Structures of circuits W and U are illustrated in Figure 3 and 4 , respectively. Suppose we could minimize the objective function (7) to -1. Then O i = -1, ∀i ∈ [n] , which means the final output state in Figure 3 equals to W (β * )|x in = |1 ⊗n . 2 Thus the state |x in could be prepared exactly by applying the circuit U (β * ) = W (β * ) † X ⊗n on the state |0 ⊗n . However we could not always optimize the loss in Eq. ( 7) to -1, which means the framework could only prepare the amplitude encoding state approximately.

Algorithm 2 QNNs for the Binary Classification Training

Require: Quantum input states {ρ train i } S i=1 for the dataset {(x train i , y i )} S i=1 , the quantum observable O, the parameterized quantum circuit V (θ), the iteration time T , the batch size s, and learning rates {η θ (t)} T -1 t=0 and {η b (t)} T -1 t=0 . Ensure: The trained parameters θ * and b * . 1: Initialize each parameter in θ (0) randomly in [0, 2π] and initialize b (0) = 0. 2: for t ∈ {0, 1, • • • , T -1} do 3: Randomly sample an index subset I t ⊂ [S] with size s. Calculate the gradient ∇ θ It (θ, b)| θ,b=θ (t) ,b (t) and ∇ b It (θ, b)| θ,b=θ (t) ,b (t) using the chain rule and the parameter shifting rule (2), where the function It (θ, b) is defined in (8). 4: Update parameters θ (t+1) /b (t+1) = θ (t) /b (t) -η θ/b (t) • ∇ θ/b It (θ, b)| θ,b=θ (t) ,b (t) . 5: end for 6: Output trained parameters θ * = θ (T ) and b * = b (T ) . An interesting result is that by employing the encoding circuit in Figure 4 for constructing the input state in Section 3.2 as ρ in = U (β)(|0 0|) ⊗n U (β) † , we could bound the expectation of α(ρ in ) defined in Theorem 3.1 by a constant that only relies on the layer L (Theorem 3.2). Theorem 3.2. Suppose the state ρ in is prepared by the L-layer encoding circuit in Figure 4 , then we have, E β α(ρ in ) ≥ 2 -2L , where β denotes all variational parameters in the encoding circuit, and the expectation is taken for all parameters in β with uniform distributions in [0, 2π]. We provide the proof of Theorem 3.2 and more details about the input model in Appendix E. Theorem 3.2 can be employed with Theorem 3.1 to derive Theorem 1.1, in which the lower bound for the expectation of the gradient norm is independent from the input state.

4.1. QNNS: TRAINING AND PREDICTION

In this section, we show how to train QNNs for the binary classification in quantum computers. First of all, for the training and test data denoted as {(x train i , y i )} S i=1 and {(x test j , y j )} Q j=1 , where y i ∈ {0, 1} denotes the label, we prepare corresponding quantum input states {ρ train i } S i=1 and {ρ test j } Q j=1 using the encoding circuit presented in Section 3.3. Then, we employ Algorithm 2 to train the parameter θ and the bia b via the stochastic gradient descent method for the given parameterized circuit V and the quantum observable O. The parameter updating in each iteration is presented in the Step 3-4, which aims to minimize the loss defined in (8) for each input batch I t , ∀t ∈ [T ]: It (θ) = 1 |I t | |It| i=1 f (θ; ρ train i ) -y i + b 2 . ( ) Based on the chain rule and the Eq. ( 2), derivatives ∂ I t ∂θj and ∂ I t ∂b could be decomposed into products of objective functions f with different variables, which could be estimated efficiently by counting quantum outputs. In practice, we calculate the value of the objective function by measuring the output state of the QNN for several times and averaging quantum outputs. After the training iteration we obtain the trained parameters θ * and b * . Denote the quantum circuit V * = V (θ * ). We do test for an input state ρ test by calculating the objective function f (ρ test ) = 1 2 + 1 2 Tr[O • V * ρ test V * ) † ]. We classify the input ρ test as in the class 0 if f (ρ test ) + b * < 1 2 , or the class 1 if f (ρ test ) + b * > 1 2 . The time complexity of the QNN training and test could be easily derived by counting resources for estimating all quantum observables. Denote the number of gates and parameters in the quantum circuit V (θ) as n gate and n para , respectively. Denote the number of measurements for estimating each quantum observable as n train and n test for the training and test stages, respectively. Then, the time complexity to train QNNs is O(n gate n para n train T ), and the time complexity of the test using QNNs 

4.2. NUMERICAL SIMULATIONS

To analyze the practical performance of QNNs for binary classification tasks, we simulate the training and test of QNNs on the MNIST handwritten dataset. The 28 × 28 size image is sampled into 16 × 16, 32 × 32, and 64 × 64 to fit QNNs for qubit number n ∈ {8, 10, 12}. We set the parameter in SC-QNNs as n c = 4 for all qubit number settings. Note that the based on the tree structure, the qubit number of original TT-QNNs is limited to the power of 2. To analysis the behavior of TT-QNNs for general qubit numbers, we modify TT-QNNs into Deformed Tree Tensor (DTT) QNNs. The gradient norm for DTT-QNNs is lower bounded by O(1/n), which has a similar form of that for TT-QNNs. We provide more details of DTT-QNNs in Appendix F and denote DTT-QNNs as TT-QNNs in simulation parts (Section 4.2 and Appendix A) for the convenience. We construct the encoding circuits in Section 3.3 with the number of alternating layers L = 1 for 400 training samples and 400 test samples in each class. The TT-QNN and the SC-QNN is compared to the QNN with the random structure. To make a fair comparison, we set the numbers of RY and CNOT gates in the random QNN to be the same with the TT-QNN and the SC-QNN. The objective function of the random QNN is defined as the average of the expectation of the observable σ 3 for all qubits in the circuit. The number of the training iteration is 100, the batch size is 20, and the decayed learning rate is adopted as {1.00, 0.75, 0.50, 0.25}. We set n train = 200 and n test = 1000 as numbers of measurements for estimating quantum observables during the training and test stages, respectively. All experiments are simulated through the PennyLane Python package (Bergholm et al., 2020) . Firstly, we explain our results about QNNs with different qubit numbers in Figure 5 . We train TT-QNNs, SC-QNNs, and Random-QNNs with the stochastic gradient descent method described in Algorithm 2 for images in the class (0, 2) and the qubit number n ∈ {8, 10, 12}. The total loss is defined as the average of the single-input loss. Secondly, we explain our results about QNNs on the binary classification with different class pairs. We conduct the binary classification with the same super parameters mentioned before for 10-qubit QNNs, and the test accuracy and F1-scores for all class pairs {i, j} ⊂ {0, 1, 2, 3, 4} are provided in Table 1 . The F1-0 denotes the F1-score when treats the former class to be positive, and the F1-1 denotes the F1-score for the other case. As shown in Table 1 , TT-QNNs and SC-QNNs have higher test accuracy and F1-score than Random-QNNs for all class pairs in Table 1 . Specifically, test accuracy of TT-QNNs and SC-QNNs exceed that of Random-QNNs by more than 10% for all class pairs except the (0, 1) which is relatively easy to classify. In conclusion, both TT-QNNs and SC-QNNs show better trainability and accuracy on binary classification tasks compared with the random structure QNN, and all theorems are verified by experiments. We provide more experimental details and results about the input model and other classification tasks in Appendix A.

5. CONCLUSIONS

In this work, we analyze the vanishing gradient problem in quantum neural networks. We prove that the gradient norm of n-qubit quantum neural networks with the tree tensor structure and the step controlled structure are lower bounded by Ω( 1 n ) and Ω(( 1 2 ) nc ), respectively. The bound guarantees the trainability of TT-QNNs and SC-QNNs on related machine learning tasks. Our theoretical framework requires fewer assumptions than previous works and meets constraints on quantum neural networks for near-term quantum computers. Compared with the random structure QNN which is known to be suffered from the barren plateaus problem, both TT-QNNs and SC-QNNs show better trainability and accuracy on the binary classification task. We hope the paper could inspire future works on the trainability of QNNs with different architectures and other quantum machine learning algorithms. The Appendix of this paper is organized as follows. We shortly introduce the notion of unitary 2-designs in Appendix B. Some useful technical lemmas are provided and proved in Appendix C. We prove the Theorem 3.1 in Appendix D and Appendix G. The Appendix C, Appendix D, and Appendix G form the theoretical framework for analyzing the gradient norm of QNNs. We provide the proof of Theorem 3.2 in the main text and more detail about the proposed encoding model in Appendix E. All experimental details are provided in Appendix A.

A NUMERICAL SIMULATIONS

In this section, we provide more experimential details about the input model and other binary classification tasks. For a better understanding to the input model, we provide the visualization of the encoding circuit in Figure 7 . We notice that the encoding circuit could only catch a few features from the input data (except the image 1 which shows good results). Despite this, we obtain relatively good results on binary classification tasks which employ the mentioned encoding circuit. Apart from the binary classification tasks using QNNs equipped with the encoding circuit provided in Section 3.3, we perform some experiments such that the encoding circuit is replaced by the exact amplitude encoding, which is commonly used in existing quantum machine learning algorithms. The training and test accuracy for other class pairs are summarized in Table 2 . We notice that compared with QNNs using the encoding circuit, QNNs with the exact encoding tend to have better We summarize the results of training and test accuracy, along with the F1-score, for QNNs on the classification between different label pairs in Table 4 and Table 5 , for qubit number 8 and 10, respectively. For all label pairs, TT-QNNs and SC-QNNs show higher performance of the training accuracy, the test accuracy, and the F1-scores than that of Random-QNNs. Moreover, most of test accuracy of Random-QNNs drop for the same class pair when the qubit number is increased from 8 to 10, which suggest the trainability of Random-QNNs get worse as the qubit number increases. In this section, we simulate variants of TT-QNNs and SC-QNNs such that single-qubit gate operations are extended from {RY} to {RX, RY, RZ}. Results on the binary classification between MNIST image (0, 2) using 8-qubit QNNs are provided in Figure 9 . The training loss converges to around 0.23 and 0.175 for the TT-QNN and the SC-QNN, respectively, and the test error converges to around 0.3 for both the TT-QNN and the SC-QNN. We remark that based on results in Figures 5(a 

B NOTES ABOUT THE UNITARY 2-DESIGN

In this section, we introduce the notion of the unitary 2-design. Consider the finite gate set S = {G i } |S| i=1 in the d-dimensional Hilbert space. We denote U (d) as the unitary gate group with the dimension d. We denote P t,t (G) as the polynomial function which has the degree at most t on the matrix elements of G and at most t on the matrix elements of G † . Then, we could say the set S to be the unitary t-design if and only if for every function P t,t (•), Eq. ( 9) holds: 1 |S| G∈S P t,t (G) = U (d) dµ(G)P t,t (G), where dµ(•) denotes the Haar distribution. The Haar distribution dµ(•) is defined that for any function f and any matrix K ∈ U (d), U (d) dµ(G)f (G) = U (d) dµ(G)f (KG) = U (d) dµ(G)f (GK). The form in the right side of ( 9) can be viewed as the average or the expectation of the function P t,t (G). We remark that only the parameterized gates RY = e -iθσ2 could not form a universal gate set even in the single-qubit space U (2), thus quantum circuits employing parameterized RY gates could not form the 2-design. This is only a simple introduction about the unitary 2-design, and we refer readers to Puchała & Miszczak (2011) and Cerezo et al. (2020) for more detail.

C TECHNICAL LEMMAS

In this section we provide some technical lemmas. Lemma C.1. Let CNOT = σ 0 ⊗ |0 0| + σ 1 ⊗ |1 1|. Then CNOT(σ j ⊗ σ k )CNOT † =(δ j0 + δ j1 )(δ k0 + δ k3 )σ j ⊗ σ k + (δ j0 + δ j1 )(δ k1 + δ k2 )σ j σ 1 ⊗ σ k + (δ j2 + δ j3 )(δ k0 + δ k3 )σ j ⊗ σ k σ 3 -(δ j2 + δ j3 )(δ k1 + δ k2 )σ j σ 1 ⊗ σ k σ 3 . Further for the case σ k = σ 0 , CNOT(σ j ⊗ σ 0 )CNOT † = (δ j0 + δ j1 )σ j ⊗ σ 0 + (δ j2 + δ j3 )σ j ⊗ σ 3 . Proof.

CNOT(σ

j ⊗ σ k )CNOT † = (σ 0 ⊗ |0 0| + σ 1 ⊗ |1 1|) (σ j ⊗ σ k ) (σ 0 ⊗ |0 0| + σ 1 ⊗ |1 1|) = σ 0 ⊗ σ 0 + σ 3 2 + σ 1 ⊗ σ 0 -σ 3 2 (σ j ⊗ σ k ) σ 0 ⊗ σ 0 + σ 3 2 + σ 1 ⊗ σ 0 -σ 3 2 = 1 4 (σ j ⊗ σ k + σ 1 σ j σ 1 ⊗ σ k + σ j ⊗ σ 3 σ k σ 3 + σ 1 σ j σ 1 ⊗ σ 3 σ k σ 3 ) + 1 4 (σ j σ 1 ⊗ σ k + σ 1 σ j ⊗ σ k -σ j σ 1 ⊗ σ 3 σ k σ 3 -σ 1 σ j ⊗ σ 3 σ k σ 3 ) + 1 4 (σ j ⊗ σ k σ 3 + σ j ⊗ σ 3 σ k -σ 1 σ j σ 1 ⊗ σ k σ 3 -σ 1 σ j σ 1 ⊗ σ 3 σ k ) + 1 4 (σ j σ 1 ⊗ σ 3 σ k -σ j σ 1 ⊗ σ k σ 3 + σ 1 σ j ⊗ σ k σ 3 -σ 1 σ j ⊗ σ 3 σ k ) =(δ j0 + δ j1 )(δ k0 + δ k3 )σ j ⊗ σ k + (δ j0 + δ j1 )(δ k1 + δ k2 )σ j σ 1 ⊗ σ k + (δ j2 + δ j3 )(δ k0 + δ k3 )σ j ⊗ σ k σ 3 -(δ j2 + δ j3 )(δ k1 + δ k2 )σ j σ 1 ⊗ σ k σ 3 . For the case σ k = σ 0 , we have, CNOT(σ j ⊗ σ 0 )CNOT † = (δ j0 + δ j1 )σ j ⊗ σ 0 + (δ j2 + δ j3 )σ j ⊗ σ 3 . Lemma C.2. Let CZ = σ 0 ⊗ |0 0| + σ 3 ⊗ |1 1|. Then CZ(σ j ⊗ σ k )CZ † =(δ j0 + δ j3 )(δ k0 + δ k3 )σ j ⊗ σ k + (δ j0 + δ j3 )(δ k1 + δ k2 )σ j σ 3 ⊗ σ k + (δ j1 + δ j2 )(δ k0 + δ k3 )σ j ⊗ σ k σ 3 -(δ j1 + δ j2 )(δ k1 + δ k2 )σ j σ 3 ⊗ σ k σ 3 . Further for the case σ k = σ 0 , CZ(σ j ⊗ σ 0 )CZ † = (δ j0 + δ j3 )σ j ⊗ σ 0 + (δ j1 + δ j2 )σ j ⊗ σ 3 . Proof. CZ(σ j ⊗ σ k )CZ † = (σ 0 ⊗ |0 0| + σ 3 ⊗ |1 1|) (σ j ⊗ σ k ) (σ 0 ⊗ |0 0| + σ 3 ⊗ |1 1|) = σ 0 ⊗ σ 0 + σ 3 2 + σ 3 ⊗ σ 0 -σ 3 2 (σ j ⊗ σ k ) σ 0 ⊗ σ 0 + σ 3 2 + σ 3 ⊗ σ -σ 3 2 = 1 4 (σ j ⊗ σ k + σ 3 σ j σ 3 ⊗ σ k + σ j ⊗ σ 3 σ k σ 3 + σ 3 σ j σ 3 ⊗ σ 3 σ k σ 3 ) + 1 4 (σ j σ 3 ⊗ σ k + σ 3 σ j ⊗ σ k -σ j σ 3 ⊗ σ 3 σ k σ 3 -σ 3 σ j ⊗ σ 3 σ k σ 3 ) + 1 4 (σ j ⊗ σ k σ 3 + σ j ⊗ σ 3 σ k -σ 3 σ j σ 3 ⊗ σ k σ 3 -σ 3 σ j σ 3 ⊗ σ 3 σ k ) + 1 4 (σ j σ 3 ⊗ σ 3 σ k -σ j σ 3 ⊗ σ k σ 3 + σ 3 σ j ⊗ σ k σ 3 -σ 3 σ j ⊗ σ 3 σ k ) =(δ j0 + δ j3 )(δ k0 + δ k3 )σ j ⊗ σ k + (δ j0 + δ j3 )(δ k1 + δ k2 )σ j σ 3 ⊗ σ k + (δ j1 + δ j2 )(δ k0 + δ k3 )σ j ⊗ σ k σ 3 -(δ j1 + δ j2 )(δ k1 + δ k2 )σ j σ 3 ⊗ σ k σ 3 . For the case σ k = σ 0 , we have, CZ(σ j ⊗ σ 0 )CZ † = (δ j0 + δ j3 )σ j ⊗ σ 0 + (δ j1 + δ j2 )σ j ⊗ σ 3 . Lemma C.3. Let θ be a variable with uniform distribution in [0, 2π]. Let A, C : H 2 → H 2 be arbitrary linear operators and let B = D = σ j be arbitrary Pauli matrices, where j ∈ {0, 1, 2, 3}. Then E θ Tr[W AW † B]Tr[W CW † D] = 1 2π 2π 0 Tr[W AW † B]Tr[W CW † D]dθ (10) = 1 2 + δ j0 + δ jk 2 Tr[AB]Tr[CD] + - 1 2 + δ j0 + δ jk 2 Tr[ABσ k ]Tr[CDσ k ], where W = e -iθσ k and k ∈ {1, 2, 3}. Proof. First we simply replace the term W = e -iθσ k = I cos θ -iσ k sin θ. 1 2π 2π 0 dθTr[W AW † B]Tr[W CW † D] = 1 2π 2π 0 dθTr[(I cos θ -iσ k sin θ)A(I cos θ + iσ k sin θ)B] • Tr[(I cos θ -iσ k sin θ)C(I cos θ + iσ k sin θ)D] = 1 2π 2π 0 dθ cos 2 θTr[AB] -i sin θ cos θTr[σ k AB] + i sin θ cos θTr[Aσ k B] + sin 2 θTr[σ k Aσ k B] • cos 2 θTr[CD] -i sin θ cos θTr[σ k CD] + i sin θ cos θTr[Cσ k D] + sin 2 θTr[σ k Cσ k D] . We remark that: 1 2π 2π 0 dθ cos 4 θ = 3 8 , ( ) 1 2π 2π 0 dθ sin 4 θ = 3 8 , ( ) 1 2π 2π 0 dθ cos 2 θ sin 2 θ = 1 8 , ( ) 1 2π 2π 0 dθ cos 3 θ sin θ = 0, 1 2π 2π 0 dθ cos θ sin 3 θ = 0.

Then

The integration term = 3 8 Tr[AB]Tr[CD] + 3 8 Tr[σ k Aσ k B]Tr[σ k Cσ k D] + 1 8 Tr[AB]Tr[σ k Cσ k D] + 1 8 Tr[σ k Aσ k B]Tr[CD] - 1 8 Tr[σ k AB]Tr[σ k CD] - 1 8 Tr[Aσ k B]Tr[Cσ k D] + 1 8 Tr[σ k AB]Tr[Cσ k D] + 1 8 Tr[Aσ k B]Tr[σ k CD] =Tr[AB]Tr[CD] 1 2 + δ j0 + δ jk 2 +Tr[ABσ k ]Tr[CDσ k ] - 1 2 + δ j0 + δ jk 2 . The last equation is derived by noticing that for B = σ j , Tr[σ k Aσ k B] = Tr[Aσ k σ j σ k ] = [2(δ j0 + δ jk ) -1] • Tr[Aσ j ] = [2(δ j0 + δ jk ) -1] • Tr[AB], Tr[σ k AB] -Tr[Aσ k B] = Tr[Aσ j σ k ] -Tr[Aσ k σ j ] = 2(1 -δ j0 -δ jk ) • Tr[Aσ j σ k ] = 2(1 -δ j0 -δ jk ) • Tr[ABσ k ], while similar forms hold for D = σ j , Tr[σ k Cσ k D] = Tr[Cσ k σ j σ k ] = [2(δ j0 + δ jk ) -1] • Tr[Cσ j ] = [2(δ j0 + δ jk ) -1] • Tr[CD], Tr[σ k CD] -Tr[Cσ k D] = Tr[Cσ j σ k ] -Tr[Cσ k σ j ] = 2(1 -δ j0 -δ jk ) • Tr[Cσ j σ k ] = 2(1 -δ j0 -δ jk ) • Tr[CDσ k ]. Lemma C.4. Let θ be a variable with uniform distribution in [0, 2π]. Let A, C : H 2 → H 2 be arbitrary linear operators and let B = D = σ j be arbitrary Pauli matrices, where j ∈ {0, 1, 2, 3}. Then E θ Tr[GAW † B]Tr[GCW † D] = 1 2π 2π 0 Tr[GAW † B]Tr[GCW † D]dθ = 1 2 - δ j0 + δ jk 2 Tr[AB]Tr[CD] + - 1 2 - δ j0 + δ jk 2 Tr[ABσ k ]Tr[CDσ k ], where W = e -iθσ k , G = ∂W ∂θ and k ∈ {1, 2, 3}. Proof. First we simply replace the term W = e -iθσ k = I cos θ -iσ k sin θ and G = ∂W ∂θ = -I sin θ -iσ k cos θ. 1 2π 2π 0 dθTr[GAW † B]Tr[GCW † D] = 1 2π 2π 0 dθTr[(-I sin θ -iσ k cos θ)A(I cos θ + iσ k sin θ)B]Tr[(-I sin θ -iσ k cos θ)C(I cos θ + iσ k sin θ)D] = 1 2π 2π 0 dθ -sin θ cos θTr[AB] -i cos 2 θTr[σ k AB] -i sin 2 θTr[Aσ k B] + cos θ sin θTr[σ k Aσ k B] • -sin θ cos θTr[CD] -i cos 2 θTr[σ k CD] -i sin 2 θTr[Cσ k D] + cos θ sin θTr[σ k Cσ k D] . The integration above could be simplified using equations 12-16, The integration term = 1 8 Tr[AB]Tr[CD] + 1 8 Tr[σ k Aσ k B]Tr[σ k Cσ k D] - 1 8 Tr[AB]Tr[σ k Cσ k D] - 1 8 Tr[σ k Aσ k B]Tr[CD] - 3 8 Tr[σ k AB]Tr[σ k CD] - 3 8 Tr[Aσ k B]Tr[Cσ k D] - 1 8 Tr[σ k AB]Tr[Cσ k D] - 1 8 Tr[Aσ k B]Tr[σ k CD] =Tr[AB]Tr[CD] 1 2 - δ j0 + δ jk 2 +Tr[ABσ k ]Tr[CDσ k ] - 1 2 - δ j0 + δ jk 2 . The last equation is derived by noticing that for B = σ j , Tr[σ k Aσ k B] = Tr[Aσ k σ j σ k ] = [2(δ j0 + δ jk ) -1] • Tr[Aσ j ] = [2(δ j0 + δ jk ) -1] • Tr[AB], Tr[σ k AB] + Tr[Aσ k B] = Tr[Aσ j σ k ] + Tr[Aσ k σ j ] = 2(δ j0 + δ jk ) • Tr[Aσ j σ k ] = 2(δ j0 + δ jk ) • Tr[ABσ k ], while similar forms hold for D = σ j , Tr[σ k Cσ k D] = Tr[Cσ k σ j σ k ] = [2(δ j0 + δ jk ) -1] • Tr[Cσ j ] = [2(δ j0 + δ jk ) -1] • Tr[CD], Tr[σ k CD] + Tr[Cσ k D] = Tr[Cσ j σ k ] -Tr[Cσ k σ j ] = 2(δ j0 + δ jk ) • Tr[Cσ j σ k ] = 2(δ j0 + δ jk ) • Tr[CDσ k ]. D THE PROOF OF THEOREM 3.1: THE TT PART Now we begin the proof of Theorem 3.1. Proof. Firstly we remark that by Lemma D.1, each partial derivative is calculated as ∂f TT ∂θ (k) j = 1 2 Tr[O • V TT (θ + )ρ in V TT (θ + ) † ] -Tr[O • V TT (θ -)ρ in V TT (θ -) † ] , since the expectation of the quantum observable is bounded by [-1, 1], the square of the partial derivative could be easily bounded as: ∂f TT ∂θ (k) j 2 ≤ 1. By summing up 2n -1 parameters, we obtain ∇ θ f TT 2 = j,k ∂f TT ∂θ (k) j 2 ≤ 2n -1. On the other side, the lower bound could be derived as follows, E θ ∇ θ f TT 2 ≥ 1+log n j=1 E θ ∂f TT ∂θ = 1+log n j=1 4E θ f TT - 1 2 2 (18) ≥ 1 + log n 2n • Tr σ (1,0,••• ,0) • ρ in 2 + Tr σ (3,0,••• ,0) • ρ in 2 , where Eq. ( 18) is derived using Lemma D.2, and Eq. ( 19) is derived using Lemma D.3. Now we provide the detail and the proof of Lemmas D.1, D.2, D.3. Lemma D.1. Consider the objective function of the QNN defined as f (θ) = 1 + O 2 = 1 + Tr[O • V (θ)ρ in V (θ) † ] 2 , where θ encodes all parameters which participate the circuit as e -iθj σ k , k ∈ 1, 2, 3, ρ in denotes the input state and O is an arbitrary quantum observable. Then, the partial derivative of the function respect to the parameter θ j could be calculated by ∂f ∂θ j = 1 2 Tr[O • V (θ + )ρ in V (θ + ) † ] -Tr[O • V (θ -)ρ in V (θ -) † ] , where θ + ≡ θ + π 4 e j and θ -≡ θπ 4 e j . Proof. First we assume that the circuit V (θ) consists of p parameters, and could be written in the sequence: V (θ) = V p (θ p ) • V p-1 (θ p-1 ) • • • V 1 (θ 1 ), such that each block V j contains only one parameter. Consider the observable defined as O = V † j+1 • • • V † p OV p • • • V j+1 and the input state defined as ρ in = V j-1 • • • V 1 ρ in V † 1 • • • V † j-1 . The parameter shifting rule (Crooks, 2019) provides a gradient calculation method for the single parameter case. For f j (θ j ) = Tr[O •U (θ j )ρ in U (θ j ) † ] , the gradient could be calculated as df j dθ j = f j (θ j + π 4 ) -f j (θ j - π 4 ). Thus, by inserting the form of O and ρ in , we could obtain ∂f ∂θ j = df j dθ j = f j (θ j + π 4 )-f j (θ j - π 4 ) = 1 2 Tr[O • V (θ + )ρ in V (θ + ) † ] -Tr[O • V (θ -)ρ in V (θ -) † ] . Lemma D.2. For the objective function f TT defined in Eq. ( 3), the following formula holds for every j ∈ {1, 2, • • • , 1 + log n}: E θ ∂f TT ∂θ (1) j 2 = 4 • E θ (f TT - 1 2 ) 2 , ( ) where the expectation is taken for all parameters in θ with uniform distribution in [0, 2π]. Proof. First we rewrite the formulation of f TT in detail: f TT = 1 2 + 1 2 Tr σ (3,0,••• ,0) • V m+1 CX m • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • CX † m V † m+1 , where m = log n and we denote σ (i1,i2,••• ,in) ≡ σ i1 ⊗ σ i2 ⊗ • • • ⊗ σ in . The operation V consists of n•2 1-single qubit rotations W (j) = e -iσ2θ (j) on the (j -1)•2 -1 +1th qubit, where j = 1, 2, • • • , n • 2 1-. The operation CX consists of n • 2 -CNOT gates, where each of them acts on the (j -1) • 2 + 1-th and (j -0.5) • 2 + 1-th qubit, for j = 1, 2, • • • , n • 2 -. Now we focus on the partial derivative of the function f to the parameter θ (1) j . We have: ∂f TT ∂θ (1) j = 1 2 Tr σ (3,0,••• ,0) • V m+1 CX m • • • ∂V j θ (1) j • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • V † j • • • CX † m V † m+1 (22) + 1 2 Tr σ (3,0,••• ,0) • V m+1 CX m • • • V j • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • ∂V † j θ (1) j • • • CX † m V † m+1 (23) = Tr σ (3,0,••• ,0) • V m+1 CX m • • • ∂V j θ (1) j • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • V † j • • • CX † m V † m+1 . The Eq. ( 24) holds because both terms in ( 22) and ( 23) except ρ in are real matrices, and ρ in = ρ † in . The key idea to derive E θ ∂fTT ∂θ (1)  j 2 = 4 • E θ (f TT -1 2 ) 2 is that for cases B = D = σ j ∈ {σ 1 , σ 3 }, E θ Tr[W AW † B]Tr[W CW † D] = E θ Tr[GAW † B]Tr[GCW † D] = 1 2 Tr[Aσ 1 ]Tr[Cσ 1 ] + 1 2 Tr[Aσ 3 ]Tr[Cσ 3 ]. Now we write the analysis in detail. E θ   ∂f TT ∂θ (1) j 2 -4(f TT - 1 2 ) 2   (25) =E θ1 • • • E θm E θ (1) m+1 Tr σ (3,0,••• ,0) • V m+1 CX m A m CX m V † m+1 2 (26) -E θ1 • • • E θm E θ (1) m+1 Tr σ (3,0,••• ,0) • V m+1 CX m B m CX m V † m+1 2 (27) =E θ1 • • • E θm 1 2 Tr σ (3,0,••• ,0,3,0,••• ,0) • A m 2 + 1 2 Tr σ (1,0,••• ,0,0,0,••• ,0) • A m 2 (28) -E θ1 • • • E θm 1 2 Tr σ (3,0,••• ,0,3,0,••• ,0) • B m 2 + 1 2 Tr σ (1,0,••• ,0,0,0,••• ,0) • B m 2 , where and Eq. (28, 29) are derived using the collapsed form of Lemma C.1: A m = V m CX m-1 • • • ∂V j θ (1) j • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • V † j • • • CX † m-1 V † m , B m = V m CX m-1 • • • V j • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • V † j • • • CX † m-1 V † m , CNOT(σ 1 ⊗ σ 0 )CNOT † = σ 1 ⊗ σ 0 , CNOT(σ 3 ⊗ σ 0 )CNOT † = σ 3 ⊗ σ 3 , and θ denotes the vector consisted with all parameters in the -th layer. The integration could be performed for parameters {θ m , θ m-1 , • • • , θ j+1 }. It is not hard to find that after the integration of the parameters θ j+1 , the term Tr[σ i • A j ] 2 and Tr[σ i • B j ] 2 have the opposite coefficients. Besides, the first index of each Pauli tensor product σ (i1,i2,••• ,in) could only be i 1 ∈ {1, 3} because of the Lemma C.3. So we could write E θ   ∂f TT ∂θ (1) j 2 -4(f TT - 1 2 ) 2   (30) =E θ1 • • • E θj    i1∈{1,3} 3 i2=0 • • • 3 in=0 a i Tr [σ i • A j ] 2 -a i Tr [σ i • B j ] 2    where A j = ∂V j ∂θ (1) j CX j-1 • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • CX j-1 V † j = (G (1) j ⊗ I ⊗(n-1) )A /θ (1) j j (W (1) † j ⊗ I ⊗(n-1) ), B j = V j CX j-1 • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • CX j-1 V † j = (W (1) j ⊗ I ⊗(n-1) )A /θ (1) j j (W (1) † j ⊗ I ⊗(n-1) ), and a i is the coefficient of the term Tr [σ i • A j ] 2 . We denote G (1)  j = ∂W E θ (1) j Tr [σ i • A j ] 2 -Tr [σ i • B j ] 2 = 0, since for the case i 1 ∈ {1, 3}, the term δi 1 0 +δi 1 2 2 = 0, which means Lemma C.3 and Lemma C.4 have the same formulation. Then, we derive the Eq. ( 20). Lemma D.3. For the loss function f defined in (3), we have: E θ (f TT - 1 2 ) 2 ≥ Tr σ (1,0,••• ,0) • ρ in 2 + Tr σ (3,0,••• ,0) • ρ in 2 8n , where we denote σ (i1,i2,••• ,in) ≡ σ i1 ⊗ σ i2 ⊗ • • • ⊗ σ in , and the expectation is taken for all parameters in θ with uniform distributions in [0, 2π]. Proof. First we expand the function f TT in detail, f TT = 1 2 + 1 2 Tr σ (3,0,••• ,0) • V m+1 CX m • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • CX † m V † m+1 , where m = log n. Now we consider the expectation of (f TT -1 2 ) 2 under the uniform distribution for θ (1) m+1 : E θ (1) m+1 (f TT - 1 2 ) 2 = 1 4 E θ (1) m+1 Tr σ (3,0,••• ,0) • V m+1 CX m • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • CX † m V † m+1 2 (34) = 1 8 Tr σ (3,0,••• ,0) • A 2 + 1 8 Tr σ (1,0,••• ,0) • A 2 (35) = 1 8 Tr σ (3,0,••• ,0,3,0,••• ,0) • A 2 + 1 8 Tr σ (1,0,••• ,0,0,••• ,0) • A 2 (36) ≥ 1 8 Tr σ (1,0,••• ,0,0,••• ,0) • A 2 , ( ) where A = CX m V m • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • V m CX † m , A = V m • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • V m , and Eq. ( 35) is derived using Lemma C.3, Eq. ( 36) is derived using the collapsed form of Lemma C.1: CNOT(σ 1 ⊗ σ 0 )CNOT † = σ 1 ⊗ σ 0 , CNOT(σ 3 ⊗ σ 0 )CNOT † = σ 3 ⊗ σ 3 , We remark that during the integration of the parameters {θ  E θ2 • • • E θm+1 (f TT - 1 2 ) 2 ≥ ( 1 2 ) m+2 Tr σ (1,0,••• ,0) • V 1 • ρ in • V † 1 2 , ( ) where θ denotes the vector consisted with all parameters in the -th layer. Finally by using Lemma C.3, we could integrate the parameters {θ  (j) 1 } n j=1 in (40): E θ (f TT - 1 2 ) 2 = E θ1 • • • E θm+1 (f TT - 1 2 ) 2 (41) ≥ Tr σ (1,0,••• ,0) • ρ in 2 + Tr σ (3,0,••• ,0) • ρ in 2 2 m+3 (42) = Tr σ (1,0,••• ,0) • ρ in 2 + Tr σ (3,0,••• ,0) • ρ in 2 8n . ( (k) j = e -iσ2β (k) j , and each single qubit rotation layer could be written as Finally, we could mathematically define the encoding model: V j = V j (β j ) = W (1) j ⊗ W (2) j ⊗ • • • ⊗ W (n) j . U (ρ in ) = V 2L+1 U L U L-1 • • • U 1 X ⊗n , where U j = CZ 1 V 2j CZ 2 V 2j-1 is the j-th alternating layer. By employing the encoding model illustrated in Figure 10 for the state preparation, we find that the expectation of the term α(ρ in ) defined in Theorem 1.1 has the lower bound independent from the qubit number. Theorem E.1. Suppose the input state ρ in (θ) is prepared by the L-layer encoding model illustrated in Figure 10 . Then, E β α(ρ in ) = E β Tr σ (1,0,••• ,0) • ρ in 2 + Tr σ (3,0,••• ,0) • ρ in 2 ≥ 2 -2L , where β denotes all variational parameters in the encoding circuit, and the expectation is taken for all parameters in β with uniform distribution in [0, 2π]. Proof. Define ρ j = U j U j-1 • • • U 1 X ⊗n |0 ⊗n 0| ⊗n X ⊗n U † 1 • • • U † j-1 U † j , for j ∈ {0, 1, • • • , L}. We have: E β Tr σ (1,0,••• ,0) • ρ in 2 + Tr σ (3,0,••• ,0) • ρ in 2 (44) = E β1 • • • E β 2L+1 Tr σ (1,0,••• ,0) • V 2L+1 ρ L V † 2L+1 2 + Tr σ (3,0,••• ,0) • V 2L+1 ρ L V † 2L+1 2 (45) = E β1 • • • E β 2L Tr σ (1,0,••• ,0) • ρ L 2 + Tr σ (3,0,••• ,0) • ρ L 2 . ( ) ≥ 2 -2L Tr σ (1,0,••• ,0) • ρ 0 2 + Tr σ (3,0,••• ,0) • ρ 0 2 (47) = 2 -2L • (0 2 + (-1) 2 ) = 2 -2L , where Eq. ( 45) is derived from the definition of ρ L . Eq. ( 46) is derived using Lemma C.3. Eq. ( 47) is derived by noticing that for each j ∈ {0, 1, • • • , L -1}, the following equations holds, E β1 • • • E β2j+2 Tr σ (1,0,••• ,0) • ρ j+1 2 + Tr σ (3,0,••• ,0) • ρ j+1 2 (49) = E β1 • • • E β2j+2 Tr σ (1,0,••• ,0) • U j+1 ρ j U † j+1 2 + Tr σ (3,0,••• ,0) • U j+1 ρ j U † j+1 2 (50) = E β1 • • • E β2j+2 Tr σ (1,0,••• ,0) • CZ 1 V 2j+2 CZ 2 V 2j+1 ρ j V † 2j+1 CZ 2 V † 2j+2 CZ 1 2 (51) + E β1 • • • E β2j+2 Tr σ (3,0,••• ,0) • CZ 1 V 2j+2 CZ 2 V 2j+1 ρ j V † 2j+1 CZ 2 V † 2j+2 CZ 1 2 (52) = E β1 • • • E β2j+2 Tr σ (1,3,0,••• ,0) • V 2j+2 CZ 2 V 2j+1 ρ j V † 2j+1 CZ 2 V † 2j+2 2 (53) + E β1 • • • E β2j+2 Tr σ (3,0,••• ,0) • V 2j+2 CZ 2 V 2j+1 ρ j V † 2j+1 CZ 2 V † 2j+2 2 (54) ≥ E β1 • • • E β2j+2 Tr σ (3,0,••• ,0) • V 2j+2 CZ 2 V 2j+1 ρ j V † 2j+1 CZ 2 V † 2j+2 2 (55) = E β1 • • • E β2j+1 1 2 Tr σ (1,0,••• ,0) • CZ 2 V 2j+1 ρ j V † 2j+1 CZ 2 2 + 1 2 Tr σ (3,0,••• ,0) • CZ 2 V 2j+1 ρ j V † 2j+1 CZ 2 2 (56) = E β1 • • • E β2j+1 1 2 Tr σ (1,0,••• ,0,3) • V 2j+1 ρ j V † 2j+1 2 + 1 2 Tr σ (3,0,••• ,0) • V 2j+1 ρ j V † 2j+1 2 (57) ≥ E β1 • • • E β2j+1 1 2 Tr σ (3,0,••• ,0) • V 2j+1 ρ j V † 2j+1 2 (58) = E β1 • • • E β2j 1 4 Tr σ (1,0,••• ,0) • ρ j 2 + Tr σ (3,0,••• ,0) • ρ j 2 , where Eq. ( 50) is derived from the definition of ρ j+1 . Eq. (51-52) are derived from the definition of U j+1 . Eq. (53-54) and Eq. ( 57) are derived using Lemma C.1. Eq. ( 56) and Eq. ( 59) are derived using Lemma C.3.

F THE DEFORMED TREE TENSOR QNN

Similar to the TT-QNN case, the objective function for the Deformed Tree Tensor (DTT) QNN is given in the form: f DTT (θ) = 1 2 + 1 2 Tr[σ 3 ⊗ I ⊗(n-1) V DTT (θ)ρ in V DTT (θ) † ], where V DTT denotes the circuit operation of DTT-QNN which is illustrated in Figure 11 . The lower bound result for the gradient norm of DTT-QNNs is provided in Theorem F.1. Theorem F.1. Consider the n-qubit DTT-QNN defined in Figure 11 and the corresponding objective function f DTT defined in (60), then we have: 1 + log n 4n • α(ρ in ) ≤ E θ ∇ θ f DTT 2 ≤ 2n -1, where the expectation is taken for all parameters in θ with uniform distributions in [0, 2π], ρ in ∈ C 2 n ×2 n denotes the input state, α(ρ in ) = Tr σ (1,0,••• ,0) • ρ in 2 + Tr σ (3,0,••• ,0) • ρ in 2 , and σ (i1,i2,••• ,in) ≡ σ i1 ⊗ σ i2 ⊗ • • • ⊗ σ in . Proof. Firstly we remark that by Lemma D.1, each partial derivative is calculated as ∂f DTT ∂θ (k) j = 1 2 Tr[O • V DTT (θ + )ρ in V DTT (θ + ) † ] -Tr[O • V DTT (θ -)ρ in V DTT (θ -) † ] , since the expectation of the quantum observable is bounded by [-1, 1], the square of the partial derivative could be easily bounded as: On the other side, the lower bound could be derived as follows, ∂f DTT ∂θ (k) j 2 ≤ 1. |0 U (ρ in ) W (1) 1 W (1) 2 W (1) 3 W (1) 4 W (1) 5 |0 W (2) 1 • |0 W (3) 1 W (2) 2 • |0 W (4) 1 • |0 W (5) 1 W (3) 2 W (2) 3 • |0 W (6) 1 • |0 W (7) 1 W (4) 2 • |0 W (8) 1 • |0 W (5) 1 W (3) 2 W (2) 3 • |0 W (6) 1 • |0 W (7) 1 W (4) 2 • |0 W E θ ∇ θ f DTT 2 ≥ 1+ log n j=1 E θ ∂f DTT ∂θ (1) j 2 (62) = 1+ log n j=1 4E θ f DTT - 1 2 2 (63) ≥ 1 + log n 4n • Tr σ (1,0,••• ,0) • ρ in 2 + Tr σ (3,0,••• ,0) • ρ in 2 , ( ) where Eq. ( 63) is derived using Lemma F.1, and Eq. ( 64) is derived using Lemma F.2. Lemma F.1. For the objective function f DTT defined in Eq. ( 60), the following formula holds for every j ∈ {1, 2, • • • , 1 + log n }: E θ ∂f DTT ∂θ (1) j 2 = 4 • E θ (f DTT - 1 2 ) 2 , ( ) where the expectation is taken for all parameters in θ with uniform distribution in [0, 2π] . B m = V m CX m-1 • • • V j • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • V † j • • • CX † m-1 V † m . Eq. ( 73) and Eq. ( 74) are derived using the collapsed form of Lemma C.1: CNOT(σ 1 ⊗ σ 0 )CNOT † = σ 1 ⊗ σ 0 , CNOT(σ 3 ⊗ σ 0 )CNOT † = σ 3 ⊗ σ 3 , and θ denotes the vector consisted with all parameters in the -th layer. The integration could be performed for parameters {θ m , θ m-1 , • • • , θ j+1 }. It is not hard to find that after the integration of the parameters θ j+1 , the term Tr[σ i • A j ] 2 and Tr[σ i • B j ] 2 have the opposite coefficients. Besides, the first index of each Pauli tensor product σ (i1,i2,••• ,in) could only be i 1 ∈ {1, 3} because of the Lemma C.3. So we could write E θ   ∂f DTT ∂θ (1) j 2 -4(f DTT - 1 2 ) 2   (75) =E θ1 • • • E θj    i1∈{1,3} 3 i2=0 • • • 3 in=0 a i Tr [σ i • A j ] 2 -a i Tr [σ i • B j ] 2    ( ) where A j = ∂V j ∂θ (1) j CX j-1 • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • CX j-1 V † j = (G j ⊗ I ⊗(n-1) )A /θ (1) j j (W (1) † j ⊗ I ⊗(n-1) ), B j = V j CX j-1 • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • CX j-1 V † j = (W ⊗ I ⊗(n-1) )A /θ (1) j j (W (1) † j ⊗ I ⊗(n-1) ), and a i is the coefficient of the term Tr [σ i • A j ] 2 . We denote G Lemma F.2. For the loss function f DTT defined in (60), we have: E θ (f DTT - 1 2 ) 2 ≥ Tr σ (1,0,••• ,0) • ρ in 2 + Tr σ (3,0,••• ,0) • ρ in 2 16n , ( ) where we denote σ (i1,i2,••• ,in) ≡ σ i1 ⊗ σ i2 ⊗ • • • ⊗ σ in , and the expectation is taken for all parameters in θ with uniform distributions in [0, 2π]. Proof. First we expand the function f DTT in detail, f TT = 1 2 + 1 2 Tr σ (3,0,••• ,0) • V m+1 CX m • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • CX † m V † m+1 , where m = log n . Now we consider the expectation of (f DTT -1 2 ) 2 under the uniform distribution for θ (1) m+1 : Lemma G.2. For the loss function f SC defined in (4), we have: E θ (1) m+1 (f DTT - 1 2 ) 2 = 1 4 E θ (1) m+1 Tr σ (3,0,••• ,0) • V m+1 CX m • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • CX † m V † E θ f SC - 1 2 2 ≥ Tr σ (1,0,••• ,0) • ρ in 2 + Tr σ (3,0,••• ,0) • ρ in 2 2 3+nc , ( ) where we denote σ (i1,i2,••• ,in) ≡ σ i1 ⊗ σ i2 ⊗ • • • ⊗ σ in , and the expectation is taken for all parameters in θ with uniform distributions in [0, 2π]. Proof. We expand the function f SC in detail, f SC = 1 2 + 1 2 Tr σ (3,0,••• ,0) • V n CX n-1 • • • CX 1 V 1 • ρ in • V † 1 CX † 1 • • • CX † n-1 V † n = 1 2 + 1 2 Tr σ 3,0,••• ,0 • ρ (n) , where ρ (j) = V j CX j-1 • • • CX 1 V 1 ρ in • V † 1 CX † 1 • • • CX † j-1 V † j , ∀j ∈ [n]. Now we focus on the expectation of (f SC -1 2 ) 2 : E θ f SC - 1 2 2 = E θ1 E θ2 • • • E θn 1 4 Tr σ 3,0,••• ,0 • ρ (n) 2 (107) ≥ E θ1 E θ2 • • • E θn-1 1 8 Tr σ 1,0,••• ,0 • ρ (n-1) 2 (108) ≥ E θ1 E θ2 • • • E θn-n c 1 2 2+nc Tr σ 1,0,••• ,0 • ρ (n-nc) 2 (109) = E θ1 E θ2 • • • E θn-n c -1 1 2 2+nc Tr σ 1,0,••• ,0 • ρ (n-nc-1) 2 (110) = E θ1 1 2 2+nc Tr σ 1,0,••• ,0 • ρ (1) 2 (111) = 1 2 3+nc Tr [σ 1,0,••• ,0 • ρ in ] 2 + Tr [σ 3,0,••• ,0 • ρ in ] 2 , ( ) where Eq. ( 112) is derived from Lemma C.3. We derive Eqs. (108-109) by noticing that following equations hold for n -n c + 1 ≤ j ≤ n and i ∈ {1, 3}, E θj Tr σ i,0,••• ,0 • ρ (j) 2 = E θj Tr σ i,0,••• ,0 • V j CX j-1 ρ (j-1) CX † j-1 V † j 2 (113) = 1 2 Tr σ 1,0,••• ,0 • CX j-1 ρ (j-1) CX † 



We refer the readers to Appendix B for a short discussion about the unitary 2-design. n i=1 x i / x |i allows storing the 2 n -dimensional vector in n qubits. Due to the dense encoding nature and the similarity to the original vector, the amplitude encoding is preferred as the state preparation by many QML algorithms(Harrow et al., 2009;Rebentrost et al., 2014;Kerenidis et al., 2019). Despite the wide application in quantum algorithms, efficient amplitude encoding remains little explored. Existing work(Park et al., 2019) could prepare the amplitude encoding state in time O(2 n ) using a quantum circuit with O(2 n ) depth, which is prohibitive for large-size data on near-term quantum computers. In fact, arbitrary quantum amplitude encoding with polynomial gate complexity remains an open problem. We ignore the global phase on the quantum state. Based on the assumptions of this paper, all quantum states that encode input data lie in the real space, which then limit the global phase as 1 or -1. For the former case, nothing needs to be done; for the letter case, the global phase could be introduced by adding a single qubit gate e -iπσ 0 = -σ0 = -I on anyone among qubit channels.



Figure 3: The parameterized alternating layered circuit W (β) (n = 8, L = 1) for training the corresponding encoding circuit of the input x in .

Figure 4: The encoding circuit U (β * ), where W (β * ) is the trained parameterized circuit in Figure 3.

Figure 5: Simulations on the MNIST binary classification between (0, 2). The training loss and the test error during the training iteration are illustrated in Figures 5(a), 5(e) for the n=8 case, Figures 5(b), 5(f) for the n=10 case, and Figures 5(c), 5(g) for the n=12 case. The gradient norm of objective functions and the term α(ρ in ) during the training are shown in Figures 5(d) and 5(h), respectively for the n=8 case.

n=10 case, and Figures5(c), 5(g) for the n=12 case. The test error of the TT-QNN, the SC-QNN and the Random-QNN converge to around 0.2 for the n=8 case. As the qubit number increases, the converged test error of both TT-QNNs and SC-QNNs remains lower than 0.2, while that of Random-QNNs increases to 0.26 and 0.50 for n=10 and n=12 case, respectively. The training loss of both TT-QNNs and SC-QNNs converges to around 0.15 for all qubit number settings, while that of Random-QNNs remains higher than 0.22. Both the training loss and the test error results show that TT-QNNs and SC-QNNs have better trainability and accuracy on the binary classification compared with Random-QNNs. We record the l 2 -norm of the gradient during the training for the n=8 case in Figure5(d). The gradient norm for the TT-QNN and the SC-QNN is mostly distributed in [0.4, 1.4], which is significantly larger than the gradient norm for the Random-QNN that is mostly distributed in [0.1, 0.2]. As shown in Figure5(d), the gradient norm verifies the lower bounds in the Theorem 3.1. Moreover, we calculate the term α(ρ in ) defined in Theorem 3.1 and show the result in Figure5(h). The average of α(ρ in ) is around 0.6, which is lower bounded by the theoretical result 1 4 in Theorem 3.2 (L = 1).

Figure 6: The training loss of the input model for MNIST images in class {0, 1, 2, 3}.

Figure 7: The visualization of the encoding circuit for MNIST images in class {0, 1, 2, 3, 4}.In this section, we discuss the training of the input model (Algorithm 1) in detail. We construct the encoding circuits in Section 3.3 for each training or test data with the number of alternating layers L = 1. The number of the training iteration is 100. We adopt the decayed learning rate as {0.100, 0.075, 0.050, 0.025}. We illustrate the loss function defined in Eq. 7 during the training of Algorithm 1 for label in {0, 1, 2, 3} for the n=8 case in Figure6, in which we show the training of the input model for one image per sub-figure. All shown loss functions converge to around -0.6 after 60 iterations.

demonstrates the simulation on MNIST classifications between images (0, 2), which shows the convergence of training loss (Figure8(a)) and the test error (Figure8(b)). The norm of the gradient is counted in Figure8(c), in which both the TT-QNN and the SC-QNN show larger gradient norm than the Random-QNN. Thus, the trainability of TT-QNNs and SC-QNNs remains when replace the encoding circuit with the exact amplitude encoding.

Figure 8: The training of binary classification task on the MNIST image bettwen classes (0, 2) using 8-qubit QNNs with the exact amplitude encoding. The training loss and the test error during the training iteration are illustrated in Figures 8(a), 8(b), respectively. The gradient norm of objective functions during the training is shown in Figures 8(c).

) and Figure5(e), the training loss of original TT-QNN and SC-QNN converge at around 0.15, and the test error converge at around 0.20 and 0.15, respectively. Thus, both the TT-QNN and the SC-QNN show the worse performance than original QNNs when employing the extended gate set {RX, RY, RZ}. Another result is provided in

Figure 9: The training of binary classification task on the MNIST image bettwen classes (0, 2) using 8-qubit QNNs with the extended gate operation {RX, RY, RZ}. The training loss and the test error during the training iteration are illustrated in Figures 9(a), 9(b), respectively.

which means Lemma C.3 and Lemma C.4 collapse to the same formulation:

of A j and B j . By Lemma C.3 and Lemma C.4, we have

(j) } in each layer ∈ {1, 2,• • • , m + 1}, the coefficient of the term {Tr[σ (1,0,••• ,0) • A]} 2only times a factor 1/2 for the case j = 1, and the coefficient remains for the cases j > 1 (check Lemma C.3 for detail). Since the formulation {Tr[σ (1,0,••• ,0) • A]} 2 remains the same when merging the operation CX with σ (1,0,••• ,0) , for ∈ {1, 2, • • • , m}, we could generate the following equation,

Figure 10: An example of the variational encoding model (L = 2 case).

Figure 11: Quantum Neural Network with the Deformed Tree Tensor structure (qubit number = 12).

of A j and B j . By Lemma C.3 and Lemma C.4, we haveE θ (1) j Tr [σ i • A j ] 2 -Tr [σ i • B j ] 2 = 0,since for the case i 1 ∈ {1, 3}, the term δi 1 0 +δi 1 2 2 = 0, which means Lemma C.3 and Lemma C.4 have the same formulation. Then, we derive the Eq. (65).

Tr σ (3,0,••• ,0) • A 2 + 1 8 Tr σ (1,0,••• ,0) • A 2 (80)and ±a i are coefficients of the term Tr[σ i • A j ] 2 , Tr [σ i • B j ]2 , respectively. We denote G rest part of A j and B j . By Lemma C.3 and Lemma C.4, we haveE θ (1) j Tr [σ i • A j ] 2 -Tr [σ i • B j ] 2 = 0,since for the case i 1 ∈ {1, 3}, the term δi 1 0+δi 1 2 2 = 0 in Lemma C.3 and Lemma C.4, which means both lemmas have the same formulation. Then, we derive the Eq. (93).

Tr σ 3,0,••• ,0 • CX j-1 ρ (j-1) CX † Tr σ 1,0,••• ,0 • CX j-1 ρ (j-1) CX † Tr σ 1,0,••• ,0 • ρ (j-1) 2 ,(117)

The test accuracy and F1-scores for different class pairs (qubit number = 10).

). The training and test accuracy of TT-QNNs and SC-QNNs for different encoding strategies and different class pairs.

The classification accuracy for different class pairs (n=8).

The classification accuracy for different class pairs (n=10).

Table 6 which shows the difference on the training and test accuracy.As a conclusion, employing gate set {RX, RY, RZ} could worse the performance of the QNNs on real-world problems, which may due to the fact that real-world data lie in the real space, while operations {RX, RZ} introduce the imaginary term to the state.

) E THE QUANTUM INPUT MODELFor the convenience of the analysis, we consider the encoding model that the number of alternating layers is L. The model begins with the inital state |0 ⊗n , where n is the number of the qubit. Then we employ the X gate to each qubit which transform the state into |1 ⊗n . Next we employ L alternating layer operations, each of which contains four parts: a single qubit rotation layer denoted as V 2i-1 , a CZ gate layer denoted as CZ 2 , again a single qubit rotation layer denoted as V

|0

X W(1) 1• W(1) 2• W(1) 3• W(1) 4• W(1) 5 |0 X W(2) 1• W(2) 2• W(2) 3• W(2) 4• W(2) 5 |0 X W(3) 1• W(3) 2• W(3) 3• W(3) 4• W(3) 5 |0 X W(4) 1• W(4) 2• W(4) 3• W(4) 4• W (4) 5 |0 X W(5) 1• W(5) 2• W(5) 3• W(5) 4• W(5) 5• W (8) 3• W (8) 4• W (8) 5Proof. The proof has a very similar formulation compare to the original tree tensor case. First we rewrite the formulation of f DTT in detail:where m = log n and we denotewhere each of them acts on the (j -1) • 2 + 1-th and (j -0.5)Now we focus on the partial derivative of the function f to the parameter θ(1) j . We have:The Eq. ( 69) holds because both terms in (67) and (68) except ρ in are real matrices, andSimilar to the tree tensor case, the key idea to derive E θ ∂fDTT ∂θ(1)= 0, which means Lemma C.3 and Lemma C.4 collapse to the same formulation:Now we write the analysis in detail.wherewhereEq. ( 80) is derived using Lemma C.3, and Eq. ( 81) is derived using the collapsed form of Lemma C.1:We remark that during the integration of the parameters {θonly times a factor 1/2 for the case j = 1, and the coefficient remains for the cases j > 1 (check Lemma C.3 for detail). Since the formulationwe could generate the following equation,where θ denotes the vector consisted with all parameters in the -th layer.Finally by using Lemma C.3, we could integrate the parameters {θ 85):G THE PROOF OF THEOREM 3.1: THE SC PART Theorem G.1. Consider the n-qubit SC-QNN defined in Figure 2 and the corresponding objective function f SC defined in (4), then we have:where n c is the number of the control operation CNOT that directly links to the first qubit channel, and the expectation is taken for all parameters in θ with uniform distributions in [0, 2π],Proof. Firstly we remark that by Lemma D.1, each partial derivative is calculated assince the expectation of the quantum observable is bounded by [-1, 1], the square of the partial derivative could be easily bounded as:By summing up 2n -1 parameters, we obtainOn the other side, the lower bound could be derived as follows,where Eq. ( 91) is derived using Lemma G.1, and Eq. ( 92) is derived using Lemma G.2.Lemma G.1. For the objective function f SC defined in Eq. ( 60), the following formula holds for every j such that θ(1) j tunes the single qubit gate on the first qubit channel:where the expectation is taken for all parameters in θ with uniform distribution [0, 2π].Proof. First we write the formulation of f SC in detail:where we denoteThe operation CX is defined as,

CX =

CNOT operation on qubits pairThe operation V is defined as,Now we focus on the partial derivative of the function f SC to the parameter θ(1) j . We have:Under review as a conference paper at ICLR 2021The Eq. ( 97) holds because both terms in ( 95) and ( 96) except ρ in are real matrices, andThe key idea to derive E θ ∂fSC ∂θ(1) = 0, which means both lemma collapse to the same formulation:Now we write the analysis in detail.where 101) and Eq. ( 102) are derived using Lemma C.3 and the collapsed form of Lemma C.1:and θ denotes the vector consisted with all parameters in the -th layer. The integration (99)-( 102) could be performed similarly for parameters {θ n-1 , θ n-2 , • • • , θ j+1 }. It is not hard to find that after the integration of the parameters θ j+1 , the term Tr[σ i • A j ] 2 and Tr[σ i • B j ] 2 have the opposite coefficients. Besides, the first index of each Pauli tensor product σ (i1,i2,••• ,in) could only be i 1 ∈ {1, 3} because of the Lemma C.3. So we could writewherej ⊗ I ⊗(n-1) )A /θ(1) j j (W(1) † j ⊗ I ⊗(n-1) ),⊗ I ⊗(n-1) )A /θ(1) j j (W(1) † j ⊗ I ⊗(n-1) ),where Eq. ( 113) is derived based on the definition of ρ (j) in Eq. ( 106), Eqs. (114-115) are derived based on Lemma C.3, and Eq. ( 117) is derived based on Lemma C.1.We derive Eq. (110-111) by noticing that following equations hold for 2 ≤ j ≤ n -n c ,where Eq. ( 118) is based on the definition of ρ (j) in Eq. ( 106), Eq. ( 119) is based on Lemma C.3, and Eq. ( 120) is based on Lemma C.1.

