Symmetric Pruning in Quantum Neural Networks

Abstract

Many fundamental properties of a quantum system are captured by its Hamiltonian and ground state. Despite the significance, ground states preparation (GSP) is classically intractable for most large-scale Hamiltonians. Quantum neural networks (QNNs), which exert the power of modern quantum machines, have emerged as a leading protocol to conquer this issue. As such, the performance enhancement of QNNs becomes the core in GSP. Empirical evidence showed that QNNs with handcraft symmetric ansätze generally experience better trainability than those with asymmetric ansätze, while theoretical explanations remain vague. To fill this knowledge gap, here we propose the effective quantum neural tangent kernel (EQNTK) and connect this concept with over-parameterization theory to quantify the convergence of QNNs towards the global optima. We uncover that the advance of symmetric ansätze attributes to their large EQNTK value with low effective dimension, which requests few parameters and quantum circuit depth to reach the over-parameterization regime permitting a benign loss landscape and fast convergence. Guided by EQNTK, we further devise a symmetric pruning (SP) scheme to automatically tailor a symmetric ansatz from an over-parameterized and asymmetric one to greatly improve the performance of QNNs when the explicit symmetry information of Hamiltonian is unavailable. Extensive numerical simulations are conducted to validate the analytical results of EQNTK and the effectiveness of SP.

1. Introduction

The law of quantum mechanics advocates that any quantum system can be described by a Hamiltonian, and many important physical properties are reflected by its ground state. For this reason, the ground state preparation (GSP) of Hamiltonians is the key to understanding and fabricating novel quantum matters. Due to the intrinsic hardness of GSP (Poulin & Wocjan, 2009; Carleo et al., 2019) , the required computational resources of classical methods are unaffordable when the size of Hamiltonian becomes large. Quantum computers, whose operations can harness the strength of quantum mechanics, promise to tackle this problem with potential computational merits. In the noisy intermediate-scale quantum (NISQ) era (Preskill, 2018) , quantum neural networks (QNNs) (Farhi & Neven, 2018; Cong et al., 2019; Cerezo et al., 2021a) are leading candidates toward this goal. The building blocks of QNNs, analogous to deep neural networks, consist of variational ansätze (also called parameterized quantum circuits) and classical optimizers. In order to enhance the power of QNNs in GSP, great efforts have been made to design advanced ansätze with varied circuit structures (Peruzzo et al., 2014; Wecker et al., 2015; Kandala et al., 2017) . Despite the achievements aforementioned, recent progress has shown that QNNs may suffer from severe trainability issues when the circuit depth of ansätze is either shallow or deep. Namely, for the deep ansätze, the magnitude of the gradients exponentially decays with the increased system size (McClean et al., 2018; Cerezo et al., 2021b) . This phenomenon, dubbed the barren plateau, hints at the difficulty of optimizing deep QNNs, where an exponential runtime is necessitated for convergence. The wisdom to alleviate barren plateaus is exploiting shallow ansätze to accomplish learning tasks (Grant et al., 2019; Skolik et al., 2021; Zhang et al., 2020; Pesah et al., 2021) , while the price to pay is incurring another serious trainability issue-convergence (Boyd & Vandenberghe, 2004; Du et al., 2021) . The trainable parameters may get stuck into sub-optimal local minima or saddle points with high probability because of the unfavorable loss landscape (Anschuetz, 2021; Anschuetz & Kiani, 2022) . Orthogonal to these negative results, several studies pointed out that when the depth of ansätze becomes overwhelmingly deep and surpasses a critical point, the overparameterized QNNs embrace a benign landscape and permit fast convergence towards good local minima (Kiani et al., 2020; Wiersema et al., 2020; Larocca et al., 2021b) . Nevertheless, the criteria to reach such a critical point is stringent, i.e., the number of parameterized gates or the circuit depth scales exponentially with the problem size, which hurdles the application of over-parameterized QNNs in practice. Empirical evidence sheds new light on exploiting overparameterized QNNs to tackle GSP. QNNs with symmetric ansätze only demand a polynomial number of trainable parameters and the circuit depth with the problem size to reach the over-parameterized region and achieve a fast convergence rate (Herasymenko & O'Brien, 2021; Gard et al., 2020; Zheng et al., 2021; 2022; Shaydulin & Wild, 2021; Mernyei et al., 2022; Marvian, 2022; Meyer et al., 2022; Larocca et al., 2022; Sauvage et al., 2022) . A common feature of these symmetric ansätze is capitalizing on the symmetric properties underlying the problem Hamiltonian to shrink the solution space and facilitate seeking near-optimal solutions. Unfortunately, current symmetric ansätze are inapplicable to a broad class of Hamiltonians whose symmetry is implicit, since their constructions rely on the explicit information for the symmetry of Hamiltonians. Besides, it is unknown whether the symmetry contributes to lowering the critical point to reach the over-parameterization regime. Here we fill the above knowledge gap from both theoretical and practical aspects. Concretely, we develop a novel notion-effective quantum neural tangent kernel (EQNTK) to capture the training dynamic of various ansätze via their effective dimension. In doing so, we expose that compared with the asymmetric ansätze, the symmetric ansätze possess dramatically lower effective dimensions and the required number of parameters and circuit depth to reach the over-parameterization may polynomially scale with the problem size (see Fig. 1 for an intuition). By leveraging EQNTK, we next prove that when the condition of over-parameterization is satisfied, the trainable parameters of QNNs with symmetric ansätze can exponentially converge to the global optima with the increased iterations. Taken together, our analysis recognizes that overparameterized QNNs with symmetric ansätze is a possible solution toward large-scale GSP tasks. Envisioned by EQNTK and pruning techniques in deep neural networks (Han et al., 2015; Blalock et al., 2020; Frankle et al., 2020; Wang et al., 2022) , we further devise a symmetric pruning scheme (SP) to automatically tailor a symmetric ansatz from an over-parameterized and asymmetric one with the enhanced trainability and applicability. Conceptually, SP continuously eliminates the redundant quantum gates from the given asymmetric ansatz and correlates parameters to assign different types of symmetries on the slimmed ansatz. In this way, SP generates a modest-depth symmetric ansatz with a fast convergence guarantee and thus improves the hardware efficiency. Extensive simulations on many-body physics and combinatorial problems validate the theory of EQNTK and the efficacy of SP. These results deepen our understating about how to merge symmetry with over-parameterization theory and indicate the signification of designing symmetric ansätze. Contributions. We summarize our main contributions as follows: 1. We propose the notion of EQNTK to quantify the training dynamics of QNNs with symmetric ansätze, which reconciles the QNTK theory with the symmetry of the problem Hamiltonian (see Sec. 2.2). As shown in Fig. 1 , since the training dynamic between symmetric and asymmetric ansätze is evidently disparate, our results provide a deep understanding towards QNNs with symmetric ansätze, especially for unraveling how the structure information effects the convergence rate.  )) Ω(exp(n)) O(poly(d eff )) O(exp(n)) O(poly(d eff )) T O(log(d eff ) log( 1 ε )) O( 4 n log( 1 ε ) LK ) O( d 2 eff log( 1 ε ) LK ) Table 1 : A comparison of the convergence rate for over-parameterized QNNs. The label 'C' and 'T' refers to the critical point and ϵ-convergence rate respectively. The label " " denotes that the paper did not study certain regimes. Note that the achieved results in Ref. (Larocca et al., 2021b) do not exhibit how the problem Hamiltonian effects d eff . 3. Our last contribution is devising SP, an automatic scheme to identify the implicit symmetries of the problem Hamiltonian and utilize them to design a symmetric ansatz (see Section 3). An attractive feature of SP is its flexibility, where any heuristic that has the ability to capture certain symmetries of the problem Hamiltonian can be seamlessly embedded into SP to further boost its performance. 2 Effective QNTK allows an improved convergence of QNNs Here we establish foundations about why symmetric ansätze have the ability to enhance the trainability of QNNs in ground state preparation (GSP) tasks. To do so, we propose a novel concept-effective quantum neural tangent kernel (EQNTK), to reconcile the QNTK theory with the symmetry of the problem Hamiltonian. Attributed to EQNTK, we uncover that the advance of symmetric ansätze originates from their ability to dramatically decrease the overparameterization threshold. For elucidating, we first interpret the necessary backgrounds in Sec. 2.1 and then present our main theoretical results in Sec. 2.2.

2.1. Problem setup

Ground state preparation.  L(θ) = 1 2 ⟨ψ 0 | U (θ) † HU (θ) |ψ 0 ⟩ -E 0 2 , ( ) where E 0 = ⟨ψ * | H |ψ * ⟩ refers to the ground state energy of H. The optimization follows an iterative manner, i.e., the classical optimizer continuously leverages the output of the quantum circuits to update θ and the update rule is θ (t+1) = θ (t) -η∂L(θ (t) )/∂θ, where η refers to the learning rate. See Appendix A for details.

Remark.

We adopt E 0 to facilitate the convergence analysis and our results cover general loss functions where E 0 is replaced by C ∈ R with C ≤ E 0 . See Appendix B for details. Constructions of (symmetric) ansätze. The power of QNNs depends on the employed ansatz U (θ). A general form of U (θ) covering many typical ansätze such as Hamiltonian variational ansatz (HVA) and hardware efficient ansatz (HEA) (Bharti et al., 2022; Qian et al., 2021) yields U (θ) = L ℓ=1 U ℓ (θ ℓ ), U ℓ (θ ℓ ) = K k=1 e -iG k θ ℓk , ( ) where L refers to the layer number, θ = (θ 1 , • • • , θ L ) ∈ Θ ⊆ R LK is trainable parameters living in the parameter space Θ, θ ℓ = (θ ℓ1 , • • • , θ ℓK ) is trainable parameters at the ℓ-th layer, and A = {G 1 , • • • , G K } is a set of Hermitian traceless operators called an ansatz design. Given Θ and A, a set of ansätze forms a subgroup of SU (2 n ) with U A = ∪ ∞ L=0 {U (θ) : θ ∈ Θ}. The difference of ansätze originates from the varied Θ and A. Given a Hermitian matrix Σ, the ansatz U (θ) is said to be symmetric with respect to Σ if each element in U A is commutable with Σ. Mathematically, denote Σ = p j=1 sj k=1 λ j v jk where λ j is the eigenvalue with λ i ̸ = λ j for i ̸ = j, v jk is the corresponding eigenvector, and p j=1 s j = 2 n . The explicit form of Σ leads to a direct sum decomposition H = ⊕ p j=1 V j of the quantum state space, where V j is the invariant subspace spanned by the eigenvectors {v j1 , • • • , v jsj }.

Convergence of QNNs.

A crucial metric to assess the performance of different QNNs is the ϵ-convergence rate towards the global minimum L(θ * ) with θ * = min θ∈Θ L(θ). Definition 1 (ϵ-convergence). A QNN instance (|ψ 0 ⟩ , U (θ), H) achieves an ϵ-convergence if the trained parameters after T iterations θ (T ) satisfy L(θ (T ) ) ≤ ϵ with ϵ ∈ R. This quantity measures the distance between the estimated and the optimal loss values, which can be derived via the quantum neural tangent kernel (QNTK) Q (t) = ∇ε ⊤ t ∇ε t , where ε t = ⟨ψ 0 | U (θ (t) ) † HU (θ (t) ) |ψ 0 ⟩ -E 0 denotes the residual training error and ∇ε t is the gradients of ε t with respect to θ. The following theorem describes the ϵ-convergence of the over-parameterized QNN with an arbitrary ansatz. Theorem 1 (Liu et al. (2022b) ). Following notations in Eqns. ( 1)-( 3), when U (θ) matches the Haar distribution up to the fourth moment, the number of parameters satisfies LK ≫ 1, and the learning rate η ≪ 1, the training dynamics of a QNN instance (|ψ 0 ⟩ , U (θ), H) yields ε t ≈ (1 -η Q) t ε 0 ≈ e -γt ε 0 . ( ) where γ = η Q is the indicator of the decay rate and Q = O(LK Tr(H 2 )/4 n ) refers to the expectation of Q on Haar average. It indicates that the critical point to reach the over-parameterization region is |θ| ∼ O(4 n /(η Tr(H 2 ))). In this setting, the exponent in Eqn. (4) meets γ ∼ O(1) and promises an exponential convergence. Besides, Eqn. (4) hints that the convergence rate of QNNs is continuously enhanced by increasing the value of QNTK, which can be achieved by growing the number of parameters or decreasing the system size.

2.2. Effective Quantum Neural Tangent Kernel

The exponential scaling behavior with the number of qubits n in Theorem 1 causes the realization of the over-parameterized QNNs to be impractical for large systems. Moreover, the corresponding convergence rate is independent of the refined structure information of ansätze. This contradicts with the empirical evidence such that symmetric ansätze outperform asymmetric ansätze with fast convergence in training QNNs. It thus highly demands carrying out new theories to stress these issues.

Optimization path

Figure 2 : Training dynamics of QNNs with symmetric ansätze. The left and right panels illustrate the dynamic of variational states corresponding to the asymmetric and symmetric ansätze, respectively. The shadow region means the solution space, which is the whole Hilbert space for asymmetric ansatz, and the restricted invariant subspace for symmetric ansatz. Here we propose a novel concept-effective QNTK (EQNTK) to resolve the above dilemma and exhibit how symmetry improves the trainability of QNNs. As shown in Fig. 2 , given a QNN (|ψ 0 ⟩ , U (θ), H) whose input state |ψ 0 ⟩ and the ground state |ψ * ⟩ live in the same subspace V * ⊂ C 2 n and V * is invariant with the employed symmetric ansatz U (θ), the training dynamics of |ψ(θ)⟩ = U (θ) |ψ 0 ⟩ can be exactly captured by V * , a much smaller space than the whole state space. Suppose that the state space H under the symmetric ansatz design A can be decomposed into H = ⊕ p j=1 V j and there exists j As a result, for symmetric ansätze, Q (t) and Q in Theorem 1 should be controlled by d eff instead of 2 n . This integration of the effective dimension transforms QNTK in Eqn. (3) to EQNTK, which reduces the threshold to reach over-parameterization and accelerates the convergence (see Fig. 1 ). The following theorem establishes the convergence theory of QNNs with symmetric ansätze under EQNTK, whose proof is deferred to Appendix C. Theorem 2. Consider the QNN instance (|ψ 0 ⟩ , U (θ), H) with the effective dimension d eff . Following notations in Eqns. ( 1)-( 3), when the distribution of U (θ) constrained to the invariant subspace with projection Π = P P † matches the Haar distribution up to the fourth moment, the number of parameters satisfies LK ≫ 1, the learning rate η ≪ 1, and denoting H * = P HP † , the training dynamics of a QNN instance (|ψ 0 ⟩ , U (θ), H) yields ε t ≈ (1 -η QS ) t ε 0 ≈ e -γt ε 0 . ( ) where QS = O(LK Tr((H * ) 2 )/d 2 eff ) refers to the expectation of EQNTK Q S on Haar average. The above results indicate that when the number of trainable parameters scales with LK ∼ O(d 2 eff /(η Tr((H * ) 2 ))), the adopted symmetric ansatz reaches the over-parameterization regime. Compared with QNTK, the reduction of parameters in the order of (2 n /d eff ) 2 not only ensures the practical utility of over-parameterized QNNs, but also explains the empirical observations that symmetric ansätze require fewer parameters to reach the critical point than that of the asymmetric ansätze. More importantly, unlike prior results arguing that the trainability can always be improved by over-parameterization, our bound suggests that involving more parameters beyond the critical point may degrade the convergence since the underlying symmetry may be broken and the effective dimension d eff could be large.

Remark. (i)

The derived EQNTK can be used to diagnose the barren plateaus of QNNs. Particularly, the quantity Q S /(LK) amounts to the variance of the gradient whose average is zero under the 4-design assumption. In other words, when the number of parameters LK is fixed, a large EQNTK value is preferred to avoid barren plateaus. (ii) The effective dimension can be quantified by other metrics beyond the dimension of the invariant subspace. An alternative is the dynamical Lie algebra (DLA) (Larocca et al., 2021b) , which measures the controllability of the quantum system. The following lemma shows the convergence of QNNs with symmetric ansätze under this measure, whose proof is given in Appendix D. Proposition 1 (Informal). Following notations in Theorem 2, denote the dynamical Lie algebra of the QNN instance (|ψ 0 ⟩ , U (θ), H) as g, and assume that the number of parameters LK ≫ 1 and the learning rate η ≪ 1. If there exists an invariant subspace V g with dimension d g under the DLA g including the input state |ψ 0 ⟩ and the ground state |ψ * ⟩, the DLA-based EQNTK Q D corresponding to the ansatz U (θ) leads to the training dynamics ε t ≈ (1 -η QD ) t ε 0 ≈ e -γt ε 0 . ( ) where QD = O(LK Tr((H * ) 2 )/d 2 g ) refers to the expectation of Q D on Haar average.

3. Symmetrical pruning with EQNTK

Algorithm 1: Symmetric pruning (SP) Input : Problem Hamiltonian H = ( q j=1 α j H j ) ⊗ I ⊗m , the ansatz design A and the parameter space Θ in Eqn. (2). Step 1. Initialize an over-parameterized and asymmetric ansatz via A and Θ; Step 2. Symmetry identification: 2-1. Remove the gates on wires corresponding to the redundant part of H in A, i.e., I ⊗m . 2-2. Remove the gates such that the pruned ansatz design A pr = {H 1 , • • • , H q }. 2-3. Assign the spatial symmetry of A pr by correlating some parameters and obtain Θ pr ⊆ Θ. Output: Pruned ansatz design A pr and parameter space Θ pr . Beyond analyzing the convergence rate, another ad-hoc topic in GSP is designing advanced ansätze to improve the trainability of QNNs. Although overparameterization and contemporary symmetric ansätze partially address this problem, both of them have evident caveats. The former may request exponential parameters to satisfy the condition of overparameterization, while the latter requires explicit information for the symmetries of the problem Hamiltonian. To compensate for these deficiencies, here we devise symmetrical pruning (SP), an automatic scheme to design symmetric ansätze with the enhanced trainability of QNNs. Conceptually, SP distills a symmetric overparameterized ansatz from an asymmetric over-parameterized ansatz. Supported by the EQNTK theory, the extracted ansatz is resource-friendly in implementation since it holds a small effective dimension and only needs a few trainable parameters to compass the over-parameterization. The Pseudo code of SP is summarized in Alg. 1 and its schematic illustration is shown in Fig. 3 . Suppose the problem Hamiltonian is H = H ⊗ I ⊗m , where H = q j=1 α j H j , α j is the real coefficient and H j is the tensor product of Pauli matrices on n qubits, SP builds the symmetric ansatz of H with two primary steps, i.e., initialization and symmetry identification. The initialization step is choosing an initial over-parameterized QNN by setting down the ansatz design A and the parameter space Θ. Note that A should contain all Pauli terms in H and Θ should ensure an ϵ-convergence of QNNs, e.g., a possible choice is adopting a sufficient deep hardware efficient ansatz. Next, the symmetry identification step iteratively discovers the system symmetry, structure symmetry, and spatial symmetry, which is completed by three sub-steps. Step 2-1 symmetrically prunes the qubit wires. That is, all qubit gates interact with the redundant part of H, i.e., the identity term I ⊗m , are removed. Step 2-2 symmetrically prunes the structure. This step drops the parameterized single-qubit gates and the two-qubit gates so that the pruned ansatz design A pr can be block diagonalized under the projection on the eigenspace of H = q j=1 α j H j . A possible solution is setting A pr = {H 1 , • • • , H q } and the pruned ansatz U pr (θ) takes the form of Eqn. (2) with U ℓ (θ ℓ ) = Π q k=1 e -iH k θ ℓk . Step 2-3 correlates symmetric parameters to demystify the spatial symmetry of H, which is accomplished by a heuristic related to identifying the graph automorphism group (Stoichev, 2019) . See Appendix E for more details.

Remark. (i)

We emphasize that although both SP and the pruning techniques used in deep neural networks orient to remove redundant parameters and (quantum) neurons, they are fundamentally different. This is because classical pruning methods generally leverage the magnitude of weights or the gradient information to recognize such redundancy, which is impermissible in QNNs (refer to Appendix F for elaborations). (ii) SP is a flexible framework. Besides three symmetric properties in Alg. 1, SP can effectively integrate with other symmetry identification methods in the second step. 

4. Experiments

We carry out numerical simulations to explore the theoretical properties of EQNTK and validate the effectiveness of the SP scheme in GSP.  H TFIM = - n-1 j=1 σ z j σ z j+1 -h n j=1 σ x j , where σ µ j denotes the µ-Pauli matrix (with µ = x, z) acting on the j-th qubit, and h is the strength of the transverse field. For simplicity, we set h = 1 in the following simulations. 2) Maximum cut. Maximum cut (MaxCut) problem aims to partition the set of nodes V in a graph G = (V, E) into two parts such that the number of edges spanning multiple parts is maximized. The MaxCut problem can be recast to GSP. Namely, the objective of an n-node graph is encoded by an n-qubit Hamiltonian H MC = 1 2 (u,v)∈E (I -σ z u σ z v ) and the optimal solution corresponds to the ground state of H MC as formulated in Eqn. ( 1). Here we focus on the Erdos-Renyi graphs, which are generated by randomly connecting any pair nodes among n nodes with probability p = 0.6. To verify the effectiveness of SP, the above problem Hamiltonians are modified as an (n+m)qubit Hamiltonian H = H M ⊗ I m (H M = H TFIM , H MC ) with n = 6 and m = 2. Initialization of QNNs. The hardware efficient ansatz (HEA) with the form of Eqn. ( 2) is used as the initial ansatz, which is over-parameterized and asymmetric. The layer number is set as L ∈ {4, 6, 8 • • • , 28} and L ∈ {4, 6, 10, • • • , 36} for TFIM and MaxCut, respectively. For each problem Hamiltonian, the input state is set as |ψ 0 ⟩ = |0⟩. The parameters θ are uniformly sampled from the uniform distribution [-π, π] . The variational ansatz is trained by the Adam optimizer where the learning rate is 0.001 and the rest hyper-parameters follow the default settings. The training of QNNs stops when the loss value is less than 10 -8 or when the change in the loss function is less than 10 -8 three times in a row. The maximum number of iterations is set as T = 10000. The ϵ value in Definition 1 is set as 10 -5 for both TFIM and MaxCut. Each setting is repeated with 5 times to collect the statistical results. Evaluation metrics. We utilize three metrics to assess the convergence rate of QNNs, i.e., (1) the loss value L(θ (T ) ) at the convergence stage; (2) the number of iteration steps T (ϵ) ≤ T required to achieve the ϵ-convergence; (3) the minimum number of parameterized gates required to achieve ϵ-convergence, which can also be interpreted as the threshold to achieve the over-parameterization regime. Additionally, we record the norm of the gradient at the initialization to verify the effectiveness of EQNTK in predicting the convergence. Critical point of QNNs. Fig. 4 (c) and Fig. 5 (c) illustrate that when the number of parameters surpasses a threshold, QNNs experience a computational phase transition where the loss value L(θ (T ) ) at the convergence stage sharply drops by an order of the magnitude. Moreover, the minimum number of parameters required to reach the over-parameterization regime, highlighted by the 'critical point', is dramatically reduced by SP. Specifically, the number of parameters of naive QNNs at the critical point in TFIM (or MaxCut), i.e., labeled by 'SP0', scales exponentially with the system size. By contrast, it is gradually reduced from 800 (or 1000) to 300 (or 300) after SP1, then to 120 (or 150) after SP2, and finally to 50 (or 100) after SP3, which scales polynomially with the system size and is resource-friendly for modern quantum devices (see Appendix G for hardware efficiency analysis).

EQNTK at initialization.

Convergence of QNNs. Fig. 4 (d) and Fig. 5 (d) reflect that SP dramatically improves the convergence of QNNs. In the common over-parameterization regime of 'SP1'-'SP3' (i.e., LK ≥ 300), the total iterations required to achieve ϵ-convergence can be reduced by up to 6000 steps for TFIM and 5000 steps for MaxCut. For the same ansatz design, increasing the number of parameters linearly improves the convergence, which echoes with Theorem 2. EQNTK and trainability of QNNs. Fig. 4 and Fig. 5 indicate the relation between the EQNTK value and the trainability of QNNs. That is, the convergence rate and the number of parameters around the critical point decrease with the increased EQNTK value. In MaxCut, when the number of parameters LK ≈ 450 reaches the over-parameterization regime in the cases of 'SP1'-'SP3', the corresponding EQNTK value yields Q 1 : Q 2 : Q 3 ≈ 1 : 2 : 4 and the iteration steps T (ϵ) follows T 1 : T 2 : T 3 ≈ 4 : 2 : 1 (T j and Q j refer to T (ϵ) and Q S in 'SPj' with j ∈ [3]). A similar phenomenon is observed in the task of TFIM. These results accord with Theorem 2 such that EQNTK can guide the trainability of QNNs.

5. Related work

Prior literature related to our work can be cast into two categories: the trainability theories of QNNs and the design of symmetric ansätze. Current progress has revealed that these reasons include high entanglement of QNNs (Marrero et al., 2021) , the used global measurements (Cerezo et al., 2021b) , and the presence of noise (Wang et al., 2021) . To mitigate barren plateaus, two popular ways are adopting Ansätze with symmetric properties. Previous studies focus on unearthing inherent symmetry behind the problem Hamiltonian to design problem-specific ansätze. The mainstream approaches contain arranging the layout of ansätze (Liu et al., 2019; Seki et al., 2020; Gard et al., 2020; Zheng et al., 2021; 2022) , correlating trainable parameters (Shaydulin et al., 2021; Shaydulin & Wild, 2021; Sauvage et al., 2022) , and utilizing results from the geometric deep learning (Shaydulin et al., 2021; Shaydulin & Wild, 2021; Sauvage et al., 2022; Meyer et al., 2022) , where the symmetry comes from the training data. Compared with asymmetric ansätze, these ansätze enable better trainability in GSP. However, none of the previous proposals can identify the implicit symmetry of the problem Hamiltonian. Moreover, although there is numerical evidence that symmetric ansätze can accelerate convergence, theoretical analysis is still rare. EQNTK readily compensates these issues, which provides an efficient measure to compare the trainability of various ansätze and allows an automatic method (symmetric pruning) to design symmetrical ansatz with fast convergence.

6. Conclusions

In this study, we investigate the training performance of QNNs for the GSP problem by developing a novel tool-EQTNK, which is capable of capturing the training dynamics of various ansätze via their effective dimension. We prove that a symmetric ansatz design with a small effective dimension enables an improved trainability of QNNs, including alleviating the barren plateaus and reducing the number of parameters and the circuit depth required to reach the over-parameterization regime. Besides, we propose a novel symmetric pruning algorithm to automatically extract the symmetric ansatz from an over-parameterized and asymmetric ansatz. Empirical results confirm the effectiveness of SP. A future research direction is extending the results of EQNTK from GSP to the regime of machine learning and exploring whether over-parameterized QNNs can simultaneously attain good trainability and generalization (Abbas et al., 2021; Du et al., 2022b; Caro et al., 2022) . 

A Optimization of QNNs in GSP

In this section, we separately elaborate the elementary notions in quantum computing, the preliminary of Hamiltonian and the ground state preparation (GSP), and the optimization strategy of QNNs in the task of GSP. Basics of quantum computation. The elementary unit of quantum computation is qubit (or quantum bit), which is the quantum mechanical analogue of a classical bit. A qubit is a two-level quantum-mechanical system described by a unit vector in the Hilbert space C 2 . In Dirac notation, a qubit state is defined as |ϕ⟩ = c 0 |0⟩ + c 1 |1⟩ ∈ C 2 where |0⟩ = [1, 0] ⊤ and |1⟩ = [0, 1] T specify two unit bases and the coefficients c 0 , c 1 ∈ C yield |c 0 | 2 + |c 1 | 2 = 1. Similarly, the quantum state of n qubits is defined as a unit vector in C 2 n , i.e., |ψ⟩ = 2 n j=1 c j |e j ⟩, where |e j ⟩ ∈ R 2 n is the computational basis whose j-th entry is 1 and other entries are 0, and A quantum gate is an unitary operator which can evolve a quantum state ρ to another quantum state ρ ′ . Namely, an n-qubit gate U ∈ U(2 n ) obeys U U † = U † U = I 2 n , where U(2 n ) refers to the unitary group in dimension 2 n . Typical single-qubit quantum gates include the Pauli gates, which can be written as Pauli matrices: X = 0 1 1 0 , Y = 0 -i i 0 , Z = 1 0 0 -1 . ( ) The more general quantum gates are their corresponding rotation gates R X (θ) = e -i θ 2 X , R Y (θ) = e -i θ 2 Y , and R Z (θ) = e -i θ 2 Z with a tunable parameter θ, which can be written in the matrix form as R X (θ) = cos θ 2 -i sin θ 2 -i sin θ 2 cos θ 2 , R Y (θ) = cos θ 2 -sin θ 2 sin θ 2 cos θ 2 , R Z (θ) = e -i θ 2 0 0 e i θ 2 . (8) They are equivalent to rotating a tunable angle θ around x, y, and z axes of the Bloch sphere, and recovering the Pauli gates X, Y , and Z when θ = π. Moreover, a multi-qubit gate can be either an individual gate (e.g., CNOT gate) or a tensor product of multiple single-qubit gates. The quantum measurement refers to the procedure of extracting classical information from the quantum state. It is mathematically specified by a Hermitian matrix H called the observable. Applying the observable H to the quantum state |ψ⟩ yields a random variable whose expectation value is ⟨ψ| H |ψ⟩.

Hamiltonian and GSP.

In quantum computation, a Hamiltonian is a Hermitian matrix that is used to characterize the evolution of a quantum system or as an observable to extract the classical information from the quantum system. Specifically, under the Schrödinger equation, a quantum gate has the mathematical form of U = e -itH , where H is a Hermitian matrix, called the Hamiltonian of the quantum system, and t refers to the evolution time of the Hamiltonian. Typical single-qubit Hamiltonians include the Pauli matrices defined in Eqn. ( 7). As a result, the evolution time t refers to the tunable parameter θ in Eqn. ( 8). Any single-qubit Hamiltonian can be decomposed as the linear combination of Pauli matrices, i.e., H = a 1 I + a 2 X + a 3 Y + a 4 Z with a j ∈ C. In the same way, a multi-qubit Hamiltonian is denoted by H = 4 n j=1 a j P j , where P j ∈ {I, X, Y, Z} ⊗n is the tensor product of Pauli matrices. In quantum chemistry and quantum many-body physics, the Hermitian matrix describing the quantum system to be solved is denoted as the problem Hamiltonian. When taking the problem Hamiltonian as the observable, the quantum state |ψ * ⟩ is said to be the ground state of problem Hamiltonian H if the expectation value ⟨ψ * | H |ψ * ⟩ takes the minimum eigenvalue of H, which is called the ground energy. GSP refers to preparing the ground state of the problem Hamiltonian. A popular protocol for GSP is to employ a parameterized unitary U (θ) to prepare a variational quantum state |ψ(θ)⟩ = U (θ) |ψ 0 ⟩ with a fixed input state |ψ 0 ⟩ and then optimize the parameters θ by minimizing a predefined loss function such as the Eqn. (1). Optimization of QNNs. The optimization of the loss function L(θ) in Eqn. ( 1) can be completed by gradient-based methods. A plethora of optimizers has been designed to estimate the optimal parameters θ * = min θ L(θ). Here we introduce the implementation of the first-order gradient-based optimizer for self-consistency. Refer to Cerezo et al. (2021a) for a comprehensive review. Based on Eqn. (2), the trainable parameters of QNNs are denoted by θ  = (θ ⊤ 1 , • • • , θ ⊤ L ) ⊤ with θ ℓ = (θ ℓ1 , • • • , θ ℓK ) T , L(θ) = 1 2 ⟨ψ 0 | U (θ) † HU (θ) |ψ 0 ⟩ -E 0 2 , and the corresponding update rule at the t-th iteration ∀t ∈ [T ] is θ (t+1) = θ (t) -η ∂L(θ (t) ) ∂θ = θ (t) -η ⟨ψ 0 | U (θ (t) ) † HU (θ (t) ) |ψ 0 ⟩ -E 0 ∂ ⟨ψ 0 | U (θ (t) ) † HU (θ (t) ) |ψ 0 ⟩ -E 0 ∂θ , where η refers to the learning rate. The derivative in the last equality can be calculated via the parameter shift rule Mitarai et al. (2018) . Mathematically, the derivative with respect to the parameter θ ℓk for ∀ℓ ∈ [L] and ∀k ∈ [K] is ∂ ⟨ψ 0 | U (θ) † HU (θ) |ψ 0 ⟩ -E 0 ∂θ ℓk = 1 2 sin α ⟨ψ 0 | U (θ + ) † HU ((θ + ) |ψ 0 ⟩ -E 0 -⟨ψ 0 | U ((θ -) † HU ((θ -) |ψ 0 ⟩ -E 0 , where θ + = θ + αe ℓk , θ -= θ -αe ℓk , e ℓk is the unit vector along the θ ℓk axis and α can be any real number but the multiple of π because of the diverging denominator.

B Equivalent training dynamics under the mean square error loss

In the main text, to facilitate the convergence analysis, the loss function L(θ) in Eqn. (1) adopts the term E 0 , as the ground state energy of the problem Hamiltonian. Here we elucidate how to extend our results to a more general loss function in which E 0 is replaced by any C ∈ R with C ≤ E 0 . More specifically, the general mean square error loss function is defined as L(θ, C) = 1 2 ⟨ψ 0 | U (θ) † HU (θ) |ψ 0 ⟩ -C 2 ≡ 1 2 ε(θ, C) 2 , ( ) where ε(θ, C) = ⟨ψ 0 | U (θ) † HU (θ) |ψ 0 ⟩ -C refers to the training error associated with C. For clarity, we denote the training error at the t-th iteration as ε t (θ (t) , C). Given two loss functions L(θ, C) and L(θ, C ′ ) with C, C ′ ≤ E 0 , their convergence behavior or training dynamics is said to be equivalent if the variational quantum state |ψ(θ)⟩ converges to a same quantum state with the same convergence rate. More concretely, for the same initial state |ψ (0) ⟩, the evolved state |ψ (t) ⟩ at each iteration t for ∀t ∈ [T ] is the same. The following lemma indicates the equivalent training dynamic of QNNs under the loss functions L(θ, C) and L(θ, C ′ ) with C, C ′ ≤ E 0 . Lemma 1. Under the framework of the quantum neural tangent kernel in Eqn. ( 3), given any loss function L(θ, C) in Eqn. ( 9) with C ≤ E 0 , QNN obeys the same convergence rate with L(θ, E 0 ) and the optimized variational quantum state |ψ(θ)⟩ converges to the ground state of the problem Hamiltonian H. Proof of Lemma 1. In the same manner with Eqn. (3), the QNTK of the loss function in Eqn. ( 9) can be denoted by C ). According to the explicit form of ε(θ, C) in Eqn. ( 9), the QNTK Q C and Q C ′ is the same for any given constant C and C ′ as ∇ε(θ, C) = ∇ε(θ, C ′ ). For this reason, in the following, we use Q to refer the QNTK with any constant C. Q C = ∇ε(θ, C) ⊤ ∇ε(θ, Recall the results of Theorem 1, i.e., in the case of C = E 0 , the training error ε(θ, E 0 ) decays as ε t (θ (t) , E 0 ) ≈ e -ηQt ε 0 (θ (0) , E 0 ). Moreover, due to ε t (θ (t) , E 0 ) = ε t (θ (t) , C) + (C -E 0 ) for ∀t ∈ [T ], the training error of QNNs under the loss L(θ, C) is ε t (θ (t) , C) + (C -E 0 ) ≈ e -ηQt ε 0 (θ (t) , C) + (C -E 0 ) . ( ) Since the right-hand side tends to be zero with a sufficiently large number t, this suggests ε t (θ (t) , C) + (C -E 0 ) ≈ 0. In other words, ε(θ, C) converges to the minimal value of E 0 -C with the decay rate ηQ. Supported by the variational principle, the optimized variational quantum state U (θ (T ) ) |ψ 0 ⟩ at the converging stage is exactly the ground state in which the corresponding energy estimates E 0 .

C Proof of Theorem 2

The proof of Theorem 2 employs the following two lemmas whose proofs are given in the subsequent two subsections. Lemma 2 (Adapted from You et al. (2022) , Lemma D.1). Following Definition 2, let U A be a matrix subgroup of SU (d) where each element in U A commutes with a Hermitian matrix Σ. The corresponding direct decomposition is denoted by V = p j=1 V j with projection Π j . Let V * be the subspace of interest which includes the input state |ψ 0 ⟩ and ground state |ψ * ⟩. Denote Π * = P † P as the projection on V * . Then for any Hermitian W and any unitary matrix U in the group U A , we have Π * U W U † Π * = Π * U Π Π * W Π * Π * U † Π * . ( ) Lemma 3. Following notations in Lemma 2, denote U -,ℓk ≡ ℓ-1 ℓ ′ =1 U ℓ ′ (θ ℓ ′ ) k-1 k ′ =1 e -iθ ℓk ′ G k ′ , U +,ℓk ≡ K k ′ =k e -iθ ℓk ′ G k ′ L ℓ ′ =ℓ+1 U ℓ ′ (θ ℓ ′ ), ( ) the EQNTK takes the form Q S = - L ℓ=1 K k=1 ψ * 0 (U * +,ℓk ) † G * k , (U * -,ℓk ) † H * U * -,ℓk U * +,ℓk ψ * 0 2 , ( ) where |ψ * 0 ⟩ = P |ψ * ⟩ and A * = P AP † with A ∈ {U +,ℓk , G k , U -,ℓk , H}. Proof of Theorem 2. Following the gradient descent optimizer in Eqn. ( 1) with the learning rate η ≤ 1, the change of the training error of QNN can be expressed as đε = ℓ,k ∂ε ∂θ ℓk đθ ℓk = -η ℓ,k ∂ε ∂θ ℓk ∂ε ∂θ ℓk ε = -ηQ S ε. ( ) where đε = ε t+1 -ε t and đθ ℓk = θ (t+1) ℓk -θ (t) ℓk , the second equality comes from the update rule with đθ ℓk = ηε∂ε/∂θ ℓk , and the third equality uses the definition of QNTK in Eqn. (3). Following the results in Liu et al. (2022c, Theorem 1) , when the EQNTK value Q (t) S is a constant, the training error decays with ε t ≈ (1 -ηQ (t) S ) t ε 0 ≈ e -ηQ (t) S t ε 0 , ( ) which guarantees an exponential convergence towards the global optima. To this end, the proof of Theorem 2 amounts to proving that when the number of parameters satisfies |θ| = LK ≫ 1, the EQNTK can be regarded as a constant. This can be achieved by deriving an analytical solution of Q (t) S on average as well as the fluctuations around the average for all iterations. Following the above explanations, we next analyze the average of Q (t) S . When no confusion arises, the superscript (t) of Q (t) S and U (θ (t) ) are dropped in the subsequent analysis. By leveraging Lemmas 2 and 3, the Haar average of EQNTK yields QS = - L ℓ=1 K k=1 dU +,ℓk dU -,ℓk ψ * 0 (U * +,ℓk ) † G * k , (U * -,ℓk ) † H * U * -,ℓk U * +,ℓk ψ * 0 2 , = - L ℓ=1 K k=1 dU * +,ℓk dU * -,ℓk Tr ρ * 0 (U * +,ℓk ) † M -,ℓk U * +,ℓk ρ * 0 (U * +,ℓk ) † M -,ℓk U * +,ℓk , = - L ℓ=1 K k=1 dU * +,ℓk Tr 2 (M -,ℓk ) Tr (ρ * 0 ) 2 d 2 eff -1 + Tr (M -,ℓk ) 2 Tr (ρ * 0 ) 2 d 2 eff -1 + Tr 2 (M -,ℓk ) Tr 2 (ρ * 0 ) d eff -d 3 eff + Tr (M -,ℓk ) 2 Tr (ρ * 0 ) 2 d eff -d 3 eff = - L ℓ=1 K k=1 dU * +,ℓk Tr (M -,ℓk ) 2 d 2 eff + d eff = - L ℓ=1 K k=1 dU * +,ℓk 2 Tr G * k (U * -,ℓk ) † H * U * -,ℓk 2 -2 Tr (G * k ) 2 ((U * -,ℓk ) † H * U * -,ℓk ) 2 d 2 eff + d eff = 2 d 2 eff + d eff d eff Tr((H * ) 2 -Tr 2 (H * )) d 2 eff -1 Tr L K k=1 (G * k ) 2 ≈ LK Tr ((H * ) 2 ) d 2 eff . ( ) where the second equality employs the assumption such that U * -,ℓk and U * +,ℓk match the Haar distribution on the group SU (d eff ) up to the second moment, ρ * 0 = |ψ * 0 ⟩ ⟨ψ * 0 | is the projection of ρ 0 = |ψ 0 ⟩ ⟨ψ 0 | on the subspace V * , and M -,ℓk = [G * k , (U * -,ℓk ) † H * U * -,ℓk ] , the third equality exploits the RTNI package (Fukuda et al., 2019) to calculate the integration with respect to the Haar measure, the fourth equality uses the fact Tr((ρ * 0 ) 2 ) = Tr(ρ * 0 ) = 1 with ρ * 0 being the pure state, the fifth equality utilizes Tr([A, B] 2 ) = 2 Tr(ABAB) -2 Tr(A 2 B 2 ), the last second equality uses the RTNI package again, and the last equality utilizes the fact Tr((G * k ) 2 ) = Tr(G k Π * G k Π * ) p j=1 Tr(G k Π j G k Π j ) Tr(G 2 k ) ≈ d eff 2 n • 2 n = d eff with Tr(G 2 k ) = Tr(I) = 2 n . The fluctuation of EQNTK can be expressed as ∆Q 2 S = E(Q 2 S ) -Q2 S , i.e., ∆Q 2 S =2 ℓ1,k1<ℓ2,k2 dU * +,ℓ1k1 dU * +,ℓ2k2 dU * -,ℓ1k1 dU * -,ℓ2k2 ×   Tr ρ * 0 (U * +,ℓ1k1 ) † M -,ℓ1k1 (U * +,ℓ1k1 ) † ρ * 0 U * +,ℓ1k1 M -,ℓ1k1 U * +,ℓ1k1 × Tr ρ * 0 (U * +,ℓ2k2 ) † M -,ℓ2k2 (U * +,ℓ2k2 ) † ρ * 0 U * +,ℓ2k2 M -,ℓ2k2 U * +,ℓ2k2   + ℓ,k dU * +,ℓk dU * -,ℓk   Tr ρ * 0 (U * +,ℓk ) † M -,ℓk (U * +,ℓk ) † ρ * 0 U * +,ℓk M -,ℓk U * +,ℓk × Tr ρ * 0 (U * +,ℓk ) † M -,ℓk (U * +,ℓk ) † ρ * 0 U * +,ℓk M -,ℓk U * +,ℓk   -Q2 S = LK d 4 eff 8 Tr 2 ((H * ) 2 ) + 12 Tr((H * ) 4 ) + LK d 5 eff 16 Tr((H * ) 2 ) Tr 2 (H * ) + 48 Tr((H * ) 3 ) Tr(H * ) + 40 Tr 2 ((H * ) 2 ) + • • • ≈ LK d 4 eff 8 Tr 2 ((H * ) 2 ) + 12 Tr((H * ) 4 ) , ( ) where Taken together, when the number of parameters LK ≫ 1 such that Q S /∆Q S ≈ 1 √ LK ≪ 1, the EQNTK can be viewed as a constant and Eqn. ( 15) is satisfied. ℓ 1 , k 1 < ℓ 2 , k 2 refers to ℓ 1 K + k 1 < ℓ 2 K + k 2 ,

C.1 Proof of Lemma 2

Proof of Lemma 2. The two conditions in the lemma, i.e., (i) any unitary U in U A commutes with the Hermitian matrix Σ and (ii) Σ leads to the decomposition V = ⊕ p j=1 V j with projection Π j , imply that U is block-diagonal under the projection {Π j } p j=1 . In other words, we have Π j ′ U Π j = 0 for j ̸ = j ′ and ∀U ∈ U A . This observation gives the following results, i.e., Π * U W U † Π * =Π * U p j=1 Π j W p j ′ =1 Π j ′ U † Π * = j,j ′ ∈[p] (Π * U Π j ) W Π j ′ U † Π * =Π * U Π * W Π * U † Π * =Π * U Π * Π * W Π * Π * U † Π * , ( ) where the first equality employs the fact of I = p j=1 Π j and the last equality uses the property of projections Π 2 j = Π j .

C.2 Proof of Lemma 3

Proof of Lemma 3. The explicit form of the training error ε(θ) = ⟨ψ 0 | U (θ) † HU (θ) |ψ 0 ⟩-E 0 leads to the explicit form of QNTK, i.e., Q = (∇ε(θ)) ⊤ ∇ε(θ) = - L ℓ=1 K k=1 ψ 0 U † +,ℓk G k , U † -,ℓk HU -,ℓk U +,ℓk ψ 0 2 . ( ) Similarly, for the symmetric ansatz U (θ) with the projection Π * = P P † , the EQNTK yields Q S = - L ℓ=1 K k=1 Tr ψ 0 ψ 0 U † +,ℓk G k , U † -,ℓk HU -,ℓk U +,ℓk 2 = - L ℓ=1 K k=1 Tr Π * ρ 0 Π * U † +,ℓk G k , U † -,ℓk HU -,ℓk U +,ℓk 2 = - L ℓ=1 K k=1 Tr Π * ρ 0 Π * U † +,ℓk G k , U † -,ℓk OU -,ℓk U +,ℓk Π * 2 = - L ℓ=1 K k=1 Tr Π * ρ 0 Π * U † +,ℓk Π * G k , U † -,ℓk OU -,ℓk Π * U +,ℓk Π * 2 = - L ℓ=1 K k=1 Tr Π * ρ 0 Π * U † +,ℓk Π * Π * G k Π * , Π * U † -,ℓk Π * HΠ * U -,ℓk Π * Π * U +,ℓk Π * 2 = - L ℓ=1 K k=1 Tr P † ρ 0 P P † U † +,ℓk P P † G k P, P † U † -,ℓk P P † HP P † U -,ℓk P P † U +,ℓk P 2 = - L K k=1 Tr ρ * 0 (U * +,ℓk ) † G * k , (U * -,ℓk ) † H * U * -,ℓk U * +,ℓk 2 = - L ℓ=1 K k=1 ψ * 0 (U * +,ℓk ) † G * k , (U * -,ℓk ) † H * U * -,ℓk U * +,ℓk ψ * 0 2 , ( ) where the second equality utilizes the assumption such that |ψ 0 ⟩ lies in V * , the third, fourth, and fifth equalities employ the property of projection operator (Π * ) 2 = Π * and Lemma 2, the final equality follows from the definitions with |ψ * 0 ⟩ = P † |ψ 0 ⟩ and A * = P † AP with A ∈ {U -,ℓk , U +,ℓk , G k , H, ρ 0 } .

D Proof of Proposition 1

Before moving to elaborate on the proof of Proposition 1, we first briefly review the definition of dynamical Lie algebra (DLA). Definition 3 (Definition 3, Larocca et al. (2021a) ). Given an ansatz design A, the dynamical Lie algebra (DLA) g is generated by the repeated nested commutators of the operators in A. That is g = span ⟨iG 1 , • • • , iG K ⟩ Lie ( ) where span ⟨S⟩ Lie denotes the Lie closure, i.e., the set obtained by repeatedly taking the commutator of the elements in S. The proof of Lemma 1 employs the following Lemma. Lemma 4. Consider the QNN instance (|ψ 0 ⟩ , U (θ), H) whose DLA is g. If there exists an invariant subspace V g including the input state |ψ 0 ⟩ and the ground state |ψ * ⟩ with dimension d g under g, then the effective dimension of this ansatz design A yields d eff = d g . Proof of Lemma 4. We first demonstrate the equivalence between the group U A = ∪ ∞ L=0 {U (θ) : θ ∈ R LK } and the group generated by the elements in g, i.e., U A = {e V , V ∈ g}. U A ⇒ {e V , V ∈ g}. To facilitate understanding, we consider a single-layer unitary U (θ) with L = 1 and the ansatz design A = {G 1 , G 2 }. From the Baker-Campbell-Hausdorff formula, we have U (θ) = e iθ1G1 e iθ2G2 = e J1(θ) , where J 1 (θ) = i θ 1 G 1 + θ 2 G 2 + iθ 1 θ 2 2 [G 1 , G 2 ] - θ 2 1 θ 2 12 [G 1 , [G 1 , G 2 ]] + • • • . ( ) Eqn. ( 25) implies that by merging e iθ1G1 and e iθ2G2 into a single term, the new evolution is generated by an operator J 1 (θ) depending on both θ 1 and θ 2 , which contains a nested commutator between G 1 and G 2 . Therefore, we have J 1 (θ) ∈ g and U (θ) ∈ {e V , V ∈ g}. For the case of multiple layers, i.e., U (θ) = L ℓ=1 e iθ ℓ 1 G1 e iθ ℓ 2 G2 , we have U (θ) = e J L (θ) ∈ {e V , V ∈ g} by recursively applying the Baker-Campbell-Hausdorff formula to reformulate U (θ) by the J L (θ) ∈ g. U A ⇐ {e V , V ∈ g}. Since each element in g is a linear combination of the nested commutators in Eqn. ( 25), there always exists θ ∈ R 2L for any V ∈ g such that J L (θ) = V and thus e J L (θ) = U (θ) ∈ ∪ ∞ L=0 {U (θ) : θ ∈ R LK }. Taken together, we obtain Eqn. ( 23) in the case of K = 2. The results for the ansatz design A with more than two elements can be derived in the same manner. More details can be found in Section IV of Larocca et al. (2021b) . The equivalence of U A and {e G : G ∈ g} indicates that for any G ∈ g and U ∈ U A , G and U commutes with the same Hermitian matrix Σ since U can be expressed as e G and hence has the same Eigen-space with G. This implies that the invariant subspace induced by U A is the same with the one induced by g and thus d = d g . Proof of Proposition 1. Following Lemma 4, the DLA-based EQNTK is the same as the EQNTK discussed in Theorem 2 because the corresponding U (θ) induces the same invariant subspace. Hence, the results achieved in Theorem 2 can be applied to the DLA-based EQNTK by replacing the effective dimension d eff with d g .

E Implementation details of the symmetric pruning algorithm

In this section, we elucidate Steps 2-1, 2-2, and 2-3 of the proposed SP in Alg. 1. Recall the considered problem Hamiltonian is expressed as H = H ⊗ I ⊗m with H = q j=1 α j H j , where α j is the real coefficient and H j is the tensor product of Pauli matrices on n qubits. A symmetry S of a Hamiltonian H is a unitary operator leaving H invariant, i.e.,

S HS

† = H. ( ) All of these symmetries form a symmetry group S where for any two symmetries S 1 , S 2 ∈ S, their compositions S 1 • S 2 or S 2 • S 1 and their inverses S -1 1 and S -1 2 are also symmetries in S. In SP, these symmetries are classified into three categories, namely, the system symmetry (Step 2-1), the structure symmetry (Step 2-2), and the spatial symmetry (Step 2-3). Suppose that the initialized asymmetric ansatz is U (θ), SP adopts the following methods to tailor this ansatz to obey the above symmetries. System symmetry. System symmetry considers the symmetry on qubit wires. Specifically, since the problem Hamiltonian H can be decomposed into a tensor product of Pauli terms, the symmetry condition in Eqn. ( 26) holds for any unitary of the form S sys = I ⊗n ⊗U , where U is an arbitrary unitary in SU (2 m ). All such unitaries are called the system symmetry and form a subgroup of the symmetry group S, i.e., S sys = {S sys = I ⊗n ⊗ U : U ∈ SU (2 m )}. The system symmetry of a unitary V can be recognized if S sys V S † sys = V . With this regard, SP assigns the system symmetry to U (θ) by removing the redundant parameterized gates and the two-qubit gates associated with the last m qubit wires. In doing so, the pruned ansatz has the form U Pr (θ) = U 1 (θ) ⊗ I ⊗m , which yields S sys (U 1 (θ) ⊗ I ⊗m )S † sys = U 1 (θ) ⊗ I ⊗m , where U 1 (θ) is the unitary extracted from U (θ) (the gates applied on the first n qubit wires). Structure symmetry. The structure symmetry S str refers to the symmetry for the effective Hamiltonian H, which satisfies S str HS † str = H. Moreover, an ansatz V (θ) is said to be structure symmetric to the problem Hamiltonian H if there exists a non-trivial symmetry S str (i.e., not the identity operation) and θ ∈ Θ\{0} such that S str V (θ)S † str = V (θ). A feasible solution of constructing the structure symmetric ansatz is restricting the corresponding ansatz design that only contains the Pauli terms of H. Given the pruned ansatz U Pr returned by Step 2-1, SP (Step 2-2) assigns the structure symmetry on it by removing specific the single-qubit gates and the two-qubit gates so that the pruned ansatz design follows A = {H 1 , • • • , H q }. The tailored ansatz returned by Step 2-2 coincides with HVA, i.e., U 1 (θ) transforms to the new ansatz whose ℓ-th layer is expressed as q j=1 e -iθ ℓj Hj for ∀l ∈ [L]. Spatial symmetry. Spatial symmetry is a discrete symmetry considering the permutation invariance for the sites of the problem Hamiltonian, which tightly relates to the problem of graph automorphism that has been widely studied in graph theory. For this reason, here we introduce the spatial symmetry from the graphical perspective and elucidate the implementations of Step 2-3 in Alg. 1. The key in this step is leveraging the algorithms developed in graph theory to automatically identify the spatial symmetry of problem Hamiltonians. From the graphical view, an n-qubit Hamiltonian H refers to a graph G = (V, E) with n vertices, where the j-th node v j ∈ V represents the j-th site (qubit) of H and the edge E i,j ∈ E characterizes the interaction strength of the i-th and the j-th sites (qubits). This graph can further be described by an adjacency matrix D. Recall that a spatial symmetry π of a Hamiltonian H is a permutation over the sites leaving H invariant, i.e., πHπ -1 = H (or equivalently [π, H] = 0). In other words, the spatial symmetry π preserves the topology invariance of G such that for any (u, v) ∈ E, we have (π(v), π(u)) ∈ E, and πDπ -1 = D. In GSP, the action of π on an n-qubit state |ψ⟩ → π |ψ⟩ means permuting the indices of qubits. For instance, a permutation π with π(1) = 3, π(2) = 1, π(3) = 2 acting on the state |ψ 1 ⟩ |ψ 2 ⟩ |ψ 3 ⟩ yields π(|ψ 1 ⟩ |ψ 2 ⟩ |ψ 3 ⟩) = |ψ 3 ⟩ |ψ 1 ⟩ |ψ 2 ⟩. All these permutations form a discrete group of symmetries S n with the cardinality O(n!). Particularly, the spatial symmetries of the Hamiltonian is the automorphism group of its corresponding graph, defined as Aut(H) = {π a ∈ S n |π a Hπ -1 a = H}, or equivalently Aut(H) = {π a ∈ S n |π a Dπ -1 a = D}. The qubits (or qubit-pairs) in the ansatz corresponding to the nodes (or edges) that can be swapped are called equivalent qubits (or qubit-pairs). More precisely, for any node (or edges) u ∈ V (or (u, v) ∈ E), if there exists π ∈ Aut(H) such that π(u) = x (or π(u, v) = (x, y)), then the qubits (or qubitpairs) corresponding to the node (or edge) u (or (u, v)) and x (or (x, y)) are called equivalent qubits (or qubit-pairs). Given the ansatz returned by Step 2-2, Step 2-3 assigns the spatial symmetry on it by correlating the single-qubit parameterized gates on the equivalent qubits or the two-qubit parameterized gates on the equivalent qubit-pairs. 

F The limitations of applying classical pruning methods to QNNs

Although both SP in Alg. 1 and the classical pruning techniques distill a smaller network (or an ansatz) from an over-parameterized one in the view of algorithmic implementation, the latter cannot be directly employed to enhance the power of QNNs. Recall that a common feature of classical pruning methods is scoring each parameter or network element and then removing those accompanied with low scores. Such scores generally correspond to the magnitude of parameters (Frankle & Carbin, 2018) , the gradient of parameters (Lee et al., 2018; Wang et al., 2020) , and the Hessian matrix (LeCun et al., 1989; Hassibi & Stork, 1992) at the initialization stage or the phase of terminal. Unfortunately, Cerezo & Coles (2021) proved that the gradient information in QNNs with random deep ansatz exponentially vanishes with the increased number of qubits. In other words, the gradient information fails to provide any useful information to guide pruning. Meanwhile, the output of QNNs can be regarded as a periodic function of parameters (Schuld et al., 2021) , which forbids employing the parameters' magnitude as the metric to guide the pruning. Therefore, it is inappropriate to straightforwardly apply classical pruning methods to QNNs, where the extracted ansatz may not promise the enhanced trainability. In contrast with classical pruning methods, our proposal does not require any gradient information to construct the symmetric ansätze. Instead, it removes the redundant gates and shrinks the solution space according to the information of the problem Hamiltonian.

G More numerical simulation details

In this section, we provide the simulation details omitted in the main text.

Hardware efficient ansatz.

As shown in the left panel of Fig. 6 , HEA yields the layerstacking structure following Eqn. (2), where each layer consists of multiple single-qubit Pauli rotation gates and fixed two-qubit CNOT gates. In our numerical simulations, the ℓ-th layer of the employed HEA takes the form U ℓ (θ) =U (1) ent n j=1 R X (θ ℓ j,1 )R Y (θ ℓ j,2 )R Z (θ ℓ j,3 )U (1) ent × U (2) ent n j=1 R X (θ ℓ j,4 )R Y (θ ℓ j,5 )R Z (θ ℓ j,6 )U (2) ent ( ) where R µ (θ ℓ j,k ) = e -iθ ℓ j,k µ with µ ∈ {X, Y, Z} denotes the parameterized single-qubit gate, and U (1) ent = ⌊ n 2 ⌋ j=1 CNOT 2j-1,2j and U (2) ent = ⊗ ⌊ n-1 2 ⌋ j=1 CNOT 2j,2j+1 refer to the entangled layers with ⌊a⌋ being the greatest integer no larger than a. Transverse-field Ising model. A central problem in quantum many-body physics is predicting the properties of these quantum systems from the first principles of quantum mechanics. Transverse-field Ising model (TFIM) has been employed to explore many interesting quantum systems. In our numerical simulation, we employ an n-qubit Hamiltonian of 1D TFIM with an open boundary condition, i.e., H TFIM = -n-1 j=1 σ z j σ z j+1 -n j=1 σ x j , where σ µ j denotes the µ-Pauli matrix (with µ = x, z) acting on the j-th qubit. The effective dimension for HVA under this Hamiltonian is given by d eff = n 2 (Larocca et al., 2021a) . The Hamiltonian is graphically depicted in Fig. 7(a) .

MaxCut.

Although many important problems in statistical physics and operation research (Wheeler, 2004 ) can be formulated as MaxCut, finding the optimal solution of MaxCut has been proven to be NP-hard (Karp, 1972) and quantum computers are expected to attain better approximated solutions than those of classical computers (Farhi et al., 2014; Zhou et al., 2022) .In this work, we consider the MaxCut problem of the Erdos-Renyi graphs whose topology is less structured. An Erdos-Renyi (ER) graph on the vertex set V is a random graph in which each pair of nodes (u, v) connects independently with probability p. Fig. 7(b) shows the instance of the ER graph used in the numerical simulation with setting p = 0.6.

Evolution of ansätze.

Here we present the evolution of the ansatz structure for the transverse-field Ising model during symmetric pruning in Fig. 6 , which serves as an example for better understanding the learning dynamics of our proposal. Specifically, we adopt the Hardware efficient ansatz as the initial over-parameterized ansatz, as shown in the left side of Fig. 6 . The gates on the last two wires corresponding to I ⊗2 are first removed through Step 2-1 (referred to 'SP1') in Alg. 1 to ensure the system symmetry. Subsequently, in Step 2-2, SP employs the symmetric information of problem Hamiltonian H TFIM to remove the parameterized single-qubit gates and two-qubit gates on the first six wires such that the pruned ansatz design is A pr = {σ x 1 , • • • , σ x 6 , σ z 1 σ z 2 , • • • σ z 5 σ z 6 }. Finally, in Step 2-3, the spatial symmetry pruning correlates the parameterized gates on the equivalent qubits and qubitpairs through the returned automorphism by the package nauty. In the case of TIFM, the package nauty returns a non-trivial automorphism π(j) = n + 1 -j that permutes qubits from each side of the chain. This operation leads to a reduction of the number of free parameters from 11 for the ansatz pruned by 'SP2' to 6 for the ansatz returned by 'SP3'. Hardware efficiency analysis. We use two standard metrics to quantify the hardware efficiency, i.e., (1) the number of measurements (shots) M required to complete one optimization step; (2) the circuit depth l(ϵ) required to reach the over-parameterization criteria characterized by ϵ-convergence with ϵ = 10 -5 . Namely, the first metric concerns the runtime cost in training QNNs, and the second metric evaluates the required quantum resources to construct the quantum circuit. To facilitate the comparison of various ansätze under the first metric, the number of shots for a single parameter update is set to be s, so that the total number of measurements taken to complete one optimization step linearly scales with the number of parameters, i.e., M = LKs. As depicted in Figure 8 , in the task of TFIM, compared with the initial over-parameterized asymmetric ansatz (labeled by 'SP0'), the required number of measurements M for the pruned ansatz can be dramatically reduced by 14 times; meanwhile, the required circuit depth is reduced by 3 times. In the task of Max-Cut, the hardware efficiency improvement is problem-dependent. In particular, the required circuit depth is reduced by about 1.8 times and 2 times for the Erdos-Renyi graph and the 3-regular graph, respectively. For the Erdos-Renyi graph, compared to the ansatz returned by 'SP1', the required circuit depth of the ansatz returned by 'SP2' slightly increases from 100 to 130. This increase originates from the fact that the Erdos-Renyi graph with a large number of edges requires a relatively deep circuit depth to construct the symmetric ansatz after SP1. Although the circuit depth is subtly increased, an evident benefit is a dramatic reduction of the number of measurements M , i.e., compared with the initial ansatz, the required M for the pruned ansatz is reduced by 10 times than that of the initial ansatz. The achieved numerical results confirm that the symmetric ansatz output by our proposal can effectively improve hardware efficiency. As such, it can simultaneously reduce the required quantum resource for reaching the regime of over-parameterization, enable an efficient implementation on NISQ devices, and more importantly, improve the convergence rate so as to reduce the number of access to the quantum devices. Training dynamics analysis of symmetric ansatz. Here we numerically exhibit that EQNTK has the ability to capture the training dynamics of QNNs with symmetrical ansatz. The hyperparameter settings are as follows. In the task of TFIM, the number of qubits of the Hamiltonian is set as n = 6. We employ the QNN with symmetric ansätze processed by SP with the number of layers L = 80 to optimize the loss function defined in Eqn. (1). The learning rate η and the maximum number of iteration T is set as 10 -4 and 1000, respectively. Figure 9 plots the theoretically predicted residual training error according to Theorem 2, the practical residual training error ε with 30 independent random initializations, and their average versus the gradient descent optimization steps. The numerical results show that the residual error ε decays exponentially, which echos with the training dynamics derived in Theorem 2.



Figure 1: The critical point of the over-parameterized regime. When the number of parameters is beyond the critical point (the red circle), the training error exponentially converges to a nearly global minimum. Symmetric ansätze (the blue curve) require few parameters to reach the critical point over the asymmetric ansätze.

* ∈ [p] such that V j * = V * includes the input state |ψ 0 ⟩, the ground state |ψ * ⟩, and all possible variational states {|ψ(θ)⟩ |θ ∈ Θ}. Then the dynamics of |ψ(θ)⟩ can be derived by the dimension of this subspace d eff = |V * |, dubbed the effective dimension of the QNN. Definition 2 (Effective dimension). Consider a QNN instance (|ψ 0 ⟩ , U (θ), H) with symmetric ansatz design A. Suppose V * is the invariant subspace covering |ψ 0 ⟩, {|ψ(θ)⟩ |θ ∈ Θ}, and |ψ * ⟩. Then the effective dimension of this QNN is d eff = |V * |. The projection on this subspace is defined as Π = P P † , where P ∈ C d×d eff is an arbitrary set of orthonormal basis.

Figure 3: Schematic of symmetric pruning. The proposed symmetric pruning (SP) distillsthe symmetric ansatz from a given asymmetric ansatz, completed by removing the redundant gates (highlighted by the red dashed boxes) and correlating the parameters in the gate respecting the spatial symmetry (highlighted by the solid boxes with the same color).

Figure 4: Results for TFIM model under symmetric pruning. The panels (a)-(c) plotthe EQNTK value QS at initialization, the loss value after convergence L(θ (T ) ), and the iteration number to achieve the ϵ-convergence T (ε) versus the number of parameters LK, respectively. The labels 'SP0'-'SP3' refer to the initial ansatz, the pruned ansatz after system symmetric pruning, structure symmetric pruning, and spatial symmetric pruning, respectively.

QNNs.McClean et al. (2018) first discovered the barren plateau of QNNs. Since then, a line of research is uncovering intrinsic reasons leading to this phenomenon.

Figure 5: Results for MaxCut under SP. The notations are identical to those in Fig. 4.

j | 2 = 1 with c j ∈ C. Besides Dirac notation, the density matrix can be used to describe more general qubit states. For example, the density matrix of the state |ψ⟩ is ρ = |ψ⟩ ⟨ψ| ∈ C 2 n ×2 n , where ⟨ψ| = |ψ⟩ † refers to the complex conjugate transpose of |ψ⟩. For a set of qubit states {p j , |ψ j ⟩} m j=1 with p j > 0, m j=1 p j = 1, and |ψ j ⟩ ∈ C 2 n for j ∈ [m], its density matrix is ρ = m j=1 p j ρ j with ρ j = |ψ j ⟩ ⟨ψ j | and Tr(ρ) = 1.

where the subscript 'ℓk' refers to the k-th parameter of the ℓ-th layer U ℓ for ∀k ∈ [K] and ∀ℓ ∈ [L]. Recall the loss function takes the form

the derivation of the second equality mainly follows the results ofLiu et al. (2022b, Appendix C), and the approximation comes from the truncation and the preservation of the leading order term.

Figure 6: Evolution of ansatz structure during symmetric pruning From left to right shows the initial hardware efficient ansatz and ansatz structure at different stages of symmetric pruning, where 'SP1', 'SP2', 'SP3' refer to the sub-steps 2-1, 2-2, and 2-3 in Alg. 1 respectively and L ′ < L. The symbol 'RX' ('RY', 'RZ') refers to the single qubit rotation around the x (y, z)-axis and I refers to the identity gate. The rotation gates with the same color of the pruned ansatz are correlated by one individual parameter per layer.

Figure 7: Graph representation of problem Hamiltonian. The left panel and the right panel depict the graph representation of the TFIM model and Erdos-Renyi graph with p = 0.6, respectively.

Figure8: The quantum resource required for achieving ϵ-convergence. The left panel, the middle panel, and the right panel depict the number of measurements required to complete one optimization step and the circuit depth required to achieve the ϵ-convergence in the task of TFIM, MaxCut for Erdos-Renyi graph, and MaxCut for 3-regular graph, respectively. Particularly, the label '×s' refers that the total number of shots M is the product of the displayed value in the histogram and the number of shots for updating a single parameter s. The labels 'SP0'-'SP3' refer to the initial ansatz, the pruned ansatz after system symmetric pruning, structure symmetric pruning, and spatial symmetric pruning, respectively.

2. Our key technical contribution is achieving a tighter convergence bound of QNNs with various symmetric ansätze (see Theorem 2 and Lemma 1). Particularly, our bound yields γ = O(poly(LK, d eff -1 )), where LK is the number of parameters and d eff is the effective dimension. The comparison with prior results is summarized in Table 1. Our results not only greatly reduce the threshold in reaching overparameterization but promise an improved convergence rate. These two conclusions are indispensable in applying over-parameterized QNNs to solve practical problems.

Given an n-qubit Hamiltonian H ∈ C 2 n ×2 n , GSP aims to find the eigenvector |ψ * ⟩ ∈ C 2 n (i.e., the ground state) of H corresponding to its minimum eigenvalue. For any n-qubit state |ψ⟩, the variational principle ensures ⟨ψ

Two typical problem Hamiltonians in many-body physics and combinatorial optimization are considered. The omitted details are postponed to Appendix G.

Published as a conference paper at ICLR 2023 Kaining Zhang, Min-Hsiu Hsieh,Liu Liu, and Dacheng Tao.  Toward trainability of quantum neural networks. arXiv preprint arXiv:2011.06258, 2020. Stephen Zhang and Wentao Cui. Overparametrization in qaoa. Written Report, 2020. Han Zheng, Zimu Li, Junyu Liu, Sergii Strelchuk, and Risi Kondor. Speeding up learning quantum states through group equivariant convolutional quantum ans {\" a} tze. arXiv preprint arXiv:2112.07611, 2021. Han Zheng, Zimu Li, Junyu Liu, Sergii Strelchuk, and Risi Kondor. On the Superexponential Quantum Speedup of Equivariant Quantum Machine Learning Algorithms with SU(d) Symmetry. 7 2022. Zeqiao Zhou, Yuxuan Du, Xinmei Tian, and Dacheng Tao. solving large-scale maxcut problems on small quantum machines. arXiv preprint arXiv:2205.11762, 2022.

The flexibility of SP. The automorphism group for the graphs corresponding to the Hamiltonians with the complicated topological structure is hard to compute manually. In this work, we employ nauty to automatically recognize the automorphism group of graph corresponding to the Hamiltonian nauty(McKay et al., 1981). Besides nauty , there are many heuristic algorithms to compute the automorphism group, including Traces(McKay & Piperno, 2014), saucy (Darga et al., 2004), Bliss(Junttila & Kaski, 2007) and canauto(López-Presa et al., 2014). All of them can be easily integrated into SP. Moreover, these heuristic algorithms are capable of solving most graphs for up to tens of thousands of nodes in less than a second(McKay & Piperno, 2014).

7. Acknowledgements

Yong Luo was supported by the Special Fund of Hubei Luojia Laboratory under Grant 220100014 and the National Natural Science Foundation of China under Grant 62276195. Tongliang Liu was partially supported by Australian Research Council Projects IC-190100031, LP-220100527, DP-220102121, and FT-220100318. 

