QUARK: A GRADIENT-FREE QUANTUM LEARNING FRAMEWORK FOR CLASSIFICATION TASKS

Abstract

As more practical and scalable quantum computers emerge, much attention has been focused on realizing quantum supremacy in machine learning. Existing quantum ML methods either (1) embed a classical model into a target Hamiltonian to enable quantum optimization or (2) represent a quantum model using variational quantum circuits and apply classical gradient-based optimization. The former method leverages the power of quantum optimization but only supports simple ML models, while the latter provides flexibility in model design but relies on gradient calculation, resulting in barren plateau (i.e., gradient vanishing) and frequent classical-quantum interactions. To address the limitations of existing quantum ML methods, we introduce Quark, a gradient-free quantum learning framework that optimizes quantum ML models using quantum optimization. Quark does not rely on gradient computation and therefore avoids barren plateau and frequent classical-quantum interactions. In addition, Quark can support more general ML models than prior quantum ML methods and achieves a dataset-sizeindependent optimization complexity. Theoretically, we prove that Quark can outperform classical gradient-based methods by reducing model query complexity for highly non-convex problems; empirically, evaluations on the Edge Detection and Tiny-MNIST tasks show that Quark can support complex ML models and significantly reduce the number of measurements needed for discovering near-optimal weights for these tasks.

1. INTRODUCTION

Quantum computing provides a new computational paradigm to achieve exponential speedups over classical counterparts for various tasks, such as cryptography (Shor, 1994) , scientific simulation (Tazhigulov et al., 2022) , and data analytics (Arute et al., 2019) . A key advantage of quantum computing is its ability to entangle multiple quantum bits, called qubits, allowing n qubits to encode a 2 n -dimensional vector, while encoding this vector in classical computing requires 2 n bits. Inspired by this potential, recent work (Jaderberg et al., 2022; Macaluso et al., 2020b; Torta et al., 2021; Kapoor et al., 2016; Bauer et al., 2020; Farhi & Neven, 2018a; Schuld et al., 2014; Cong et al., 2019b) has focused on realizing quantum speedups over classical algorithms in the field of supervised learning. Existing quantum ML work can be divided into two categories: classical model with quantum optimization (CMQO) and quantum model with classical optimization (QMCO). First, CMQO methods embed a classical ML model jointly with the optimization problem into a target Hamiltonian and optimize the model using quantum adiabatic evolution (QAE) (Finnila et al., 1994) or quantum approximate optimization algorithm (QAOA) (Farhi et al., 2014; Torta et al., 2021) . As the transition between a classical model and the target Hamiltonian only applies to loworder polynomial activations (see Figure 2 ), CMQO methods do not support ML models with nonlinear activations that cannot be represented in low-order polynomial (e.g., ReLU). Second, QMCO methods optimize variational quantum modelsfoot_0 by iteratively performing gradient descent using classical optimizers. QMCO methods are fundamentally limited by barren plateau (i.e., gradient vanishing (McClean et al., 2018) ) and the high cost of frequent quantum-classical interactions. To address the limitations of existing quantum ML methods, we introduce Quark, a gradient-free quantum learning framework for classification tasks that optimizes quantum models with quantum Figure 1 : An overview of the Quark optimization framework. Each horizontal line indicates a qubit, and each box on these lines represents one or multiple quantum gates applied on these qubits. Quark's optimization pipeline includes three stages: (1) Ψ 0 preparation, which initializes R W , R D , R O and performs model's forward processing U M , (2) amplitude amplification using a Grover-based algorithm, and (3) weights measurement. Quark uses K-parallel datasets (KPD) to maximize the probability of observing highly accurate weights in each measurement. Darker bars in the probability plots denote weights with higher accuracies. optimization (QMQO). Figure 1 shows an overview of Quark. A key idea behind Quark is entangling the weightfoot_1 of an ML model (i.e., the R W register in Figure 1 ) and the encoded dataset (i.e., the R D register in Figure 1 ) in a quantum state, where model weights that achieve optimal classification accuracy on the training dataset can be observed with the highest probabilities in a measurement. Therefore, users can obtain highly accurate model weights by directly measuring the updated R W weight register. To maximize the probability of observing optimal weights, we introduce two key techniques. Amplitude amplification. Quark uses a Grover-based mechanism to iteratively update the probability distribution of weights based on their training accuracies. As a result, the probability of observing weights with higher accuracy increases after each Grover iteration, as shown in Figure 1 .

K-parallel datasets (KPD).

Applying amplitude amplification on one dataset results in a linear amplification scenario where the measuring probability of each weight is proportional to its training accuracy J(w i ). We further introduce K-parallel datasets, a technique to enable exponential amplification. Specifically, by entangling k identical training datasets with model weights in parallel (using k × R D , as shown in Figure 1 ), the probability of observing weight w i in a measurement is proportional to J(w i ) k . Therefore, as k increases, the optimized probability distribution of weights gradually converges to the optimal weights. Compared with CMQO methods, Quark provides more flexibility in model design by composing models directly on quantum circuits and therefore supports a broader range of ML models. Compared with QMCO methods, Quark does not require gradient calculation and therefore does not suffer from barren plateau. Quark avoids frequent classical-quantum interactions by realizing both model design and optimization fully on quantum. Besides, by using basis encoding for the training dataset, Quark supports non-linear operations (e.g., ReLU) in its model architecture, and the optimization complexity is independent of the training dataset size. Theoretically, we compare model query complexityfoot_2 between Quark and gradient-based methods on a balanced C-way classification task, and prove that Quark can outperform gradient-based methods by reducing model query complexity for highly non-convex problems. In addition, we prove that using K-parallel datasets can further reduce model query complexity under certain circumstances. Simulations on two tasks (i.e., Edge Detection and Tiny-MNIST) show that Quark supports complex ML models, which can include quantum convolution, pooling, fully connected, and ReLU layers. In addition, Quark can significantly reduce the number of measurements needed for discovering a near-optimal weight by applying amplitude amplification and KPD. Contributions. This paper makes the following contributions: • We propose Quark, a gradient-free quantum learning framework for classification tasks that optimizes quantum ML models with quantum optimization. Quark avoids barren plateau and frequent classical-quantum interactions, supports more general models than prior quantum ML frameworks, and achieves a dataset-size-independent optimization complexity. • Theoretically, we prove that Quark can outperform gradient-based methods by reducing model query complexity for highly non-convex problems and that using KPD can further reduce model query complexity. • Empirically, we show that Quark can support complex ML models and significantly reduce the number of measurements needed for discovering a near-optimal weight for the Edge Detection and Tiny-MNIST tasks.

2. RELATED WORK

Figure 2 compares Quark with existing quantum ML approaches.

2.1. CLASSICAL MODEL WITH QUANTUM OPTIMIZATION

Existing CMQO methods aim at solving classical ML problems with quantum optimization techniques by leveraging the advantage of quantum parallelism (Nielsen & Chuang, 2002) . Based on the well-established algorithmic foundation in quantum annealing (Finnila et al., 1994; Kadowaki & Nishimori, 1998; Brooke et al., 1999; Santoro et al., 2002; Santoro & Tosatti, 2006) and adiabatic quantum computing (Farhi et al., 2001; Albash & Lidar, 2018) , prior work (Denil & De Freitas, 2011; Dumoulin et al., 2014; Adachi & Henderson, 2015) attempts for the quantum restricted Boltzmann machine (RBM) by formulating RBM as an Ising model (Cipra, 1987) . Inspired by the quantum approximate optimization algorithm (QAOA) (Farhi et al., 2014) , Torta et al. (2021) embeds a single binary perceptron layer into a target Hamiltonian to search for optimal weights. However, CMQO methods are limited by the locality restriction of the target Hamiltonian and can only embed models with low-order polynomial activations (e.g., square). This limitation prevents CMQO methods from supporting practical deep learning architectures, which generally contain non-polynomial activations such as ReLU and sigmoid. Similar to Quark, Kapoor et al. (2016) also uses Grover's algorithm to find a hyperplane that can perfectly separate the training dataset. However, this method only applies to an idealistic setup where a hyperplane with perfect classification exists in its search space. Besides, the method cannot adapt to generic model architectures other than single-layer perceptrons.

2.2. QUANTUM MODEL WITH CLASSICAL OPTIMIZATION

Motivated by the recent advances in variational quantum algorithms (VQAs) (Cerezo et al., 2021) , QMCO methods use variational quantum circuits (VQC) (Benedetti et al., 2019) to represent the trainable parameters of an ML model. Havlíček et al. (2019) ; Schuld & Killoran (2019) use VQC as a variational feature map to reproduce linear support vector machines (SVM) and kernel methods on quantum circuits, which can outperform classical counterparts under certain circumstances (Liu et al., 2021) . Besides conventional ML methods, recent work has also explored the feasibility of classical neural networks on quantum circuits (Massoli et al., 2022) . Farhi & Neven (2018b) 2021) uses a Grover algorithm as part of their method, they still require VQC as their model building blocks that require gradients update. Besides, due to amplitude-based data encoding, VQC-based methods in general suffer from a linear dependency with respect to dataset size in terms of model query complexity during training. Another drawback for amplitude encoding is that due to the unitary constraint of quantum transformations, non-linear operations are hard to implement for VQC-based methods. In contrast, our method uses basis encoding that concerns only qubits state transformation rather than amplitude transformation, which enables more efficient model query complexity and more general non-linear transformations.

3.1. NOTATIONS

Let D = {(x i , y i )} i∈N denote a training dataset, where x i ∈ {0, 1} dx is the binarized feature vector associated with the i-th sample, and y i ∈ {0, 1} dy is its label. Let ŷ = f (w, x) : {0, 1} dw × {0, 1} dx → {0, 1} dy denote our model parameterized by w ∈ {0, 1} dw . Given an objective l(ŷ, y), our goal is to find a near-optimal w * that minimizes/maximizes the overall objective J(w) = 1 N (xi,yi)∈D l(f (w, x i ), y i ). We focus on classification tasks and use the objective l(ŷ, y) = 1(ŷ = y). We use | • | to denote the cardinality of a set and absolute value of a scalar, and use ∥ • ∥ 2 and ∥ • ∥ ∞ to denote the L 2 -norm and infinity norm of a vector. Finally, we use ⊗ to denote tensor product, and use ¬, ⊕, ∧, and ∨ to denote NEGATE, XOR, AND, and OR in logical expressions, respectively.

3.2. QUANTUM BASICS

A bit in the quantum regime, called a qubit, is represented by a super-position of |0⟩ and |1⟩, which is formally defined as |z⟩ = α|0⟩ + e iϕ β|1⟩, where α and β are the amplitudes, and e iϕ is the relative phase. Furthermore, the rule also enforces ⟨z|z⟩ = 1 where ⟨z| is the conjugate transpose of |z⟩. In an n-qubit system, a quantum state is represented as a superposition of 2 n basis states. For computational simplicity, we will be using the basis states that are spanned by {|0⟩, |1⟩} n and denote the superposition of a n-qubits state as |Ψ⟩ = 2 n -1 i=0 1 √ 2 n |i⟩ where |i⟩ is the corresponding computational basis. We thus use |w i ⟩ ∈ {|0⟩, |1⟩} dw to denote weight basis state, |x j , y j ⟩ for the entangled data basis state where |x j ⟩ ∈ {|0⟩, |1⟩} dx is the feature and |y j ⟩ ∈ {|0⟩, |1⟩} dy the label. As a comparison, in the classical computing regime, each operation can only be applied to one state.

4. METHODOLOGY

To realize quantum supremacy, we circumvent the overhead induced by frequent classical-quantum interactions and the gradient calculation step in the VQC-based methods, which may result in barren To this end, we designed Quark in a QMQO fashion, which (1) achieves gradients-free optimization over the entire weight distribution, (2) uses basis encoding to achieve potential speedup as sample size scales up, and (3) enables a flexible quantum model design that can easily incorporate nonlinearities. Figure 1 shows an overview of the quantum circuit design in Quark. | w⟩ | x⟩ | y⟩ | o⟩ 0 0 0 0 0 0 0 1 . . . 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 . . . 1 1 1 0 1 1 1 1 Prob 0 0 0 0 0 0 0 1 . . . 0 1 1 0 0 1 1 1 1 0 0 0 1 0 0 1 . . . 1 1 1 0 1 1 1 1 Amplitude Amplification | w⟩ = | 0⟩ | w⟩ | x⟩ | y⟩ | o⟩ | w⟩ = | 0⟩ | w⟩ = | 1⟩ | w⟩ = | 1⟩ As we are using basis encoding over the entire data distribution, we initialize the data register R D as |D⟩ = (xi,yi)∈D 1 √ |D| |x i , y i ⟩. To initialize a model's weights register R W , we construct a uniform state by applying the Hadamard gate on each weight qubit: |W ⟩ = 2 dw -1 i=0 1 √ 2 dw |w i ⟩. In addition, Quark includes an auxiliary register R O for storing intermediate results and model's output, which is initialized as |O⟩ = |0⟩ do . Therefore, the initial state is represented as |W ⟩ ⊗ |D⟩ ⊗ |O⟩. By encoding the quantum model architecture as a unitary matrix U M , the model's forward processing is defined as: U M (|W ⟩ ⊗ |D⟩ ⊗ |O⟩) = i 1 2 dw 2 |w i ⟩ j 1 |D| |x j , y j ⟩|f (w i , x j )⟩ where f (w i , x j ) is the model's output for weight w i and sample x jfoot_3 . Given the above expression for the model forward processing, the key insight for Quark is to update the probability of observing weight w in a measurement based on its training objective J(w). This is achieved through Grover's algorithm by defining the solution state space S = {|w i ⟩|x j , y j ⟩|f (w i , x j )⟩ | y j = f (w i , x j )}. Appendix A.1 includes an introduction to the original Grover's algorithm. After amplitude amplification, we observe a model weight by measuring the weight register R W . To further illustrate our insight, Figure 3 shows a toy example, where the underline oracle function that generates the training data is g(x) = 0 ⊕ x. Using one qubit for x, the training dataset is constructed as {(x 0 = 0, y 0 = 0), (x 1 = 1, y 1 = 1)}. By defining the learner model as o = f (x, w) = w ⊕ x, where w is the trainable parameter, Quark includes four qubits (i.e., |w⟩, |x⟩, |y⟩, and |o⟩). For initial state preparation, we simply go through the model forward circuit with a uniform weight initialization that result in an uniform amplitude state, as shown on the left of Figure 3 . As no optimization happens at this stage, the probabilities for observing |w⟩ = |0⟩ and |w⟩ = |1⟩ in a measurement are equivalent. After amplitude amplification, Quark reaches an optimized amplitude state (shown on the right of Figure 3 ), where the probability of observing the optimal weight |w⟩ = |0⟩ in a measurement is much higher. For the rest of this section, Section 4.1 introduces Quark's on-circuit model design philosophy, Section 4.2 describes the Grover-based method for amplitude amplification, and Section 4.3 introduce  X 11 O 1 W 1 X 31 X 21 X 12 X 32 X 22 W 2 W 3 O 2 ∧ ¬ ¬ ¬ ¬ ¬ ¬ ∧ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ X 11 X 12 X 13 X 21 X 22 X 23 W 1 W 2 W 3 * = O 1 O 2 (a) Convolution X 1 O X 2 X 3 ∨ X 1 X 2 X 3 max( ) = O (b) MaxPool W = [W 1 , W 2 , W 3 ] and input X i = [X i1 , X i2 , X i3 ], the output is given by O i = (¬(W 1 ⊕ X i1 )) ∧ (¬(W 2 ⊕ X i2 )) ∧ (¬(W 3 ⊕ X i3 )). (b) MaxPooling: for a input X i = [X i1 , X i2 , X i3 ] the max pooling output is given by O i = X i1 ∨ X i2 ∨ X i3 . K-Parallel Dataset (KPD), a technique that enables exponential amplification to further improve the optimization algorithm.

4.1. MODEL DESIGN

The model circuit U M can be designed through arbitrary combinations of base gates (CNOT, Rotational gates, Hadamard gate, etc.), which can then entangle |W ⟩, |D⟩ to get a trainable model. Leveraging the flexibility of basis encoding, Quark is capable of realizing modules used in classical deep models. 

4.2. AMPLITUDE AMPLIFICATION

To increase the probability of measuring the weights with the highest classification accuracy, we use Grover's algorithm to amplify the amplitude for any state |w i ⟩|x j , y j ⟩|f (w i , x j )⟩ ∈ S, thus the measuring probability of optimal weights would be amplified the most. In this section, we will formally define the unitary operators we need for Grover's update, followed by the complexity analysis of the optimization algorithm. For conventional Grover's algorithm, we need two reflection unitaries, namely U s ⊤ , U Ψ0 , where U s ⊤ is the reflection against the non-solution sub-space and U Ψ0 is the reflection against the initial state. In our case, by defining solution state |s⟩ = |wi⟩|xj ,yj ⟩|f (wi,xj )⟩∈S 1 |S| |w i ⟩|x j , y j ⟩|f (w i , x j )⟩ and initial state Ψ 0 = i,j 1 2 dw |D| |w i ⟩|x j , y j ⟩|f (w i , x j )⟩ , the Grover operator can be easily built from U comb = U Ψ0 U s ⊤ where U s ⊤ = I -2|s⟩⟨s|, U Ψ0 = 2|Ψ 0 ⟩⟨Ψ 0 | -I. Now that we have the Grover operator well-defined, we can formally do analysis on the algorithm in terms of model query complexity. Theorem 1 By defining α = w i ∈Wϵ J(wi)dδ w i ∈W J(wi)dδ as the probability to measure an ϵ-optimal solution after the Grover's update, the model query complexity for getting an ϵ-optimal solution w i ∈ W ϵ on a balanced C-way classification task is O( √ C α ). J(w i ) is the objective value (accuracy) for weights w i , W is the complete weight space, and W ϵ is the ϵ-optimal weight subspace. The proof of Theorem 1 is included in Appendix A.3.1, it follows from the analysis that we need O( √ C) Grover iterations per measurement and O( w i ∈W J(wi)dδ w i ∈Wϵ J(wi)dδ ) measurements in expectation for sampling an ϵ-optimal weight. Notice that the complexity of our method does not depend on sample size N . As we are sampling solutions from an optimized distribution for a general non-convex problem, our method does depend on an O( w i ∈W J(wi)dδ w i ∈Wϵ J(wi)dδ ) term. However, to achieve an ϵ-optimal solution on a general non-convex problem, the worst case scenario for VQC-based methods with gradient-based optimizers needs O( w i ∈W dδ w i ∈Wϵ dδ ) iterations to sample initial points lie within convex regions that contain ϵ-optimal solutions. With per iteration gradient evaluation cost that is in the order of O(N ), this gives an overall complexity of O( w i ∈W dδ w i ∈Wϵ dδ N ) for VQC-based methods. Since O( w i ∈W J(wi)dδ w i ∈Wϵ J(wi)dδ ) < O( w i ∈W dδ w i ∈Wϵ dδ ) and √ C ≪ N in practice. Our method provides a speed-up for finding an ϵ-optimal solution in the general non-convex setup. In addition, as Quark does not require gradients calculation, it will not suffer from the notorious problem of barren plateau. For more idealistic case of convex problems, Garg et al. (2020) has given a proof of no quantum speed-ups can be obtained in this case with no exception of our method.

4.3. K-PARALLEL DATASETS (KPD)

A naive implementation can only achieve p(w i ) ∝ J(w i ), which requires more measurements for sampling ϵ-optimal solution. To this end, we further extend our method to achieve p(w i ) ∝ J(w i ) k through K-Parallel Dataset (KPD), as shown in Figure 1 when k > 1. The intuition is similar to simulated annealing, by updating weights distribution proportional to J(w i ) k , the probability mass would gradually converge to the global optimal solutions. We do so by concatenating the same dataset K times for increasing solution states in the order of K as: U M |W ⟩ K-1 k=0 (|D k ⟩ ⊗ |O k ⟩) = i 1 2 dw 2 |w i ⟩ K-1 k=0   j 1 |D k | |x j , y j ⟩|f (w i , x j )⟩   the solution set is now defined as: S K : {|w i ⟩ K-1 k=0 |x j k , y j k ⟩|f (w i , x j k )⟩ | K-1 k=0 (y j k = f (w i , x j k ))} We then update |s⟩, |Ψ 0 ⟩, U comb accordingly as in Appendix A.2. Now the ratio of solution states between different weights can grow exponentially in terms of K. Theorem 2 By defining β = w i ∈Wϵ dδ w i ∈W/Wϵ dδ as the volume ratio between ϵ-optimal weight subspace |W ϵ | and non-ϵ-optimal weight subspace |W -W ϵ |, the model query complexity for getting an ϵ-optimal solution w i ∈ W ϵ on a balanced C-way classification task with k-parallel dataset is O((1 + β k-1 ( 1 α -1) k )kC k 2 ). The proof of Theorem 2 is included in Appendix A.3.2. Here, the trade-off is that the number of Grover iterations needed per measurement grows to O C k 2 . As we also need to apply the model forward for each dataset, the KPD Grover complexity in terms of model query complexity is now O(kC k 2 ). On the other hand, as we are converging to the global solution, the number of measurements needed can be reduced from O( 1 α ) to O(1 + β k-1 ( 1 α -1) k ) as β < α. Thus the optimal value for k depends on the specification of α, β, C. Thus for cases where m Theorem 3 Given α, β, C, the optimal value of k is k * = ⌊log α β 1 α ⌋ + 1 if α β ≥ m 1 m-1 √ C where m = ⌊log α β 1 α ⌋ + 1 else k * = 1. 1 m-1 √ C ≤ α β ≪ 1 α , we have our optimal k > 1 which proves the effectiveness of KPD under certain circumstances. We include the proof of Theorem 3 in Appendix A.3.3.

5. EXPERIMENTS

In this section, we empirically verify the effectiveness of Quark on two tasks, namely Edge Detection and Tiny-MNIST. Algorithm 1 formally states the pipeline of Quarkfoot_4 . Instead of leveraging the approximated iteration number from theoretical derivations for Grover's update, we use a more precise but efficient estimation in practice to achieve a more accurate amplitude amplification effect. We include the detail of Grover iteration number estimation in Appendix A.5.1. Notice that most of the simulation results are obtained numerically since we have no access to large scale quantum devices that fit our setup. However, we do use Qiskit Aer (Anis & et al., 2021) to verify the reproducibility of our numerical simulation results on models that are applicable, results are included in Appendix A.6. // get Grover iteration g, Grover operator U comb and initial state |Ψ 0 ⟩ for i = 0; i < m; ++i do for j = 0; j < g; ++j do |Ψ j+1 ⟩ = U comb |Ψ j ⟩ ; // Grover update w i = Measure(|Ψ g ⟩) ; // Weights measurement on R W B.add(w i ); return arg max w∈B (Evaluate(w))

5.1. EDGE DETECTION

For Edge Detection, our goal is to identify if a 3 × 3 binary matrix has 1) both vertical and horizontal lines 2) vertical lines only 3) horizontal lines only 4) no lines, where a line is defined by three consecutive '1's in a row or column. Thus the task is a 4-way classification task with 512 instances. We split the 512 samples into a training set with 400 randomly selected instances and a test set with the rest. We use a Quantum Convolutional Model that consists of several 1 × 3 convolution kernels with Maxpooling modules as described in Figure 4 for this task. Results are demonstrated in Figure 5 (normalized accuracy: Ĵ(w i ) = J(wi) i J(wi) ), from which we can clearly see a linear relationship between model accuracy and optimized weights distribution using 1-PD. For 4-PD, we can observe that weights distribution further concentrates on the global optimal solutions. Indeed, as we concatenate more training datasets, the probability mass will gradually converge to the optimal solutions, thus reducing the number of measurements we need significantly. We include the results for 2-PD and 3-PD in Appendix A.5.3. (wi) i J(wi) .

5.2. TINY-MNIST

For Tiny-MNIST, we down-sample original 28 × 28 images from MNIST to 3 × 3 and remove duplicated samples. Due to device limitation, we only consider classes 1, 2, 7, which makes this task a 3-way classification task. We use Weighted Mask modules (details are included in Appendix A.4) with MaxPooling modules to compose our model, which can achieve 86.08% training accuracy and 82.61% testing accuracy by the best model in the search space. Whereas a well-optimized classical single layer perceptron model can achieve 85% in training accuracy and 80% test accuracy. Similarly, as shown in Figure 6 (normalized accuracy: Ĵ(w i ) = J(wi) i J(wi) ), a linear relationship between model accuracy and optimized weights distribution can be observed using 1-PD, while 4-PD can further increase the probability for sampling optimal solutions. We also include the results of 2-PD and 3-PD on Tiny-MNIST in Appendix A.5.4. Besides distribution evolution, we also statistically demonstrate the relationship between measured top-1 model's test accuracy and measurements budget using uniform random sampling, 1-PD, and 4-PD respectively in Figure 7 . In order to achieve equally well-performed model, 4-PD only requires ∼ 30 shots on Edge Detection task for a mean test accuracy > 98% while uniform random sampling needs ∼ 900 (∼ 30× more) shots with much higher variance. Similar trend can be observed on Tiny-MNIST where 4-PD only requires ∼ 1000 shots for a mean test accuracy > 76.5% while uniform random sampling needs ∼ 20000 (∼ 20× more) shots with higher variance. We include the training ones in Appendix A.5.5. 

6. CONCLUSION

In this paper, we propose a new quantum learning framework Quark that does not involve gradients calculation and operates in a fully-quantum fashion. Acknowledging the notorious problem of barren plateaus from VQC based methods, Quark shed some lights on circumventing this phenomenon through a gradient-free optimization pipeline. Quark also enables a more general of module design due to basis encoding, so that non-linear operations can be easily implemented. Theoretically, we present some evidences in terms of model query complexity for Quark to demonstrate trade-offs between VQC based methods and Quark. Empirically, we have verified the effectiveness of Quark through numerical simulations on Edge Detection and Tiny-MNIST.

A APPENDIX

A.1 GROVER'S ALGORITHM A well-known algorithm for amplitude amplification is Grover's algorithm. The original Grover's algorithm is trying to search specific states that satisfy some properties which are called solution states. Instead of enumerating over all possible states to find the solution states that lie in the solution set S, Grover's algorithm tries to amplify the solution states' amplitudes using two reflection unitary matrices U s ⊤ , U Ψ0 . Let |Ψ 0 ⟩ denote the initial state of all qubits and |s⟩ = si∈S 1 √ |S| |s i ⟩ represent the basis state spanned by all solution states. U s ⊤ and U Ψ0 are constructed as follows: The combination of the two reflection unitary matrices U comb = U Ψ0 U s ⊤ is equivalent to a rotation of 2θ on the plane spanned by |s ⊤ ⟩ and |Ψ 0 ⟩. Therefore, applying the combination of the two reflection matrices k times gives: U s ⊤ = I -2|s⟩⟨s| (2) U Ψ0 = 2|Ψ 0 ⟩⟨Ψ 0 | -I |Ψ k ⟩ = k i=1 U comb |Ψ 0 ⟩ (4) = cos ((2k + 1)θ)|s ⊤ ⟩ + sin ((2k + 1)θ)|s⟩ As for most practical setups the solutions are always sparse and existed, we have 0 < θ ≪ π 3 . To maximize the amplitude for |s⟩, k should be in the order of O( 1 θ ).

A.2 DEFINITION

The |s⟩, |Ψ 0 ⟩ in KPD are updated as: |s⟩ = |wi⟩ K-1 k=0 |xj k ,yj k ⟩|f (wi,xj k )⟩∈S K 1 |S K | |w i ⟩ K-1 k=0 |x j k , y j k ⟩|f (w i , x j k )⟩ and Ψ 0 = i,j0,••• ,j K-1 1 2 dw |D| K |w i ⟩ K-1 k=0 |x j k , y j k ⟩|f (w i , x j k )⟩ ,again the Grover operator can be easily built from U comb = U s ⊤ U Ψ0 where U s ⊤ = I - 2|s⟩⟨s|, U Ψ0 = 2|Ψ 0 ⟩⟨Ψ 0 | -I. A.3 PROOF A.3.1 THEOREM 1 Given the probability for sampling a solution state is p, then the Grover iterations to achieve the maximum amplitude amplification effect is 1 p . As in our case, the probability for sampling a solution state is given by w i ∈W J(wi)dδ w i ∈W dδ , then suppose we are doing a balanced C-way classification task the Grover iterations we need is: wi∈W dδ wi∈W J(w i )dδ = 1 E wi∼W [J(w i )] (6) = 1 1 C (7) = √ C (9) → (10) is due to for a balanced C-way classification, we should expect the accuracy for a random model to be the same as a random guess 1

C

As we have defined α = w i ∈Wϵ J(wi)dδ w i ∈W J(wi)dδ as the probability to measure a ϵ-optimal solution, thus in order to sample a ϵ-optimal solution by measurements, it takes O( 1 α ) trials. Which gives us an overall complexity of O( √ C α ) A.3.2 THEOREM 2 Given we are doing a balanced C-way classification task for k parallel dataset, the probability for sampling a solution state is now w i ∈W J(wi) k dδ w i ∈W 1 k dδ = E wi∼W [J(w i ) k ]. Since the objective function we defined is non-negative J(w i ) ≥ 0, thus we have J(w i ) k to be a convex function in J(w i ). As expectation operator preserve convexity we have E [J(w i ) k ] ≥ E[J(w i )] k , we have: wi∈W 1 k dδ wi∈W J(w i ) k dδ = 1 E wi∼W [J(w i ) k ] (9) ≤ 1 E wi∼W [J(w i )] k (10) = √ C k Thus the Grover iteration needed is upper bounded by C k 2 For measurement, the expected iterations we need is w i ∈W J(wi) k dδ w i ∈Wϵ J(wi) k dδ which can be approximated as: wi∈W J(w i ) k dδ wi∈Wϵ J(w i ) k dδ = wi∈Wϵ J(w i ) k dδ + wi∈W/Wϵ J(w i ) k dδ wi∈Wϵ J(w i ) k dδ (12) = 1 + wi∈W/Wϵ J(w i ) k dδ wi∈Wϵ J(w i ) k dδ (13) = 1 + wi∈W/Wϵ J(w i ) k dδ wi∈Wϵ J(w i ) k dδ wi∈Wϵ 1 k dδ wi∈W/Wϵ 1 k dδ wi∈W/Wϵ 1 k dδ wi∈Wϵ 1 k dδ (14) = 1 + w i ∈W/Wϵ J(wi) k dδ w i ∈W/Wϵ 1 k dδ w i ∈Wϵ J(wi) k dδ w i ∈Wϵ 1 k dδ 1 β (15) = 1 + E wi∈W/Wϵ [J(w i ) k ] E wi∈Wϵ [J(w i ) k ] 1 β (16) ≈ 1 + E wi∈W/Wϵ [J(w i )] k E wi∈Wϵ [J(w i )] k 1 β (17) = 1 + ( E wi∈W/Wϵ [J(w i )] E wi∈Wϵ [J(w i )] ) k 1 β (18) = 1 + ( w i ∈W/Wϵ J(wi)dδ w i ∈W/Wϵ 1dδ w i ∈Wϵ J(wi)dδ w i ∈Wϵ 1dδ ) k 1 β = 1 + ( wi∈Wϵ 1dδ wi∈W/Wϵ 1dδ wi∈W/Wϵ J(w i )dδ wi∈Wϵ J(w i )dδ ) k 1 β (20) = 1 + (β wi∈W J(w i )dδ -wi∈Wϵ J(w i )dδ wi∈Wϵ J(w i )dδ ) k 1 β (21) = 1 + (β( 1 α -1)) k 1 β (22) = 1 + β k-1 ( 1 α -1) k (23) (18) → (19) is assuming E wi∈W/Wϵ [J(w i )] > Var wi∈W/Wϵ [J(w i )] 1 2 which is usually the case in practice since J(w i ) itself is upper bounded by [0, 1 -O(ϵ)] within the non-ϵ optimal region and J(w i ) can be assumed to be near-uniformly distributed within W/W ϵ across range [0, 1 -O(ϵ)], which is a general assumption. Thus E w i ∈W/Wϵ [J(wi) k ] E w i ∈Wϵ [J(wi) k ] can be dominated by E w i ∈W/Wϵ [J(wi)] k E w i ∈Wϵ [J(wi) k ] ≤ E w i ∈W/Wϵ [J(wi)] k E w i ∈Wϵ [J(wi)] k . A.3.3 THEOREM 3 As the trade-off is between measurements needed versus Grover iterations per measurement, we can list their rate of change to make the comparison. For k-parallel dataset, Grover iterations is increased by kC k-1 2 = kC k 2 C 1 2 . For measurements needed, the exact ratio should be 1 α 1+β k-1 ( 1 α -1) k , however we can further approximate this ratio as  1 α β α k-1 1 α = ( α β ) k-1 given β k-1 ( 1 α -1) k > 1 which is equivalent to k ≤ ⌊log W 21 W 22 W 23 X 1 X 2 X 3 = O 1 O 2 (a) Fully connected layer S 1 X 1 X 2 X′ 1 X′ 2 ⊕ ⊕ S 1 X 1 X 2 = X 1 X 2 S 1 = 0 0 0 S 1 = 1 (b) ReLU Figure 8 : Representing a fully connected layer and ReLU activation as quantum circuits in the Quark optimization framework. (a) Fully connective layer: for a 2 × 3 FC layer with weights W ij and input X = [X 1 , X 2 , X 3 ], the output is given by O i = (W i1 ∧ X 1 ) + (W i2 ∧ X 2 ) + (W i3 ∧ X 3 ). (b) ReLU: for an input X i = [S, X 1 , X 2 ] where S is the sign qubit, the ReLU output is given by X ′ = (X ⊕ X) ∨ (¬S ∧ X). In addition to the convolution and maxpooling layers shown in Figure 4a and Figure 4b , Quark can also incorporate other commonly used tensor algebra operators. Figure 8 demonstrates how to represent a fully connected and a ReLU layer as quantum circuits in Quark. Due to the limitation of the number of qubits, our parameterization can be viewed as a binary model over a bounded weight space W. In our experiments, we make a little modification to Quantum Learning modules introduced before. For Edge Detection task, O 0 = ( 2 i=0 ( 2 j=0 W 0,j ⊕ X i,j )) ⊕ W 0,3 O 1 = ( 2 i=0 ( 2 j=0 W 1,j ⊕ X j,i )) ⊕ W 1,3 We use |O 0 O 1 ⟩ = |00⟩, |01⟩, |10⟩, |11⟩ to express the 4 different predictions respectively. For Tiny MNIST task,  O k = ( 2 i=0 ( 2 j=0 W k,3i+j X i,j )) ⊕ W k,9 k = 0, 1



They are also known as variational quantum circuits (VQC)-based models in the quantum literature. Throughout the paper, we use the term weight to refer to the set of all trainable parameters of a model. Number of model forward being called We omit the model's intermediate results |O⟩ for simplicity. Preprocessing() and Evaluate() are included in Appendix A.5.1



Figure 2: Comparison between CMQO, QMCO, and Quark (QMQO).

Figure 3: Toy illustration of Quark's insight, where the dataset {(x 0 = 0, y 0 = 0), (x 1 = 1, y 1 = 1)} is constructed by oracle function g(x) = 0 ⊕ x. We define our learner model as o = f (x, w) = w ⊕ x. The colored qubit strings stands for the states activated by model forward (cyan for |w⟩ = 0 and yellow for |w⟩ = 1). violet bars stand for solution states measuring probability, gray stands for non-solution states measuring probability.

Figure 4: Logical gates illustration of quantum module design. (a) Convolution: for a 1 × 3 Conv kernel with weightsW = [W 1 , W 2 , W 3 ] and input X i = [X i1 , X i2 , X i3 ], the output is given by O i = (¬(W 1 ⊕ X i1 )) ∧ (¬(W 2 ⊕ X i2 )) ∧ (¬(W 3 ⊕ X i3 )). (b) MaxPooling: for a input X i = [X i1 , X i2 , X i3 ] the max pooling output is given by O i = X i1 ∨ X i2 ∨ X i3 .

Figure 4 illustrates two Quark modules (Conv, MaxPool) used in our experiments. Besides Convolution and MaxPooling, Quark can support more diverse operations given enough qubits. We include the demonstration of the Fully Connective and ReLU modules in Appendix A.4.During training, we encode the entire data distribution through basis encoding so we can apply model forward simultaneously for all data samples. During inference, we encode each sample as a single state |x⟩. Prediction can be obtained by measurements over the output register after the model forward step U M |w * ⟩|x, 0⟩|0⟩ = |w * ⟩|x, 0⟩|f (w * , x)⟩, where w * is the measured optimal weight.

Figure 5: Amplitude ampl. + KPD on Edge Detection. normalized accuracy: Ĵ(w i ) = J(wi)  i J(wi) .

Quark's optimization pipeline. Input: Data Oracle: U D ; Model Oracle: U M ; Objective Oracle: U L ; Number of parallel dataset: k; Measurement budget: m; Weights buffer: B Output: Optimized Model Weight: w * g, U comb , |Ψ 0 ⟩ = Preprocessing(U D , U M , U L , k) ;

Figure 6: Amplitude ampl. + KPD on Tiny-MNIST. normalized accuracy: Ĵ(w i ) = J(wi)  i J(wi) .

Figure 7: The mean ± std of the test accuracy of the best discovered model with different measurement budget (in shots). URS shows the result where the weights are uniformly random sampled.

Geometrically, U s ⊤ is the reflection operator over |s ⊤ ⟩ = si / ∈S |{|0⟩,|1⟩} n \S| |s i ⟩, which is an state orthogonal to the solution space. Similarly, U Ψ0 is the reflection operator over |Ψ 0 ⟩. Given θ = arcsin (⟨Ψ 0 |s ⊤ ⟩), |Ψ 0 ⟩ can be expressed as: |Ψ 0 ⟩ = cos θ|s ⊤ ⟩ + sin θ|s⟩

⌋ + 1. Thus the optimal k should satisfy α β ≥ k 1 k-1 C 1 2 as well as k ≤ ⌊log α β 1 α ⌋ + 1. As k 1 k-1 C 1 2 , k ≥ 2 ismonotonically decreasing with respect to k thus as long as α β ≥ m 1 m-1 C 1 2 for m = ⌊log α β 1 α ⌋ + 1 we could have our optimal k = m = ⌊log α β 1 α ⌋ + 1, otherwise k = 1.

Figure9: QCA is the quantum circuit for calculating the average accuracy for different parameters, which is used in computing the number of Grover iterations as shown in Algorithm 2. QCB shows the quantum circuit for computing the accuracy of a specific weight, which is used in Algorithm 3.

Quantum dissipative neural network(Beer et al., 2020) (QDNN) and quantum convolutional neural network(Cong et al., 2019a) (QCNN), on the other hand, move a step forward towards more complicated neural architectures. QDNN enlarges its model space by applying unitary operators on both the input and output qubits, while QCNN uses a measurement-controlled operation to enable non-linear operations. However,McClean et al. (2018) shows that the barren plateau phenomenon commonly exists in VQC-based methods, where gradi-

annex

Algorithm 2: Get Grover Iterations Number Input: Data Encoder Oracle: U D ; Model Forward Oracle: U M ; Objective Function Oracle:U L ; number of parallel dataset: k; Shots to estimate accuracy: s Output: Grover Iterations Number: g Initialize Obj = 0; QC = QCA(U D , U M , U L , k) ; // Construct the quantum circuit shown in Figure 9 (a) for i = 0; i < s; ++i do], m ∈ N ; return g; We use Preprocessing() to:• Calculate number of Grover iterations we need (The procedure is illustrated in Figure 9(a) and Algorithm 2).• Construct Grover Operator U comb .• Uniformly initialize R W with Hadamard gates.• Initialize k × R D with U D s to encode k identical training dataset in parallel.In Algorithm 2, we may find that θ is close to π 4 , making G extremely large or non-existent. However, we can introduce extra samples to dataset to resolve the issue. These auxiliary samples are identified by an additional qubit, that automatically being classified into non-solution space.In function Evaluate(), we evaluate the objective value of a specific weight w. The procedure is illustrated in Figure 9 (b) and Algorithm 3.

A.5.2 TINY-MNIST EXPERIMENTAL SETUP

To construct Tiny-MNIST dataset, we select images with label 1, 2, 7 form the original MNIST training set and downsample them to 3x3 images with binarization applied to form D 1 . As samples with same representations in D 1 cannot share different labels, we use majority voting to decide labels for duplicate samples within D 1 to form our final dataset. We apply same procedure for both training and testing datasets. This gives us a Tiny-MNIST dataset with 79 instances for training and 46 instances for testing. We use Tiny-MNIST for evaluating both classical methods and Quark. The settings of classical methods are shown in 

0HDVXULQJ3UREDELOLW\

QRUPDOL]HGDFFXUDF\ 

A.6 QUANTUM SIMULATION

To further verify our framework, we use Qiskit Aer (Anis & et al., 2021) to simulate the process of solving a simplified Edge Detection task with Quark. The goal is to identify whether a 3×3 binary matrix has horizontal lines. The task is a binary classification task with 512 instances. We randomly select 400 of them as the training set and the rest as the test set. The result of simulation is shown in Table 2 . Each data point is acquired with 20 runs. The corresponding result of numerical simulation is shown in Table 3 . 

