AN ANALYTIC FRAMEWORK FOR ROBUST TRAINING OF DIFFERENTIABLE HYPOTHESES Anonymous

Abstract

The reliability of a learning model is key to the successful deployment of machine learning in various industries. Creating a robust model, particularly one unaffected by adversarial attacks, requires a comprehensive understanding of the adversarial examples phenomenon. However, it is difficult to describe the phenomenon due to the complicated nature of the problems in machine learning. Consequently, many studies investigate the phenomenon by proposing a simplified model of how adversarial examples occur and validate it by predicting some aspect of the phenomenon. While these studies cover many different characteristics of the adversarial examples, they have not reached a holistic approach to the geometric and analytic modeling of the phenomenon. Furthermore, the phenomenon have been observed in many applications of machine learning, and its effects seems to be independent of the choice of the hypothesis class. In this paper, we propose a formalization of robustness in learning theoretic terms and give a geometrical description of the phenomenon in analytic classifiers. We then utilize the proposal to devise a robust classification learning rule for differentiable hypothesis classes and showcase our proposal on synthetic and real-world data.

1. INTRODUCTION

The state-of-the-art machine learning models are shown to suffer from the phenomenon of adversarial examples, where a trained model is fooled to return an undesirable output on particular inputs that an adversary carefully crafts. While there is no consensus on the reasons behind the emergence of these examples, many facets of the phenomenon have been revealed. Szegedy et al. (2014) show that adversarial perturbations are not random and they generalize to other models. Goodfellow et al. (2015) indicate that linear approximations of the model around a test sample is an effective surrogate for the model in the generation of adversarial examples. Tanay & Griffin (2016) show that adversarial examples will appear in linear classifiers when the decision boundary is tilted towards the manifold of natural samples. Ilyas et al. (2019) reveal that the distribution of training samples and robustness of the trained model are related. Demontis et al. (2019) have proposed three metrics for measuring transferability between a target and a surrogate model based on the similarity of the loss landscape and the derivatives of the models. Li et al. (2020) infer that the cause of the phenomenon is probably geometrical and that statistical defects are amplifying its effects. Barati et al. (2021) find an example that shows pointwise convergence of the trained model to the optimal hypothesis is enough for the phenomenon to emerge. Shamir et al. (2021) explore the interaction of the decision boundary and the manifold of the samples in non-linear hypotheses. However, there are some issues with the current proposals in the literature. The geometrical and computational descriptions of the phenomenon does not always agree in their predictions (Moosavi-Dezfooli et al., 2019; Akhtar et al., 2021) . Even though geometrical perspectives have the advantage of being applicable to all hypothesis classes, they are not verifiable without a computational description. On the other hand, computational approaches are mostly coupled with a particular construction of a hypothesis and in turn need a geometric description to be applicable to different scenarios. The current defence methods does not appear to be very effective (Machado et al., 2021) and a need for novel ideas is felt (Bai et al., 2021) . Our aim in this paper is to devise a framework for the analysis of the adversarial examples phenomenon that 1. is decoupled from the underlying representation of the hypothesis. 2. provides the means for visualization of the phenomenon. 3. models the known characteristics of the phenomenon as much as possible. To this end, we first describe a necessary condition for robustness that we derive from the first principles of learning theory and lay out the groundwork for the analysis of the phenomenon from the perspective of learning rules in section 2. Next, we will extend the framework with the necessary definitions so that it would be open to geometrical interpretations in section 3. Finally, we put the proposed framework to use and verify its predictions for small scale synthetic and real-world problems. We provide a summary of the framework in appendix A for the interested reader.

2. PRELIMINARIES

In learning theory we are interested to study how machines learn (Shalev-Shwartz & Ben-David, 2014) . We will base our analysis on the framework of probably approximately correct (PAC) learning. The basic objects of study in learning theory are hypothesis classes and learning rules. We are interested in determining the necessary condition for a learning rule A to be robust with respect to a nonuniform learnable hypothesis class H. We assume that the training samples are labeled by the true labeling function and that samples come from a compactly supported distribution throughout the paper. Definition 2.1 (perplexity). Consider a universally consistent learning rule A and training sets S ′ , S ⊂ X . The perplexity of h = A(S) with respect to h ′ = A(S ′ ) is ∥h ′ -h∥ ∞ = sup x∈X |h ′ (x) -h(x)|. (1) We expect that the perplexity of the output of a learning rule h ∈ H to decrease as we add more natural samples to the training set. In contrast, adding an adversarial sample would be perplexing for h. If we find some sample x ∈ X that does not perplex h but is not correctly labeled by h, then we assume that H is agnostic to the pattern of x. In other words, if adversarial training cannot improve the robustness of h, then h is as robust as it gets for hypotheses in H. Definition 2.2 (robust learning rule). Consider a learning rule A and a sequence {d n } ∞ n=1 of random variables that almost surely converge to a random variable D that is uniformly distributed on X . A is robust if for every ϵ > 0, there exists a natural number N such that for all m, n ≥ N we have that ∥A(d m ) -A(d n )∥ ∞ ≤ ϵ. (2) This definition of a robust learning rule could be interpreted as the characteristic that when A observes enough natural samples, adding more samples would not be consequential to the output of A, independent from the generative process of the samples. The definition is equivalent with Cauchy's criterion for uniform convergence of {A(d n )} ∞ n=1 to the optimal hypothesis A(D). The largest hypothesis class that we consider here is L 2 (X ). The main characteristic of functions in L 2 (X ) is that they are square-integrable. Formally, for a function f ∈ L 2 (X ), ∥f ∥ L 2 (X ) = X |f (x)| 2 dV (x) 1 2 < ∞. (3) Theorem 2.3. L 2 (X ) is nonuniform learnable. We assume that a hypothesis h ∈ H has a series or an integral representation, h(x) = ∞ i=1 a i φ i (x), (4) h(x) = Ω ν(ω)σ(x; ω) dω. The series representation is customary in machine learning. A series representation is adequate if we have a discreet set of features {φ i } ∞ i=0 , e.g. polynomials. We argue that an integral representation is more adequate for analysis when there is a continuum of features to choose from, e.g. a neuron in an artificial neural network (ANN). Informally, the integral representation would abstract away the complexity of finding good features by incorporating every feature possible. Definition 2.4 (SVC learning rule). Consider a training set S = {(x n ∈ X , t(x n ))} N n=1 and a hypothesis h ∈ L 2 (X ). The support vector classifier (SVC) learner solves the following program, arg min a 1 2 ∥a∥ 2 subject to t n h(x n ) ≥ 1 (6) arg min ν 1 2 ∥ν∥ 2 L 2 (Ω) subject to t n h(x n ) ≥ 1 (7) Theorem 2.5. The SVC learning rule is not robust with respect to L 2 (X ), and the optimal solution is, a i = N n=1 λ n t n φ i (x n ), (8) ν(ω) = N n=1 λ n t n σ(x n ; ω), in which {λ n } N n=1 are the Lagrange multipliers. Proposition 2.6. The SVC learning rule would return the labeling function t as the output in the infinite sample limit unless H is agnostic to t. Based on proposition 2.6, one may suggest that SVC learners with respect to L 2 (X ) are weak against black box attacks as well. We formalize the conditions for transfer of adversarial examples as follows. Definition 2.7 (transfer of adversarial examples). The adversarial examples of a surrogate ĥ ∈ Ĥ would probably approximately transfer to a target h ∈ H if some ϵ, ϵ ′ , δ ≥ 0 exist in which with probability 1 -δ we have that ∥ ĥ -h∥ ∞ ≤ ϵ, ∥ ∂ ĥ ∂x i - ∂h ∂x i ∥ ∞ ≤ ϵ ′ . The idea behind this definition of transfer is that not only ĥ and h probably approximate each other, they also vary in a similar manner in the neighborhood of a test sample. → D a number N ∈ N exists where for all n ≥ N we have that ∥A(d n ) -A(D)∥ ∞ ≤ ϵ, for a dense subset of X . Theorem 2.9. If H and the class of its derivatives ∂H are normal hypothesis classes, then adversarial examples of h ∈ H would transfer between different representations of h. Theorem 2.10. L 2 (X ) is not a normal hypothesis class. Even though theorem 2.3 shows that it is possible to find a robust learning rule for L 2 (X ), we argue that this hypothesis class is not appropriate for the analysis of the adversarial examples phenomenon in ANNs. Since it is not a normal hypothesis class, the hypotheses in L 2 (X ) are not clustered together and it does not model the transferability of the adversarial examples between ANNs with different activation functions. In this section we were focused on describing the phenomenon and the conditions for robustness and transfer in learning theory. Next, we will extend the framework with the tools and definitions needed for a geometrical treatment of the phenomenon.

3. THE SPACE OF HOLOMORPHIC HYPOTHESES

We need to come up with a suitable frame and axis if we want to be able to visualize the phenomenon. In this paper, we are only concerned with visualizing the phenomenon in binary classification tasks. We are interested in visualizing the submanifold of the natural samples S and to study its interactions with the submanifold of the decision boundary C. Definition 3.1 (complex-valued classifier). A complex-valued classifier h : X → C is a function h(x) = u(x)+iv(x) in which the real part ℜ[h(x)] = u(x) = 0 encodes the geometrical position of the decision boundary and the imaginary part ℑ[h(x)] = v(x) = 0 regresses through the geometrical position of the training samples. The complex plane C provides us with a frame to visualize S with respect to C. The decision boundary is represented by the imaginary axis and h assigns a position in the complex plane to each x ∈ X . Thus, the image of every path γ : [0, 1] → X would be a curve h(γ) in C. If γ crosses γ ′ in X , h(γ) would cross h(γ ′ ) in C as well. However, in general the geometrical interactions between h(γ) and h(γ ′ ) does not translate to γ and γ ′ . In particular distance and angles measured in C may not reflect the true distance and angles in X . in order to overcome this obstacle, we will turn to a special subset of L 2 (X ). Definition 3.2 (the Bergman space). Consider a compact and simply connected domain set X ⊂ C d . The Bergman space A 2 (X ) ⊂ L 2 (X ) is a reproducing kernel Hilbert space defined as A 2 (X ) = {f ∈ O(X ) | X |f (z)| 2 dV (z) 1 2 < ∞}. O(X ) is the space of holomorphic functions on X . There are different equivalent ways to characterize holomorphic functions. The holomorphic functions are the solutions to the homogeneous Cauchy-Riemann equations, or its counterpart in higher dimensions ∂ (del-bar) equations. Equivalently, the holomorphic functions are the functions that are complex differentiable in each dimension of x ∈ X . A third characterization of O(X ) is that these functions are complex analytic and have a power series representation. O(X ) is also special because, unlike real analytic functions, it is closed under uniform convergence. In one complex dimension, holomorphic functions are also known as conformal maps. Conformal maps are those maps that preserve the angle between paths in X and their image in C. In other words, if γ and γ ′ cross each other at a right angle, h(γ) and h(γ ′ ) would cross each other at a right angle as well. However, it could be shown that holomorphic functions can only be defined over the field of complex numbers. Nevertheless, we argue that the simplicity of analysis in A 2 (X ), and its unique set of properties would justify the switch to complex numbers. We emphasize that encoding data using complex numbers could be as simple as 1 + ix or e ix . We will replace x with z to symbolize the transition from the real to the complex number system. A domain X that allows for the definition of holomorphic functions is called a domain of holomorphy. Fortunately, most of the domains that we would normally face in applications are domains of holomorphy. A notable subset of domains of holomorphy are convex domains. We have provided a short summary of the topic in appendix E, with some examples on how to encode real data with complex numbers. Krantz (2001) is our main reference for the definitions and notation in function theory of several complex variables. Definition 3.3 (the Bergman kernel). The Bergman kernel K X (z, ζ) of a compact and simply connected domain X is the unique function with the reproducing property f (z) = X f (ζ)K X (z, ζ)dV (ζ), ∀f ∈ A 2 (X ). ( ) The reproducing property of the Bergman kernel of X in conjunction with the fact that the optimal Bayes classifier achieves the minimum of the 0-1 loss function provides the means to define the infinite sample limit of any learning rule on A 2 (X ) that is minimizing the complex 0-1 loss function, loss C (t, z, h) = loss 0-1 (t, z, ℜ[h]) + loss M SE (0, z, ℑ[h]), independently from the details of the implementation or the training process. Definition 3.4 (holomorphic optimal Bayes classifier). The holomorphic optimal Bayes classifier is the orthogonal projection of the optimal Bayes classifier f D : X → R into A 2 (X ), o D (z) = X f D (ζ)K X (z, ζ) dV (ζ), Theorem 3.5. A 2 (X ) and ∂A 2 (X ) are normal hypothesis classes. Theorem 3.5 predicts that adversarial examples of a nonrobust hypothesis h ∈ A 2 (X ) transfer between different representations of h. Geometrical properties of holomorphic functions would enable us to infer the geometrical relation between S and C by studying their images in C. This property of the holomorphic hypotheses would prove to be key to finding a robust learning rule for A 2 (X ). Next, we will extend the SVC learning rule to accommodate for complex-valued hypotheses. Definition 3.6 (complex SVC learning rule). The complex SVC learner solves the following program, arg min h, ξ 1 2 ∥h∥ 2 A 2 (Ω) + C N n=1 ξ n subject to ξ n ≥ 0, t n ℜ[f (z n )] ≥ 1 -ξ n , , ξ n ≥ ℑ[f (z n )]), ℑ[f (z n )] ≥ -ξ n (17) in which {ξ n } N n=1 are slack variables introduced to allow for soft margins. We now have the required tools and definitions to analyze the phenomenon from a geometrical perspective. In the next section, we will make use of the proposed framework to analyze the phenomenon in A 2 (X ) and to find a robust classification learning rule for A 2 (X ).

4. ROBUST CLASSIFICATION FOR HOLOMORPHIC HYPOTHESES

In this section, we will develop a robust learning rule with respect to A 2 (X ). To do so, we first examine a toy problem to get an intuition for how adversarial examples occur in holomorphic hypotheses, and then continue to introduce the proposed robust learning rule. To start, we will try to classify the unit disk D into distinct halves in which t(z) = sign(ℜ[z]) z ∈ D, is the labeling function. We will choose our training samples to be the set S n = z, t(z) | z = e i k n 2π k = 0, • • • , n -1 . ( ) We will use the orthonormal polynomial basis of the unit disk as features, φ k (z) = k + 1 π z k . ( ) We will visualize a holomorphic function in three ways. First, we use a domain coloring technique and graph the hypotheses on D. The hue of a color represent the angle, and the saturation represent the magnitude of a complex number. We will also plot the contours of the real and imaginary parts of the hypotheses in the same graph using white and black lines respectively. Second, we will graph the real and the imaginary parts of the hypotheses on the unit circle T = ∂D. T is the set in which all the training points of S n are sampled from. In other words S = T. Third, we will graph the image of T in the range space of the hypotheses. We have visualized the analogue of holomorphic optimal Bayes classifier o D in the left column of figure 1 . However, we have made use of the Szegő kernel to project the labeling function t instead of the Bergman projection of t. The middle column of figure 1 depicts the output of applying the complex-valued SVC learning rule to S 30 using the orthonormal basis of A 2 (D) as features. As we have demonstrated in figure 1 , the solution to program 17 is not robust with respect to A 2 (D). Nevertheless, due to the normality of the holomorphic functions, we can see that the learned hypothesis resembles o D , even though they are not learned through the same learning rule, or make use of the same hypothesis class for that matter. Looking at figure 1, we can see that o D is a transcendental function and has two logarithmic branch points on i and -i. With this observation in mind, it is no wonder that approximating o D is troublesome for our learning rule. Furthermore, we see that the essential singular region of the nonrobust - 2 0 2 ∠z -2 -1 0 1 2 h(z) o  ℜ[h] ℑ[h] -2 0 2 ∠z orthonormal ℜ[h] ℑ[h] -2 0 2 ∠z harmonic ℜ[h] ℑ[h] (b) graph of the complex-valued hypotheses on the unit circle T. -1 0 1 ℜ[h] -2 -1 0 1 2 ℑ[h] o  -1 0 1 ℜ[h] orthonormal -1 0 1 ℜ[h] harmonic (c) image of S = T under the complex-valued hypotheses. Figure 1 : Visualizations of o D and the optimal hypotheses of the hinge loss function for S 30 . We can see that even though the hypotheses are optimizing a different loss function than the 0-1 loss, they show certain similarities with o D . hypothesis has advanced inside of D. This advancement has caused the learners output to be nonrobust in a neighborhood of T. This observation is common to nonrobust learning rules with respect to A 2 (X ). Theorem 4.1. Let h ∈ A 2 (X ) be the output of a nonrobust learning rule, then h is robust on a dense open subset of X . Theorem 4.1 shows that not only our choice of A 2 (X ) as a replacement for L 2 (X ) models the transferability of adversarial examples, it also correctly predicts the apparent paradox in the existence of the adversarial examples in ANNs, where the nonrobust hypothesis appears to be robust almost everywhere. Figure 1c graphs the image of T in the range space of the hypotheses h(T). Due to the singularities of the holomorphic projection of t, o D (T) passes through the point at the infinity. Consequently, o D (T) consists of two parallel lines; the trapezoidal shape in the figure is caused by the fact that we cannot in practice reach the point at the infinity. Looking at the nonrobust hypothesis in Figure 1c , we can see that the image of T passes through the decision boundary quite a few times. The figure shows that h(T) of the nonrobust hypothesis is longer than necessary. In other words, it is possible to find a holomorphic hypothesis that achieves the same loss on the training set with h(T) not passing through the decision boundary so many times. We can repeat the same argument for any other curve inside D. Thus, we argue that if the output h of A with respect to A 2 (X ) minimizes the area covered by the image of X under h, then it is robust. Theorem 4.2. A learning rule A with respect to A 2 (X ) that minimizes the Dirichlet energy of its output, E[h] = X ∥∇h(z)∥ 2 dV (z), is robust. Theorem 4.2 could be easily generalized to any differentiable hypothesis class. The reason is that the process of measuring the length of the image of a path is the same for all differentiable hypotheses; whether they are complex-valued or not. Definition 4.3 (robust (complex) SVC learning rule). The robust (complex) SVC learning rule is the same as the (complex) SVC learning rule, but minimizes E[h] instead of ∥h∥ 2 H(X ) . Definition 4.4 (harmonic features). A set of non-constant features {φ i } ∞ i=1 are harmonic if X ∇φ j (z) • ∇φ k (z) dV (z) = 1 j = k 0 j ̸ = k . ( ) Given the tuning matrix Σ jk = X ∇φ j (z) • ∇φ k (z) dV (z), we can transform any set of features to its harmonic counterpart, φ * = Σ -1 2 φ. ( ) We emphasize that Σ is positive definite by definition. Thus, a unique square root of Σ exists and it is invertible. Theorem 4.5. ℓ 2 regularization of a harmonic hypothesis minimizes its Dirichlet energy. We have repeated the toy experiment using the harmonic basis functions of A 2 (X ), and reported the results in the right column of figure 1 . We can see that the harmonic hypothesis is robust as expected. Thus, we have succeeded in finding a robust learning rule for A 2 (X ). The interested reader can find more examples and experiments in the supplementary material of the paper.

5. ROBUST TRAINING OF ANNS

In this section, we will apply the results of section 4 to ANNs in a limited manner to demonstrate the feasibility and applicability of our approach. First, we put the proposed framework to use and describe how adversarial examples occur in ANNs. -1.0 -0.5 0.0 0.5 As stated in section 2, we represent neural networks using an integral representation. If we assume that ν ∈ A 2 (Ω), then we can repeat the process described in section 4 by replacing ν with its power series representation and follow a similar path. We already know that the orthonormal basis for A 2 (Ω) would not be robust on the boundary of Ω. In other words, the learning rule would fail to find the robust coefficients of neurons on the boundary of Ω. When we use the harmonic bases of A 2 (Ω) on the other hand, the region of convergence of ν(ω) would cover ∂Ω and the output of the learning rule would be robust. 1.0 x -1.0 -0.5 0.0 0.5 1.0 1.5 h(x) orthonormal ℜ[h] ℑ[h] -1.0 -0.5 0.0 0.5 1.0 x harmonic ℜ[h] ℑ[h] We have trained two neural networks which use the first 30 orthonormal and the harmonic bases to represent ν. Both networks use the Heaviside activation function H ℜ[ω]x + ℑ[ω] = 1 ℜ[ω]x+ℑ[ω]>0 . The interested reader can find the details of the experiment in the appendix D. We have reported the results of the experiment in figure 2 . The figure suggests that the harmonic ANN is robust. To showcase the applicability of the framework in a real-world scenario, we use a MLP classifier with ReLU activation as a surrogate and attack two polynomial hypotheses that are trained on a subset of the UCI digits dataset (Dua & Graff, 2017) . We have reported the results in figure 3 . The results show that the proposed robust learning rule is effective in mitigating the effects of the phenomenon, and supports the proposition that ANNs and polynomials are different representations of the same normal hypothesis class. Theorem 5.1. The output of the robust SVC learning rule with respect to a hypothesis h represented in integral form satisfies the Poisson partial differential equation (PDE) -∆ν(ω) = N n=1 λ n t n σ(x n ; ω), ( ) in which ∆ is the Laplace operator. Theorem 5.1 shows that training a robust ANN is the same as solving a PDE. This observation suggests that robust classifiers are a normal family of functions. The fundamental solution Φ of the Laplace operator could be used to find ν as a function of {λ n } N n=1 , ν(ω) = N n=1 λ n t n Ω Φ(ω -w)σ(x n ; w) dV (w). ( ) Definition 5.2 (harmonic activation function). A family of activation functions S is harmonic if they satisfy the Helmholtz equation with the natural Dirichlet or Neumann boundary condition -∆ ω s(x; ω) = s(x; ω) s ∈ S(X ). ( ) Theorem 5.3. L 2 regularization of ν(ω) of a harmonic ANN is the same as minimizing its Dirichlet energy. Theorem 5.3 is the final result of this paper. We have managed to find a robust learning rule for a generalized notion of ANNs. Solving PDEs is an involved and complicated process, and we will not attempt to solve the derived PDEs in this paper. Nevertheless, both PDEs are very well-known and well-studied, and various methods and techniques are available in the literature which focuses on solving these PDEs.

6. CONCLUSION

In this paper we introduced a general framework for training robust ANN classifiers. We proposed a formal definition for robustness and transfer in learning theoretic terms, and described how adversarial examples might emerge from the pointwise convergence of the trained hypothesis to the infinite sample limit of the learning rule. Since our proposal assumes that the convergence is pointwise, we need a separate explanation for the transfer of adversarial examples between hypotheses under pointwise convergence. To this end, we propose the normal hypothesis classes, and define these classes by adapting the definition of a normal family of functions. We also introduce integral representations as an abstraction of ANNs that is easier to analyze in terms of convergence. Next, we showed that under the proposed definitions, learning rules for ANNs that are converging to a hypothesis in L 2 (X ) does not explain the observation of transfer between different architectures of ANNs. As an alternative, we propose that transfer in ANNs is better modeled by functions in O(X ). O(X ) is the set of holomorphic functions on X . We provide the necessary definitions to use O(X ) as a hypothesis class, and take the first steps to enable the use of powerful tools of complex analysis in the study of the phenomenon. Through holomorphicity, our framework provides a geometrical interpretation for the adversarial examples phenomenon. We conclude that a binary classifier with minimal Dirichlet energy is robust. In other words, replacing the L 2 regularization term in the loss function with the Dirichlet energy of the hypothesis should result in a robust classifier. Minimizing the Dirichlet energy might not be tractable in gradient descent as we need to compute the derivative of the Dirichlet energy to do so. To circumvent this problem, we introduce the harmonic features and activation functions. In summary, we first construct features or activation functions that satisfy a condition, and then show that L 2 regularization of these hypothesis classes is the same as minimizing their Dirichlet energy. Consequently, minimizing the Dirichlet energy for these hypotheses could be as efficient as ℓ 2 regularization of their parameters. To the best of our knowledge, we have provided the first method that makes use of calculus of variations and differential equations to tackle the challenge of robust training of ANNs. Nevertheless, the analysis proved that training robust ANNs classifiers is not trivial, and we need to either implicitly or explicitly solve an intermediate PDE. There are multiple methods of doing so available in the relevant literature. As a result, we have left a large-scale implementation of the proposed method to future, when an in depth understanding of the appropriate parameter domain sets and harmonic activation functions in ANNs is achieved. set and applying the proposal of Barati et al. (2021) to ANNs does not end up with a unique set of optimal training points as it did with polynomials. In our framework, theorem 4.2 provides an alternative to an optimal sequence of training points. In this case, h n : X → C would map X to a subset of C with minimal area. The area of h n (X ) is decided by d n and it will be less than or equal to the area of h(X ). We exploit this fact to guarantee uniform convergence of {h n } ∞ n=1 for binary classifiers. It is conceivable that a similar approach could be taken for other applications of machine learning as well. In conclusion, implement the proposed framework in the following steps: 1. Check if A satisfies definition 2.2. If it does, A is already robust. For example, Barati et al. (2021) used Bernstein polynomials to uniformly approximate the optimal hypothesis. 2. Replace the regularization term of A with Dirichlet energy of the hypothesis E[h n ]. 3. Check if the updated A satisfies definition 2.2. To do so, consider an arbitrary path γ : R → X and prove that h is robust on γ when length of h(γ) is minimal. 4. If computing E[h n ] is intractable, consider finding a set of suitable harmonic features.

B COMPARISON WITH THE LITERATURE

The main point of distinction between the proposed framework and the competing proposals is in the use of complex analysis to simplify the analysis. Even though analytic properties of the hypotheses has been exploited before in the literature to derive various methods of defence and attack, e.g. Wen (2022) , the use of holomorphic functions and complex analysis in machine learning have been sparse Barkatou & Jaroschek (2018); Heilman et al. (2015) ; Sarma et al. (2019) . We believe the reason to be that applying complex analysis to machine learning is an interdisciplinary effort by nature. Goodfellow et al. (2015) were the first to show that nonrobust ANNs are weak to analytic attacks, and attributed the phenomenon to something that they called linearity. Tanay & Griffin (2016) refuted this claim by showing that linear classifiers could be robust. We answer the apparent paradox between linear and nonlinear nature of the phenomenon by proposing that both positions are in some sense correct, and that the phenomenon would be better described by the analytic properties of the robust hypothesis. Hein & Andriushchenko (2017) proposed a certificate of robustness for differentiable classifiers. The same certificate could be used for analytic functions. The certificate might be improved by considering the Taylor series expansion of the classifier around a test sample. The Abel-Ruffini theorem might result in some complications in this process. From a geometric point of view, our proposal is very well aligned with the dimpled manifold model of Shamir et al. (2021) . Regarding the boundary tilting perspective of Tanay & Griffin (2016) , it seems that the nonrobust classifier in figure 1c has managed to cross the decision boundary at right angles, and there seems to be exceptions to the boundary tilting perspective in nonlinear cases. Paknezhad et al. (2021) hypothesises that the robust classifier would maximize the margin. We did not make an attempt to analyze the learning rule from the perspective of maximum margin classification. Nonetheless, the peculiar shape of the image of T under the robust hypothesis in figure 1c suggests that the proposed robust SVC learning rule also maximizes the margin in some sense. Our model for the transfer of adversarial examples is mostly similar to the proposal of Goodfellow et al. (2015) , in which the reason for the transfer of adversarial examples are deemed to be that the models converge to the optimal linear hypothesis. Papernot et al. (2016) shows that existing machine learning approaches are in general vulnerable to systematic black-box attacks regardless of their structure, showing that transferable adversarial examples are common in machine learning models. Here, we have defined the holomorphic optimal Bayes classifier and given a formal definition of how transfer occurs. Ilyas et al. (2019) moves the blame to hidden patterns in the input, which our proposal does not align with. Inkawhich et al. (2019) relates the transfer of the adversarial examples to the learned features as well. There are multiple instances of the use of gradient information through out the literature (Ros & Doshi-Velez, 2018; Paknezhad et al., 2021) . The main contrast between our proposal is that we manage to relate gradient regularization with the geometry of f (X ) and further reveal how gradient regularization is related with the Dirichlet energy of the hypothesis.

C PROOF OF THE THEOREMS

C.1 THEOREM 2.3 Proof. L 2 (X ) would be nonuniform learnable if and only if it is a countable union of PAC learnable hypothesis classes H α (X ). Since L 2 (X ) is a Hilbert space, it has an orthonormal basis {φ α } ∞ α=0 . Then, we can choose H α (X ) = α k=0 a k φ k (x)|a k ∈ R , ( ) and the theorem would follow. C.2 THEOREM 2.5 Proof. The Lagrangian of program 7 is L = 1 2 Ω |ν(ω)| 2 dV (ω) + N n=1 λ n 1 -t n Ω ν(ω)σ(x n ; ω) dV (ω) . ( ) We can rearrange the Lagrangian into L = Ω 1 2 |ν(ω)| 2 - N n=1 λ n t n ν(ω)σ(x n ; ω) dV (ω) + N n=1 λ n , = Ω L(ω, ν, ∇ν) dV (ω) + N n=1 λ n . ( ) By the Euler-Lagrange equations for a function of several variables, the optimal ν must satisfy the following PDE δL δν - d j=1 ∂ ∂x j δL δν j = 0, in which ν j = ∂ν ∂xj , and δ δν is the usual differentiation with the exception that it treats ν like a symbolic variable. Thus, ν(ω) = N n=1 λ n t n σ(x n ; ω). ( ) Similarly, the Lagrangian of program 6 is L = 1 2 ∞ i=0 a 2 i + N n=1 λ n 1 -t n ∞ i=0 a i φ i (x n ) . ( ) The optimal a i is achieved when the derivative of the Lagrangian with respect to a i vanishes. Thus, a i = N n=1 λ n t n φ i (x n ). ( ) To see that the learning rule is not robust in general, consider the Dirac's delta, which could be constructed by taking the limit of a Normal probability distribution function with vanishing variance, δ(x; ω) = lim σ→0 N (x; ω, σ) = ∞ x = ω 0 otherwise . ( ) We can see that the output of the SVC learning rule, h(x) = Ω ν(ω)δ(x; ω) dV (ω) = ν(x), = N n=1 λ n t n δ(x n ; x), would be the same as the memoising learning rule, which is the text book example of a consistent learner that is not nonuniform. Thus, the sample complexity of the SVC learning rule is dependent on the distribution of training samples. Our definition of robustness is the same as Cauchy's criterion for uniform convergence of a sequence of functions. Since the sample complexity of the SVC learning rule depends on the distribution of training samples, {A(d n )} ∞ n=1 cannot be uniformly converging to its limit in general.

C.3 PROPOSITION 2.6

Proof. Consider the integral representation h(x) = Ω ν(ω)δ(x; ω) dV (ω) = ν(x). ( ) According to theorem 2.5, in the infinite sample limit we would have h(x) = X λ(ζ)t(ζ)δ(x; ζ) dV (ζ), = λ(x)t(x). Since the norm of h should be minimal and h should satisfy t(x)h(x) ≥ 1, we can deduce that λ(x) = 1 for all x. We can ignore the samples on the decision boundary since they are a degenerate subset of X . Consequently, if t ∈ H, then t would remain feasible and would dominate all other h ∈ H in terms of the training loss and the regularization score. C.4 THEOREM 2.9 Proof. Since H is normal, we know that for large and diverse enough training sets it is true that ∥h -ĥ∥ ∞ ≤ ϵ. Moreover, since ∂H is normal as well, it is true that ∥ ∂ ĥ ∂x i - ∂h ∂x i ∥ ∞ ≤ ϵ ′ . Given both conditions, we can infer that a gradient based attack would follow a similar path for both h and ĥ and the label that would be assigned to the generated adversarial example would also be probably approximately equal.

C.5 THEOREM 2.10

Proof. It is enough to find a learning rule A with respect to L 2 (X ) that does not satisfy the condition for being normal. The SVC learning rule for the integral representation h(x) = Ω ν(ω)δ(x; ω) dV (ω) is such a learning rule. To see why, imagine a set of training points S ∼ d n . Then h = A(S) would be a point mass function. Consequently, we cannot find a dense set K ⊆ X for which ∥A(d n ) -A(D)∥ ∞ ≤ ϵ. Thus, L 2 (X ) is not a normal hypothesis class. that h(S) and the imaginary axis, which represents the decision boundary, would have the minimum number of intersection points allowed by the training set S. According to definition 2.2, to show that A is robust, we have to show that for any ϵ ≥ 0 a nonnegative integer N exists for which the output of A for any two training set S, S ′ ∼ D larger than N would satisfy ∥A(S) -A(S ′ )∥ ∞ ≤ ϵ. Without loss of generality, assume that A(D) partitions X into M regions. Thus, as long as S is diverse enough so that we have at least one sample from each region, the number of times that the image of any curve A(d n )(γ) intersect with the decision boundary would not change. Consequently, adding any more samples to the training set would only results into a more accurate estimation of the position of the decision boundary. In other words, it is always the case that after adding the required M support vectors to S n , the position of the decision boundary would not make any drastic changes. As a result, for big enough training sets, equation 54 would be satisfied. Keep in mind that the same could not be said about a learning rule that does not minimize E[h] since there is no guarantee that A(d n )(γ) and the decision boundary does not intersect more than necessary. C.9 THEOREM 4.5 Proof. h has a representation h(z) = α≥0 a α φ α (z) = a H φ(z). We can prove the theorem by simply calculating E[f ], Ω ∥∇h(z)∥ 2 dV (z) = Ω a H J(z)J(z) H a dV (z), = a H Ω J(z)J(z) H dV (z)a, = a H Σa = a H a, in which J(z) is the Jacobian of the feature vector φ(z). C.10 THEOREM 5.1 Proof. The Lagrangian for the real-valued robust SVC learning rule is L = 1 2 Ω ∥∇ν(ω)∥ 2 dV (ω) + N n=1 λ n 1 -t n Ω ν(ω)σ(x n ; ω) dV (ω) , = Ω 1 2 d j=1 ν j (ω) 2 - N n=1 λ n t n ν(ω)σ(x n ; ω) dV (ω) + N n=1 λ n , = Ω L(ω, ν, ∇ν) dV (ω) + N n=1 λ n . By the Euler-Lagrange equations for a function of several variables, ν satisfies the following equation δL δν - d j=1 ∂ ∂ω j δL δν j = 0. Consequently, ν must satisfy the following PDE -∆ν(ω) = N n=1 λ n t n σ(x n ; ω). C.11 THEOREM 5.3 Proof. We know that ν(ω) = N n=1 a n s n (ω), -∆s n (ω) = s n (ω). Computing E[ν], we would have E[ν] = Ω ∇ν(ω) • ∇ν(ω) dV (ω), = N n=1 N m=1 a n a m Ω ∇s n (ω) • ∇s m (ω) dV (ω). ( ) Since s is twice differentiable, we may apply the divergence theorem to get Ω ∇ • s n (ω)∇s m (ω) dV (ω) = Ω ∇s n (ω) • ∇s m (ω) -s n (ω)s m (ω) dV (ω), = ∂Ω s n (ω)(∇s m • n) dS(ω), ( ) where n is the outward-pointing normal of the boundary of Ω. When either s m satisfies the natural Newman or s n satisfy the natural Dirichlet boundary conditions, the right hand side of equation 68 must vanish. Consequently, Ω ∇s n (ω) • ∇s m (ω) dV (ω) = Ω s n (ω)s m (ω) dV (ω). By replacing the expression in E[ν] we would have E[ν] = ∥ν∥ 2 L 2 (X ) = N n=1 N m=1 a n a m Ω s n (ω)s m (ω) dV (ω). D DETAILS OF THE EXPERIMENTS D.1 SECTION 4 First, we will find o D by projecting t(z) into H 2 (T) ⊂ L 2 (T), o D (z) = T t(ζ)S T (z, ζ) dζ. H 2 (T) is the Hardy space on T and its reproducing kernel is the Szegő kernel S T (z, ζ). The theory of the Bergman and the Szegő kernels are identical as far as we are concerned. Consequently, o D (z) = π -π t(e iθ ) 1 2π(1 -ze -iθ ) dθ, = 1 2π π -π sign(cos θ) 1 -ze -iθ dθ, = 1 2π π 2 -π 2 1 1 -ze -iθ dθ - 3π 2 π 2 1 1 -ze -iθ dθ , ( ) = i π log(-z -i) -log(-z + i) . The projection has two logarithmic branch points and our computer algebra system chooses the branch cuts to be parallel to the real axis. We rotate them by 90 degrees in opposite directions so that the branch cuts are perpendicular to T and fall outside of D. To rotate the branch cuts by an angle ϕ, we have to multiply the expression inside the logarithms by e iϕ . Thus, the projection of t(z) into H 2 (T) is o D (z) = i π log(i(-z -i)) -log(-i(-z + i)) , = i π log(1 -iz) -log(1 + iz) . Next, we will find a non-robust hypothesis in A 2 (D). As we have described in the paper, the orthonormal bases for A 2 (D) are φ k (z) = z k √ γ k , γ k = D |z| 2k dV (z) = π k + 1 . ( ) We choose a series representation for our hypothesis, h(z) = b + K k=1 a k φ k (z), then use a convex optimization library to solve program 17 for h with S 30 as the training set. Finally, we train a robust hypothesis with the help of theorem 4.5. To do so, we need to find the harmonic bases for A 2 (D). The elements of the tuning matrix Σ jk of the polynomial bases are Σ jk = D d dz z j d dz z k dV (z), = jk D z (j-1) z (k-1) dV (z), = 0 j ̸ = k kπ j = k . ( ) Consequently, the harmonic bases of A 2 (D) are φ * k (z) = z k √ kπ . The reader might find it interesting that the corresponding kernel for harmonic bases of A 2 (D) is the polylogarithm of order 1, K * D (z, ζ) = ∞ k=1 φ * k (z)φ * k (ζ), = 1 π ∞ k=1 (zζ) k k , = - 1 π log(1 -zζ).

D.2 SECTION 5

We train a single-layer perceptron in the first part of section 5. The training samples are a set of 8 equispaced points in [-1, 1] and are labeled by their sign. We choose an integral representation for the network, h(x) = D ν(ω)H ℜ[ω]x + ℑ[ω] dV (ω), v ∈ A 2 (D), In which H(x) = 1 x>0 is the Heaviside step function. We need to choose a representation for ν to compute h. In this experiment we want to compare the performance of the orthonormal and harmonic bases of A 2 (D). Replacing ν with the corresponding representations we have h(x) = ∞ k=0 a n D ω k √ γ k H ℜ[ω]x + ℑ[ω] dV (ω), = ∞ k=0 a k ψ k (x) √ γ k . Thus, the harmonic and orthonormal h have representations h(x) = ∞ k=0 a k k + 1 π ψ k (x), h * (x) = ∞ k=0 a k 1 kπ ψ k (x). Next, we will compute ψ k ψ k (x) = D ω k H ℜ[ω(x -i)] dV (ω), = ℜ[ω(x-i)]>0 ω k dV (ω), = -tan -1 (x)+π -tan -1 (x) 1 0 r k e ikθ rdrdθ, = 1 (k + 2) -tan -1 (x)+π -tan -1 (x) e ikθ dθ, = -ie ik(-tan -1 (x)) k(k + 2) (-1) k -1 , ( ) =    0 k even -i exp -ik tan -1 (x) k(k+2) k odd . Finally, we will train h and h * using the complex SVC learning rule. If the norm of ψ was too small and training was numerically unstable, multiplying all ψ k with a constant would not change the results. The second experiment compares the robustness of a MLP with the robustness of harmonic and orthonormal Chebyshev bases. The MLP size is 64 × 40 × 30 × 20 × 10 and all the neurons are ReLU activated. The MLP has minimized the cross entropy loss and is trained by the ADAM optimizer. We will use the Chebyshev bases T α as polynomial features. However, it would be computationally intractable to to include all possible polynomial bases in the hypothesis. To see why, consider the count of all polynomial bases where each x i has a degree of at most one. This set has the same cardinality with a set that contains all possible subsets of {x i } 64 i=1 , i.e. the power set of {x i } 64 i=1 . Consequently, we have to decide which polynomial basis would be added to the hypothesis. To do so, we enumerate all the possible walks up to D steps on a 8x8 lattice, and then map each walk to a polynomial. A node in the lattice represents a pixel, and the structure of the lattice represents the structure of a 2D image. A walk with D steps would be mapped to a degree D polynomial. For example, the following walk x 00 → x 01 → x 11 → x 10 → x 00 would be mapped to T 2 (x 00 )T 1 (x 01 )T 1 (x 11 )T 1 (x 10 ). n odd j T j (x) for odd n. 2 n even j T j (x) -1 for even n, (105) 1 -1 T n (x) T m (x) dx √ 1 -x 2 =              0 if n ̸ = m, π if n = m = 0, π 2 if n = m ̸ = 0. Finally, we compute the harmonic Chebyshev bases and train the polynomial hypotheses using the SVC learning rule. To test the robustness of the trained polynomials, we evaluate the performance of the polynomials on adversarial examples that are generated by attacking the MLP with a single-step ℓ 2 normalized gradient-based attack.

E DOMAINS OF HOLOMORPHY

In this section, we provide a brief summary of domains of holomorphy and describe how such domains could be constructed for common applications of machine learning. In complex analysis, the main subject of study are holomorphic functions of a scalar z ∈ C. Domains of holomorphic functions in this setting are simple objects that are described by the Riemann mapping theorem and its generalizations, which states that if X is a simply connected domain in C and is not C itself, then a biholomorphic map between X and the unit disk D exists. One way to show that some domain is a domain of holomorphy is by finding a biholomorphic map between that domain and another domain holomorphy. A biholomorphic map is a map that is holomorphic and has a holomorphic inverse. As we have shown in section 4, unit disk is a domain of holomorphy. Thus, all simply connected domains in C are domains of holomorphy. However, an analogue for the Riemann mapping theorem does not exist for several complex variables. Domains of holomorphy has a geometrical property known as pseudoconvexity. The formal description of pseudoconvexity is very technical and we believe that the formal definition is not useful to the audience. Instead, we will present the reader with some domains of holomorphy, and then describe a method for constructing new domains of holomorphy from those building blocks. We already know that the unit disk D ⊂ C is a domain of holomorphy, and how we can use biholomorphic mappings to find new ones. In higher dimensions, we can construct a domain of holomorphy using a Cartesian product of other domains of holomorphy. As a special case, the Cartesian product of d disks is called a polydisk D d (c, r) centered on c and with radius r and is defined as D d (c, r) = {z ∈ C d | |z j -c j | ≤ r j , j = 1, • • • , d}. The boundary of the polydisk D d (0, 1) could be used to encode [0, 1] d . To do so, map each real dimension to a complex variable with z j = e iπxj . The complex exponential is a periodic function that maps the real line to the unit circle. This procedure could be used to complexify datasets like MNIST. It is possible to find domains of holomorphy that cannot be constructed using a Cartesian product of lower dimensional domains. One such domain is the ball B(c, r) centered on c with radius r B(c, r) = {z ∈ C d | ∥z -c∥ ≤ r}. ( ) The boundary of the ball could be used to encode correlated dimensions that has a constant magnitude such as one-hot encoded categorical data. The real data could be mapped to a complex variable as the case for polydisks, with the extra step of normalizing z ∈ C d to have r as the magnitude. It goes without saying that a Cartesian product of balls and polydisks is also a domain of holomorphy.

F ARE COMPLEX NUMBERS NECESSARY?

One might imagine that we could have described the same framework without ever mentioning the complex numbers. In this section, we discuss how complex analysis helps us in our analysis, and why we think that complex analysis has much more to offer. First, we know from various papers in the literature of adversarial examples phenomenon that the issue would occur in most applications of machine learning. Thus, if we want to describe the phenomenon, we have to analyse it in a context that is free from the choice of the hypothesis class. In our opinion, an analysis based on learning theory has the best chance of fulfilling this requirement. However, if we want to apply learning theory, we need to describe the phenomenon in learning theory terms as well. Consequently, we need to come up with definitions that conform to how learning theory defines its objects of study; the language of PAC learnability and uniform convergence. The first obstacle the we would face is that real differentiable functions are not closed under uniform convergence. In other words, it is possible to find a sequence of real differentiable functions {f n } ∞ n=1 that is uniformly converging to f , and yet f is nowhere differentiable. Consequently, when we are dealing with the output of a learning rule A, we cannot assume that the output is not ill-behaved in its derivatives. We recommend that the reader take a look at the Weierstrass function to get a picture of how ill-behaved the derivatives could become. This is a big issue for our analysis given that most of the definitions in the literature around the context of adversarial examples requires differentiation in one way or the other. To guarantee that the pointwise limit of a sequence of differentiable functions is differentiable, the sequence needs to be pointwise converging in its value and uniformly converging in its derivatives. According to PAC learnability, to achieve uniform convergence in derivatives, we need to train the derivatives of the hypothesis. But, how would we generate a training set for the gradient of the label "dog" with respect to the pixels in an image? We cannot ask a human to generate the derivatives! On its face, this is likely an impossible feat, and it seems that adversarial examples phenomenon is out of the reach of learning theory. This is where complex numbers show their true potential. It is known that sequences of complex differentiable (holomorphic) functions are closed under uniform convergence; proposition C.1 is a testament to this fact. In other words, if {f n } ∞ n=1 is a sequence of holomorphic functions and the sequence is compactly converging to a function f , we know that f is holomorphic. As we have stated in the main article, only complex-valued functions of a complex variable can be holomorphic. As a result, moving away from the complex number system to the real number system would be a huge step for some of the proofs in this paper. Nevertheless, while it would be more involved, it is probably possible to find similar theorems for real-valued hypotheses with the help of distributional derivatives. Apart from the ease of analysis of sequences of holomorphic functions, these functions are also unique in their geometrical properties. In the main article, we freely talk about the angles and lengths in the domain and the range space of a complex-valued hypothesis. We would not be able to do so if it was not for the holomorphicity of the hypothesis. The boundary tilting perspective of Tanay & Griffin (2016) is a good example of the steps needed to be taken for setting up a framework for rigorous study of the phenomenon in a geometrical sense. As demonstrated by Tanay & Griffin (2016) , coming up with alternative formal definitions for the real and the imaginary axis in our proposal is not a walk in the park. Moreover, even when Tanay & Griffin (2016) managed to do so, it proved to be too difficult to apply the geometrical intuition to a nonlinear hypothesis. In contrast, the geometrical interpretability of holomorphic functions enable us to circumvent these problems, and to translate our geometrical intuition to formal mathematical expressions with ease. In conclusion we argue that while it is possible to recreate our framework without ever mentioning the complex number system, doing so would need even more exotic mathematical objects, and a much more involved discussion. 



graph of the complex-valued hypotheses on the unit disk D.

Figure 2: The output of the robust (right) and nonrobust (left) SVC learning rules for an integral representation with ν ∈ A 2 (D) and σ = H.

Figure 3: The result of performing a onestep ℓ 2 normalized gradient-based black-box attack on harmonic and orthonormal Chebyshev polynomials for the UCI digits dataset. The baseline is a 64 × 40 × 30 × 20 × 10 fully-connected MLP that is trained on the same dataset.

A ROBUST TRAINING IN A NUTSHELL

The main text is concerned with the theoretical aspects of the proposed framework. While the theoretical discussion is necessary, some readers might only be interested in the implementation of the framework. Here, we will describe the framework in a step-by-step manner from a practical perspective.The goal in our framework is to show that a learning rule A satisfies the condition in definition 2.2. In other words, we want to guarantee that h n = A(d n ) converges uniformly to h = A(D). For example, Barati et al. (2021) find a robust learning rule for polynomials by choosing a particular sequence of training points (the Chebyshev grid) that guarantee the uniform convergence of the sequence of hypotheses. However, finding optimal training sets in practice is not a well-defined process. Ilyas et al. (2019) need a robust classifier in the first place to generate an optimal training C.6 THEOREM 3.5Proof. First, assume that A is a nonuniform learner with respect to A 2 (X ). By the definition of nonuniform learnability, for any ϵ, δ > 0 a natural number N exists that for any training set S larger than N and with probability 1 -δ we have thatThen everyThus, we can deduce thatEquation 49 shows that A 2 (X ) and ∂A 2 (X ) are normal hypothesis classes with respect to nonuniform learners.Proposition C.2 (Krantz ( 2010)). Let {f n } ∞ n=1 be a sequence of holomorphic functions on a domain X ⊂ C d . Assume that the sequence converges pointwise to a limit function f on X . Then f is holomorphic on a dense open subset of X . Also the convergence is uniform on compact subsets of the dense open set.Proposition C.2 shows that when A is a consistent learner for A 2 (X ) and {A(d n )} ∞ n=1 is converging pointwise to A(D), then the convergence is uniform on some dense open K ⊂ X and we have thaton K. As a result, A 2 (X ) and ∂A 2 (X ) are normal hypothesis classes with respect to universally consistent learners.C.7 THEOREM 4.1Proof. Consider a smooth path γ : [0, 1] → X . The length of the image of γ under h ∈ A 2 (X ) isHence, when A has minimized E[h], it has minimized ∥h(γ)∥ for all γ. Consequently, the image of all of the paths that start from a training sample z would stay as close to f (z) as possible. It followsWe add a loop to every node in the lattice so that monomials would be included in the set of features as well.Next, we have to compute the elements of the tuning matrix Σ αβ , α j β j U αj -1 (x j )U βj -1 (x j )We need the following formulas for computing the integrals,

