TRANSFORMERS ARE DEEP INFINITE-DIMENSIONAL NON-MERCER BINARY KERNEL MACHINES

Abstract

Despite their ubiquity in core AI fields like natural language processing, the mechanics of deep attention-based neural networks like the "Transformer" model are not fully understood. In this article, we present a new perspective towards understanding how Transformers work. In particular, we show that the "dot-product attention" that is the core of the Transformer's operation can be characterized as a kernel learning method on a pair of Banach spaces. In particular, the Transformer's kernel is characterized as having an infinite feature dimension. Along the way we generalize the standard kernel learning problem to what we term a "binary" kernel learning problem, where data come from two input domains and a response is defined for every cross-domain pair. We prove a new representer theorem for these binary kernel machines with non-Mercer (indefinite, asymmetric) kernels (implying that the functions learned are elements of reproducing kernel Banach spaces rather than Hilbert spaces), and also prove a new universal approximation theorem showing that the Transformer calculation can learn any binary non-Mercer reproducing kernel Banach space pair. We experiment with new kernels in Transformers, and obtain results that suggest the infinite dimensionality of the standard Transformer kernel is partially responsible for its performance. This paper's results provide a new theoretical understanding of a very important but poorly understood model in modern machine learning.

1. INTRODUCTION

Since its proposal by Bahdanau et al. (2015) , so-called neural attention has become the backbone of many state-of-the-art deep learning models. This is true in particular in natural language processing (NLP), where the Transformer model of Vaswani et al. (2017) has become ubiquitous. This ubiquity is such that much of the last few years' NLP breakthroughs have been due to developing new training regimes for Transformers (Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019; Wang et al., 2019a; Joshi et al., 2020; Lan et al., 2020; Brown et al., 2020, etc.) . Like most modern deep neural networks, theoretical understanding of the Transformer has lagged behind the rate of Transformer-based performance improvements on AI tasks like NLP. Recently, several authors have noted Transformer operations' relationship to other, better-understood topics in deep learning theory, like the similarities between attention and convolution (Ramachandran et al., 2019; Cordonnier et al., 2020) and the design of the residual blocks in multi-layer Transformers (e.g., Lu et al. (2019) ; see also the reordering of the main learned (fully-connected or attentional) operation, elementwise nonlinearity, and normalization in the original Transformer authors' official reference codebase (Vaswani et al., 2018) and in some more recent studies of deeper Transformers (Wang et al., 2019b) to the "pre-norm" ordering of normalize, learned operation, nonlinearity, add residual ordering of modern ("v2") Resnets (He et al., 2016) ). In this paper, we propose a new lens to understand the central component of the Transformer, its "dot-product attention" operation. In particular, we show dot-product attention can be characterized as a particular class of kernel method (Schölkopf & Smola, 2002) . More specifically, it is a so-called indefinite and asymmetric kernel method, which refer to two separate generalizations of the classic class of kernels that does not require the classic assumptions of symmetry and positive (semi-) definiteness (Ong et al., 2004; Balcan et al., 2008; Zhang et al., 2009; Wu et al., 2010; Loosli et al., 2016; Oglic & Gärtner, 2018; 2019, etc.) . We in fact show in Theorem 2 below that dot-product attention can learn any asymmetric indefinite kernel. This insight has several interesting implications. Most immediately, it provides some theoretical justification for one of the more mysterious components of the Transformer model. It also potentially opens the door for the application of decades of classic kernel method theory towards understanding one of today's most important neural network models, perhaps similarly to how tools from digital signal processing are widely used to study convolutional neural networks. We make a first step on this last point in this paper, proposing a generalization of prior kernel methods we call "binary" kernel machines, that learns how to predict distinct values for pairs of elements across two input sets, similar to an attention model. The remainder of this paper is organized as follows. Section 2 reviews the mathematical background of both Transformers and classic kernel methods. Section 3 presents the definition of kernel machines on reproducing kernel Banach spaces (RKBS's) that we use to characterize Transformers. In particular we note that the Transformer can be described as having an infinite-dimensional feature space. Section 4 begins our theoretical results, explicitly describing the Transformer in terms of reproducing kernels, including explicit formulations of the Transformer's kernel feature maps and its relation to prior kernels. Section 5 discusses Transformers as kernel learners, including a new representer theorem and a characterization of stochastic-gradient-descent-trained attention networks as approximate kernel learners. In Section 6, we present empirical evidence that the infinite-dimensional character of the Transformer kernel may be somewhat responsible for the model's effectiveness. Section 7 concludes and summarizes our work.

2.1. TRANSFORMER NEURAL NETWORK MODELS

The Transformer model (Vaswani et al., 2017) has become ubiquitous in many core AI applications like natural language processing. Here, we review its core components. Say we have two ordered sets of vectors, a set of "source" elements {s 1 , s 2 , . . . , s S }, s j ∈ R ds and a set of "target" elements {t 1 , t 2 , . . . , t T }, t i ∈ R dt . In its most general form, the neural-network "attention" operation that forms the backbone of the Transformer model is to compute, for each t i , a t i -specific embedding of the source sequence {s j } S j=1 .foot_0 The particular function used in the Transformer is the so-called "scaled dot-product" attention, which takes the form a ij = (W Q t i ) T (W K s j ) √ d α ij = exp(a ij ) S j=1 exp(a ij ) t i = S j=1 α ij W V s j (1) where W V , W K ∈ R ds×d , and W Q ∈ R dt×d are learnable weight matrices, usually called the "value," "key," and "query" weight matrices, respectively. Usually multiple so-called "attention heads" with independent parameter matrices implement several parallel computations of (1), with the Cartesian product (vector concatenation) of several d-dimensional head outputs forming the final output t i . Usually the unnormalized a ij 's are called attention scores or attention logits, and the normalized α ij 's are called attention weights. In this paper, we restrict our focus to the dot-product formulation of attention shown in (1). Several other alternative forms of attention that perform roughly the same function (i.e., mapping from R ds × R dt to R) have been proposed (Bahdanau et al., 2015; Luong et al., 2015; Veličković et al., 2018; Battaglia et al., 2018, etc.) but the dot-product formulation of the Transformer is by far the most popular.

2.2. KERNEL METHODS AND GENERALIZATIONS OF KERNELS

Kernel methods (Schölkopf & Smola, 2002; Steinwart & Christmann, 2008, etc.) are a classic and powerful class of machine learning methods. The key component of kernel methods are the namesake kernel functions, which allow the efficient mapping of input data from a low-dimensional data domain, where linear solutions to problems like classification or regression may not be possible, to a high-or infinite-dimensional embedding domain, where linear solutions can be found. Given two nonempty sets X and Y, a kernel function κ is a continuous function κ : X × Y → R. In the next few sections, we will review the classic symmetric and positive (semi-) definite, or Mercer, kernels, then discuss more general forms.

2.2.1. SYMMETRIC AND POSITIVE SEMIDEFINITE (MERCER) KERNELS

If X = Y, and for all x i , x j ∈ X = Y, a particular kernel κ has the properties symmetry: κ(x i , x j ) = κ(x j , x i ) (2a) positive (semi-) definiteness: c T Kc ≥ 0 ∀ c ∈ R n ; i, j = 1, . . . , n; n ∈ N (2b) where K in (2b) is the Gram matrix, defined as K ij = κ(x i , x j ), then κ is said to be a Mercer kernel. For Mercer kernels, it is well-known that, among other facts, (i) we can define a Hilbert space of functions on X , denoted H κ (called the reproducing kernel Hilbert space, or RKHS, associated with the reproducing kernel κ), (ii) H κ has for each x a (continuous) unique element δ x called a point evaluation functional, with the property f (x) = δ x (f ) ∀f ∈ H κ , (iii) κ has the so-called reproducing property, f, κ(x, •) Hκ = f (x) ∀f ∈ H κ , where •, • Hκ is the inner product on H κ , and (iv) we can define a "feature map" Φ : X → F H , where F H is another Hilbert space sometimes called the feature space, and κ(x, y) = Φ(x), Φ(y) F H (where •, • F H is the inner product associated with F H ). This last point gives rise to the kernel trick for RKHS's. From a machine learning and optimization perspective, kernels that are symmetric and positive (semi-) definite (PSD) are desirable because those properties guarantee that empirical-risk-minimization kernel learning problems like support vector machines (SVMs), Gaussian processes, etc. are convex. Convexity gives appealing guarantees for the tractability of a learning problem and optimality of solutions.

2.2.2. LEARNING WITH NON-MERCER KERNELS

Learning methods with non-Mercer kernels, or kernels that relax the assumptions (2), have been studied for some time. One line of work (Lin & Lin, 2003; Ong et al., 2004; Chen & Ye, 2008; Luss & D'aspremont, 2008; Alabdulmohsin et al., 2015; Loosli et al., 2016; Oglic & Gärtner, 2018; 2019, etc.) has focused on learning with symmetric but indefinite kernels, i.e., kernels that do not satisfy (2b). Indefinite kernels have been identified as reproducing kernels for so-called reproducing kernel Kreȋn spaces (RKKS's) since Schwartz (1964) and Alpay (1991) . Replacing a Mercer kernel in a learning problem like an SVM with an indefinite kernel makes the optimization problem nonconvex in general (as the kernel Gram matrix K is no longer always PSD). Some early work in learning with indefinite kernels tried to ameliorate this problem by modifying the spectrum of the Gram matrix such that it again becomes PSD (e.g., Graepel et al., 1998; Roth et al., 2003; Wu et al., 2005) . More recently, Loosli et al. (2016) ; Oglic & Gärtner (2018) , among others, have proposed optimization procedures to learn in the RKKS directly. They report better performance on some learning problems when using indefinite kernels than either popular Mercer kernels or spectrally-modified indefinite kernels, suggesting that sacrificing convexity can empirically give a performance boost. This conclusion is of course reminiscent of the concurrent experience of deep neural networks, which are hard to optimize due to their high degree of non-convexity, yet give superior performance to many other methods. Another line of work has explored the application of kernel methods to learning in more general Banach spaces, i.e., reproducing kernel Banach spaces (RKBS's) (Zhang et al., 2009) . Various constructions to serve as the reproducing kernel for a Banach space (replacing the inner product of an RKHS) have been proposed, including semi-inner products (Zhang et al., 2009) , positive-definite bilinear forms via a Fourier transform construction (Fasshauer et al., 2015) , and others (Song et al., 2013; Georgiev et al., 2014, etc.) . In this work, we consider RKBS's whose kernels may be neither symmetric nor PSD. A definition of these spaces is presented next.  κ(x, •) ∈ B Y for all x ∈ X ; (3a) f, κ(x, •) B X ×B Y = f (x) for all x ∈ X , f ∈ B X ; (3b) κ(•, y) ∈ B X for all y ∈ Y; and (3c) κ(•, y), g B X ×B Y = g(y) for all y ∈ Y, g ∈ B Y . (3d) Then, B X and B Y are a pair of reproducing kernel Banach spaces (RKBS's) on X and Y, respectively, and κ is their reproducing kernel. Line (3a) (resp. (3c)) says that, if we take κ, a function of two variables x ∈ X and y ∈ Y, and fix x (resp. y), then we get a function of one variable. This function of one variable must be an element of B Y (resp. B X ). Lines (3b) and (3d) are the reproducing properties of κ. For our purposes, it will be useful to extend this definition to include a "feature map" characterization similar to the one used in some explanations of RKHS's (Schölkopf & Smola, 2002, Chapter 2) . Definition 2 (Feature maps for RKBS's (Lin et al., 2019, Theorem 2.1; Georgiev et al., 2014) ). For a pair of RKBS's as defined in Definition 1, suppose that there exist mappings Φ X : X → F X , Φ Y : Y → F Y , where F X and F Y are Banach spaces we will call the feature spaces, and a nondegenerate bilinear mapping •, • F X ×F Y : F X × F Y → R such that κ(x, y) = Φ X (x), Φ Y (y) F X ×F Y for all x ∈ X , y ∈ Y. In this case, the spaces B X and B Y can be defined as (Xu & Ye, 2019; Lin et al., 2019 ) B X = f v : X → R : f v (x) Φ X (x), v F X ×F Y ; v ∈ F Y , x ∈ X (5a) B Y = g u : Y → R : g u (y) u, Φ Y (y) F X ×F Y ; u ∈ F X , y ∈ Y . ( ) Remark 1. We briefly discuss how to understand the spaces given by (5). Consider (5a) for example. It is a space of real-valued functions of one variable x, where the function is also parameterized by a v. Picking a v ∈ F Y in (5a) defines a manifold of functions in B X . This manifold of functions with fixed v varies with the function Φ X . Evaluating a function f v in this manifold at a point x is defined by taking the bilinear product of Φ X (x) and the chosen v. This also means that we can combine (4) and ( 5) to say κ(x, y) = Φ X (x), Φ Y (y) F X ×F Y = f Φ X (x) , g Φ Y (y) B X ×B Y for all x ∈ X , y ∈ Y. (6) Remark 2. If Φ X (x) and Φ Y (y) can be represented as countable sets of real-valued measurable functions, {φ X (x) } ∈N and {φ Y (y) } ∈N for (φ X ) : X → R and (φ Y ) : Y → R (i.e., F X , F Y ⊂ ∈N R); and u, v F X ×F Y = ∈N u v for u ∈ F X , v ∈ F Y ; then the "feature map" construction, whose notation we borrow from Lin et al. (2019) , corresponds to the "generalized Mercer kernels" of Xu & Ye (2019) .

4. DOT-PRODUCT ATTENTION AS AN RKBS KERNEL

We now formally state the formulation for dot-product attention as an RKBS learner. Much like with RKHS's, for a given kernel and its associated RKBS pair, the feature maps (and also the bilinear mapping) are not unique. In the following, we present a feature map based on classic characterizations of other kernels such as RBF kernels (e.g., Steinwart et al. (2006) ). Proposition 1. The (scaled) dot-product attention calculation of (1) is a reproducing kernel for an RKBS in the sense of Definitions 1 and 2, with the input sets X and Y being the vector spaces from which the target elements {t i } T i=1 , t i ∈ R dt and source elements {s j } S j=1 , s j ∈ R ds are drawn, respectively; the feature maps Φ X (t) = ∞ n=0 p1+p2+•••+p d =n d -1 /4 n! p 1 !p 2 ! • • • p d ! 1 /2 d =1 (q ) p (7a) Φ Y (s) = ∞ n=0 p1+p2+•••+p d =n d -1 /4 n! p 1 !p 2 ! • • • p d ! 1 /2 d =1 (k ) p (7b) where q is the th element of q = W Q t, k is the th element of k = W K s, W Q ∈ R d×dt W K ∈ R d×ds , with d ≤ d s , d t and rank(W Q ) = rank(W K ) = d; the bilinear mapping Φ X (t), Φ Y (s) F X ×F Y = Φ X (t) • Φ Y (s) ; and the Banach spaces B X = f k (t) = exp (W Q t) T k/ √ d ; k ∈ F Y , t ∈ X (8a) B Y = g q (s) = exp q T (W K s)/ √ d ; q ∈ F X , s ∈ Y (8b) with the "exponentiated query-key kernel," κ(t, s) = Φ X (t), Φ Y (s) F X ×F Y = f Φ Y (s) , g Φ X (t) B X ×B Y = exp (W Q t) T (W K s) √ d (9) the associated reproducing kernel. The proof of Proposition 1 is straightforward and involves verifying ( 9) by multiplying the two infinite series in (7), then using the multinomial theorem and the Taylor expansion of the exponential. In the above and when referring to Transformer-type models in particular rather than RKBS's in general, we use t, s, q, and k for x, y, u, and v, respectively, to draw the connection between the elements of the RKBS's and the widely-used terms "target," "source," "query," and "key." The rank requirements on W Q and W K mean that span({Φ X (t), t ∈ X }) = F X and span({Φ Y (s), s ∈ Y}) = F Y . This in turn means that the bilinear mapping is nondegenerate. Remark 3. Now that we have an example of a pair of RKBS's, we can make more concrete some of the discussion from Remark 1. Examining (8a), for example, we see that when we select a k ∈ F Y , we define a manifold of functions in B X where k is fixed, but W Q can vary. Similarly, selecting a q ∈ F X defines a manifold in B Y . Selecting an element from both F X and F Y locks us into one element each from B X and B Y , which leads to the equality in (6). Remark 4. Examining ( 8)-( 9), we can see that the element drawn from F Y that parameterizes the element of B X , as shown in (8a), is a function of Φ Y (and vice-versa for (8b)). This reveals the exact mechanism in which the Transformer-type attention computation is a generalization of the RKBS's considered by Fasshauer et al. (2015) , Lin et al. (2019) , Xu & Ye (2019) , etc., for applications like SVMs, where one of these function spaces is considered fixed. Remark 5. Since the feature maps define the Banach spaces ( 5), the fact that the parameters W Q and W K are learned implies that Transformers learn parametric representations of the RKBS's themselves. This is in contrast to classic kernel methods, where the kernel (and thus the reproducing space) is usually fixed. In fact, in Theorem 2 below, we show that (a variant of) the Transformer architecture can approximate any RKBS mapping. Remark 6. The symmetric version of the exponentiated dot product kernel is known to be a reproducing kernel for the so-called Bargmann space (Bargmann, 1961) which arises in quantum mechanics. Remark 7. Notable in Proposition 1 is that we define the kernel of dot-product attention as including the exponential of the softmax operation. The output of this operation is therefore not the attention scores a ij but rather the unnormalized attention weights, ᾱij = α ij j α ij . Considering the exponential as a part of the kernel operation reveals that the feature spaces for the Transformer are in fact infinite-dimensional in the same sense that the RBF kernel is said to have an infinite-dimensional feature space. In Section 6, we find empirical evidence that this infinite dimensionality may be partially responsible for the Transformer's effectiveness.

5.1. THE BINARY RKBS LEARNING PROBLEM AND ITS REPRESENTER THEOREM

Most kernel learning problems take the form of empirical risk minimization problems. For example, if we had a learning problem for a finite dataset (x 1 , z 1 ), . . . , (x n , z n ), x i ∈ X , z i ∈ R and wanted to learn a function f : X → R in an RKHS H κ , the learning problem might be written as f * = arg min f ∈Hκ 1 n n i=1 L (x i , z i , f (x i )) + λR( f Hκ ) where L : X × R × R → R is a convex loss function, R : [0, ∞) → R is a strictly increasing regularization function, and λ is a scaling constant. Recent references that consider learning in RKBS's (Georgiev et al., 2014; Fasshauer et al., 2015; Lin et al., 2019; Xu & Ye, 2019 ) consider similar problems to ( 10), but with the RKHS H replaced with an RKBS. The kernel learning problem for attention, however, is different from (10) in that, as we discussed in the previous section, we need to predict a response z ij (i.e., the attention logit) for every pair (t i , s j ). This motivates a generalization of the classic class of kernel learning problems that operates on pairs of input spaces. We discuss this generalization now. Definition 3 (Binary kernel learning problem -regularized empirical risk minimization). Let X and Y be nonempty sets, and B X and B Y RKBS's on X and Y, respectively. Let •, • B X ×B Y : B X × B Y → R be a bilinear mapping on the two RKBS's. Let Φ X : X → F X and Φ Y : Y → F Y be fixed feature mappings with the property that Φ X (x i ), Φ Y (y i ) F X ×F Y = f Φ Y (y) , g Φ X (x) B X ×B Y . Say {x 1 , . . . , x nx }, x i ∈ X , {y 1 , . . . , y ny }, y j ∈ Y, and {z ij } i=1,...,nx; j=1,...,ny , z ij ∈ R is a finite dataset where a response z ij is defined for every (i, j) pair of an x i and a y j . Let L : X × Y × R × R → R be a loss function that is convex for fixed (x i , y j , z i,j ), and R X : [0, ∞) → R and R Y : [0, ∞) → R be convex, strictly increasing regularization functions. A binary empirical risk minimization kernel learning problem for learning on a pair of RKBS's takes the form f * , g * = arg min f ∈B X ,g∈B Y 1 n x n y i,j L x i , y j , z ij , f Φ Y (yj ) , g Φ X (xi) B X ×B Y + λ X R X ( f B X ) + λ Y R Y ( g B Y ) where λ X and λ Y are again scaling constants. Remark 8. The idea of a binary kernel problem that operates over pairs of two sets is not wholly new: there is prior work both in the collaborative filtering (Abernethy et al., 2009) Virtually all classic kernel learning methods find solutions whose forms are specified by so-called representer theorems. Representer theorems state that the solution to a regularized empirical risk minimization problem over a reproducing kernel space can be expressed as a linear combination of evaluations of the reproducing kernel against the dataset. Classic solutions to kernel learning problems thus reduce to finding the coefficients of this linear combination. Representer theorems exist in the literature for RKHS's (Kimeldorf & Wahba, 1971; Schölkopf et al., 2001; Argyriou et al., 2009 ), RKKS's (Ong et al., 2004; Oglic & Gärtner, 2018) , and RKBS's (Zhang et al., 2009; Zhang & Zhang, 2012; Song et al., 2013; Fasshauer et al., 2015; Xu & Ye, 2019; Lin et al., 2019) .  ι(f * ) = nx i=1 ξ i κ(x i , •) ι(g * ) = ny j=1 ζ j κ(•, y j ). ( ) where ι(f ) (resp. ι(g)) denotes the Gâteaux derivative of the norm of f (resp. g) with the convention that ι(0) 0, and where ξ i , ζ j ∈ R. Proof. See Appendix B.

THEOREM

The downside of finding solutions to kernel learning problems like ( 10) or ( 11) of the form (12) as suggested by representer theorems is that they scale poorly to large datasets. It is well-known that for an RKHS learning problem, finding the scalar coefficients by which to multiply the kernel evaluations takes time cubic in the size of the dataset, and querying the model takes linear time. The most popular class of approximation techniques are based on the so-called Nyström method, which constructs a low-rank approximation of the kernel Gram matrix and solves the problem generated by this approximation (Williams & Seeger, 2001) . A recent line of work (Gisbrecht & Schleif, 2015; Schleif & Tino, 2017; Oglic & Gärtner, 2019) has extended the Nyström method to RKKS learning. In this section, we characterize the Transformer learning problem as a new class of approximate kernel methods -a "distillation" approach, one might call it. We formally state this idea now. Proposition 2 (Parametric approximate solutions of binary kernel learning problems). Consider the setup of a binary kernel learning problem from Definition 3. We want to find approximations to the solution pair (f * , g * ). In particular, we will say we want an approximation κ : X × Y → R such that κ(x, y) ≈ f * Φ Y (y) , g * Φ X (x) B X ×B Y for all x ∈ X and y ∈ Y. Comparing (13) to (6) suggests a solution: to learn a function κ that approximates κ. In particular, (6) suggests learning explicit approximations of the feature maps, i.e., κ(x, y) ≈ Φ X (x), Φ Y (y) F X ×F Y . In fact, it turns out that the Transformer query-key mapping (1) does exactly this. That is, while the Transformer kernel calculation outlined in Propostion 1 is finite-dimensional, it can in fact approximate the potentially infinite-dimensional optimal solution (f * , g * ) characterized in Theorem 1. This fact is proved next. Theorem 2. Let X ⊂ R dt and Y ⊂ R ds be compact; t ∈ X , s ∈ Y; and let q : X → R and k : Y → R for = 1, . . . , d be two-layer neural networks with m hidden units. Then, for any continuous function F : X × Y → R and > 0, there are integers m, d > 0 such that F (t, s) - d =1 q (t)k (s) < for all t ∈ X , s ∈ Y. ( ) Proof. See Appendix C. We now outline how Theorem 2 relates to Transformers. If we concatenate the outputs of the two-layer neural networks {q } d =1 and {k } d =1 into d-dimensional vectors q : R dt → R d and k : R ds → R d , then the dot product q(t) T k(s) denoted by the sum in ( 14) can approximate any real-valued continuous function on X × Y. Minus the usual caveats in applications of universal approximation theorems (i.e., in practice the output elements share hidden units rather than having independent ones), this dot product is exactly the computation of the attention logits a ij , i.e., F (t, s) ≈ log κ(t, s) for the F in ( 14) and the κ in (9) up to a scaling constant √ d. Since the exponential mapping between the attention logits and the exponentiated query-key kernel used in Transformers is a one-to-one mapping, if we take F (t, s) = log f * Φ Y (s) , g * Φ X (t) B X ×B Y , then we can use a Transformer's dot-product attention to approximate the optimal solution to any RKBS solution arbitrarily well. The core idea of an attention-based deep neural network is then to learn parametric representations of q and k via stochastic gradient descent. Unlike traditional representer-theorem-based learned functions, training time of attention-based kernel machines like deep Transformers (generally, but with no guarantees) scale sub-cubically with dataset size, and evaluation time stays constant regardless of dataset size. We study modifications of the Transformer with several kernels used in classic kernel machines. We train on two standard machine translation datasets and two standard sentiment classification tasks.

6. IS THE EXPONENTIATED DOT PRODUCT ESSENTIAL TO TRANSFORMERS?

For machine translation, IWSLT14 DE-EN is a relatively small dataset, while WMT14 EN-FR is a considerably larger one. For sentiment classification, we consider SST-2 and SST-5. We retain the standard asymmetric query and key feature mappings, i.e., q = W Q t and k = W K s, and only modify the kernel κ : R d × R d → R ≥0 . In the below, τ > 0 and γ ∈ R are per-head learned scalars. Our kernels of interest are: 1. the (scaled) exponentiated dot product (EDP), κ(q, t) = exp(q T k/ √ d), i.e., the standard Transformer kernel; 2. the radial basis function (RBF) kernel, κ(q, t) = exp( -τ / √ d(q -k) 2 2 ), where • 2 is the standard 2-norm. It is well-known that the RBF kernel is a normalized version of the exponentiated dot-product, with the normalization making it translation-invariant; 3. the vanilla L2 distance, κ(q, t) = τ / √ d qk 2 ; 4. an exponentiated version of the intersection kernel, κ(q, t) = exp( d =1 min(q , k )). The symmetric version of the intersection kernel was popular in kernel machines for computer vision applications (Barla et al., 2003; Grauman & Darrell, 2005; Maji et al., 2008, etc.) , and is usually characterized as having an associated RKHS that is a subspace of the function space L 2 (i.e., it is infinite-dimensional in the sense of having a feature space of continuous functions, as opposed to the infinite-dimensional infinite series of the EDP and RBF kernels); 5. a quadratic polynomial kernel, κ(q, k) = ( 1 / √ dq T k + γ) 2 . Full implementation details are provided in Appendix D. Results for machine translation are presented in Table 1 . Several results stand out. First, the exponentiated dot product, RBF, and exponentiated intersection kernels, which are said to have infinite-dimensional feature spaces, indeed do perform better than kernels with lower-dimensional feature maps such as the quadratic kernel. In fact, the RBF and EDP kernels perform about the same, suggesting that a deep Transformer may not need the translation-invariance that makes the RBF kernel preferred to the EDP in classic kernel machines. Intriguingly, the (unorthodox) exponentiated intersection kernel performs about the same as the two than the EDP and RBF kernels on IWSLT14 DE-EN, but slightly worse on WMT14 EN-FR. As mentioned, the EDP and RBF kernels have feature spaces of infinite series, while the intersection kernel corresponds to a feature space of continuous functions. On both datasets, the quadratic kernel performs slightly worse than the best infinite-dimensional kernel, while the L2 distance performs significantly worse. Results for sentiment classification appear in Table 2 . Unlike the machine translation experiments, the infinite-dimensional kernels do not appear strictly superior to the finite-dimensional ones on this task. In fact, the apparent loser here is the exponentiated intersection kernel, while the L2 distance, which performed the worst on machine translation, is within a standard deviation of the top-performing kernel. Notably, however, the variance of test accuracies on sentiment classification means that it is impossible to select a statistically significant "best" on this task. It is possible that the small inter-kernel variation relates to the relative simplicity of this problem (and relative smallness of the dataset) vs. machine translation: perhaps an infinite-dimensional feature space is not needed to obtain Transformer-level performance on this learning problem. It is worth noting that the exponentiated dot product kernel (again, the standard Transformer kernel) is a consistent high performer. This may be experimental evidence for the practical usefulness of the universal approximation property they enjoy (c.f. Theorem 2). The relatively small yet statistically significant performance differences between kernels is reminiscent of the same phenomenon with activation functions (ReLU, ELU, etc.) for neural nets. Moreover, the wide inter-kernel differences in performance for machine translation, compared against the much smaller performance differences on the SST sentiment analysis tasks, demonstrates an opportunity for future study on this apparent task-and dataset-dependency. As a whole, these results suggest that kernel choice may be an additional design parameter for Transformer networks.

7. CONCLUSION

In this paper, we drew connections between classic kernel methods and the state-of-the-art Transformer networks. Beyond the theoretical interest in developing new RKBS representer theorems and other kernel theory, we gained new insight into what may make Transformers work. Our experimental results suggest that the infinite dimensionality of the Transformer kernel makes it a good choice in application, similar to how the RBF kernel is the standard choice for e.g. SVMs. Our work also reveals new avenues for Transformer research. For example, our experimental results suggest that choice of Transformer kernel acts as a similar design choice as activation functions in neural net design. Among the new open research questions are (1) whether the exponentiated dot-product should be always preferred, or if different kernels are better for different tasks (c.f. how GELUs have recently become very popular as replacements for ReLUs in Transformers), (2) any relation between vector-valued kernels used for structured prediction (Álvarez et al., 2012) and, e.g., multiple attention heads, and (3) the extension of Transformer-type deep kernel learners to non-Euclidean data (using, e.g., graph kernels or kernels on manifolds).

A DEEP NEURAL NETWORKS LEAD TO BANACH SPACE ANALYSIS

Examining the kernel learning problem (11), it may not immediately clear why the reproducing spaces on X and Y need be Banach spaces rather than Hilbert spaces. Suppose for example that we have two RKHS's H X and H Y on X and Y, respectively. Then, we can take their tensor product H X ⊗ H Y as an RKHS on X × Y, with associated reproducing kernel κ X ×Y (x 1 ⊗ y 1 , x 2 ⊗ y 2 ) = κ X (x 1 , x 2 )κ Y (y 1 , y 2 ) , where κ X and κ Y are the reproducing kernels of H X and H Y , respectively, and x 1 ⊗ y 1 , x 2 ⊗ y 2 ∈ X ⊗ Y. The solutions to a regularized kernel problem like (11) would then be drawn from H X and H Y . This setup is similar to those studied in, e.g., Abernethy et al. (2009) ; He et al. (2017) . In a shallow kernel learner like an SVM, the function in the RKHS can be characterized via its norm. Representer theorems allow for the norm of the function in the Hilbert space to be calculated from the scalar coefficients that make up the solution. On the other hand, for a Transformer layer in a multilayer neural network, regularization is usually not done via norm penalty as shown in ( 11). In most applications, regularization is done via dropout on the attention weights a ij as well as via the implicit regularization obtained from subsampling the dataset during iterations of stochastic gradient descent. While dropout has been characterized as a form of weight decay (i.e., a variant of p-norm penalization) for linear models (Baldi & Sadowski, 2013; Wager et al., 2013, etc.) , recent work has shown that dropout induces a more complex regularization effect in deep networks (Helmbold & Long, 2017; Arora et al., 2020, etc.) . Thus, it is difficult to characterize the norm of the vector spaces we are traversing when solving the general problem (11) in the context of a deep network. This can lead to ambiguity as to whether the norm being regularized as we traverse the solution space is a Hilbert space norm. If f and g are infinite series or L p functions, for example, their resident space is only a Hilbert space if the associated norm is the 2 or L 2 norm. This motivates the generalization of kernel learning theorems to the general Banach space setting when in the context of deep neural networks.

B PROOF OF THEOREM 1 B.1 PRELIMINIARIES

To prove this theorem, we first need some results and definitions regarding various properties of Banach spaces (Megginson, 1998) . These preliminaries draw from Xu & Ye (2019) and Lin et al. (2019) . Two metric spaces (M, d M ) and (N , d N ) are said to be isometrically isomorphic if there exists a bijective mapping T : M → N , called an isometric isomorphism, such that for all m ∈ M, (Megginson, 1998, Definition 1.4.13) . d N (T (m)) = d M (m) The dual space of a vector space V over a field F, which we will denote V * , is the space of all continuous linear functionals on V, i.e., V * = {g : V → F, g linear and continuous}. (B.1) A normed vector space V is reflexive if it is isometrically isomorphic to V * * , the dual space of its dual space (a.k.a. its double dual). For a normed vector space V, the dual bilinear product, which we will denote •, • V (i.e., with only one subscript, as opposed to e.g., •, • B X ×B Y ), is defined on V and V * as f, g V g(f ) for f ∈ V, g ∈ V * . Given a normed vector space V and its dual space V * , let U ⊆ V and W ⊆ V * . The annihilator of U in V * and the annihilator of W in V, denoted U ⊥ and ⊥ W respectively, are (Megginson, 1998, Definition 1.10.14 ) U ⊥ = {g ∈ V * : f, g V = 0 ∀f ∈ U} ⊥ W = {f ∈ V : f, g V = 0 ∀g ∈ W}.

B.2 MINIMUM-NORM INTERPOLATION (OPTIMAL RECOVERY)

Following Fasshauer et al. (2015) ; Xu & Ye (2019) ; Lin et al. (2019) , we first prove a representer theorem for a simpler problem -that of perfect interpolation while minimizing the norm of the solution -before proceeding to the representer theorem for the empirical risk minimization problem (11) in the next section. Definition 4 (Minimum-norm interpolation in a pair of RKBS's). Let X and Y be nonempty sets, and B X and B Y RKBS's on X and Y, respectively. Let •, • B X ×B Y : B X ×B Y → R be a bilinear mapping on the two RKBS's, Φ X : X → F X and Φ Y : Y → F Y . Let κ(x i , y i ) = Φ X (x), Φ Y (y) F = f Φ Y (yi) , g Φ X (xi) B X ×B Y be a reproducing kernel of X and Y satisfying Definitions 1 and 2. Say {x 1 , . . . , x nx }, x i ∈ X , {y 1 , . . . , y ny }, y i ∈ Y, and {z ij } i=1,...,nx; j=1,...,ny , z ij ∈ R is a finite dataset where a response z ij is defined for every (i, j) pair of an x i and a y j . The minimum-norm interpolation problem is f * , g * = arg min f ∈B X ,y∈B Y f B X + g B Y such that (f, g) ∈ N X ,Y,Z (B.4) where N X ,Y,Z = {(f, g) ∈ B X ⊕ B Y s.t. f Φ Y (yj ) , g Φ X (xi) B X ×B Y = z ij ∀ i, j}. (B.5) To discuss the solution of (B.4), we first need to establish the condition for the existence of a solution. The following is a generalization of a result from Section 2.6 of Xu & Ye (2019) . Lemma 2. If the set {κ(x i , •)} nx i=1 is linearly independent in B Y , the set {κ(•, y j )} ny j=1 is linearly independent in B X , and the bilinear mapping •, • B X ×B Y : B X × B Y → R is nondegenerate, then N X ,Y,Z (B.5) is nonempty. Proof. From the definition of κ (3) and the bilinearity of •, • B X ×B Y , we can write that f, nx i=1 c i κ(x i , •) B X ×B Y = nx i=1 c i f, κ(x i , •) B X ×B Y = nx i=1 c i f (x i ) for all f ∈ B X for c i ∈ R, and that ny j=1 c j κ(•, y j ), g B X ×B Y = ny j=1 c j κ(•, y j ), g B X ×B Y = ny j=1 c j g(y j ) for all g ∈ B Y for c j ∈ R. This means that nx i=1 c i κ(x i , •) = 0 if and only if nx i=1 c i f (x i ) = 0 for all f ∈ B X and ny j=1 c j κ(•, y j ) = 0 if and only if ny j=1 c j g(y j ) = 0 for all g ∈ B Y . This shows that linear independence of {κ(x i , •)} nx i=1 and {κ(•, y j )} ny j=1 imply linear independence of {f (x i )} nx i=1 and {g(y j )} ny j=1 , respectively. Then, considering the nondegeneracy of the bilinear mapping •, • B X ×B Y , we can say that ny j=1 c j κ(•, y j ), nx i=1 c i κ(x i , •) B X ×B Y = 0 if and only if   nx i=1 c i f (x i ) = 0 for all f ∈ B X or ny j=1 c j g(y j ) = 0 for all g ∈ B Y   . From this, we can see that linear independence of {κ(x i , •)} nx i=1 and {κ(•, y j )} ny j=1 , and the nondegeneracy of •, • B X ×B Y ensure the existence of at least one (f * , g * ) pair in N X ,Y,Z . Now we can prove a lemma characterizing the solution to (B.4). Lemma 3. Consider the minimum-norm interpolation problem from Definition 4. Assume that B X and B Y are smooth, strictly convex, and reflexive,foot_1 and that {κ(x i , •)} nx i=1 and {κ(•, y j )} ny j=1 are linearly independent. Then, (B.4) has a unique solution pair (f * , g * ), with the property that ι(f * ) = nx i=1 ξ i κ(x i , •) ι(g * ) = ny j=1 ζ j κ(•, y j ). where ι(•) is the regularized Gâteaux derivative as defined in (B.2), and where ξ i , ζ j ∈ R. Proof. The existence of a solution pair is given by the linear independence of {κ(x i , •)} nx i=1 and {κ(•, y j )} ny j=1 and Lemma 2. Uniqueness of the solution will be shown by showing that N X ,Y,Z is closed and convex and a subset of a strictly convex and reflexive set, which ensures it is a Chebyshev set. Since B X and B Y are strictly convex and reflexive, their direct sum B X ⊕ B Y is strictly convex and reflexive, as we noted in section B.1. Now we analyze N X ,Y,Z . We first show convexity. Pick any (f, g), (f , g ) ∈ N X ,Y,Z and t ∈ (0, 1). Then note that for any (x i , y j , z i,j ), tf Φ Y (yj ) , tg Φ X (xi) B X ×B Y + (1 -t)f Φ Y (yj ) , (1 -t)g Φ X (xi) B X ×B Y = t f Φ Y (yj ) , g Φ X (xi) B X ×B Y + (1 -t) f Φ Y (yj ) , g Φ X (xi) B X ×B Y = tz i,j + (1 -t)z i,j = z i,j thus showing that N X ,Y,Z is convex. Closedness may be shown by the strict convexity of its superset B X ⊕ B Y and the continuity of •, • B1×B2 . Thus, the closed and convex N X ,Y,Z ⊆ B X ⊕ B Y is a Chebyshev set, implying a unique (f * , g * ) ∈ N X ,Y,Z with f * , g * B X ⊕B Y = min (f,g)∈N X ,Y,Z f B X + g B Y . Now we characterize this solution (f * , g * ). Similar to proofs of the classic RKHS representer theorem (Schölkopf et al., 2001) and those of earlier RKBS representer theorems (Xu & Ye, 2019; Lin et al., 2019, etc.) , we approach this via orthogonal decomposition. Consider the following set of function pairs (f, g) that map all data pairs (x i , y j ) to 0: N X ,Y,0 = (f, g) ∈ B X ⊕ B Y : f Φ Y (yj ) , g Φ X (xi) B X ×B Y = 0; i = 1, . . . , n x ; j = 1, . .

. , n y

We can see that N X ,Y,0 is closed under addition and scalar multiplication, making it a subspace of B X ⊕ B Y . Taking our optimal (f * , g * ), we can see that (f * , g * ) + (f 0 , g 0 ) B X ⊕B Y ≥ (f * , g * ) B X ⊕B Y for any (f 0 , g 0 ) ∈ N X ,Y,0 (B.7) thus showing that (f * , g * ) is orthogonal to the subspace N X ,Y,0 . Consider the left and right preimages of N X ,Y,0 under •, • B X ×B Y : •, g -1 B X ×B Y [N X ,Y,0 ] = f ∈ B X : f Φ Y (yj ) , g Φ X (xi) B X ×B Y = 0; i = 1, . . . , n x ; j = 1, . . . , n y ; g ∈ B Y f, • -1 B X ×B Y [N X ,Y,0 ] = f ∈ B X : f Φ Y (yj ) , g Φ X (xi) B X ×B Y = 0; i = 1, . . . , n x ; j = 1, . . . , n y ; f ∈ B X . Since •, g -1 B X ×B Y [N X ,Y,0 ] ⊆ B X and f, • -1 B X ×B Y [N X ,Y,0 ] ⊆ B Y , we can consider them as normed vector spaces with norms • B X and • B Y , respectively. From (B.7) and the definition of the direct sum norm (B.3), f * + f 0 B X ≥ f * B X for all f 0 ∈ •, g -1 B X ×B Y [N X ,Y,0 ] , for arbitrary g ∈ B Y (B.8a) g * + g 0 B Y ≥ g * B Y for all g 0 ∈ f, • -1 B X ×B Y [N X ,Y,0 ] , for arbitrary f ∈ B X . (B.8b) We can then use (B.8) and Lemma 1 to say f, ι(f * ) B X = 0 for all f ∈ •, g -1 B X ×B Y [N X ,Y,0 ] , for arbitrary g ∈ B Y g, ι(g * ) B Y = 0 for all g ∈ f, • -1 B X ×B Y [N X ,Y,0 ] , for arbitrary f ∈ B X which means ι(f * ) ∈ •, g -1 B X ×B Y [N X ,Y,0 ] ⊥ for all g ∈ B Y (B.9a) ι(g * ) ∈ f, • -1 B X ×B Y [N X ,Y,0 ] ⊥ for all f ∈ B X . (B.9b) From (B.9a) and ( 3a)-(3b), f * ∈ g∈B Y •, g -1 B X ×B Y [N X ,Y,0 ] = f ∈ B X : f Φ Y (yj ) , g Φ X (xi) B X ×B Y = 0; g ∈ B Y ; i = 1, . . . , n x ; j = 1, . . . , n y = f ∈ B X : f Φ Y (yj ) , h B X ×B Y = 0 h ∈ span {κ(x i , •); i = 1, . . . , n x } ; j = 1, . . . , n y = ⊥ span {κ(x i , •); i = 1, . . . , n x } . (B.10) And from (B.9b) and (3c)-(3d),  g * ∈ f ∈B X f, • -1 B X ×B Y [N X ,Y,0 ] = g ∈ B Y : f Φ Y (yj ) , g Φ X (xi) B X ×B Y = 0; f ∈ B X ; i = 1, . . . , n x ; j = 1, . . . , n y = g ∈ B Y : h , f Φ Y (yj ) B X ×B Y = 0 h ∈ span {κ(•, ι(f * ) ∈ ⊥ span{κ(x i , •); i = 1, . . . , n x } ⊥ = span{κ(x i , •); i = 1, . . . , inf v∈S f (v) has at least one solution. Xu & Ye (2019) and Lin et al. ( 2019) also reference Ekeland & Turnbull (1983) as a source for Lemma 4. We now restate Theorem 1 with the conditions on B X and B Y filled in. Theorem 1, Revisited. Suppose we have a kernel learning problem of the form in (11). Let κ : X × Y → R, κ(x i , y i ) = Φ X (x i ), Φ Y (y i ) F X ×F Y = f Φ Y (y) , g Φ X (x) B X ×B Y be a reproducing kernel satisfying Definitions 1 and 2. Assume that {κ(x i , •)} nx i=1 is linearly independent in B Y and that {κ(•, y j )} ny j=1 is linearly independent in B X . Assume also that B X and B Y are reflexive, strictly convex, and smooth. Then, the regularized empirical risk minimization problem (11) has a unique solution pair (f * , g * ), with the property that ι(f * ) = nx i=1 ξ i κ(x i , •) ι(g * ) = ny j=1 ζ j κ(•, y j ). where ξ i , ζ j ∈ R. Proof. As before, we begin by proving existence and uniqueness of the solution pair (f * , g * ). We first prove uniqueness using some basic facts about convexity. Assume that there exist two distinct minimizers (f * 1 , g * 1 ), (f * 2 , g * 2 ) ∈ B X ⊕ B Y . Define (f * 3 , g * 3 ) = 1 /2[(f * 1 , g * 1 ) + (f * 2 , g * 2 ) ]. Then, since B X and B Y are strictly convex, we have f * 3 B X = 1 2 (f * 1 + f * 2 ) B X < 1 2 f * 1 B X + 1 2 f * 2 B X g * 3 B Y = 1 2 (g * 1 + g * 2 ) B Y < 1 2 g * 1 B Y + 1 2 g * 2 B Y and since R X and R Y are convex and strictly increasing, R X ( f * 3 B X ) = R X 1 2 (f * 1 + f * 2 ) B X < R X 1 2 f * 1 B X + 1 2 f * 2 B X ≤ 1 2 R X ( f * 1 B X + 1 2 R X ( f * 2 B X ) and R Y ( g * 3 B Y ) = R Y 1 2 (g * 1 + g * 2 ) B Y < R Y 1 2 g * 1 B Y + 1 2 g * 2 B Y ≤ 1 2 R Y ( g * 1 B Y + 1 2 R Y ( g * 2 B Y ). Consider the regularized empirical risk minimization cost function ( 11) T (f, g) = L(f, g) + λ X R X ( f B X ) + λ Y R Y ( g B Y ) where we use the shorthand L(f, g) = 1 n x n y i,j L x i , y j , z ij , f Φ Y (y) , g Φ X (x) B X ×B Y . We have that R X ( • B X ) and R Y ( • B Y ) are both convex via identities about composition of convex functions. The function L(f, g) is also convex since all the functions in the summand are convex in f and g. Then, since we have assumed that T (f * 1 , g * 1 ) = T (f * 2 , g * 2 ) , by plugging in some of the above inequalities we can write T (f * 3 , g * 3 ) = T 1 2 [(f * 1 , g * 1 ) + (f * 2 , g * 2 )] = L 1 2 [(f * 1 , g * 1 ) + (f * 2 , g * 2 )] + R X 1 2 (f * 1 + f * 2 ) B X + R Y 1 2 (g * 1 + g * 2 ) B Y < 1 2 L (f * 1 , g * 1 ) + 1 2 L (f * 2 , g * 2 ) + 1 2 R X ( f * 1 B X ) + 1 2 R X ( f * 2 B X ) + 1 2 R Y ( f * 1 B Y ) + 1 2 R Y ( f * 2 B Y ) = 1 2 T (f * 1 , g * 1 ) + 1 2 T (f * 2 , g * 2 ) = T (f * 1 , g * 1 ) contradicting that (f * 1 , g * 1 ) is a minimizer, and thus showing uniqueness of the solution. We now prove existence via Lemma 4. We already know that T (•) is convex. From the bilinearity of •, • B X ×B Y and the convexity of L, L is continuous in f and g. Since the regularization functions R X and R Y are convex and strictly increasing, is follows that the functions R X ( f B X ) and R Y ( g B Y ) are continuous in f and g, respectively. Thus, T (f, g) is continuous. Consider the set E = {(f, g) ∈ B X ⊕ B Y : T (f, g) ≤ T (0, 0)}. The set E is nonempty (it contains at least (0, 0)), and we can see that f, g B X ⊕B Y = f B X + g B Y ≤ R -1 X (T (f, 0)) + R -1 Y (T (0, g)) showing that E is bounded. So, by Lemma 4, we are guaranteed the existence of an optimal solution (f * , g * ). Pick any (f, g) ∈ B X × B Y and consider the set D f,g = x i , y j , f Φ Y (yj ) , g Φ X (xi) B X ×B Y : i = 1, . . . , n x ; j = 1, . . . , n y i.e., the set of pairs of points (x i , y j ) along with the value that the function pair (f, g) maps to via the bilinear form at the pair of points (x i , y j ). From Lemma 3, there exists an element (f , g ) ∈ B X × B Y such that (f , g ) interpolates D f,g perfectly, i.e., f Φ Y (yj ) , g Φ X (xi) B X ×B Y = f Φ Y (yj ) , g Φ X (xi) B X ×B Y ; i = 1, . . . , n x ; j = 1, . . . , n y whose Gâteaux derivatives of norms satisfy ι(f ) ∈ span{κ(x i , •); i = 1, . . . , n x } ι(g ) ∈ span{κ(•, y j ); j = 1, . . . , n y }. Further, this element (f , g ) obtains the minimum-norm interpolation of D f,g , i.e., f , g B X ⊕B Y ≤ f, g B X ⊕B Y . This last fact implies T (f , g ) ≤ T (f, g). Therefore, the unique optimal solution (f * , g * ) also satisfies ι(f * ) ∈ span{κ(x i , •); i = 1, . . . , n x } ι(g * ) ∈ span{κ(•, y j ); j = 1, . . . , n y } which implies that suitable parameters ξ 1 , . . . , ξ nx ∈ R and ζ 1 , . . . , ζ ny ∈ R exist such that ι(f * ) = nx i=1 ξ i κ(x i , •) ι(g * ) = ny j=1 ζ j κ(•, y j ) proving the claim.

C PROOF OF THEOREM 2

First, we state the following well-known lemma. Lemma 5. For any two compact Hausdorff spaces X and Y, continuous function κ : X × Y → R, and > 0, there exists an integer d > 0 and continuous functions φ : X → R, ψ : Y → R, = 1, . . . , d such that κ(x, y) - d =1 φ (x)ψ (y) < ∀ x ∈ X , y ∈ Y. Proof. The product space of two compact spaces X and Y, X × Y, is of course compact by Tychonoff's theorem. Consider the algebra A = f : f (x, y) = d =1 φ (x)ψ (y), x ∈ X , y ∈ Y . It is easy to show (i) that A is an algebra, (ii) that A is a subalgebra of the real-valued continuous functions on X × Y, and (iii) that A separates points. Then, combining the aforementioned facts, by the Stone-Weierstrass theorem A is dense in the set of real-valued continuous functions on X ×Y. Remark 9. In addition to helping us prove Theorem 2 below, Lemma 5 also serves as somewhat of an analog to Mercer's theorem for the more general case of asymmetric, non-PSD kernels. It is however weaker than Mercer's theorem in that the non-PSD nature of κ means that the functions in the sum cannot be considered as eigenfunctions (with φ = ψ ) with associated nonnegative eigenvalues. Now we proceed to the proof of Theorem 2. Proof. To keep the equations from becoming too cluttered, below we use q(t), k(s), φ(t), ψ(s) ∈ R d as the vector concatenation of the scalar functions {q (t)}, {k (s)}, {φ (t)}, and {ψ (s)}, = 1, . . . , d, respectively. All sup norms are with respect to X × Y. Our proof proceeds similarly to the proof of Theorem 5.1 of Okuno et al. (2018) . We generalize their theorem and proof to non-Mercer kernels and simplify some intermediate steps. First, by applying Lemma 5, we can write that for any 1 , there is a d such that κ -φ T ψ sup < 1 (C.1) Now we consider the approximation of φ and ψ by q and k , respectively. By the universal approximation theorem of multilayer neural networks (Cybenko, 1989; Hornik et al., 1989; Funahashi, 1989; Attali & Pagès, 1997, etc.) , we know that for any functions φ : X → R, ψ : Y → R and scalar 2 > 0, there is an integer m > 0 such that if q : X → R and k : Y → R are two-layer neural networks with m hidden units, then φ -q sup < 2 and ψ -k sup < 2 (C.2) for all . Now, beginning from ( 14), we can write κ -q T k sup ≤ κ -φ T ψ sup + φ T ψk T q sup (C.3) by the triangle inequality. Examining the second term of the RHS of (C.3), φ T ψq T k sup = φ T (ψk) + (φq) T k sup ≤ φ T (ψk) sup + (φq) T k sup ≤ φ sup ψk sup + φq sup k sup (C.4) where the first inequality uses the triangle inequality and the second uses the Cauchy-Schwarz inequality. Finally, we can combine (C.1)-(C.4) to write κ -q T k sup ≤ κ -φ T ψ sup + φ sup ψk sup + φq sup k sup < 1 + d 2 ( φ sup + ψ sup + d 2 ). Picking 1 and 2 appropriately, e.g. 1 = /2 and 2 ≤ √ d 2 (( φ sup + ψ sup ) 2 +2 ) 2d 2 , completes the proof.

D EXPERIMENT IMPLEMENTATION DETAILS

Datasets The 2014 International Workshop on Spoken Language Translation (IWSLT14) machine translation dataset is a dataset of transcriptions of TED talks (and translations of those transcriptions). We use the popular German to English subset of the dataset. We use the 2014 "dev" set as our test set, and a train/validation split suggested in demo code for fairseq (Ott et al., 2019) , where every 23rd line in the IWSLT14 training data is held out as a validation set. The 2014 ACL Workshop on Statistical Machine Translation (WMT14) dataset is a collection of European Union Parliamentary proceedings, news stories, and web text with multiple translations. We use newstest2013 as our validation set and newstest2014 as our test set. The Stanford Sentiment Treebank (SST) (Socher et al., 2013) is a sentiment analysis dataset with sentences taken from movie reviews. We use two standard subtasks: binary classification (SST-2) and fine-grained classification (SST-5). SST-2 is a subset of SST-5 with neutral-labeled sentences removed. We use the standard training/validation/testing splits, which gives splits of 6920/872/1821 on SST-2 and 8544/1101/2210 on SST-5. Data Preprocessing On both translation datasets, we use sentencepiecefoot_2 to tokenize and train a byte-pair encoding (Sennrich et al., 2016) on the training set. We use a shared BPE vocabulary across the target and source languages. Our resulting BPE vocabulary size is 8000 for IWSLT14 DE-EN and 32000 for WMT14 EN-FR. For SST, we train a sentencepiece BPE for each subtask separately, obtaining BPE vocabularies of size 7465 for SST-2 and 7609 for SST-5. Models Our models are written in Pytorch (Paszke et al., 2019) . We make use of the Fairseq (Ott et al., 2019) library for training and evaluation. In machine translation, we use 6 Transformer layers in both the encoder and decoder. Both Transformer sublayers (attention and the two fully-connected layers) have a residual connection with the "pre-norm" (Wang et al., 2019b) ordering of Layer normalization -> Attention or FC -> ReLU -> Add residual. We use an embedding dimension of 512 for the learned token embeddings. For IWSLT14, the attention sublayers use 4 heads with a per-head dimension d of 128 and the fully-connected sublayers have a hidden dimension of 1024. For WMT14, following Vaswani et al. (2017) 's "base" model, the attention layers have 8 heads with a per-head dimension d of 64 and the fully-connected sublayers have a hidden dimension of 2048. For SST, we use a very small, encoder-only, Transformer variant, with only two Transformer layers. The token embedding dimension is 64, each Transformer self-attention sublayer has 4 heads with per-head dimension d of 16, and the fully-connected sublayers have a hidden dimension of 128. To produce a sentence classification, the output of the second Transformer layer is average-pooled over



Often, the source and target sets are taken to be the same, si = ti ∀i. This instance of attention is called self attention. As Fasshauer et al. (2015) note, any Hilbert space is strictly convex and smooth, so it seems reasonable to assume that an RKBS is also strictly convex and smooth. https://github.com/google/sentencepiece



y j ); j = 1, . . . , n y } ; i = 1, . . . , n x = ⊥ span {κ(•, y j ); y = 1, . . . , n y } . (B.11) Combining (B.9a) and (B.10), we get

and tensor kernel method(Tao et al., 2005;Kotsia & Patras, 2011;He et al., 2017) literatures. Our problem and results are new in the generalization to Banach rather than Hilbert spaces: as prior work in the RKBS literature(Micchelli et al., 2004;Zhang & Zhang, 2012; Xu & Ye, 2019, etc.)  notes, RKBS learning problems are distinct from RKHS ones in their additional nonlinearity and/or nonconvexity. An extension of binary learning problems to Banach spaces is thus motivated by the Transformer setting, where a kernel method is in a context of a nonlinear and nonconvex deep neural network, rather than as a shallow learner like an SVM or matrix completion. For more discussion, see Appendix A.

where datapoints come from only one of the sets on which the reproducing kernel is defined (i.e., only X but not Y), which means the solution sought is an element of only one of the Banach spaces (e.g., f : X → R, f ∈ B X ). Here, we state and prove a theorem for the more-relevant-to-Transformers binary case presented in Definition 3. Theorem 1. Suppose we have a kernel learning problem of the form in (11). Let κ : X × Y → R be the reproducing kernel of the pair of RKBS's B X and B Y satisfying Definitions 1 and 2. Then, given some conditions on B X and B Y (see Appendix B), the regularized empirical risk minimization problem (11) has a unique solution pair (f

Test BLEU scores for Transformers with various kernels on machine translation (casesensitive sacreBLEU). Values are mean ± std. dev over 5 training runs with different random seeds.

Test accuracies for Transformers with various kernels on sentiment classification. Values are mean ± std. dev over 5 training runs with different random seeds.

Before beginning the proof, we state the following lemma regarding the existence of solutions of convex optimization problems on Banach spaces: Lemma 4 (Ekeland & Témam, 1999, Chapter II, Proposition 1.2). Let B be a reflexive Banach space and S a closed, convex, and bounded (with respect to • B ) subset of B. Let f : S → R ∪ {+∞} be a convex function with a closed epigraph (i.e., it satisfies the condition that ∀c ∈ R ∪ {+∞}, the set {v ∈ S : f (v) ≤ c} is closed). Then, the optimization problem

annex

A normed vector space V is called strictly convex if tv 1 + (1 -t)v 2 V < 1 whenever v 1 V = v 2 V = 1, 0 < t < 1, where v 1 , v 2 ∈ V and • V denotes the norm of V (Megginson, 1998, Definition 5.1.1; citing Clarkson, 1936 and Akhiezer & Krein, 1962) .A nonempty subset A of a metric space (M, d M ) is called a Chebyshev set if, for every element m ∈ M, there is exactly one element c ∈ A such that d M (m, c) = d M (m, A) (Megginson, 1998, Definition 5.1.17 ) (where recall the distance between a point m and a set A in a metric space is equal to inf c∈A d M (m, c)). If a normed vector space V is reflexive and strictly convex, then every nonempty closed convex subset of V is a Chebyshev set (Megginson, 1998, Corollary 5.1.19; citing Day, 1941) .For a normed vector space V and v, w ∈ V, the Gâteaux derivative of the norm (Megginson, 1998, Definition 5.4 .15) at v in the direction of w is defined asIf the Gâteaux derivative of the norm at v in the direction of w exists for all w ∈ V, then • V is said to be Gâteaux differentiable at v. A normed vector space V is called Gâteaux differentiable or smooth if its norm is Gâteaux differentiable at all v ∈ V (Megginson, 1998, Corollary 5.4.18) .The smoothness of a normed vector space V implies that, if we define a "norm operator" ρ on V, ρ(v) v V , then for each v ∈ V \ {0}, there exists a continuous linear functional d G ρ(v) on V such that (Xu & Ye, 2019, p. 24 )Since the Gâteaux derivative of the norm is undefined at 0, following Xu & Ye (2019, Equation 2.16); Lin et al. (2019, p. 20) ; etc., we define a regularized Gâteaux derivative of the norm operator on V,Given two vector spaces V and W defined over a field F, the direct sum, denoted V ⊕ W, is the vector space with elements (v, w) ∈ V ⊕ W for v ∈ V, w ∈ W with the additional structureand W are normed vector spaces with norms • V and • W , respectively, then we will say3) Megginson (1998, Definition 1.8.1) calls (B.3) a "1-norm" direct sum norm, but notes that other norm-equivalent direct sum norms such as a 2-norm and infinity-norm are possible. Some other useful facts about direct sums are:• if V and W are both strictly convex, then V ⊕ W is strictly convex (Megginson, 1998, Theorem 5.1.23 ); • if V and W are both reflexive, then V ⊕W is reflexive (Megginson, 1998, Corollary 1.11.20 ); • and if V and W are both smooth, then V ⊕ W is smooth (Megginson, 1998, Theorem 5.4.22 ).An element v of a normed vector space V is said to be orthogonal (or Birkhoff-James orthogonal) to another element w ∈ V if (Birkhoff, 1935; James, 1947) vNow we can state a lemma regarding orthogonality in RKBS's. Lemma 1 (Xu & Ye, 2019, Lemma 2.21 ). If the RKBS B is smooth, then f ∈ B is orthogonal to g ∈ B if and only if g, ι(f ) B = 0, where •, • B means the dual bilinear product as given in (B.1) and ι is the regularized Gâteaux derivative from (B.2). Also, an f ∈ B \ {0} is orthogonal to a subspace N ⊆ B if and only if h, ι(f ) B = 0 for all h ∈ N .the non-padding tokens, then passed to a classification head. This classification head is a two-layer neural network with hidden dimension 64 and output dimension equal to the number of classes; this output vector becomes the class logits.Training We train with the Adam optimizer (Kingma & Ba, 2015) . Following Vaswani et al. (2017) , for machine translation we set the Adam parameters β 1 = 0.9, β 2 = 0.98.On IWSLT14 DE-EN, we schedule the learning rate to begin at 0.001 and then multiply by a factor of 0.1 when the validation BLEU does not increase for 3 epochs. FOR WMT14 EN-FR, we decay proportionally to the inverse square root of the update step using Fairseq's implementation. For both datasets, we also use a linear warmup on the learning rate from 1e-7 to 0.001 over the first 4000 update steps.On IWSLT14 DE-EN, we end training when the BLEU score does not improve for 7 epochs on the validation set. On WMT14 EN-FR, we end training after 100k gradient updates (inclusive of the warmup stage), which gives us a final learning rate of 0.0002. We train on the cross-entropy loss and employ label smoothing of 0.1. We use minibatches with a maximum of about 10k source tokens on IWSLT14 DE-EN and 25k on WMT14 EN-FR Also on WMT14, we ignore sentences with more than 1024 tokens.For both SST subtasks, we also use a linear warmup from 1e-7 over 4000 warmup steps, but use an initial post-warmup learning rate of 0.0001. Similar to IWSLT14, we decay the learning rate by multiplying by 0.1 when the validation accuracy does not increase for 3 epochs, and end training when the validation accuracy does not improve for 8 epochs.Evaluation for machine translation Following Vaswani et al. (2017) , we use beam-search decoding, with a beam length of 4 and a length penalty of 0.6, to generate sentences for evaluation. We use sacrebleu (Post, 2018) to generate BLEU scores. We report whole-word case-sensitive BLEU.

