ON ACCELERATED PERCEPTRONS AND BEYOND

Abstract

The classical Perceptron algorithm of Rosenblatt can be used to find a linear threshold function to correctly classify n linearly separable data points, assuming the classes are separated by some margin γ > 0. A foundational result is that Perceptron converges after O(1/γ 2 ) iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to O( √ log n/γ), with more sophisticated algorithms. In this paper, we unify these existing results under one framework by showing that they can all be described through the lens of solving min-max problems using modern acceleration techniques, mainly through optimistic online learning. We then show that the proposed framework also leads to improved results for a series of problems beyond the standard Perceptron setting. Specifically, a) for the margin maximization problem, we improve the stateof-the-art result from O(log t/t 2 ) to O(1/t 2 ), where t is the number of iterations; b) we provide the first result on identifying the implicit bias property of the classical Nesterov's accelerated gradient descent (NAG) algorithm, and show NAG can maximize the margin with an O(1/t 2 ) rate; c) for the classical p-norm Perceptron problem, we provide an algorithm with O( (p -1) log n/γ) convergence rate, while existing algorithms suffer the O((p -1)/γ 2 ) convergence rate.

1. INTRODUCTION

In this paper, we revisit the problem of learning a linear classifier, which is one of the most important and fundamental tasks of machine learning (Bishop, 2007) . In this problem, we are given a set S of n training examples, and the goal is to find a linear classifier that correctly separates S as fast as possible. The most well-known algorithm is Perceptron (Rosenblatt, 1958) , which can converge to a perfect (mistake-free) classifier after Ω(1/γ 2 ) number of iterations, provided the data is linearly separable with some margin γ > 0 (Novikoff, 1962) . Over subsequent decades, many variants of Perceptron have been developed (Aizerman, 1964; Littlestone, 1988; Wendemuth, 1995; Freund & Schapire, 1999; Cesa-Bianchi et al., 2005 , to name a few). However, somewhat surprisingly, there has been little progress in substantially improving the fundamental Perceptron iteration bound presented by Novikoff (1962) . It is only recently that a number of researchers have discovered accelerated variants of the Perceptron with a faster Ω( √ log n/γ) iteration complexity, although with a slower per-iteration cost. These works model the problem in different ways, e.g., as a non-smooth optimization problem or an empirical risk minimization task, and they have established faster rates using sophisticated optimization tools. Soheili & Pena (2012) put forward the smooth Perceptron, framing the objective as a non-smooth strongly-concave maximization and then applying Nestrov's excessive gap technique (NEG, Nesterov, 2005) . Yu et al. (2014) proposed the accelerated Perceptron by furnishing a convex-concave objective that can be solved via the mirrorprox method (Nemirovski, 2004) . Ji et al. (2021) put forward a third interpretation, obtaining the accelerated rate by minimizing the empirical risk under exponential loss with a momentum-based normalized gradient descent algorithm. Following this line of research, in this paper, we present a unified analysis framework that reveals the exact relationship among these methods that share the same order of convergence rate. Moreover, we show that the proposed framework also leads to improved results for various problems beyond the standard Perceptron setting. Specifically, we consider a general zero-sum game that involves two players (Abernethy et al., 2018) : a main player that chooses the classifier, and an auxiliary player that picks a distribution over data. The two players compete with each other by performing no-regret online learning algorithms (Hazan, 2016; Orabona, 2019) , and the goal is to find the equilibrium of some convex-concave function. We show that, under this dynamic, all of the existing accelerated Perceptrons can find their equivalent forms. In particular, these perceptrons can be described as a dynamic where two players solving the game via performing optimistic online learning strategies (Rakhlin & Sridharan, 2013) , which is one of the most important classes of algorithms in online learning. Note that implementing online learning algorithms (even optimistic strategies) to solve zero-sum games has already been extensively explored (e.g., Rakhlin & Sridharan, 2013; Daskalakis et al., 2018; Wang & Abernethy, 2018; Daskalakis & Panageas, 2019) . However, we emphasize that our main novelty lies in showing that all of the existing accelerated Perceptrons, developed with advanced algorithms from different areas, can be perfectly described under this unified framework. It greatly simplifies the analysis of accelerated Perceptrons, as their convergence rates can now be easily obtained by plugging-in off-the-shelf regret bounds of optimistic online learning algorithms. Moreover, the unified framework reveals a close connection between the smooth Perceptron and the accelerated Perceptron of Ji et al. (2021) : Theorem 1 (informal). Smooth Perceptron and the accelerated Perceptron of Ji et al. (2021) can be described as a dynamic where the two players employ the optimistic-follow-the-regularized-leader (OFTRL) algorithm to play. The main difference is that the smooth Perceptron outputs the weighted average of the main player's historical decisions, while the accelerated Perceptron of Ji et al. (2021) outputs the weighted sum. Beyond providing a deeper understanding of accelerated Perceptrons, our framework also provides improved new results for several other important areas: • Implicit bias analysis. The seminal work of Soudry et al. (2018) shows that, for linearly separable data, minimizing the empirical risk with the vanilla gradient descent (GD) gives a classifier which not only has zero training error (thus can be used for linear separation), but also maximizes the margin. This phenomenon characterizes the implicit bias of GD, as it implicitly prefers the ( 2 -)maximal margin classifier among all classifiers with a positive margin, and analysing the implicit bias has become an important tool for understanding why classical optimization methods generalize well for supervised machine learning problems. Soudry et al. (2018) show that GD can maximize the margin in an O(1/ log t) rate, and this is later improved to O(1/ √ t) (Nacson et al., 2019) and then O(1/t) (Ji & Telgarsky, 2021) . The state-of-the-art algorithm is proposed by Ji et al. (2021) , who show that their proposed momentum-based GD has an O(log t/t 2 ) margin-maximization rate. In this paper, we make two contributions toward this direction: 1. We show that, under our analysis framework, we can easily improve the margin maximization rate of the algorithm of Ji et al. (2021) 

2. RELATED WORK

This section briefly reviews the related work on Perceptron algorithms, implicit-bias analysis, and game theory. The background knowledge on (optimistic) online learning is presented in the Preliminaries (Section 3).

Accelerated Perceptrons and p-norm Perceptron

The study of Perceptron algorithms has an extensive history dating back to the mid-twentieth century (Rosenblatt, 1958; Novikoff, 1962) . However, it is only recently that progress on improving the fundamental Ω(1/γ 2 ) iteration bound of the vanilla Perceptron in the standard setting has been made. Specifically, the smooth Perceptron proposed by Soheili & Pena (2012) achieves an Ω( √ log n/γ) rate by maximizing an auxiliary nonsmooth strongly-concave function with NEG (Nesterov, 2005) . The same accelerated rate is later obtained by two other work (Yu et al., 2014; Ji et al., 2021) . The former considers a bi-linear saddle point problem and employs the mirror-prox method (Nemirovski, 2004) . The latter applies momentum-based GD to minimize the empirical risk with exponential loss. In this paper, we show that these algorithms can be unitedly analysed under our framework. Apart from the above accelerated Perceptrons, there exists another class of algorithms which enjoy an O(poly(n) log(1/γ)) convergence rate (Dunagan & Vempala, 2004; Peña & Soheili, 2016; Dadush et al., 2020) . These methods typically call (accelerated) Perceptrons as a subroutine, and then apply the re-scaling technique to adjust the decision periodically. Although the dependence on γ becomes better, the polynomial dependence on n of these methods makes them computationally inefficient for large-scale data sets. In this paper, we focus on accelerated Perceptrons with O( √ log n/γ) rate, and leave explaining the re-scaling type algorithms as future work. The p-norm Perceptron problem (Gentile & Littlestone, 1999 ) is a natural extension of the classical Perceptron setting, which assumes the p-norm of the feature vectors are bounded, where 2 ≤ p < ∞. Gentile (2001) shows that a mirror-descent-style update guarantees an O(p/γ 2 ) convergence rate. By contrast, our proposed algorithm achieves a tighter O( √ p log n/γ) convergence rate. Implicit-bias analysis In many real-world applications, directly minimizing the empirical risk (without any regularization) by first-order methods can provide a model which, not only enjoys low training error, but also generalizes well (Soudry et al., 2018) . This is usually considered as the implicit bias introduced by the optimization methods (Soudry et al., 2018) . Explaining this phenomenon is a crucial step towards understanding the generalization ability of commonly-used optimization methods. For linear separable data, Soudry et al. (2018) proves that, when minimizing the empirical risk with exponential loss, the vanilla GD can maximize the margin in an O(1/ log t) rate. This result implies the implicit bias of GD towards the 2 -maximal margin classifier. Later, Nacson et al. (2019) show that GD with a function-value-dependant decreasing step-size enjoys an O(1/ √ t) margin-maximization rate. Ji & Telgarsky (2021) improve this result to O(1/t) by employing a faster-decreasing step size. Ji et al. (2021) design a momentum-based GD that maximizes the margin with an O(log t/t 2 ) rate. However, it remained unclear whether this rate could be further improved and how to analyze the margin maximization ability of other classical optimization methods such as NAG. In this paper, we provide positive answers to both questions. Fianlly, we note that (Ramdas & Pena, 2016) prove a varint of the Perceptron algorithm named normalized perceptron can also converge to a maximum margin classifer in an O(1/ √ t) convergence rate. Games and no-regret dynamics Our framework is motivated by the line of research that links optimization methods for convex optimization to equilibrium computation with no-regret dynamics. The seminal work of Rakhlin & Sridharan (2013) recovers the classical mirror-prox method (Nemirovski, 2004) with Optimistic online mirror descent. Abernethy & Wang (2017) show that the well-known Frank-Wolfe algorithm can be seen as applying (optimistic) online algorithms to compute the equilibrium of a special zero-sum game called Fenchel game, which is constructed via the Fenchel duality of the objective function. Later, researchers demonstrate that other classical optimization methods for smooth optimization, such as Heavy-ball and NAG, can also be described similarly (Abernethy et al., 2018; Wang & Abernethy, 2018; Wang et al., 2021) . We highlight the differences between this work and the previous ones: 1) Our analysis does not involve the Fenchel game or Fenchel duality; instead, we directly work on the (regularized) min-max game designed for linear classification. 2) Most of the previous work mainly focus on understanding optimization algorithms for smooth optimization, and it was unclear how to understand algorithm such as NEG under the game framework. 3) Although both Abernethy et al. (2018) and our work analyze NAG, the goals are significantly different: Abernethy et al. (2018) focus on the optimization problem itself, while we consider how minimizing the empirical risk would affect the margin. 4) To our knowledge, the link between the implicit-bias problems and no-regret dynamics is also new.

3. PRELIMINARIES

Notation. We use lower case bold face letters x, y to denote vectors, lower case letters a, b to denote scalars, and upper case letters A, B to denote matrices. For a vector x ∈ R d , we use x i to denote the i-th component of x. For a matrix A ∈ R n×d , let A (i,:) be its i-th row, A (:,j) the j-th column, and A (i,j) the i-th element of the j-th column. We use • to denote a general norm, • * its dual norm, and • p the p -norm. For a positive integer n, we denote the set {1, . . . , n} as [n], and the n-dimensional simplex as ∆ n . Let E : ∆ n → R be the negative entropy function, defined as E(p) = n i=1 p i log p i , ∀p ∈ ∆ n . For some strongly convex function R(w) : W → R, define the Bregman divergence between any two points w, w ∈ W as: D R (w, w ) = R(w) -R(w ) -(w -w ) ∇R(w ). Online convex optimization (OCO) Here we review a general weighted OCO framework, proposed by Abernethy et al. (2018) . In each round t = 1, . . . , T of this paradigm, a learner first chooses a decision z t from a convex set Z ⊆ R d , then observes a loss function f t (•) : Z → R as well as a weight α t > 0, and finally updates the decision. The performance of the learner is measured by the weighted regret, which is defined as R T = T t=1 α t f t (z t ) -min z∈Z T t=1 α t f t (z). Perhaps the most natural method for OCO is follow-the-leader (FTL), which simply picks the empirically best decision at each round: z t = argmin z∈Z t-1 i=1 α i f i (z). However, FTL is unstable, and one can easily find counter-examples where FTL suffers linear regret (Shalev-Shwartz, 2011) . A classical way to address this limitation is by adding a regularizer to the objective: z t = argmin z∈Z η t-1 i=1 α i f i (z) + D R (z, z 0 ) , where η > 0 is the step size, and z 0 is the initial decision. This method is called follow-the-regularized-leader (Hazan, 2016) , and it can achieve a sub-linear regret bound with appropriately chosen η. Moreover, tighter bounds are also possible in favored cases with more advanced techniques. In this paper, we consider FTRL equipped with optimistic strategies (i.e., Optimistic FTRL, Rakhlin & Sridharan, 2013; Orabona, 2019) , given by OFTRL[R, z 0 , ψ t , η, Z] : z t = argmin z∈Z η t-1 i=1 α i f i (z) + α t ψ t (z) + D R (z, z 0 ), where an additional function ψ t is added, which is an approximation of the next loss f t . A tighter regret bound can be achieved when ψ t is close enough to f t . Next, we introduce two special cases of OFTRL: OFTL[ψ t , Z] : z t = argmin z∈Z t-1 j=1 α j f j (z) + α t ψ t (z) FTRL + [R, z 0 , η, Z] : z t = argmin z∈Z η • t j=1 α j f j (z) + D R (z, z 0 ). The first one is Optimistic FTL, where the regularizer is set to be zero. The second algorithm is FTRL + , which uses f t as the optimistic function ψ t . Finally, we note that, apart from FTL-type algorithms, there also exists optimistic methods that are developed based on mirror descent, such as optimistic online mirror decent (OMD, Rakhlin & Sridharan, 2013) . Due to page limitation, the details of OMD and the regret bounds of these OCO algorithms are postponed to the Appendix A. No-regret dynamics for zero-sum game Finally, we introduce the framework for using no-regret online algorithms to solve a zero-sum game. Consider the following general two-player game: max w∈W min p∈Q g(w, p), where g(w, p) is a concave-convex function, and W and Q are convex sets. The no-regret framework for solving (1) is presented in Protocol 1. In each round t of this procedure, the w-player first picks a decision w t ∈ W. Then, the p-player observes its loss t (•) = g(w t , •) as well as a weight α t . After that, the p-player picks the decision p t , and passes it to the w-player. As a consequence, the w-player observes its loss h t (•) = -g(•, p t ) and weight α t . Note that both t and h t are convex. Denote the weighted regret bounds of the two players as R w and R p , respectively. Then, we have the following classical conclusion. The proof is shown in Appendix B. Theorem 2. Define m(w) = min p∈Q g(w, p), and w the weighted average of {w t } T t=1 . Then we have that ∀w ∈ W, m(w) -m(w T ) ≤ ( T t=1 α t ) -1 (R p + R w ) . Protocol 1 No-regret dynamics with weighted OCO 1: Input: {α t } T t=1 , w 0 , p 0 . 2: Input: OL w , OL p . // The online algorithms for choosing w and p. 3: for t = 1, . . . , T do 4: w t ← OL w ; 5: OL p ← α t , t (•); // Define t (•) = g(w t , •) 6: p t ← OL p ; 7: OL w ← α t , h t (•); // Define h t (•) = -g(•, p t ) 8: end for 9: Output: w T = 1 T t=1 αt T t=1 α t w t .

4. UNDERSTANDING EXISTING ACCELERATED PERCEPTRONS

In this section, we first introduce our main topic, i.e., the binary linear classification problem, and then present the three accelerated Perceptrons and their equivalent forms under Protocol 1. For clarity, we use v t to denote the classifier updates in the original algorithms, and w t the corresponding updates in the equivalent forms under Protocol 1.

4.1. BINARY LINEAR CLASSIFICATION AND PERCEPTRON

We focus on the binary linear classification problem, which dates back to the pioneering work of Rosenblatt (1958) . Let S = {x (i) , y (i) } n i=1 be a linear-separable set of n training examples, where x (i) ∈ R d is the feature vector of the i-th example, and y (i) ∈ {-1, 1} is the corresponding label. The goal is to efficiently find a linear classifier w ∈ R d that correctly separates all data points. More formally, let A = [y (1) x (1) , . . . , y (n) x (n) ] be the matrix that contains all of the data. Then we would like to find a w ∈ R d such that min i∈[n] A (i,:) w > 0. (2) This goal can be reformulated as a min-max optimization problem: max w∈R d min i∈[n] A (i,:) w = max w∈R d min p∈∆ n p Aw, where, at the RHS of (3), min i∈[n] A (i,:) w is rewritten as min p∈∆ n p Aw. The two expressions are equivalent since the optimal distribution p ∈ ∆ n will always put all weight on one training example (i.e., one row) in A. For any w ∈ R d , let γ(w) = min p∈∆ n p Aw be the margin of w, and we introduce the following standard assumption. Assumption 1. We assume that feature vectors are bounded, i.e., x (i) 2 ≤ 1 for all i ∈ [n], and that there exists a w * ∈ R d , such that w * 2 = 1 and γ(w * ) = γ > 0.

4.2. SMOOTH PERCEPTRON

In order to solve (3), the vanilla Perceptron repeatedly moves the direction of the classifier to that of examples on which it performs badly. However, this greedy policy only yields a sub-optimal convergence rate. To address this problem, Soheili & Pena (2012) propose the Smooth Perceptron, and the pseudo-code is summarized in the first box in Algorithm 1. The key idea is to find a classifier v ∈ R d that maximizes the following 2 -regularized function: ψ(v) = -1 2 v 2 2 + min q∈∆ n q Av, which is non-smooth strongly-concave with respect to v. Under Assumption 1, Soheili & Pena (2012) show that the maximum value of ψ(v) is max v∈R d ψ(v) = γ 2 /2, and a classifier v ∈ R d satisfies (2) when ψ(v) ≥ 0. In order to solve (4), Soheili & Pena (2012) apply the classical Nesterov's excessive gap technique (Nesterov, 2005) for strongly concave functions. This algorithm introduces a smoothed approximation of (4), which is parameterized by some constant µ > 0, and defined as ψ µ (v) = -1 2 v 2 2 + min q∈∆ n q Av + µD E q, 1 n , Algorithm 1 Smooth Perceptron (Soheili & Pena, 2012 ) Initialization : θ 0 = 2 3 , µ 0 = 4, v 0 = A 1 n Initialization : q 0 = q µ0 (v 0 ), where q µ (v) = argmin q∈∆ n q Av + µD E q, 1 n . for t = 1, . . . , T -1 do v t = (1 -θ t-1 )(v t-1 + θ t-1 Aq t-1 ) + θ 2 t-1 Aq µt-1 (v t-1 ) µ t = (1 -θ t-1 )µ t-1 q t = (1 -θ t-1 )q t-1 + θ t-1 q µt (v t ) θ t = 2 t+3 end for Output: v T -1 OL w = OFTL h t-1 (•), R d ⇔ w t = argmin w∈R d t-1 j=1 α j h j (w) + α t h t-1 (w) OL p = FTRL + E(•), 1 n , 1 4 , ∆ n ⇔ p t = argmin p∈∆ n 1 4 t s=1 α s s (p) + D E p, 1 n Output: w T which bounds the function in (4) from below, i.e., ∀v ∈ R d , µ > 0, ψ µ (v) ≤ ψ(v) + µ log n. Then, the algorithm performs a sophisticated update rule such that the excessive gap condition holds ∀t ≥ 1: γ 2 /2 ≤ Av t 2 /2 ≤ ψ µ (v t ). We refer to Soheili & Pena (2012) for more details. Soheili & Pena (2012) show that the smooth Perceptron can output a positive-margin classifier after Ω( √ log n/γ) iterations. However, the analysis is quite involved, which heavily relies on the complicated relationship between ψ(v t ), ψ µ (v t ) and Av t || 2 . In the following, we provide a no-regret explanation of this algorithm and then show that the convergence rate can be easily obtained under our framework. Specifically, we define the objective function in Protocol 1 as g(w, p) = p Aw -1 2 w 2 2 , and provide the equivalent expression in the second box of Algorithm 1. More specifically, we have the following proposition. The proof is given in Appendix C.1. Proposition 1. Let α t = t. Then the two interpretations of the smooth Perceptron in Algorithm 1 are the same, in the sense that v T -1 = w T , and q T -1 = T t=1 αtpt T t=1 αt . Proposition 1 shows that, under Protocol 1, the smooth Perceptron can be seen as optimizing (6) by implementing two online learning algorithms: in each round t, the w-player applies OFTL with ψ t = h t-1 , a decision which the p-player subsequently observes and responds with FTRL + . Moreover, we obtain theoretical guarantees for the smooth Perceptron based on Theorem 2, matching the convergence rate provided by Soheili & Pena (2012) . Theorem 3. Let α t = t. Under Protocol 1, the regret of the two players of Algorithm 1 is bounded by R w ≤ 2 T t=1 p t -p t-1 2 1 and R p ≤ 4 log n -2 T t=1 p t -p t-1 2 1 . Moreover, w T has non-negative margin when T = Ω( √ log n γ ).

4.3. ACCELERATED PERCEPTRON VIA ERM

Since binary linear classification is a (and perhaps the most fundamental) supervised machine learning problem, it can be naturally solved via empirical risk minimization (ERM). This idea is adopted by Ji et al. (2021) , who consider the following ERM problem: min v∈R d R(v) = 1 n n i=1 (-y (i) v x (i) ), where (z) = exp(z) is the exponential loss. To minimize (7), Ji et al. (2021) propose the a normalized momentum-based gradient descent (NMGD) algorithm, and show that NMGD can converge to Algorithm 2 Accelerated Perceptron of Ji et al. (2021) Input: q 0 = 1 n , v 0 = 0, g 0 = 0. for t = 1, . . . , T do Set θ t-1 = t 2(t+1) , β t = t t+1 v t = v t-1 -θ t-1 (g t-1 -A q t-1 ) for i = 1, . . . , n do q t,i = exp(-y (i) v t x (i) ) n j=1 exp(-y (j) v t x (j) ) end for g t = β t (g t-1 -A q t ) end for Output: v T OL w = OFTL h t-1 (•), R d ⇔ w t = argmin w∈R d t-1 j=1 α j h j (w) + α t h t-1 (w) OL p = FTRL + E(•), 1 n , 1 4 , ∆ n ⇔ p t = argmin p∈∆ n 1 4 t s=1 α s s (p) + D E p, 1 Output: w T a classifier with a positive margin after O(1/γ 2 ) iterations. Based on the property of the normalized gradient, they also provide primal-dual form of the algorithm (presented in the first-box of Algorithm 2), which can be considered as applying Nesterov's accelerated gradient descent in the dual space (Ji et al., 2021) . For the accelerated Perceptron of Ji et al. (2021) , we set g(w, p) = p Aw -1 2 w 2 2 and provide its equivalent form at the second box in Algorithm 2. Specifically, we have the following proposition. Proposition 2. Let α t = t. Then the two interpretations of the accelerated Perceptron in Algorithm 2 are the same, in the sense that q T = p T and v T = 1 4 T t=1 α t w t = 1 4 T t=1 α t • w T . Remark Note that in Ji et al. (2021) , the parameter θ t-1 is set to be 1. We observe that this causes a mismatch between the weights and w t s in the output (when θ t-1 = 1, v T will become T t=1 α t+1 w t instead of T t=1 α t w t ). Our no-regret analysis suggests that θ t-1 should be configured as t 2(t+1) , and later we will show that this procedure helps eliminate the log t factor in the convergence rate. The proposition above reveals that the smooth Perceptron and the accelerated Perceptron of Ji et al. (2021) are closely related. The main difference is that the accelerated Perceptron of Ji et al. (2021) outputs the weighted sum of all w t 's, instead of the weighted average. The rationale behind this phenomenon is that, instead of the margin, Ji et al. (2021) use the normalized margin to measure the performance of the algorithm, defined as γ(v) = min p∈∆ n p Av/ v 2 . They not only prove γ(v T ) > 0 (which directly implies γ(v T ) > 0), but also show γ(v T ) = Ω(γ -log T γT 2 ). This guarantee is more powerful than the previous results, as it implies that the normalized margin can be maximized in an O(log t/t 2 ) rate. When t approaches ∞, the direction of v T will converge to that of the maximal margin classifier. We show that a better margin-maximization rate can be directly obtained under our framework with the parameter setting in Algorithm 2. Algorithm 3 NAG  Input: v 0 = 0, s 0 = 0. for t = 1, . . . , T do u t = s t-1 + 1 2(t-1) v t-1 v t = v t-1 -η t ∇R(u t ) s t = s t-1 + 1 2(t+1) v t end for Output: s T OL p = OFTRL E(•), 1 n , 1 4 , t-1 (•), ∆ n ⇔ p t = argmin p∈∆ n 1 4 t-1 s=1 α s s (p) + α t t-1 (p) + D E p, 1 n OL w = FTRL + [0, 0, 1, R d ] ⇔ w t = argmin

5. BEYOND PERCEPTRONS

In This section, we show that our framework benefits a wide range of other problems.

5.1. MARGIN MAXIMIZATION BY NESTEROV'S ACCELERATED GRADIENT DESCENT

One major task for the implicit bias study is to explain why commonly-used first-order optimization methods generalize well. For linearly separable data, previous work proves that GD and momentumbased GD prefers the 2 -maximal margin classifier by showing that they can maximize the margin. In this part, we show that the well-known Nesterov's accelerated gradient descent (NAG), with appropriately chosen parameters, can maximize the margin in an O(1/γ 2 ) rate. Specifically, following previous work, we consider maximize the margin by solving the ERM problem in (7) and apply Nesterov's accelerated gradient descent (NAG) to minimize the objective function. The details of the algorithm are summarized in the first box of Algorithm 3, and its equivalent form under Protocol 1 is presented in the second box of Algorithm 3. For the NAG algorithm, we set the objective function as g(w, p) = p Aw -1 2 w 2 2 , and have the following conclusion. Theorem 6. Let η t = t R(ut) and α t = t. Then, the two expressions in Algorithm 3 are equivalent, in the sense that s T = w T . Moreover, min p∈∆ n p AsT sT 2 ≥ γ -8 log n+2 T (T +1)γ .



Let α t = t. Under Protocol 1, the regret of the two players of Algorithm 2 is bounded byR w ≤ 2 T t=1 p tp t-1 2 1 and R p ≤ 4 log n -2 T t=1 p tp t-1 2 1 .Moreover, w T has a non-negative margin when T = Ω( √ log n/γ) and γ(v T ) = Ω(γ -8 log n γT (T +1) ).

PERCEPTRON OF YU ET AL.(2014)    Finally,Yu et al. (2014) considers applying the mirror-prox algorithm to solve the max-min opti- Rakhlin & Sridharan (2013), mirror-prox can be recovered by their optimistic online mirror descent (OMD) with an appropriately chosen optimistic term. Here, we show that, by simply manipulating the notation, the algorithm of Yu et al. (2014) can be recovered as two-players applying OMD to solve (8). Due to page limitation, the algorithmic details and their analysis are postponed to Appendix C.5, which we summarize as follows (formalized in Theorems 8 and 9):Theorem 5 (informal). Let g(w, p) = p Aw. The Perceptron of Yu et al. (2014) can be described under Protocol 1, where both players apply OMD endowed with the appropriate parameters. Moreover, under Assumption 1, w T has a non-negative margin when T = Ω √ log n γ .

ACKNOWLEDGMENTS

We gratefully thank the AI4OPT Institute for funding, as part of NSF Award 2112533. We also acknowledge the NSF for their support through Award IIS-1910077, as well as an ARC-ACO fellowship provided by Georgia Tech.

annex

Algorithm 4 Accelerated algorithm for the p-norm perceptront-1 j=1 α j h j (w) + α t h t-1 (w) + D 1 2(q-1) • 2 q (w, 0)Remark Theorem 6 indicates that NAG can also maximize the margin in an O(1/T 2 ) rate. Note that, in Algorithm 3, the p-player plays first and the w-player second, while, in Algorithm 2, it is the other way around. This difference makes sense as Algorithm 3 can be considered as applying Nesterov's acceleration in the dual space (Ji et al., 2021) , while Algorithm 3 uses Nesterov's acceleration in the primal space.

5.2. ACCELERATED p-NORM PERCEPTRON

In this section, we focus on the p-norm Perceptron problem, introduced by Gentile (2000) . Compared to the classical perceptron problem, p-norm Perceptron introduces the following assumption, which is more general than Assumption 1.Assumption 2. For the p-norm Perceptron problem, we assume: ∀i ∈ [n], x (i) p ≤ 1, where p ∈ [2, ∞). Moreover, assume there exists a w * ∈ R d , such that w * q ≤ 1 and min p∈∆ n p Aw * ≥ γ. Here, • q is the dual norm of • p ; i.e., 1 p + 1 q = 1 and q ∈ (1, 2].Under Assumption 2, Gentile (2000) proposes a mirror-descent style algorithm that achieves an Ω((1 -p)/γ 2 ) convergence rate. In the following, we provide a new algorithm with a better rate. Specifically, under Protocol 1, we define the objective function as g(w, p) = p Aw, and introduce Algorithm 4, wherein the w-player uses OFTRL with regularizer 1 2(q-1) • 2 q , which is 1-strongly convex w.r.t. the q-norm (Orabona, 2019) . On the other hand, the q-player employs the FTRL + algorithm. We have the following result.Theorem 7. Let α t = 1, η p = 1/η w , and η w = 1 2(q-1) log n . Then the output w T of Algorithm 4 has a non-negative margin when T = Ω 2(p -1) log n/γ .

6. CONCLUSION

In this paper, we provide a unified analysis for the existing accelerated Perceptrons, and obtain improved results for a series of problems. In the future, we will explore how to extend our framework to other closely related areas, such as semi-definite programming (Garber & Hazan, 2011) and generalized margin maximization (Sun et al., 2022) . Another interesting direction is to consider whether other more advanced online learning algorithms (Orabona & Pál, 2016; Cutkosky & Orabona, 2018; Jun et al., 2017; Zhang et al., 2018; Wang et al., 2020; Zhao et al., 2020) is useful for perceptron and the implicit bias analysis based on our framework.

