ON ACCELERATED PERCEPTRONS AND BEYOND

Abstract

The classical Perceptron algorithm of Rosenblatt can be used to find a linear threshold function to correctly classify n linearly separable data points, assuming the classes are separated by some margin γ > 0. A foundational result is that Perceptron converges after O(1/γ 2 ) iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to O( √ log n/γ), with more sophisticated algorithms. In this paper, we unify these existing results under one framework by showing that they can all be described through the lens of solving min-max problems using modern acceleration techniques, mainly through optimistic online learning. We then show that the proposed framework also leads to improved results for a series of problems beyond the standard Perceptron setting. Specifically, a) for the margin maximization problem, we improve the stateof-the-art result from O(log t/t 2 ) to O(1/t 2 ), where t is the number of iterations; b) we provide the first result on identifying the implicit bias property of the classical Nesterov's accelerated gradient descent (NAG) algorithm, and show NAG can maximize the margin with an O(1/t 2 ) rate; c) for the classical p-norm Perceptron problem, we provide an algorithm with O( (p -1) log n/γ) convergence rate, while existing algorithms suffer the O((p -1)/γ 2 ) convergence rate.

1. INTRODUCTION

In this paper, we revisit the problem of learning a linear classifier, which is one of the most important and fundamental tasks of machine learning (Bishop, 2007) . In this problem, we are given a set S of n training examples, and the goal is to find a linear classifier that correctly separates S as fast as possible. The most well-known algorithm is Perceptron (Rosenblatt, 1958) , which can converge to a perfect (mistake-free) classifier after Ω(1/γ 2 ) number of iterations, provided the data is linearly separable with some margin γ > 0 (Novikoff, 1962) . Over subsequent decades, many variants of Perceptron have been developed (Aizerman, 1964; Littlestone, 1988; Wendemuth, 1995; Freund & Schapire, 1999; Cesa-Bianchi et al., 2005 , to name a few). However, somewhat surprisingly, there has been little progress in substantially improving the fundamental Perceptron iteration bound presented by Novikoff (1962) . It is only recently that a number of researchers have discovered accelerated variants of the Perceptron with a faster Ω( √ log n/γ) iteration complexity, although with a slower per-iteration cost. These works model the problem in different ways, e.g., as a non-smooth optimization problem or an empirical risk minimization task, and they have established faster rates using sophisticated optimization tools. Soheili & Pena (2012) put forward the smooth Perceptron, framing the objective as a non-smooth strongly-concave maximization and then applying Nestrov's excessive gap technique (NEG, Nesterov, 2005) . Yu et al. (2014) proposed the accelerated Perceptron by furnishing a convex-concave objective that can be solved via the mirrorprox method (Nemirovski, 2004) . Ji et al. (2021) put forward a third interpretation, obtaining the accelerated rate by minimizing the empirical risk under exponential loss with a momentum-based normalized gradient descent algorithm. Following this line of research, in this paper, we present a unified analysis framework that reveals the exact relationship among these methods that share the same order of convergence rate. Moreover, we show that the proposed framework also leads to improved results for various problems beyond the standard Perceptron setting. Specifically, we consider a general zero-sum game that involves two players (Abernethy et al., 2018) : a main player that chooses the classifier, and an auxiliary player that picks a distribution over data. The two players compete with each other by performing no-regret online learning algorithms (Hazan, 2016; Orabona, 2019) , and the goal is to find the equilibrium of some convex-concave function. We show that, under this dynamic, all of the existing accelerated Perceptrons can find their equivalent forms. In particular, these perceptrons can be described as a dynamic where two players solving the game via performing optimistic online learning strategies (Rakhlin & Sridharan, 2013), which is one of the most important classes of algorithms in online learning. Note that implementing online learning algorithms (even optimistic strategies) to solve zero-sum games has already been extensively explored (e.g., Rakhlin & Sridharan, 2013; Daskalakis et al., 2018; Wang & Abernethy, 2018; Daskalakis & Panageas, 2019) . However, we emphasize that our main novelty lies in showing that all of the existing accelerated Perceptrons, developed with advanced algorithms from different areas, can be perfectly described under this unified framework. It greatly simplifies the analysis of accelerated Perceptrons, as their convergence rates can now be easily obtained by plugging-in off-the-shelf regret bounds of optimistic online learning algorithms. Moreover, the unified framework reveals a close connection between the smooth Perceptron and the accelerated Perceptron of Ji et al. ( 2021): Theorem 1 (informal). Smooth Perceptron and the accelerated Perceptron of Ji et al. ( 2021) can be described as a dynamic where the two players employ the optimistic-follow-the-regularized-leader (OFTRL) algorithm to play. The main difference is that the smooth Perceptron outputs the weighted average of the main player's historical decisions, while the accelerated Perceptron of Ji et al. ( 2021) outputs the weighted sum. Beyond providing a deeper understanding of accelerated Perceptrons, our framework also provides improved new results for several other important areas: • Implicit bias analysis. The seminal work of Soudry et al. (2018) shows that, for linearly separable data, minimizing the empirical risk with the vanilla gradient descent (GD) gives a classifier which not only has zero training error (thus can be used for linear separation), but also maximizes the margin. This phenomenon characterizes the implicit bias of GD, as it implicitly prefers the ( 2 -)maximal margin classifier among all classifiers with a positive margin, and analysing the implicit bias has become an important tool for understanding why classical optimization methods generalize well for supervised machine learning problems. 

2. RELATED WORK

This section briefly reviews the related work on Perceptron algorithms, implicit-bias analysis, and game theory. The background knowledge on (optimistic) online learning is presented in the Preliminaries (Section 3).



The state-of-the-art algorithm is proposed byJi et al. (2021), who show that their proposed momentum-based GD has an O(log t/t 2 ) margin-maximization rate. In this paper, we make two contributions toward this direction:1. We show that, under our analysis framework, we can easily improve the margin maximization rate of the algorithm of Ji et al. (2021) from O(log t/t 2 ) to O(1/t 2 ); 2. Although previous work has analyzed the implicit bias of GD and momentum-based GD, it is still unclear how the classical Nesterov's accelerated gradient descent (NAG, Nesterov, 1988) will affect the implicit bias. In this paper, through our framework, we show that NAG with appropriately chosen parameters also enjoys an O(1/t 2 ) marginmaximization rate. To our knowledge, it is the first time the implicit bias property of NAG is proved. • p-norm Perceptron. Traditional work on Perceptrons typically assumes the feature vectors lie in an 2 -ball. A more generalized setting is considered in Gentile (2000), who assumes the feature vectors lie inside an p -ball, with p ∈ [2, ∞). Their proposed algorithm requires O(p/γ 2 ) number of iterations to find a zero-error classifier. In this paper, we develop a new Perceptron algorithm for this problem under our framework based on optimistic strategies, showing that it enjoys an accelerated O( √ p log n/γ) rate.

