ON ACCELERATED PERCEPTRONS AND BEYOND

Abstract

The classical Perceptron algorithm of Rosenblatt can be used to find a linear threshold function to correctly classify n linearly separable data points, assuming the classes are separated by some margin γ > 0. A foundational result is that Perceptron converges after O(1/γ 2 ) iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to O( √ log n/γ), with more sophisticated algorithms. In this paper, we unify these existing results under one framework by showing that they can all be described through the lens of solving min-max problems using modern acceleration techniques, mainly through optimistic online learning. We then show that the proposed framework also leads to improved results for a series of problems beyond the standard Perceptron setting. Specifically, a) for the margin maximization problem, we improve the stateof-the-art result from O(log t/t 2 ) to O(1/t 2 ), where t is the number of iterations; b) we provide the first result on identifying the implicit bias property of the classical Nesterov's accelerated gradient descent (NAG) algorithm, and show NAG can maximize the margin with an O(1/t 2 ) rate; c) for the classical p-norm Perceptron problem, we provide an algorithm with O( (p -1) log n/γ) convergence rate, while existing algorithms suffer the O((p -1)/γ 2 ) convergence rate.

1. INTRODUCTION

In this paper, we revisit the problem of learning a linear classifier, which is one of the most important and fundamental tasks of machine learning (Bishop, 2007) . In this problem, we are given a set S of n training examples, and the goal is to find a linear classifier that correctly separates S as fast as possible. The most well-known algorithm is Perceptron (Rosenblatt, 1958) , which can converge to a perfect (mistake-free) classifier after Ω(1/γ 2 ) number of iterations, provided the data is linearly separable with some margin γ > 0 (Novikoff, 1962) . Over subsequent decades, many variants of Perceptron have been developed (Aizerman, 1964; Littlestone, 1988; Wendemuth, 1995; Freund & Schapire, 1999; Cesa-Bianchi et al., 2005 , to name a few). However, somewhat surprisingly, there has been little progress in substantially improving the fundamental Perceptron iteration bound presented by Novikoff (1962) . It is only recently that a number of researchers have discovered accelerated variants of the Perceptron with a faster Ω( √ log n/γ) iteration complexity, although with a slower per-iteration cost. These works model the problem in different ways, e.g., as a non-smooth optimization problem or an empirical risk minimization task, and they have established faster rates using sophisticated optimization tools. Soheili & Pena (2012) put forward the smooth Perceptron, framing the objective as a non-smooth strongly-concave maximization and then applying Nestrov's excessive gap technique (NEG, Nesterov, 2005) . Yu et al. (2014) proposed the accelerated Perceptron by furnishing a convex-concave objective that can be solved via the mirrorprox method (Nemirovski, 2004) . Ji et al. (2021) put forward a third interpretation, obtaining the accelerated rate by minimizing the empirical risk under exponential loss with a momentum-based normalized gradient descent algorithm. Following this line of research, in this paper, we present a unified analysis framework that reveals the exact relationship among these methods that share the same order of convergence rate. Moreover, we show that the proposed framework also leads to improved results for various problems beyond the standard Perceptron setting. Specifically, we consider a general zero-sum game that involves

