APPROXIMATE VANISHING IDEAL COMPUTATIONS AT SCALE

Abstract

The vanishing ideal of a set of points X = {x 1 , . . . , x m } ⊆ R n is the set of polynomials that evaluate to 0 over all points x ∈ X and admits an efficient representation by a finite subset of generators. In practice, to accommodate noise in the data, algorithms that construct generators of the approximate vanishing ideal are widely studied but their computational complexities remain expensive. In this paper, we scale up the oracle approximate vanishing ideal algorithm (OAVI), the only generator-constructing algorithm with known learning guarantees. We prove that the computational complexity of OAVI is not superlinear, as previously claimed, but linear in the number of samples m. In addition, we propose two modifications that accelerate OAVI's training time: Our analysis reveals that replacing the pairwise conditional gradients algorithm, one of the solvers used in OAVI, with the faster blended pairwise conditional gradients algorithm leads to an exponential speed-up in the number of features n. Finally, using a new inverse Hessian boosting approach, intermediate convex optimization problems can be solved almost instantly, improving OAVI's training time by multiple orders of magnitude in a variety of numerical experiments.

1. INTRODUCTION

High-quality features are essential for the success of machine-learning algorithms (Guyon & Elisseeff, 2003) and as a consequence, feature transformation and selection algorithms are an important area of research (Kusiak, 2001; Van Der Maaten et al., 2009; Abdi & Williams, 2010; Paul et al., 2021; Manikandan & Abirami, 2021; Carderera et al., 2021) . A recently popularized technique for extracting nonlinear features from data is the concept of the vanishing ideal (Heldt et al., 2009; Livni et al., 2013) , which lies at the intersection of machine learning and computer algebra. Unlike conventional machine learning, which relies on a manifold assumption, vanishing ideal computations are based on an algebraic setfoot_0 assumption, for which powerful theoretical guarantees are known (Vidal et al., 2005; Livni et al., 2013; Globerson et al., 2017) . The core concept of vanishing ideal computations is that any data set X = {x 1 , . . . , x m } ⊆ R n can be described by its vanishing ideal, I X = {g ∈ P | g(x) = 0 for all x ∈ X}, where P is the polynomial ring over R in n variables. Despite I X containing infinitely many polynomials, there exists a finite number of generators of I X , g 1 , . . . , g k ∈ I X with k ∈ N, such that any polynomial h ∈ I X can be written as h = k i=1 g i h i , where h i ∈ P for all i ∈ {1, . . . , k} (Cox et al., 2013) . Thus, the generators share any sample x ∈ X as a common root, capture the nonlinear structure of the data, and succinctly represent the vanishing ideal I X . Due to noise in empirical data, we are interested in constructing generators of the approximate vanishing ideal, the ideal generated by the set of polynomials that approximately evaluate to 0 for all x ∈ X and whose leading term coefficient is 1, see Definition 2.2. For classification tasks, constructed generators can, for example, be used to transform the features of the data set X ⊆ R n such that the data becomes linearly separable (Livni et al., 2013) and training a linear kernel support vector machine (SVM) (Suykens & Vandewalle, 1999) on the feature-transformed data results in excellent classification accuracy. Various algorithms for the construction of generators of the approximate vanishing ideal exist (Heldt et al., 2009; Fassino, 2010; Limbeck, 2013; Livni et al., 2013; Iraji & Chitsaz, 2017; Kera & Hasegawa, 2020; Kera, 2022) , but among them, the oracle approximate vanishing ideal algorithm (OAVI) (Wirth & Pokutta, 2022) is the only one capable of constructing sparse generators and admitting learning guarantees. More specifically, CGAVI (OAVI with Frank-Wolfe algorithms (Frank & Wolfe, 1956) , a.k.a. conditional gradients algorithms (CG) (Levitin & Polyak, 1966) as a solver) exploits the sparsity-inducing properties of CG to construct sparse generators and, thus, a robust and interpretable corresponding feature transformation. Furthermore, generators constructed with CGAVI vanish on out-sample data and the combined approach of transforming features with CGAVI for a subsequently applied linear kernel SVM inherits the margin bound of the SVM (Wirth & Pokutta, 2022). Despite OAVI's various appealing properties, the computational complexities of vanishing ideal algorithms for the construction of generators of the approximate vanishing ideal are superlinear in the number of samples m. With training times that increase at least cubically with m, vanishing ideal algorithms have yet to be applied to large-scale machine-learning problems.

1.1. CONTRIBUTIONS

In this paper, we improve and study the scalability of OAVI. Linear computational complexity in m. Up until now, the analysis of computational complexities of approximate vanishing ideal algorithms assumed that generators need to vanish exactly, which gave an overly pessimistic estimation of the computational cost. For OAVI, we exploit that generators only have to vanish approximately and prove that the computational complexity of OAVI is not superlinear but linear in the number of samples m and polynomial in the number of features n. Solver improvements. OAVI repeatedly calls a solver of quadratic convex optimization problems to construct generators. By replacing the pairwise conditional gradients algorithm (PCG) (Lacoste-Julien & Jaggi, 2015) with the faster blended pairwise conditional gradients algorithm (BPCG) (Tsuji et al., 2022) , we improve the dependence of the time complexity of OAVI on the number of features n in the data set by an exponential factor. Inverse Hessian boosting (IHB). OAVI solves a series of quadratic convex optimization problems that differ only slightly and we can efficiently maintain and update the inverse of the corresponding Hessians. Inverse Hessian boosting (IHB) then refers to the procedure of passing a starting vector, computed with inverse Hessian information, close to the optimal solution to the convex solver used in OAVI. Empirically, IHB speeds up the training time of OAVI by multiple orders of magnitude. Large-scale numerical experiments. We perform numerical experiments on data sets of up to two million samples, highlighting that OAVI is an excellent large-scale feature transformation method.

1.2. RELATED WORK

The Buchberger-Möller algorithm was the first method for constructing generators of the vanishing ideal (Möller & Buchberger, 1982) . Its high susceptibility to noise was addressed by Heldt et al. (2009) with the approximate vanishing ideal algorithm (AVI), see also Fassino (2010) ; Limbeck (2013) . The latter introduced two algorithms that construct generators term by term instead of degree-wise such as AVI, the approximate Buchberger-Möller algorithm (ABM) and the border bases approximate Buchberger-Möller algorithm. The aforementioned algorithms are monomialaware, that is, they require an explicit ordering of terms and construct generators as linear combinations of monomials. However, monomial-awareness is an unattractive property: Changing the order of the features changes the outputs of the algorithms. Monomial-agnostic approaches such as vanishing component analysis (VCA) (Livni et al., 2013) do not suffer from this shortcoming, as they construct generators as linear combinations of polynomials. VCA found success in hand posture recognition, solution selection using genetic programming, principal variety analysis for nonlinear data modeling, and independent signal estimation for blind source separation (Zhao & Song, 2014; Kera & Iba, 2016; Iraji & Chitsaz, 2017; Wang & Ohtsuki, 2018) . The disadvantage of foregoing the term ordering is that VCA sometimes constructs multiple orders of magnitude more generators



A set X ⊆ R n is algebraic if it is the set of common roots of a finite set of polynomials.

