APPROXIMATE VANISHING IDEAL COMPUTATIONS AT SCALE

Abstract

The vanishing ideal of a set of points X = {x 1 , . . . , x m } ⊆ R n is the set of polynomials that evaluate to 0 over all points x ∈ X and admits an efficient representation by a finite subset of generators. In practice, to accommodate noise in the data, algorithms that construct generators of the approximate vanishing ideal are widely studied but their computational complexities remain expensive. In this paper, we scale up the oracle approximate vanishing ideal algorithm (OAVI), the only generator-constructing algorithm with known learning guarantees. We prove that the computational complexity of OAVI is not superlinear, as previously claimed, but linear in the number of samples m. In addition, we propose two modifications that accelerate OAVI's training time: Our analysis reveals that replacing the pairwise conditional gradients algorithm, one of the solvers used in OAVI, with the faster blended pairwise conditional gradients algorithm leads to an exponential speed-up in the number of features n. Finally, using a new inverse Hessian boosting approach, intermediate convex optimization problems can be solved almost instantly, improving OAVI's training time by multiple orders of magnitude in a variety of numerical experiments.

1. INTRODUCTION

High-quality features are essential for the success of machine-learning algorithms (Guyon & Elisseeff, 2003) and as a consequence, feature transformation and selection algorithms are an important area of research (Kusiak, 2001; Van Der Maaten et al., 2009; Abdi & Williams, 2010; Paul et al., 2021; Manikandan & Abirami, 2021; Carderera et al., 2021) . A recently popularized technique for extracting nonlinear features from data is the concept of the vanishing ideal (Heldt et al., 2009; Livni et al., 2013) , which lies at the intersection of machine learning and computer algebra. Unlike conventional machine learning, which relies on a manifold assumption, vanishing ideal computations are based on an algebraic setfoot_0 assumption, for which powerful theoretical guarantees are known (Vidal et al., 2005; Livni et al., 2013; Globerson et al., 2017) . The core concept of vanishing ideal computations is that any data set X = {x 1 , . . . , x m } ⊆ R n can be described by its vanishing ideal, I X = {g ∈ P | g(x) = 0 for all x ∈ X}, where P is the polynomial ring over R in n variables. Despite I X containing infinitely many polynomials, there exists a finite number of generators of I X , g 1 , . . . , g k ∈ I X with k ∈ N, such that any polynomial h ∈ I X can be written as h = k i=1 g i h i , where h i ∈ P for all i ∈ {1, . . . , k} (Cox et al., 2013) . Thus, the generators share any sample x ∈ X as a common root, capture the nonlinear structure of the data, and succinctly represent the vanishing ideal I X . Due to noise in empirical data, we are interested in constructing generators of the approximate vanishing ideal, the ideal generated by the set of polynomials that approximately evaluate to 0 for all x ∈ X and whose leading term coefficient is 1, see Definition 2.2. For classification tasks, constructed generators can, for example, be used to transform the features of the data set X ⊆ R n such that the data becomes linearly separable (Livni et al., 2013) and training a linear kernel support vector machine (SVM) (Suykens & Vandewalle, 1999) on the feature-transformed data results in excellent classification accuracy. Various algorithms for the construction of generators of the approximate vanishing ideal exist (Heldt et al., 2009; Fassino, 2010; Limbeck, 2013; Livni et al., 2013; Iraji & Chitsaz, 2017; Kera & Hasegawa, 2020; Kera, 2022) , but among them, the oracle approximate vanishing ideal algorithm (OAVI) (Wirth & Pokutta, 2022) is the only one capable of constructing sparse generators and admitting learning guarantees. More specifically, CGAVI (OAVI with Frank-Wolfe algorithms (Frank & Wolfe, 1956) , a.k.a. conditional gradients algorithms (CG) (Levitin & Polyak, 1966) as a solver) exploits the sparsity-inducing properties of CG to construct sparse generators and, thus, a robust and interpretable corresponding feature transformation. Furthermore, generators constructed with CGAVI vanish on out-sample data and the combined approach of transforming features with CGAVI for a subsequently applied linear kernel SVM inherits the margin bound of the SVM (Wirth & Pokutta, 2022) . Despite OAVI's various appealing properties, the computational complexities of vanishing ideal algorithms for the construction of generators of the approximate vanishing ideal are superlinear in the number of samples m. With training times that increase at least cubically with m, vanishing ideal algorithms have yet to be applied to large-scale machine-learning problems.

1.1. CONTRIBUTIONS

In this paper, we improve and study the scalability of OAVI. Linear computational complexity in m. Up until now, the analysis of computational complexities of approximate vanishing ideal algorithms assumed that generators need to vanish exactly, which gave an overly pessimistic estimation of the computational cost. For OAVI, we exploit that generators only have to vanish approximately and prove that the computational complexity of OAVI is not superlinear but linear in the number of samples m and polynomial in the number of features n. Solver improvements. OAVI repeatedly calls a solver of quadratic convex optimization problems to construct generators. By replacing the pairwise conditional gradients algorithm (PCG) (Lacoste-Julien & Jaggi, 2015) with the faster blended pairwise conditional gradients algorithm (BPCG) (Tsuji et al., 2022) , we improve the dependence of the time complexity of OAVI on the number of features n in the data set by an exponential factor. Inverse Hessian boosting (IHB). OAVI solves a series of quadratic convex optimization problems that differ only slightly and we can efficiently maintain and update the inverse of the corresponding Hessians. Inverse Hessian boosting (IHB) then refers to the procedure of passing a starting vector, computed with inverse Hessian information, close to the optimal solution to the convex solver used in OAVI. Empirically, IHB speeds up the training time of OAVI by multiple orders of magnitude. Large-scale numerical experiments. We perform numerical experiments on data sets of up to two million samples, highlighting that OAVI is an excellent large-scale feature transformation method.

1.2. RELATED WORK

The Buchberger-Möller algorithm was the first method for constructing generators of the vanishing ideal (Möller & Buchberger, 1982) . Its high susceptibility to noise was addressed by Heldt et al. (2009) with the approximate vanishing ideal algorithm (AVI), see also Fassino (2010) ; Limbeck (2013) . The latter introduced two algorithms that construct generators term by term instead of degree-wise such as AVI, the approximate Buchberger-Möller algorithm (ABM) and the border bases approximate Buchberger-Möller algorithm. The aforementioned algorithms are monomialaware, that is, they require an explicit ordering of terms and construct generators as linear combinations of monomials. However, monomial-awareness is an unattractive property: Changing the order of the features changes the outputs of the algorithms. Monomial-agnostic approaches such as vanishing component analysis (VCA) (Livni et al., 2013) do not suffer from this shortcoming, as they construct generators as linear combinations of polynomials. VCA found success in hand posture recognition, solution selection using genetic programming, principal variety analysis for nonlinear data modeling, and independent signal estimation for blind source separation (Zhao & Song, 2014; Kera & Iba, 2016; Iraji & Chitsaz, 2017; Wang & Ohtsuki, 2018) . The disadvantage of foregoing the term ordering is that VCA sometimes constructs multiple orders of magnitude more generators than monomial-aware algorithms (Wirth & Pokutta, 2022) . Furthermore, VCA is susceptible to the spurious vanishing problem: Polynomials that do not capture the nonlinear structure of the data but whose coefficient vector entries are small become generators, and, conversely, polynomials that capture the data well but whose coefficient vector entries are large get treated as non-vanishing. The problem was partially addressed by Kera & Hasegawa (2019; 2020; 2021) .

2. PRELIMINARIES

Throughout, let ℓ, k, m, n ∈ N. We denote vectors in bold and let 0 ∈ R n denote the 0-vector. Sets of polynomials are denoted by capital calligraphic letters. We denote the set of terms (or monomials) and the polynomial ring over R in n variables by T and P, respectively. For τ ≥ 0, a polynomial g = k i=1 c i t i ∈ P with c = (c 1 , . . . , c k ) ⊺ is said to be τ -bounded in the ℓ 1 -norm if the ℓ 1 -norm of its coefficient vector is bounded by τ , that is, if ∥g∥ 1 := ∥c∥ 1 ≤ τ . Given a polynomial g ∈ P, let deg(g) denote its degree. The sets of polynomials in n variables of and up to degree d ∈ N are denoted by P d and P ≤d , respectively. Similarly, for a set of polynomials G ⊆ P, let G d = G ∩P d and G ≤d = G ∩ P ≤d . We often assume that X = {x 1 , . . . , x m } ⊆ [0, 1] n , a form that can be realized, for example, via min-max feature scaling. Given a polynomial g ∈ P and a set of polynomials G = {g 1 , . . . , g k } ⊆ P, define the evaluation vector of g and evaluation matrix of G over X as g(X) = (g(x 1 ), . . . , g(x m )) ⊺ ∈ R m and G(X) = (g 1 (X), . . . , g k (X)) ∈ R m×k , respectively. Further, define the mean squared error of g over X as mse(g, X) = 1 |X| ∥g(X)∥ 2 2 = 1 m ∥g(X)∥ 2 2 . OAVI sequentially processes terms according to a so-called term ordering, as is necessary for any monomial-aware algorithm. For ease of presentation, we restrict our analysis to the degreelexicographical ordering of terms (DegLex) (Cox et al., 2013) , denoted by < σ . For example, given the terms t, u, v ∈ T 1 , DegLex works as follows: 1 < σ t < σ u < σ v < σ t 2 < σ t • u < σ t • v < σ u 2 < σ u • v < σ v 2 < σ t 3 < σ . . . , where 1 denotes the constant-1 term. Given a set of terms O = {t 1 , . . . , t k } σ ⊆ T , the subscript σ indicates that t 1 < σ . . . < σ t k . Definition 2.1 (Leading term (coefficient)). Let g = k i=1 c i t i ∈ P with c i ∈ R and t i ∈ T for all i ∈ {1, . . . , k} and let j ∈ {1, . . . , k} such that t j > σ t i for all i ∈ {1, . . . , k} \ {j}. Then, t j and c j are called leading term and leading term coefficient of g, denoted by lt(g) = t j and ltc(g) = c j , respectively. We thus define approximately vanishing polynomials via the mean squared error as follows. Definition 2.2 (Approximately vanishing polynomial). Let X = {x 1 , . . . , x m } ⊆ R n , ψ ≥ 0, and τ ≥ 2. A polynomial g ∈ P is ψ-approximately vanishing (over X) if mse(g, X) ≤ ψ. If also ltc(g) = 1 and ∥g∥ 1 ≤ τ , then g is called (ψ, 1, τ )-approximately vanishing (over X). In the definition above, we fix the leading term coefficient of polynomials to address the spurious vanishing problem, and the requirement that polynomials are τ -bounded in the ℓ 1 -norm is necessary for the learning guarantees of OAVI to hold. Definition 2.3 (Approximate vanishing ideal). Let X = {x 1 , . . . , x m } ⊆ R n , ψ ≥ 0, and τ ≥ 2. The (ψ, τ )-approximate vanishing ideal (over X), I ψ,τ X , is the ideal generated by all (ψ, 1, τ )approximately vanishing polynomials over X. For ψ = 0 and τ = ∞, it holds that I 0,∞ X = I X , that is, the approximate vanishing ideal becomes the vanishing ideal. Finally, we introduce the generator-construction problem addressed by OAVI. Problem 2.4 (Setting). Let X = {x 1 , . . . , x m } ⊆ R n , ψ ≥ 0, and τ ≥ 2. Construct a set of (ψ, 1, τ )-approximately vanishing generators of I ψ,τ X . Recall that for t, u ∈ T , t divides (or is a divisor of)  ∂ d O = {u ∈ T d : t ∈ O ≤d-1 for all t ∈ T ≤d-1 such that t | u}. Algorithm 1: Oracle approximate vanishing ideal algorithm (OAVI) Input : A data set X = {x 1 , . . . , x m } ⊆ R n and parameters ψ ≥ 0 and τ ≥ 2. Output: A set of polynomials G ⊆ P and a set of monomials O ⊆ T . d ← 1, O = {t 1 } σ ← {1} σ , G ← ∅ while ∂ d O = {u 1 , . . . , u k } σ ̸ = ∅ do for i = 1, . . . , k do ℓ ← |O| ∈ N, A ← O(X) ∈ R m×ℓ , b ← u i (X) ∈ R m , P = {y ∈ R ℓ | ∥y∥ 1 ≤ τ -1} c ∈ argmin y∈P 1 m ∥Ay + b∥ 2 2 ▷ call to convex optimization oracle g ← ℓ j=1 c j t j + u i if mse(g, X) ≤ ψ then ▷ check whether g vanishes 8 G ← G ∪ {g} else 10 O = {t 1 , . . . , t ℓ+1 } σ ← (O ∪ {u i }) σ end end d ← d + 1 end 3 ORACLE APPROXIMATE VANISHING IDEAL ALGORITHM (OAVI) In this section, we recall the oracle approximate vanishing ideal algorithm (OAVI) (Wirth & Pokutta, 2022) in Algorithm 1, a method for solving Problem 2.4.

3.1. ALGORITHM OVERVIEW

OAVI takes as input a data set set X = {x 1 , . . . , x m } ⊆ R n , a vanishing parameter ψ ≥ 0, and a tolerance τ ≥ 2 such that the constructed generators are τ -bounded in the ℓ 1 -norm. From a highlevel perspective, OAVI constructs a finite set of (ψ, 1, τ )-approximately vanishing generators of the (ψ, τ )-approximate vanishing ideal I ψ,τ X by solving a series of constrained convex optimization problems. OAVI tracks the set of terms O ⊆ T such that there does not exist a (ψ, 1, τ )approximately vanishing generator of I ψ,τ X with terms only in O and the set of generators G ⊆ P of I ψ,τ X . For every degree d ≥ 1, OAVI computes the border ∂ d O in Line 2. Then, in Lines 3-12, for every term u ∈ ∂ d O, OAVI determines whether there exists a (ψ, 1, τ )-approximately vanishing generator g of I ψ,τ X with lt(g) = u and other terms only in O via oracle access to a solver of the constrained convex optimization problem in Line 5. If such a g exists, it gets appended to G in Line 8. Otherwise, the term u gets appended to O in Line 10. OAVI terminates when a degree d ∈ N is reached such that ∂ d O = ∅. For ease of exposition, unless noted otherwise, we ignore that the optimization problem in Line 5 of OAVI is addressed only up to a certain accuracy ϵ > 0. Throughout, we denote the output of OAVI by G and O, that is, (G, O) = OAVI(X, ψ, τ ). We replace the first letter in OAVI with the abbreviation corresponding to the solver used to solve the optimization problem in Line 5. For example, OAVI with solver CG is referred to as CGAVI.

3.2. PREPROCESSING WITH OAVI IN A CLASSIFICATION PIPELINE

OAVI can be used as a feature transformation method for a subsequently applied linear kernel support vector machine (SVM). Let X ⊆ R n and Y = {1, . . . , k} be an input and output space, respectively. We are given a training sample S = {(x 1 , y 1 ), . . . , (x m , y m )} ∈ (X × Y) m drawn i.i.d. from some unknown distribution D. Our task is to determine a hypothesis h : X → Y with small generalization error P (x,y)∽D [h(x) ̸ = y]. For each class i ∈ {1, . . . , k}, let X i = {x j ∈ X | y j = i} ⊆ X = {x 1 , . . . , x m } denote the subset of samples corresponding to class i and construct a set of (ψ, 1, τ )-approximately vanishing generators G i of the (ψ, τ )-approximate vanishing ideal I ψ,τ transformation we are about to apply to data set X. As proposed in Livni et al. (2013) , we transform the training sample X via the feature transformation X i via OAVI, that is, (G i , O i ) = OAVI(X i , ψ, τ ). x j → xj = |g 1 (x j )|, . . . , |g |G| (x j )| ⊺ ∈ R |G| (FT) for j = 1, . . . , m. The motivation behind this feature transformation is that a polynomial g ∈ G i vanishes approximately over all x ∈ X i and (hopefully) attains values that are far from zero over points x ∈ X \ X i . We then train a linear kernel SVM on the feature-transformed data S = {(x 1 , y 1 ), . . . , (x m , y m )} with ℓ 1 -regularization to promote sparsity. If ψ = 0, τ = ∞, and the underlying classes of S belong to disjoint algebraic sets, then the different classes become linearly separable in the feature space corresponding to transformation (FT) and perfect classification accuracy is achieved with the linear kernel SVM (Livni et al., 2013) .

3.3. SOLVING THE OPTIMIZATION PROBLEM IN LINE 5 OF OAVI

Setting τ = ∞, the optimization problem in Line 5 becomes unconstrained and can, for example, be addressed with accelerated gradient descent (AGD) (Nesterov, 1983) . If, however, τ < ∞, OAVI satisfies two generalization bounds. Indeed, assuming that the data is of the form X ⊆ [0, 1] n , the generators in G are guaranteed to also vanish on out-sample data. Furthermore, the combined approach of using generators obtained via OAVI to preprocess the data for a subsequently applied linear kernel SVM inherits the margin bound of the SVM (Wirth & Pokutta, 2022) .

4. OAVI AT SCALE

Here, we address the main shortcoming of vanishing ideal algorithms: time, space, and evaluation complexities that depend superlinearly on m, where m is the number of samples in the data set  X = {x 1 , . . . , x m } ⊆ R n . Recall the computational complexity of OAVI for τ = ∞. Theorem 4.1 (Complexity (Wirth & Pokutta, 2022)). Let X = {x 1 , . . . , x m } ⊆ R n , ψ ≥ 0, τ = ∞, set Z = {z 1 , . . . , z q } ⊆ R n can be computed in time O((|G| + |O|) 2 q). Under mild assumptions, Wirth & Pokutta (2022) proved that O(|G| + |O|) = O(mn), implying that OAVI's computational complexity is superlinear in m. In this work, we improve OAVI's computational complexity both with a tighter theoretical analysis and algorithmic modifications.

4.1. NUMBER-OF-SAMPLES-AGNOSTIC BOUND ON |G| + |O|

For ψ > 0, OAVI, ABM, AVI, and VCA construct approximately vanishing generators of the (ψ, τ )approximate vanishing ideal. Despite being designed for the approximate setting (ψ > 0), so far, the analysis of these algorithms was only conducted for the exact setting (ψ = 0). Below, we exploit that ψ > 0 in OAVI and obtain the first number-of-samples-agnostic bound on the size of a The results in Figure 1 , see Appendix F.1 for details, show that replacing PCG with BPCG in OAVI often speeds up the training time of OAVI. For the data set skin in Figure 1c , see Table 1 for an overview of the data sets, it is not fully understood why BPCGAVI is slower than PCGAVI.

4.3. INVERSE HESSIAN BOOSTING (IHB)

We introduce inverse Hessian boosting (IHB) to speed up the training time of OAVI by multiple orders of magnitudes by exploiting the structure of the optimization problems solved in OAVI. For ease of exposition, assume for now that τ = ∞, in which case we would use AGD to solve the problem in Line 5 of OAVI.  Letting ℓ = |O|, A = O(X) ∈ R m×ℓ , b = u i (X) ∈ R m , ∈ R ℓ are ∇f (y) = 2 m A ⊺ (Ay + b) ∈ R ℓ and ∇ 2 f (y) = 2 m A ⊺ A ∈ R ℓ×ℓ , respectively . By construction, the columns of A = O(X) are linearly independent. Hence, A ⊺ A ∈ R ℓ×ℓ is positive definite and invertible, f is strongly convex, and the optimal solution c to the optimization problem in Line 5 of OAVI is unique. Further, for y ∈ R ℓ , we have ∇f (y) = 0 if and only if y = c. Instead of using AGD to construct an ϵ-accurate solution to the optimization problem in Line 5 of OAVI, we could compute the optimal solution c = (A ⊺ A) -1 A ⊺ b ∈ R ℓ . Since matrix inversions are numerically unstable, relying on them directly would make OAVI less robust, and approximately vanishing polynomials might not be correctly detected. Instead, we capitalize on the fact that the number of iterations of AGD to reach an ϵ-minimizer depends on the Euclidean distance between the starting vector and the optimal solution. IHB refers to the procedure of passing y 0 = (A ⊺ A) -1 A ⊺ b ∈ R ℓ to AGD as a starting vector. Then, AGD often reaches ϵ-accuracy in one iteration. In case (A ⊺ A) -1 is not computed correctly due to floating-point errors, AGD still guarantees an ϵ-accurate solution to the optimization problem. Thus, IHB can also be thought of as performing one iteration of Newton's method starting with iterate 0, see, for example, Galántai (2000) , and passing the resulting vector as a starting iterate to AGD. We stress that the dimension of A ⊺ A ∈ R ℓ×ℓ is number-of-samples-agnostic, see Theorem 4.2. IHB also works for τ < ∞, in which case we use CG variants to address the optimization problem in Line 5 of OAVI that takes the form c ∈ argmin y∈P f (y), where P = {y ∈ R ℓ | ∥y∥ 1 ≤ τ -1}. In the problematic case ∥y 0 ∥ 1 = ∥(A ⊺ A) -1 A ⊺ b∥ 1 > τ -1, (INF) polynomial g = ℓ j=1 c j t j +u i constructed in Line 6 of OAVI might not vanish approximately even though there exists h ∈ P with lt(h) = u i , ltc(h) = 1, non-leading terms only in O, and ∥h∥ 1 > τ that vanishes exactly over X. Thus, mse(h, X) = 0 ≤ ψ < mse(g, X) and OAVI does not append g to G and instead updates O ← (O ∪ {u i }) σ and A ← (A, b) ∈ R m×(ℓ+1) . Since mse(h, X) = 0, we just appended a linearly dependent column to A, making A rank-deficient, (A ⊺ A) -1 ill-defined, and IHB no longer applicable. To address this problem, we could select τ adaptively, but this would invalidate the learning guarantees of OAVI that rely on τ being a constant. In practice, we fix τ ≥ 2 and stop using IHB as soon as (INF) holds for the first time, preserving the generalization bounds of OAVI. Through careful updates of (A ⊺ A) -1 , the discussion in Appendix C.1 implies the following complexity result for CGAVI-IHB (CGAVI with IHB). In Appendix C.2, we introduce weak inverse Hessian boosting (WIHB), a variant of IHB that speeds up CGAVI variants while preserving sparsity-inducing properties. In Figure 2 , see Appendix F.1 for details, we observe that CGAVI-IHB is faster than BPCGAVI-WIHB, which is faster than BPCGAVI. In Figure 3 , see Appendix F.2 for details, we compare the training times of OAVI-IHB to ABM and VCA. In Figure 3d , we observe that OAVI-IHB's training time scales better than ABM's and VCA's.

5. NUMERICAL EXPERIMENTS

Unless noted otherwise, the setup for the numerical experiments applies to all experiments in the paper. Experiments are implemented in PYTHON and performed on an Nvidia GeForce RTX 3080 GPU with 10GB RAM and an Intel Core i7 11700K 8x CPU at 3.60GHz with 64 GB RAM. Our code is publicly available on GitHub. We implement OAVI as in Algorithm 1 with convex solvers CG, PCG, BPCG, and AGD and refer to the resulting algorithms as CGAVI, PCGAVI, BPCGAVI, and AGDAVI, respectively. Solvers are run for up to 10,000 iterations. For the CG variants, we set τ = 1, 000. The CG variants are run up to accuracy ϵ = 0.01 • ψ and terminated early when less than 0.0001 • ψ progress is made in the difference between function values, when the coefficient vector of a generator is constructed, or if we have a guarantee that no coefficient vector of a generator can be constructed. AGD is terminated early if less than 0.0001 • ψ progress is made in the difference between function values for 20 iterations in a row or the coefficient vector of a generator is constructed. OAVI implemented with WIHB or IHB is referred to as OAVI-WIHB or OAVI-IHB, respectively. We implement ABM as in Limbeck (2013) but instead of applying the singular value decomposition (SVD) to the matrix corresponding to A = O(X) in OAVI, we apply the SVD to A ⊺ A in case this leads to a faster training time and we employ the border as in Definition 2.5. We implement VCA as in Livni et al. (2013) but instead of applying the SVD to the matrix corresponding to A = O(X) in OAVI, we apply the SVD to A ⊺ A in case this leads to a faster training time. We implement a polynomial kernel SVM with a oneversus-rest approach using the SCIKIT-LEARN software package (Pedregosa et al., 2011) and run the polynomial kernel SVM with ℓ 2 -regularization up to tolerance 10 -3 or for up to 10,000 iterations. We preprocess with OAVI, ABM, and VCA for a linear kernel SVM as in Section 3.2 and refer to the combined approaches as OAVI * , ABM * , and VCA * , respectively. The linear kernel SVM is implemented using the SCIKIT-LEARN software package and run with ℓ 1 -penalized squared hinge loss up to tolerance 10 -4 or for up to 10,000 iterations. For OAVI * , ABM * , and VCA * , the hyperparameters are the vanishing tolerance ψ and the ℓ 1 -regularization coefficient of the linear kernel SVM. For the polynomial kernel SVM, the hyperparameters are the degree and the ℓ 2 -regularization coefficient. Table 1 and 3 contain overviews of the data sets and hyperparameter values, respectively.

5.1. EXPERIMENT: PERFORMANCE

We compare the performance of CGAVI-IHB * , BPCGAVI-WIHB * , AGDAVI-IHB * , ABM * , VCA * , and polynomial kernel SVM on the data sets credit, htru, skin, and spam. Setup. We tune the hyperparameters on the training data using threefold cross-validation. We retrain on the entire training data set using the best combination of hyperparameters and evaluate the classification error on the test set and the hyperparameter optimization time. For the generatorconstructing methods, we also compare |G| + |O|, where Time CGAVI-IHB * 1.3 × 10 2 2.3 × 10 1 1.0 × 10 2 8.3 × 10 1 AGDAVI-IHB * 1.9 × 10 2 2.8 × 10 1 1.1 × 10 2 3.1 × 10 2 BPCGAVI-WIHB * 3.7 × 10 3 8.0 × 10 2 5.6 × 10 2 4.2 × 10 2 ABM * 1.2 × 10 2 2.4 × 10 1 6. × 10 1 1.7 × 10 2 VCA * 2.4 × 10 1 6.2 × 10 0 1.4 × 10 1 6.5 × 10 1 SVM 8.9 × 10 1 4.1 × 10 0 7.1 × |G| = i |G i |, |O| = i |O| i , and (G i , O i ) is the output of applying a generator-constructing algorithm to samples belonging to class i and the sparsity of the feature transformation G = i G i , which is defined as spar(G) = ( g∈G g z )/( g∈G g e ) ∈ [0, 1], where for a polynomial g = k j=1 c j t j + t with lt(g) = t, g e = k and g z = |{c j = 0 | j ∈ {1, . . . , k}}|, that is, g e and g z are the number of non-leading and the number of zero coefficient vector entries of g, respectively. Results are averaged over ten random 60%/40% train/test partitions. Results. The results are presented in Table 2 . OAVI * admits excellent test-set classification accuracy. BPCGAVI-WIHB * , in particular, admits the best test-set classification error on all data sets but one. Hyperparameter tuning for OAVI * is often slightly slower than for ABM * and VCA * . Since BPCGAVI-WIHB * does not employ IHB, the approach is always slower than CGAVI-IHB * and AGDAVI-IHB * . For all data sets but credit and skin, the hyperparameter optimization time for the SVM is shorter than for the other approaches. On skin, since the training time of the polynomial kernel SVM is superlinear in m, the hyperparameter training for the polynomial kernel SVM is slower than for the other approaches. For data sets with few features n, the magnitude of |G| + |O| is often the smallest for VCA * . However, for spam, a data set with n = 57, as already pointed out by Kera & Hasegawa (2019) as the spurious vanishing problem, we observe VCA's tendency to create unnecessary generators. Finally, only BPCGAVI-WIHB * constructs sparse feature transformations, potentially explaining the excellent test-set classification accuracy of the approach. In conclusion, the numerical results justify considering OAVI as the generator-constructing algorithm of choice, as it is the only approximate vanishing ideal algorithm with known learning guarantees and, in practice, performs better than or similar to related approaches.

A MISSING PROOFS

Proof of Theorem 4.2. Let X = {x 1 , . . . , x m } ⊆ [0, 1] n and let t 1 , . . . , t n ∈ T 1 be the degree-1 monomials. Suppose that during OAVI's execution, for some degree d ∈ N, OAVI checks whether the term u = where 1 denotes the constant-1 monomial, that is, 1(x) = 1 for all x ∈ R n . Note that ∥h∥ 1 ≤ Note that for all x ∈ X and t ∈ T 1 , it holds that t(x) ∈ [0, 1], and, thus, |t(x) -1 2 | ≤ 1 2 . We obtain mse(g, X) ≤ mse(h, X) = 1 m ∥h(X)∥ 2 2 = 1 m x∈X   n j=1 t j (x) - 1 2 1(x) αj   2 ≤ max x∈X n j=1 t j (x) - 1 2 2αj ≤ 4 -d . Since mse(g, X) ≤ 4 -d ≤ ψ is satisfied for d ≥ D := -log(ψ) log(4) , OAVI terminates after reaching degree D. Thus, at the end of OAVI's execution, |G| + |O| ≤ D+n D ≤ (D + n) D .

B BLENDED PAIRWISE CONDITIONAL GRADIENTS ALGORITHM (BPCG)

The blended pairwise conditional gradients algorithm (BPCG) (Tsuji et al., 2022) is presented in Algorithm 2.

C ADDITIONAL INFORMATION ON IHB

We discuss the computational complexity of IHB in Appendix C.1 and WIHB in Appendix C.2.

C.1 COMPUTATIONAL COMPLEXITY OF IHB

The main cost of IHB is the inversion of the matrix A ⊺ A ∈ R ℓ×ℓ , which generally requires O(ℓ 3 ) elementary operations. Since OAVI solves a series of quadratic convex optimization problems that differ from each other only slightly, we can maintain and update (A ⊺ A) -1 using O(ℓ 2 ) instead of O(ℓ 3 ) elementary operations. Theorem C.1 (IHB update cost). Let A ∈ R m×ℓ , A ⊺ A ∈ R ℓ×ℓ , (A ⊺ A) -1 , and b ∈ R ℓ be given. In case ∥b∥ 2 > 0 and b ⊺ A(A ⊺ A) -1 A ⊺ b ̸ = ∥b∥ 2 2 , Ã = (A, b) ∈ R m×(ℓ+1) , Ã⊺ Ã ∈ R (ℓ+1)×(ℓ+1) , ( Ã⊺ Ã) -1 ∈ R (ℓ+1)×(ℓ+1) (C.1) can be constructed in O(ℓm + ℓ 2 ) elementary operations. Algorithm 2: Blended pairwise conditional gradients algorithm (BPCG) Input : A smooth and convex function f , a starting vertex y 0 ∈ vert(P ). Output: A point y T ∈ P . ℓ+1) . In O(mℓ) elementary operations, we can construct ℓ+1) and in additional O(ℓ 2 ) elementary operations, we can construct S (0) ← {y 0 } λ (0) y0 ← 1 λ (0) y0 ← 0 for v ∈ vert(P ) \ {y 0 } for t = 0, . . . , d t ← q t -a t γ t ∈ argmin γ∈[0,λ (t) a t ] f (y t + γd t ) λ (t+1) v ← λ (t) v for v ∈ vert(P ) \ {a t , q t } λ (t+1) at ← λ (t) at -γ t λ (t+1) qt ← λ (t) qt + γ t else d t ← w t -y t γ t ∈ argmin γ∈[0,1] f (y t + γd t ) λ (t+1) v ← (1 -γ t )λ (t) v for v ∈ vert(P ) \ {w t } λ (t+1) wt ← (1 -γ t )λ (t) wt + γ t end S (t+1) ← {v ∈ vert(P ) | λ (t+1) v > 0} y t+1 ← y t + γ t d t end Proof. Let B = A ⊺ A ∈ R ℓ×ℓ , B = Ã⊺ Ã ∈ R (ℓ+1)×(ℓ+1) , N = B -1 ∈ R ℓ×ℓ , and Ñ = B-1 = ( Ã⊺ Ã) -1 ∈ R (ℓ+1)×( A ⊺ b ∈ R ℓ , b ⊺ b = ∥b∥ 2 2 ∈ R, and Ã = (A, b) ∈ R m×( B = Ã⊺ Ã = B A ⊺ b b ⊺ A ∥b∥ 2 2 ∈ R (ℓ+1)×(ℓ+1) . We can then compute Ñ = B-1 = ( Ã⊺ Ã) -1 ∈ R (ℓ+1)×(ℓ+1) in additional O(ℓ 2 ) elementary operations. We write Ñ = Ñ1 ñ2 ñ⊺ 2 ñ3 ∈ R (ℓ+1)×(ℓ+1) , where Ñ1 ∈ R ℓ×ℓ , ñ2 ∈ R ℓ , and ñ3 ∈ R. Then, it has to hold that B Ñ = B A ⊺ b b ⊺ A ∥b∥ 2 2 Ñ1 ñ2 ñ⊺ 2 ñ3 = I ∈ R (ℓ+1)×(ℓ+1) , where I is the identity matrix. Note that b ⊺ Añ 2 + ∥b∥ 2 2 ñ3 = 1. Thus, ñ3 = 1 -b ⊺ Añ 2 ∥b∥ 2 2 ∈ R, (C.2) which is well-defined due to the assumption ∥b∥ 2 > 0. Since b ⊺ A is already computed, once ñ2 is computed, the computation of ñ3 requires only additional O(ℓ) elementary operations. Similarly, we have that B ñ2 + A ⊺ bñ 3 = 0 ∈ R ℓ . (C.3) h ← ℓ j=1 d j t j + u i ▷ a sparse polynomial 11 if mse(h, X) ≤ ψ then ▷ check whether h vanishes 12 G ← G ∪ {h} 13 else 14 G ← G ∪ {g} 15 end else O = {t 1 , . . . , t ℓ+1 } σ ← (O ∪ {u i }) σ end end d ← d + 1 end C.2 WEAK INVERSE HESSIAN BOOSTING (WIHB) So far, we have introduced IHB to drastically speed-up training of AGDAVI and CGAVI. A drawback of using CGAVI-IHB is that the initialization of CG variants with a non-sparse initial vector such as y 0 = (A ⊺ A) -1 A ⊺ b leads to the construction of generally non-sparse generators. In this section, we explain how to combine the speed-up of IHB with the sparsity-inducing properties of CG variants, referring to the resulting technique as weak inverse Hessian boosting (WIHB). Specifically, we present WIHB with BPCGAVI (OAVI with solver BPCG), referred to as BPCGAVI-WIHB in Algorithm 3. The high-level idea of BPCGAVI-WIHB is to use IHB and vanilla CG to quickly check whether a (ψ, 1, τ )-approximately vanishing generator exists. If it does, we then use BPCG to try and construct a (ψ, 1, τ )-approximately vanishing generator that is also sparse. We proceed by giving a detailed overview of BPCGAVI-WIHB. In Line 5 of BPCGAVI-WIHB, we construct y 0 . Then, in Line 6, we solve c ∈ argmin y∈P 1 m ∥Ay + b∥ 2 2 up to tolerance ϵ using CG with starting vector y 0 = (A ⊺ A) -1 A ⊺ b. Since y 0 is a well-educated guess for the solution to the constrained optimization problem argmin y∈P 1 m ∥Ay + b∥ 2 2 in Line 6, CG often runs for only very few iterations. The drawback of using the non-sparse y 0 as a starting vector is that c constructed in Line 6 and g constructed in Line 7 are generally non-sparse. We alleviate the issue of non-sparsity of g in Lines 8-18. In Line 8, we first check whether g is a (ψ, 1, τ )-approximately vanishing generator of X. If g does not vanish approximately, we know that there does not exist an approximately vanishing generator with leading term u i and we append u i to O in Line 17. If, however, mse(g, X) ≤ ψ, we solve the constrained convex optimization problem in Line 6 again in Line 9 up to tolerance ϵ using BPCG and a vertex of the ℓ 1 -ball P as starting vector. This has two consequences: 1. The vector d constructed in Line 9 tends to be sparse, as corroborated by the results in Table 2 . 2. The execution of Line 9 tends to take longer than the execution of Line 6 since BPCG's starting vector is not necessarily close in Euclidean distance to the optimal solution. Then, in Line 10, we construct the polynomial h, which tends to be sparse. If h is a (ψ, 1, τ )approximately vanishing generator, we append the sparse h to G in Line 12. If it happens that mse(g, X) ≤ ψ < mse(h, X), we append the non-sparse g to G in Line 14. Following the discussion of Section 4.3, should (INF) ever hold, that is, ∥y 0 ∥ 1 > τ -1, we can no longer proceed with BPCGAVI-WIHB as we can no longer guarantee that the inverse of A ⊺ A exists. In that case, we proceed with vanilla BPCGAVI for all terms remaining in the current and upcoming borders. The discussion above illustrates that BPCGAVI-WIHB solves argmin y∈P 1 m ∥Ay + b∥ 2 2 with BPCG |G| times as opposed to BPCGAVI, which solves argmin y∈P 1 m ∥Ay + b∥ 2 2 with BPCG |G| + |O| -1 times. Thus, BPCGAVI-WIHB combines the sparsity-inducing properties of BPCG with the speed-up of IHB without any of IHB's drawbacks. The speed-up of BPCGAVI-WIHB compared to BPCGAVI is evident from the numerical experiments in Figure 2 and the results of Table 2 indicate that the sparsity-inducing properties of BPCG are successfully exploited. D PYRAMIDAL WIDTH OF THE ℓ 1 -BALL Pena & Rodriguez (2019) showed that the pyramidal width ω of the ℓ 1 -ball of radius τ , that is, P = {x ∈ R ℓ | ∥x∥ 1 ≤ τ } ⊆ R ℓ is given by ω = min Proof. Let e (i) ∈ R ℓ denote the ith unit vector. Note that a non-trivial face of P = conv{±τ e (1) , . . . , ±τ e (n) } ⊆ R ℓ that is not P itself cannot contain both τ e (i) and -τ e (i) for any i ∈ {1, . . . , ℓ}. Thus, due to symmetry, we can assume that any non-trivial face of P that is not P itself is of the form F = conv({τ e (1) , . . . , τ e (k) }) for some k ∈ {1, . . . , ℓ -1}. Then, G := conv(vert(P ) \ F ) = conv({-τ e (1) , . . . , -τ e (k) , ±τ e (k+1) , . . . , ±τ e (ℓ) }). We have that dist (F, G) = min Since τ 2 k is minimized for k ∈ {1, . . . , ℓ -1} as large as possible, it follows that ω ≥ τ √ ℓ-1 . Results. The results are presented in Figure 3 . For small data sets, we observe that ABM and VCA are faster than OAVI. However, when the number of samples in the data set is large, as in the synthetic data set, OAVI can be trained faster than ABM and VCA. AGDAVI-IHB is slower than CGAVI-IHB because AGD cannot use the Frank-Wolfe gap as an early termination criterion to quickly decide that a solution close enough to the optimum is reached.



A set X ⊆ R n is algebraic if it is the set of common roots of a finite set of polynomials.



We combine the sets of generators corresponding to the different classes to obtain G = {g 1 , . . . , g |G| } = k i=1 G i , which encapsulates the feature

Figure 1: Training time comparisons with fixed ψ = 0.005, averaged over ten random runs with shaded standard deviations. On all data sets except skin, BPCGAVI is faster than PCGAVI.

and (G, O) = OAVI(X, ψ, τ ). Let T ORACLE and S ORACLE be the time and space complexities required to solve the convex optimization problem in Line 5 of OAVI, respectively. In the real number model, the time and space complexities of OAVI are O((|G| + |O|) 2 + (|G| + |O|)T ORACLE ) and O((|G| + |O|)m + S ORACLE ), respectively. The evaluation vectors of all polynomials in G over a

Figure 2: Training time comparisons with fixed ψ = 0.005, averaged over ten random runs with shaded standard deviations. CGAVI-IHB is faster than BPCGAVI-WIHB, which is faster than BPCGAVI.

Corollary 4.3 (Complexity). Let X = {x 1 , . . . , x m } ⊆ R n , ψ > 0, τ ≥ 2, and (G, O) = BPCGAVI(X, ψ, τ ). In the real number model, the time and space complexities of BPCGAVI are O((|G| + |O|) 2 m + (|G| + |O|) 4 ) and O((|G| + |O|)m + (|G| + |O|) 2 ), respectively.

Figure 3: Training time comparisons, averaged over ten random runs with shaded standard deviations. For small data sets, ABM and VCA are faster than OAVI, but for synthetic, the training times of ABM and VCA scale worse than OAVI's. See Appendix F.2 for details.

and f (y) = 1 m ∥Ay + b∥ 2 2 , the optimization problem in Line 5 of OAVI takes the form c ∈ argmin y∈R ℓ f (y). Then, the gradient and Hessian of f at y

d O, where n j=1 α j = d and α j ∈ N for all j ∈ {1, . . . , n}, is the leading term of a (ψ, 1, τ )-approximately vanishing generator with non-leading terms only in O.

∈faces (P ),∅⊊F ⊊P dist (F, conv(vert(P ) \ F )), where, for a polytope P ⊆ R ℓ , faces (P ) denotes the set of faces of P and vert(P ) denotes the set of vertices of P , for a set F ⊆ R ℓ , conv(F ) is the convex hull of F , and for two disjoint sets F, G ⊆ R ℓ , dist (F, G) is the Euclidean distance between F and G.Lemma D.1 (Pyramidal width of the ℓ 1 -ball). The pyramidal width of the ℓ 1 -ball of radius τ , that is,P = {x ∈ R ℓ | ∥x∥ 1 ≤ τ } ⊆ R ℓ , is lower bounded by ω ≥ τ √ ℓ-1 .

-v i ) 2 ▷ since v 2 j ≥ 0 for all j ∈ {k + 1, . . since v i ≤ 0 ≤ u i for all i ∈ {1, . . . , k} = τ 2 k .

u, denoted by t | u, if there exists v ∈ T such that t • v = u. If t does not divide u, we write t ∤ u. OAVI constructs generators of the approximate

Overview of data sets. All data sets are binary classification data sets and are retrieved from the UCI Machine Learning Repository(Dua & Graff, 2017) and additional references are provided. Corollary 4.4 (Complexity). Let X = {x 1 , . . . , x m } ⊆ R n , ψ ≥ 0, τ ≥ 2 large enough such that (INF) never holds, and (G, O) = CGAVI-IHB(X, ψ, τ ). Assuming CG terminates after a constant number of iterations due to IHB, the time and space complexities of CGAVI-IHB are O((|G| + |O|) 2 m + (|G| + |O|) 3 ) and O((|G| + |O|)m + (|G| + |O|) 2 ), respectively.

Numerical results averaged over ten random 60%/40% train/test partitions with best results in bold. For approaches other than BPCGAVI-WIHB * , spar (G) < 0.01 and we omit the results.

10 2 2.2 × 10 0

, guaranteeing that h can be constructed in Lines 5-6 of OAVI. Since u ∈ ∂ d O, for any term t ∈ T such that t | u and t ̸ = u, it holds that t ∈ O and thus, h is a polynomial with ltc(h) = 1, lt(h) = u, and other terms only in O. Under the assumption that the convex optimization oracle in OAVI is accurate, mse(h, X) ≥ mse(g, X), where g is the polynomial constructed during Lines 5 and 6 of OAVI. Hence, proving that h vanishes approximately implies that g vanishes approximately.

T -1 do a t ∈ argmax v∈S (t) ⟨∇f (y t ), v⟩ ▷ away vertex q t ∈ argmin v∈S (t) ⟨∇f (y t ), v⟩ ▷ local FW vertex w t ∈ argmin v∈vert(P ) ⟨∇f (y t ), v⟩ ▷ FW vertex if ⟨∇f (y t ), w t -y t ⟩ ≥ ⟨∇f (y t ), q t -a t ⟩ then

ACKNOWLEDGEMENTS

We would like to thank Gabor Braun for providing us with the main arguments for the proof of the lower bound on the pyramidal width of the ℓ 1 -ball. This research was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany´s Excellence Strategy -The Berlin Mathematics Research Center MATH + (EXC-2046/1, project ID 390685689, BMS Stipend) and JST, ACT-X Grant Number JPMJAX200F, Japan.

Plugging (C.2) into (C.3), we obtain (B -

The existence of (B -A ⊺ bb ⊺ A ∥b∥ 2 2) -1 follows from the Sherman-Morrison formula (Sherman & Morrison, 1950; Bartlett, 1951) and the assumption that b ⊺ A(A ⊺ A) -1 A ⊺ b ̸ = ∥b∥ 2 2 . Then, again using the Sherman-Morrison formula,which can be computed using additional O(ℓ 2 ) elementary operations. Finally, we construct Ñ1 , which is determined byIn summary, we require O(ℓm + ℓ 2 ) elementary operations. Note that even if we do not compute ( Ã⊺ Ã) -1 , we still require O(ℓm + ℓ 2 ) elementary operations.In the remark below, we discuss the literature related to IHB. Remark C.2 (Work related to IHB). The proof of Theorem C.1 is similar to the proof that the inverse of the Hessian can be updated efficiently in the online Newton algorithm (Hazan et al., 2007) . Both proofs rely on the Sherman-Morrison formula (Sherman & Morrison, 1950; Bartlett, 1951) . However, in our setting, the updates occur column-wise instead of row-wise. Updating (A ⊺ A) -1 column-wise for generator-constructing algorithms was already discussed in Limbeck (2013) using QR decompositions without addressing the numerical instability of inverse matrix updates.Next, we prove that Theorem C.1 implies the improved computational complexity of CGAVI-IHB in Corollary 4.4.Proof of Corollary 4.4. Suppose that CGAVI-IHB is currently executing Lines 5-11 of Algorithm 1 for a particular termThen, the optimization problem in Line 5 of Algorithm 1 takes the form c ∈ argmin y∈P f (y).We first prove that the violation of any of the two assumptions of Theorem C.1 implies the existence of a (ψ, 1, τ )-approximately vanishing generator g. We prove this claim by treating the two assumptions separately:1. If ∥b∥ 2 = 0, it holds that mse(u i , X) = 0. By the assumption that τ ≥ 2 is large enough to guarantee that (INF) never holds, CGAVI-IHB constructs a (ψ, 1, τ )-approximately vanishing generator in Line 6.

2.. If b

Thus, by the assumption that τ ≥ 2 is large enough to guarantee that (INF) never holds, CGAVI-IHB constructs a (ψ, 1, τ )-approximately vanishing generator in Line 6.Since an update of (A ⊺ A) -1 is necessary only if Line 10 is executed, that is, when there does not exist a (ψ, 1, τ )-approximately vanishing generator g, we never have to update (A ⊺ A) -1 when the assumptions of Theorem C.1 are violated. Thus, by Theorem C.1, we can always update A, A ⊺ A, and Note that the time and space complexities in Corollary 4.4 also hold for AGDAVI-IHB with τ = ∞.Algorithm 3: Blended pairwise conditional gradients approximate vanishing ideal algorithm with weak inverse Hessian boosting (BPCGAVI-WIHB) Input : A data set X = {x 1 , . . . , x m } ⊆ R n and parameters ψ ≥ ϵ ≥ 0 and τ ≥ 2. Output: A set of polynomials G ⊆ P and a set of monomials O ⊆ T . The synthetic data set consists of three features and contains two different classes. The samples x that belong to the first class are generated such that they satisfy the equationand samples that belong to the second class are generated such that they satisfy the equationThe samples are perturbed with additive Gaussian noise with mean 0 and standard deviation 0.05.

F ADDITIONAL NUMERICAL EXPERIMENTS

In this section, we provide details for the numerical experiments conducted in Figures 1, 2, and 3.

F.1 EXPERIMENT: SPEEDING UP CGAVI

We compare the training times of PCGAVI, BPCGAVI, BPCGAVI-WIHB, and CGAVI-IHB on the data sets bank, htru, skin, and synthetic.Setup. For a single run, we randomly split the data set into subsets of varying sizes. Then, for fixed ψ = 0.005, we run the generator-constructing algorithms on subsets of the full data set of varying sizes and plot the training times, which are the times required to run PCGAVI, BPCGAVI, BPCGAVI-WIHB, and CGAVI-IHB once for each class. The results are averaged over ten random runs and standard deviations are shaded in Figures 1 and 2 .Results. The results are presented in Figures 1 and 2 . In Figure 1 , we observe that the training times for BPCGAVI are often shorter than for PCGAVI, except for the skin data set. In Figure 2 , we observe that IHB not only leads to theoretically better training times but also speeds up training in practice: CGAVI-IHB is always faster than BPCGAVI. Furthermore, WIHB is indeed the best of both worlds: BPCGAVI-WIHB is always faster than BPCGAVI but preserves the sparsity-inducing properties of BPCG. The latter can also be seen in Table 2 , where BPCGAVI-WIHB * is the only algorithmic approach that constructs sparse generators.

F.2 EXPERIMENT: SCALABILITY COMPARISON

We compare the training times of CGAVI-IHB, AGD-IHB, ABM, and VCA on the data sets bank, htru, skin, and synthetic.Setup. For a single run, on at most 10,000 samples of the data set, we tune the hyperparameters of generator-constructing algorithms OAVI, ABM, and VCA and a subsequently applied linear kernel SVM using threefold cross-validation. Then, using the determined hyperparameters, we run only the generator-constructing algorithms on subsets of the full data set of varying sizes and plot the training times, which are the times required to run CGAVI-IHB, AGD-IHB, ABM, and VCA once for each class. The results are averaged over ten random runs and standard deviations are shaded in Figure 3 .

