PROGRESSIVE VORONOI DIAGRAM SUBDIVISION ENABLES ACCURATE DATA-FREE CLASS-INCREMENTAL LEARNING

Abstract

Data-free Class-incremental Learning (CIL) is a challenging problem because rehearsing data from previous phases is strictly prohibited, causing catastrophic forgetting of Deep Neural Networks (DNNs). In this paper, we present iVoro, a novel framework derived from computational geometry. We found Voronoi Diagram (VD), a classical model for space subdivision, is especially powerful for solving the CIL problem, because VD itself can be constructed favorably in an incremental manner -the newly added sites (classes) will only affect the proximate classes, making the non-contiguous classes hardly forgettable. Furthermore, we bridge DNN and VD using Power Diagram Reduction, and show that the VD structure can be progressively refined along the phases using a divide-and-conquer algorithm. Moreover, our VD construction is not restricted to the deep feature space, but is also applicable to multiple intermediate feature spaces, promoting VD to be multilayer VD that efficiently captures multi-grained features from DNN. Importantly, iVoro is also capable of handling uncertainty-aware test-time Voronoi cell assignment and has exhibited high correlations between geometric uncertainty and predictive accuracy (up to ∼0.9). Putting everything together, iVoro achieves up to 25.26%, 37.09%, and 33.21% improvements on CIFAR-100, TinyImageNet, and ImageNet-Subset, respectively, compared to the state-of-the-art non-exemplar CIL approaches. In conclusion, iVoro enables highly accurate, privacy-preserving, and geometrically interpretable CIL that is particularly useful when cross-phase data sharing is forbidden, e.g. in medical applications. 1. We explore, for the first time, the idea of using prototypical networks (Snell et al., 2017) for CIL, which is equivalent to constructing a VD in the (fixed) feature space (denoted as iVoro). 2. We show that the within-phase boundaries of VD can be progressively refined using a divide-andconquer algorithm (iVoro-D). 3. When it comes to test-time Voronoi cell assignment, we devise two protocols, augmentation consensus (iVoro-AC) and integration (iVoro-AI), for the postprocessing of Self-supervised Learning (SSL)-based label augmentation, with quantitative uncertainty awareness. 4. Finally, we introduce multilayer features to build a multilayer VD, which consistently enhances the performance (iVoro-L). iVoro. We begin with the simplest scenario in which the feature extractor is frozen after the first phase, and the prototypes ({c}) are used to construct VD. iVoro-D. iVoro treats all prototypes equally regardless of at which phase they present, and determines Voronoi boundaries all by bisecting prototypes. However, without considering data distribution, the bisector of two prototypes is not optimal especially within a certain phase. We establish an explicit connection between DNN and VD using Voronoi Diagram Reduction (Ma et al., 2022a) and show that the within-phase disicion boundaries (induced by {c}) can be refined by DNN (i.e. linear probing) and be aggregated into the global VD by a divide-and-conquer (D&C) algorithm (iVoro-D). iVoro-AC/AI. Geometrically, SSL-based label augmentation (Lee et al., 2020) will duplicate one Voronoi cell to be multiple (possibly disjoint) Voronoi cells (see Fig. 2 (D )), and this will cause ambiguity when assigning a query example to a cell, suggesting that uncertainty quantification cannot be neglected in test-time. Here we propose two protocols to resolve this ambiguity, namely, augmentation consensus (iVoro-AC) and augmentation integration (iVoro-AI). We also show that the entropy-based geometric variance (Ding & Xu, 2020) is a good indicator of the uncertainty of this assignment, with high Pearson correlation coefficients up to ∼0.9. iVoro-L. Until now, only deep features from the last layer are used for VD construction. However, the intermediate feature could also be informative to aid the VD construction. Cluster-induced Voronoi Diagram (CIVD) (Chen et al., 2013; 2017; Huang & Xu, 2020; Huang et al., 2021) , which allows for multiple centers per Voronoi cell, has recently achieved remarkable success in metric-based FSL by incorporating heterogeneous features to VD (Ma et al., 2022a). As a matter of fact, for a deep neural network, the feature induced by every layer can all be used to construct a VD. Finally, we also explore the idea to build a multilayer VD by using features elicited from multiple blocks of DNN. Broader impact. The fully-fledged iVoro achieves up to 25.26%, 37.09%, and 33.21% improvements on CIFAR-100, TinyImageNet, and ImageNet-Subset, respectively, compared with the state-of-the-art non-exemplar CIL approaches. Based on the frozen model trained at the first phase, iVoro and all its variants incur no additional training burden, and at the same time preserve the privacy of data from previous phases. It is worth noting that, although iVoro focuses on exemplar-free CIL, it outperforms even all the exemplar-based CIL methods. We believe iVoro could be further boosted when a small number of exemplars are allowed, which we leave for future work. METHODOLOGY In CIL, the data comes as a stream and a single model is trained on current data locally without revisiting previous data, but should ideally be able to discriminate between all classes it has seen so In our geometric framework, starting from iVoro, the simplest prototype-induced VD model, we gradually add four components: (I) parameterized normalization (iVoro-N), (II) divide-and-conquer

1. INTRODUCTION

In many real-world applications such as medical imaging-based diagnosis, the learning system is usually required to be expandable to new classes, for example, from common to rare inherited retinal diseases (IRDs) (Miere et al., 2020) , or from coarse to fine chest radiographic findings (Syeda-Mahmood et al., 2020) , and importantly, without losing the knowledge already learned. This motivates the concept of incremental learning (IL) (Hou et al., 2019; Wu et al., 2019; Zhu et al., 2021; Liu et al., 2021b) , also known as continual learning (Parisi et al., 2019; Delange et al., 2021; Chaudhry et al., 2019) , which has drawn growing interest in recent years. Although Deep Neural Networks (DNNs) have become the de facto method of choice due to their extraordinary ability to learn from complex data, they still suffer from severe catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow et al., 2014; Kemker et al., 2018) when adapting to new tasks that contain only unseen training samples from novel classes. To mitigate this issue, Rebuffi et al. (2017) proposed the paradigm of memory-based class-incremental learning (CIL) (Belouadah & Popescu, 2019; Zhao et al., 2020; Hou et al., 2019; Castro et al., 2018; Wu et al., 2019; Liu et al., 2021a; 2020a; 2021b) in which a small portion of samples (e.g., 20 exemplars per class) will be stored to use in the subsequent phases. However, the storing and sharing of data, e.g. medical images, may not be feasible due to privacy considerations. Another line of methods memorize (part of) network and increase the model capacity for new classes (Rusu et al., 2016; Li et al., 2019; Wang et al., 2017; Yoon et al., 2017) , which may incur unbounded memory consumption for long task sequence. Hence, in this paper, we focus on the challenging data-free CIL problem under the strictest memory and privacy constraints -no stored exemplars and fixed model capacity. Despite extensive research in recent years (see Appendix A for a literature review), three challenges still pose an obstacle to successful CIL. (I) During the course of isolated training on new data, the feature distributions of the old classes are usually dramatically changed (see Fig. 2 (A) for an illustration). Knowledge Distillation (KD) (Hinton et al., 2015) has become a routine in many CIL methods (Li & Hoiem, 2017; Schwarz et al., 2018; Castro et al., 2018; Hou et al., 2019; Dhar et al., 2019; Douillard et al., 2020; Zhu et al., 2021) to partially maintain the spatial distribution of old classes. The KD loss, however, is typically applied onto the whole network, and a strong KD loss may potentially degenerate the network's ability to adapt to novel classes. (II) Without the full access to old data, the decision boundaries cannot be learned precisely, making it harder to discriminate between old and new classes. Taking inspiration from metric-based Few-shot Learning (FSL) (Snell et al., 2017) , PASS (Zhu et al., 2021) memorizes a set of prototypes (feature centroids) and generates features augmented by Gaussian noise for a joint training in new phases. However, feature centroids might be suboptimal to represent the whole class, which is not necessarily normally distributed (Fig. 2 (B) ). (III) Since the old classes and the new classes are learned in a disjoint manner, their distributions are likely to be overlapped, which becomes even severer in our exemplar-free setting as the old data is totally absent. To circumvent this issue, Task-incremental learning (TIL) (Shin et al., 2017; Kirkpatrick et al., 2017; Zenke et al., 2017; Wu et al., 2018; Lopez-Paz & Ranzato, 2017; Buzzega et al., 2020; Cha et al., 2021; Pham et al., 2021; Fernando et al., 2017) assumes the phase within which a class was learned is known, which is generally unrealistic in practice. CIL is not grounded on this assumption. In this paper, we tackle the CIL problem from a geometric point of view. Voronoi Diagram (VD) is a classical model for space subdivision and is the underlying geometric structure of the 1-nearest neighbor classifier (Lee, 1982) . We find that VD bears a close analogy to incremental learning, because VD itself can be constructed favorably in an incremental manner -the newly added sites (classes) will roughly change only the cells of the neighboring classes, making the non-contiguous classes untouched and thus hardly forgattable (see Figure 1 ). Based on this intuition, in this paper, we present a holistic geometric framework based on VD that significantly surmounts all the listed obstacles. The contributions can be summarized as follows: far. Specifically, let D = {D t } T t=1 be the data stream in which D t = {(x t,i , y t,i )} Nt i=1 is the dataset at time step t, with data x t,i ∈ D and label y t,i ∈ C t . D is an arbitrary domain, e.g., natural image, and C t is the set of classes at phase t. The dataset D t contains N t,k , k ∈ {1, ..., K t } samples for the K t classes (i.e. N t = Kt k=1 N t,k ). Notice that C i , C j for two arbitrary phases i, j are disjoint, i.e. C i ∩ C j = ∅, ∀i, j : i ̸ = j. The unified model consists of a feature extractor ϕ and a classification head θ. The feature extractor is a deep neural network z = ϕ(x), z ∈ R n that maps from image domain D to feature domain R n , and is (traditionally) trained continuously at each phase t. In this section, T , t, and τ denote total phase, current phase, and historical phase, respectively (i.e. t ∈ {1, ..., T }, τ ∈ {1, ..., t}).

2.2. CONSTRUCTING VORONOI DIAGRAMS: A FEATURE EXTRACTOR IS ALL YOU NEED

In many CIL methods, the feature extractor ϕ and classification head θ are jointly and continuously optimized during every phase t guided by carefully designed losses (Zhu et al., 2021) . As a starting point, in this section, we freeze the feature extractor ϕ after the first phase and use a Voronoi Diagram (i.e., a 1-nearest-neighbor classifiers) to be θ as an extremely simple baseline method (denoted as iVoro), upon which we will then gradually add component introduced in Sec. 1. First, we introduce Power Diagram (PD), a generalized version of VD: Definition 2.1 (Power Diagram (Aurenhammer, 1987) and Voronoi Diagram). Let Ω = {ω 1 , ..., ω K } be a partition of the space R n , and C = {c 1 , ..., c K } be a set of centers (also called sites) such that ∪ K r=1 ω r = R n , ∩ K r=1 ω r = ∅. In addition, each center is associated with a weight ν r ∈ {ν 1 , ..., ν K } ⊆ R + . Then, the set of pairs {(ω 1 , c 1 , ν 1 ), ..., (ω K , c L , ν K )} is a Power Diagram (PD), where each cell is obtained via ω r = {z ∈ R n : r(z) = r}, r ∈ {1, .., K}, with r(z) = arg min k∈{1,...,K} d(z, c k ) 2 -ν k . If the weights are equal for all k, i.e. ν k = ν k ′ , ∀k, k ′ ∈ {1, ..., K}, then a PD collapses to a Voronoi Diagram (VD). Prototypes. As a baseline model, the class centers for iVoro are simply chosen to be the prototypes (feature mean of one class): c τ,k = 1 N τ,k i∈{1,...,N τ,k },y=k ϕ(x τ,i ), ν τ,k = 0, τ ∈ {1, ..., t}, k ∈ {1, ..., K τ }. We name those centers prototypical centers. Note that this set of centers {c τ,k } carries prototypes for all classes, old and new, up to time t. In test-time, a query sample x is assigned to the nearest class ŷ = C τ ′ ,k ′ s.t. d(z, c τ ′ ,k ′ ) = min τ,k d(z, c τ,k ) in which d(z, c τ,k ) = ||z -c τ,k || 2 2 . Parameterized Feature Transformation. Although PASS (Zhu et al., 2021) uses Gaussian noise to augment the data, the actual features are not necessarily normally distributed. To encourage the normality of feature distribution here we adopt compositional feature transformation commonly used in FSL (Ma et al., 2022a) : (1) L 2 normalization projects the feature onto the unit sphere: f (z) = z ||z||2 ; (2) linear transformation performs the scaling and shifting: g w,η (z) = wz + η; and (3) Tukey's ladder of powers transformation further improves the Gaussianity: h λ (z) = z λ if λ ̸ = 0 log(z) if λ = 0 . Finally, the feature transformation is the composition of three: (h λ • g w,η • f )(z), parameterized by w, η, λ. If all features (for both training and testing set) go through this normalization function, then iVoro becomes iVoro-N.

2.3. DIVIDE AND CONQUER: PROGRESSIVE VORONOI DIAGRAMS FOR CIL

As mentioned earlier, iVoro (and iVoro-N) treats all classes equally and separates them all by bisectors, regardless of at which phase they appear. However, for two classes C τ,k1 , C τ,k2 appear in the same phase τ , we can in fact draw better boundary by training a linear probing model parametrized by W , b in the fixed feature space. After the training, the locating of new Voronoi center requires an explicit relationship between the probing model and VD. More formally, at phase t a linear classifier with cross-entropy loss is optimized on the local data D t : L(W t , b t ) = (x,y)∈Dt -log p(y|ϕ(x); W t , b t ) = (x,y)∈Dt -log exp(W T t,y ϕ(x) + b t,y ) k exp(W T t,k ϕ(x) + b t,k ) in which W t,k , b t,k are the linear weight and bias for class C t,k . As a parameterized model, this linear probing can ideally improve the discrimination within C t . However, it is still non-trivial to merge all {W τ,k , b τ,k } t τ =1 , since the task identity is not assumed to be known like in TIL. To solve this, we get geometric insight from (Ma et al., 2022a) which directly connects linear probing model and VD by the theorem shown as follows: Theorem 2.1 (Voronoi Diagram Reduction (Ma et al., 2022a) ). The linear classifier parameterized by W , b partitions the input space R n to a Voronoi Diagram with centers {c 1 , ..., cK } given by ck = 1 2 W k if b k = -1 4 ||W k || 2 2 , k = 1, ..., K. For completeness, we also include the proof in Appendix E. During linear probing, if Thm. 2.1 is satisfied, then it is guaranteed that the resulting centers (referred to as probing-induced centers) {c t,k } Kt k=1 will also induce a VD (locally in phase t). Now given that we have two sets of centers {c τ,k } and {c τ,k }, with the latter being better locally but are not transferable across phases, we devise a divide-and-conquer (D&C) algorithm that progressively construct the decision boundaries from the two sets of centers, boosting iVoro to iVoro-D. Divide. Fortunately, the total classes {C τ } t τ =1 have been split into already disjoint t cliques. Conquer. Within each clique (i.e. phase) τ , the boundary for any two classes C τ,k1 , C τ,k2 is the bisector separating the probing-induced centers cτ,k1 , cτ,k2 , denoted as Γ τ,k1,τ,k2 = {z ′ ∈ R n |v T z ′ -q = 0} where v = cτ,k 1 -c τ,k 2 ||c τ,k 1 -c τ,k 2 ||2 and q = ||c τ,k 1 || 2 2 -||c τ,k 2 || 2 2 2||c τ,k 1 -c τ,k 2 ||2 . When merging cliques τ 1 , τ 2 , we instead resort to the prototypical centers for space partition: for any c τ1,k in clique τ 1 and any c τ2,k ′ in clique τ 2 , their bisector is Γ τ1,k,τ2,k ′ = {z ′ ∈ R n |v T z ′ -q = 0} where v = c τ 1 ,k -c τ 2 ,k ′ ||c τ 1 ,k -c τ 2 ,k ′ ||2 and q = ||c τ 1 ,k || 2 2 -||c τ 2 ,k ′ || 2 2 2||c τ 1 ,k -c τ 2 ,k ′ ||2 . In this way, the overall space partition would benefit from both locally probing-induced VD and globally prototype-based VD. See Querying the VD. In test-time, one can find the assigned Voronoi cell for query example x by eliminating one class in each round according to sign(v T z ′ -q), starting from a randomly selected boundary, so the time complexity is O( t τ =1 K τ ).

2.4. AUGMENTATION INTEGRATION: UNCERTAINTY-AWARE TEST-TIME VORONOI CELL ASSIGNMENT

Self-supervised Label Augmentation. To enhance the discriminative power of CIL method, SSLbased label augmentation (Lee et al., 2020) has been used to expand the original K t classes to 4K t by rotating the original image x. Specifically, for image x, the rotated image x (α) = rotate(x, π 2 α), α ∈ {0, 1, 2, 3} will be assigned to one of the expanded classes ŷ = k (α) , k ∈ {1, ..., K}, K ∈ t τ =1 K τ . In training time, the model is trained on the expanded dataset; however, in testing time, each of the duplicated images {x (α) } α∈{0,1,2,3} could possibly be assigned to each of the expanded classes {k (α) } α∈{0,1,2,3} , so this ambiguity has to be resolved, which has not been considered in previous CIL methods. Augmentation Consensus. Let d (α,α ′ ) ∈ R K be a vector, each component of which denotes the distance from ϕ(x) to a class that ϕ has learned, i.e. d (α,α ′ ) k = d(ϕ(x (α) ), c k (α ′ ) ) = ||ϕ(x (α) ) - c k (α ′ ) || 2 2 , k ∈ {1, ..., K}, α, α ′ ∈ {0, 1, 2, 3}. Then we want to find a consensus k, with the maximum occurrence among the 4 × 4 predictions {arg min k d (α,α ′ ) k } α,α ′ ∈{0,1,2,3} . Using augmentation consensus in test-time, iVoro is then retrofitted to iVoro-AC. Augmentation Integration. Using the consensus from the augmented samples should be more robust than the individual prediction arg min k d (0,0) k itself, but it has not considered the accumulated distance, so alternatively, we propose to integral over all predictions from augmented samples: k = arg min k α α ′ d (α,α ′ ) k . If augmentation integration is applied, then iVoro becomes iVoro-AI. Uncertainty Quantification. Since in iVoro-AC and iVoro-AI, the augmented samples collaboratively contribute to the final prediction, the quantitative uncertainty becomes non-negligible, this is because for some rotation-invariant classes, e.g. balls, the rotation operation makes less sense. Hence, when assigning a query sample x to the augmented 4× Voronoi cells, an uncertainty quantification method is needed. Truth Discovery Ensemble (TDE) (Ma et al., 2021) is the state-of-the-art uncertainty calibration method for DNNs, which finds the consensus among ensemble members by the minimization of entropy-based geometric variance (HV). Here, we only borrow HV as an indicator for the uncertainty of the 4 × 4 predictions, and refer the readers to Ma et al. (2021) for more details about TDE. Given the mean vector of the augmented predictions d * = 1 16 α,α ′ d (α,α ′ ) ∈ R K , let V denote the total squared distance to d * (i.e., V = α α ′ ||d * -d (α,α ′ ) || 2 ) and q (α,α ′ ) denotes the contribution of each d (α,α ′ ) to V (i.e., q (α,α ′ ) = ||d * -d (α,α ′ ) || 2 /V ). Then the entropy induced by {q (α,α ′ ) } is: H = -α α ′ q (α,α ′ ) log q (α,α ′ ) = 1 /V α α ′ ||d * -d (α,α ′ ) || 2 log( V /||d * -d (α,α ′ ) || 2 ). Based on these, we can define the HV as follows: Definition 2.2 (Entropy-based Geometric Variance (Ding & Xu, 2020) ). Given the point set {d (α,α ′ ) } ⊆ R K and a point d * , the entropy based geometric variance (HV) is H × V where H and V are defined as shown above. For every query example x, we calculate HV(x) based on its {d (α,α ′ ) } α,α ′ ∈{0,1,2,3} . Later we will show how HV could favorably indicate the uncertainty of the augmented prediction, and tell us when augmentation integration is useful. 

2.5. MULTILAYER VORONOI DIAGRAMS

{c k ⑴ } k {c k ⑵ } k {c k ⑶ } k {{c k (i) } L i=1 } k∈{1,...,K} Cluster-induced Voronoi Diagram [44] iVoro-L VD 1 VD 2 VD 3 Figure 4: Schematic illustration of iVoro- L. Until now, our VD construction is restricted to the deep feature space, i.e., x → ϕ(x) ∈ R n . However, the intermediate layers also contain information that supplementary to the final layer and can be useful to our VD construction. And this requires the integration of multiple VDs. Recently, Cluster-induced Voronoi Diagram (CIVD) (Chen et al., 2017; Huang et al., 2021) and Cluster-to-cluster Voronoi Diagram (CCVD) (Ma et al., 2022a) , two advanced VD structures, have shown remarkable ability to integrate multiple sets of centers for VD construction and achieve state-of-the-art performance in metric-based FSL. In this paper, we utilize the concept of CCVD for the integration of multiple VDs induced by multiple layers. We refer the readers to Ma et al. (2022a) for more details about CIVD/CCVD. Definition 2.3 (Cluster-to-cluster Voronoi Diagram). Let Ω = {ω 1 , ..., ω K } be a partition of the space R n , and C = {C 1 , ..., C K } be a set of totally ordered sets with the same cardinality L (i.e. |C 1 | = |C 2 | = ... = |C K | = L). The set of pairs {(ω 1 , C 1 ), ..., (ω K , C K )} is a Cluster-to-cluster Voronoi Diagram (CCVD) with respect to an influence function F (C k , C(z)) , and each cell is obtained via ω r = {z ∈ R n : r(z) = r}, r ∈ {1, .., K}, with r(z) = arg max k∈{1,...,K} F (C k , C(z)) where C(z) is the cluster (also a totally ordered set with cardinality L) that query point z belongs to, meaning that, all points in this cluster (query cluster) will be assigned to the same cell. The Influence Function is defined upon two totally ordered sets C k = {c (i) k } L i=1 and C(z) = {z (i) } L i=1 : F (C k , C(z)) = -sign(γ) L i=0 d(c (i) k , z (i) ) γ . As CCVD is a flexible framework and can be applied to iVoro-D/AC/AI, here, as an example, we show how CCVD can be use to boost iVoro. In iVoro, the VD is induced by {c τ,k } τ ∈{1,...,t},k∈{1,...,Kτ } that are feature means from the last layer ϕ. Now, we arbitrarily extract L layers {ϕ (l) } L l=1 and generate the K totally ordered clusters {{c (l) τ,k } L l=1 } τ ∈{1,. ..,t},k∈{1,...,Kτ } to construct CCVD and generate the query cluster {ϕ (l) (x)} L l=1 for the query example x for Voronoi cell assignment. See Appendix C for a summary of the notations and acronyms. (Rebuffi et al., 2017 ) 51.25 40.50 48.52 39.13 44.85 34.38 34.90 23.20 31.12 20.82 28.03 20.20 50.61 38.40 ✔ iCaRLNCM (Rebuffi et al., 2017) 58.13 48.00 53.91 45.38 50.79 40.88 46.08 34.43 43.42 33.33 38.08 27.65 3.1 Datasets, Benchmarks, and Implementation Details. Three standard datasets, CIFAR-100 (Krizhevsky et al., 2009) , TinyImageNet (Le & Yang, 2015) and ImageNet-Subset (Deng et al., 2009a) for CIL are used for method evaluation. We follow the popular benchmarking protocol in exemplar-free CIL used by (Liu et al., 2021b; Zhu et al., 2021; Douillard et al., 2020; Hou et al., 2019) in which the inital phase contains a half of the classes while the subsequent phases each has 1 5 , 1 10 , or 1 20 of the remaining classes. We mainly compare our method to non-exemplar methods including EWC (Kirkpatrick et al., 2017) , LwF (Li & Hoiem, 2017) , LwF-MC (Li & Hoiem, 2017) , LwM (Dhar et al., 2019) , and MUC (Liu et al., 2020b) , but we also compare with several recent exemplar-based methods iCaRL (Rebuffi et al., 2017) , EEIL (Castro et al., 2018) , UCIR (Hou et al., 2019) , and RMM (Liu et al., 2021b) for reference. A ResNet-18 (He et al., 2016) model is used for all experiments. We follow PASS (Zhu et al., 2021) to train the feature extractor on the first phase data but freeze it afterwards for all subsequent phases. All classes are expanded via rotating the original image by 90°, 180°, and 270°. See Appendix F for more details about the implementations of all the 12 ablation methods in Tab. 2. CIFAR-100 (10 phases) 3.2 iVoro: Simple VD is A Strong Baseline. Surprisingly, by only using prototypes for VD construction, our baseline method iVoro can achieve competitive performance for short phases and much better results for long phases, compared to the state-of-the-art non-exemplar CIL method. For example, the difference in accuracy in comparison to PASS is 0.29%/6.91%/3.63% for 5/10/20-phase CIFAR-100, -3.58%/-1.09%/5.43% for 5/10/20-phase TinyImageNet, and 4.76% for ImageNet-Subset. We suspect that this is because the features generated by the frozen feature extractor can be satisfactorily separable by linear bisectors (Fig. 2 ). As we can see, the features for other methods are all dramatically changing during the phases, but those for iVoro are all fixed, making incremental VD construction possible. Moreover, the accuracy of the last phase usually drops significantly with longer task sequence (e.g. 20 phases vs. 5 phases), but iVoro is highly robust at the last phase, because the final VDs are the same no matter how many phases it goes through. These results show that iVoro works favorably with long phases. When parameterized normalization is applied, iVoro-N further consistently improves upon iVoro by up to 2.40% (10-phase ImageNet-Subset) (see Tab. 2), by encouraging the compactness of feature distribution. See Appendix H about the detailed analysis of iVoro-N. 2 . ImageNet-Subset (10 phases) 3.3 Normalization (iVoro-N) and D&C (iVoro-D): Synergistic Effects. Our very baseline method, iVoro, ignores at which phase a class was learned, and computes prototypes indifferently to construct the VD (i.e. 1-nearest neighbor model). To determine the decision boundaries more subtly, iVoro-D focuses on the refinement of the within-phase boundaries. These two components, iVoro-N/D, can individually improve iVoro, but also have collective impacts. For example, as shown in Tab. 2 and Fig. 6 , iVoro-ND > iVoro-N/iVoro-D > iVoro, corroborating that every single contribution is useful and necessary. More specifically, in the three datasets, iVoro-D makes the largest contribution to TinyIm-ageNet (1.75%-3.44%) than to CIFAR-100 (0.48%-0.96%) or to ImageNet-Subset (0.72%). This can be explained by the fact that there are 100 classes in the first phase in the TinyImageNet dataset, while only 50 in the other two, making TinyImageNet a harder dataset if only vanilla prototypes are used to construct the VD. 3.4 Why and When Will Augmentation Integration (iVoro-AC/AI) Help? When augmentation consensus (iVoro-AC) or integration (iVoro-AI) is applied, the improvement is significant. For example, iVoro-AC obtains 13.76%, 17.00%, and 16.50% improvements upon iVoro on CIFAR-100, TinyImageNet, and ImageNet-Subset, respectively. iVoro-AI itself is worse than iVoro-AC, but if combined with normalization and D&C, it further elevates the accuracy by a large margin, e.g. as high as 68.70% (iVoro-NDAI) on 20-phase TinyImageNet and 78.64% on 10-phase ImageNet-Subset. To investigate the reason of this prominent improvement, we calculate the entropy-based geometric variance in class level and plot them as a function of the ∆accuracy (i.e. the improvement in accuracy after augmentation integration is used), as shown in Fig. 8 and Fig. G.2. Interestingly, there is a clear correlation between HV and ∆accuracy, and this is more notable on ImageNet-Subset (Pearson's R ∼0.9), probably because of its high resolution (224 × 224) . This tendency suggests that the higher the variance within the assignments from augmented images to expanded classes, the better the improvement after using augmentation integration. See Appendix I for uncertainty analysis, and Appendix N/Appendix O for class-level/sample-level analysis. Pearson's R: -0.927 0.906 (C) ImageNet-Subset (10 phases) 

3.5. How Good Should the Feature Extractor Be?

As iVoro is heavily dependent on the feature extractor, which cannot be evolved in any way along the learning process, one may wonder if our method still work with a poorly trained feature extractor. To verify this, we gradually decrease the number of classes used to train the feature extractor ϕ. As shown in Appendix J, compared with PASS, the best version of iVoro still has 17.75%, 13.59%, 10.89%, and 1.60% improvements with 40, 30, 20, and 10 initial classes, respectively. This means that, even if there is no strong feature extractor, our method can still reach acceptable performance higher than the state-of-the-art method. 

3.7. Comparison with Joint

Training. In CIL, the classes are sequentially learned at each phase, whereas joint training simultaneously learns all the classes in the same phase, providing an upper bound for our CIL experiments. In Tab. 3, iVoro (and its variants) is applied to joint training, and is compared with ResNet on both the original and expanded label sets. Although there is still a substantial gap between iVoro (best) and the upper bound, the catastrophic forgetting is considerably overcome (see Appendix M). In addition, and surprisingly, iVoro-AC/AI can also promote the performance of joint training, e.g. +6.31%, +17.54%, and +13.38% for CIFAR-100, TinyImageNet, and ImageNet-Subset, respectively, suggesting that our augmentation integration method is also beneficial to general training where self-supervised label augmentation is involved.

4. CONCLUSION

In this paper, we use progressive Voronoi Diagram to model the class-incremental learning problem, and propose a number of new techniques that handle various aspects of this VD construction process that gradually and greatly improve the CIL performance. Thus, iVoro is shown to be a flexible, scalable, and robust framework that strictly maintains the privacy of previous data. Our code is available at https://machunwei.github.io/ivoro/. Incremental Learning (Rebuffi et al., 2017; Hou et al., 2019; Wu et al., 2019; Zhu et al., 2021; Liu et al., 2021b) requires continuously updating a model using a sequence of new tasks without forgetting the old knowledge, which is also referred to as continual learning (Parisi et al., 2019; Delange et al., 2021; Chaudhry et al., 2019) . The main challenge of incremental learning is catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999; Goodfellow et al., 2014; Kemker et al., 2018) , where deep neural network is prone to performance deterioration on the previously learned tasks as the model parameters overfit to the current data to optimize the stability-plasticity trade-off.

A.1.1 CAUSATION OF CATASTROPHIC FORGETTING

Generally speaking, in deep neural networks, catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow et al., 2014; Kemker et al., 2018) comes from two sources: the feature distribution shifting of the old classes in the feature embedding space as well as the confusion and imbalance of the decision boundary of the classifier when learning new task. The former is caused by the excessive plasticity and parameter changing of the feature extractor of the deep model during finetuning on unseen data/classes, thus deteriorates the feature extraction and prediction on previous classes; while the latter is due to the highly overfitting and bias of the classifier on current task as well as the overlapping between the representation of new and old classes in the feature space.

A.1.2 INCREMENTAL LEARNING SCENARIOS

Three common incremental learning scenarios are widely explored in recent papers (Van de Ven & Tolias, 2019). Task-incremental learning (TIL) (Ostapenko et al., 2019; Shin et al., 2017; Kirkpatrick et al., 2017; Zenke et al., 2017; Wu et al., 2018; Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019; Buzzega et al., 2020; Cha et al., 2021; Pham et al., 2021; Fernando et al., 2017) incrementally learns a sequence of tasks in multiple phases, where each task contains unseen data of a new set of classes. To mitigate catastrophic forgetting, TIL assumes a simple setting where the task identity is known at inference time. The methods under this scenario keep leaning new task-independent classifiers or growing the model capacity by attaching additional modules (e.g. kernels, layers or branches), each corresponding to a specific task or a subset of classes. Since the task ID is available during inference, the model can directly select proper classifier or module without inferring task identity, which effectively solves the confusion boundary and classifier bias between old and new tasks, and often achieves satisfying performance. However, knowing task identity at test time is normally unrealistic in real-world situation hence restricts practical usage. Moreover, it may incur unbounded memory consumption for super long task sequence if increasing the model capacity for new tasks. Unlike TIL constrained by the availability of task identity, class-incremental learning (CIL) (Liu et al., 2020b; Belouadah & Popescu, 2019; Chaudhry et al., 2018a; Zhu et al., 2021; Douillard et al., 2020; Rebuffi et al., 2017; Hou et al., 2019; Liu et al., 2020a; 2021a; b) updates a unified classifier for all classes learned so far while task identity is no longer required during inference. To compensate the missing task identity and alleviate forgetting issue, a branch of works (Rebuffi et al., 2017; Hou et al., 2019; Liu et al., 2021b; Douillard et al., 2020; Castro et al., 2018; Wu et al., 2019) alternatively follow a memory-based setting, in which a limited number of samples from old classes (e.g., 20 exemplars per class) is stored and maintained in a memory buffer, which are later replayed to jointly train the model with current data (normally combined with knowledge distillation) in order to constrain the feature distribution shifting of the old classes and the decision boundary bias of the classifier. However, their performance deteriorates with smaller buffer size, and eventually, the storing and sharing of previous data, e.g. medical images, may not be feasible when memory limits and privacy issue are taken into consideration. Given the potential memory issue, another direction of works (Kirkpatrick et al., 2017; Zenke et al., 2017; Li & Hoiem, 2017; Dhar et al., 2019; Zhu et al., 2021) intend to explore CIL in a much challenging setting without memory rehearsal, mainly based on regularization and knowledge distillation techniques, which is known as exemplar-free CIL. In this paper, we are following this CIL setting. Domain-incremental learning (DIL) (Rostami, 2021; Tang et al., 2021; Volpi et al., 2021) , different from the aforementioned two scenarios, incrementally learning new domains of the same classes in each phase. Some domain adaptation techniques, e.g. meta learning, data shifting, domain randomization, are implemented in DIL to increase the model robustness and generalizability to handle various domain distributions. Since this scenario is not quite related to this paper, no detailed discussion will be included.

A.1.3 CATEGORIES OF INCREMENTAL LEARNING METHODS

There are three categories of existing IL methods to overcome catastrophic forgetting (Delange et al., 2021) . Regularization-based methods constrain the plasticity of the model to preserve old knowledge. This can be addressed by directly penalizing the changes of important parameters for previous tasks (Aljundi et al., 2018; Chaudhry et al., 2018a; Kirkpatrick et al., 2017; Zenke et al., 2017; Kumar et al., 2021) or regularizing the gradients when training on unseen data (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2018b) . Knowledge distillation is another regularization solution, which is widely used in various IL methods to implicitly consolidate previous knowledge by introducing regularization loss term on model representations, including output logits or probabilities (Li & Hoiem, 2017; Schwarz et al., 2018; Rebuffi et al., 2017; Castro et al., 2018) and intermediate features (Hou et al., 2019; Dhar et al., 2019; Douillard et al., 2020; Zhu et al., 2021) . Some other works focus on correcting the classifier bias on new classes (Belouadah & Popescu, 2019; Wu et al., 2019; Belouadah & Popescu, 2020; Zhao et al., 2020) . Rehearsal-based methods either store and replay a limited amount of exemplars from old classes as raw images (Rebuffi et al., 2017; Hou et al., 2019; Liu et al., 2021b; Chaudhry et al., 2019; Buzzega et al., 2020) or embedded features (Hayes et al., 2020; Iscen et al., 2020) to jointly train the model in the incremental phases, or alternatively generate exemplars of previous classes (Ostapenko et al., 2019; Shin et al., 2017; Wu et al., 2018; Kemker & Kanan, 2017) . The former relies on memory buffer for all learned classes, where the performance is constrained by the buffer size limits and it is impracticable when data privacy is required and storing data is prohibited. The latter requires continuously learning a deep generative model, which is also prone to catastrophic forgetting thus the quality of generated exemplars is not reliable. Architecture-based methods aims at dynamically adapting task-specific sub-network architectures, which requires task identity to select proper sub-network. Some works directly expand the network by adding new layers or branches (Rusu et al., 2016; Li et al., 2019; Wang et al., 2017; Yoon et al., 2017) , which is limited in practice due to unbounded model parameter growth. Others freeze partial network with masks for old tasks (Golkar et al., 2019; Hung et al., 2019; Mallya & Lazebnik, 2018; Serra et al., 2018) , but suffering from running out of model parameters for new knowledge. The architecture-based methods are usually combined with memory buffer and distillation, and can achieve good results. Our work is focusing on the most challenging but also the most practical non-exemplar classincremental learning problem, which is a general real-world scenario when no old data can be stored due to memory limits or data privacy and task identity is unavailable during inference, with the constraint of fixed model capacity in the same time. A.2 COMPUTATIONAL GEOMETRY FOR DEEP LEARNING Computational geometry is an emerging perspective for studying various aspects of deep learning. The geometric structure of deep neural networks is first hinted at by (Raghu et al., 2017) which reveals that piecewise linear activations subdivide input space into convex polytopes. Afterward, (Balestriero et al., 2019) points out that the exact structure is a Power Diagram (PD, a generalized form of Voronoi Diagram) (Aurenhammer, 1987) which is subsequently used to explain the recurrent neural networks (Wang et al., 2018) and generative models (Balestriero et al., 2020) . The Power Diagram (or Voronoi Diagram) subdivision, however, is not necessarily the optimal model for describing the partitioning of deep/intermediate feature spaces. More recently, several works in computational geometry (Chen et al., 2013; 2017; Huang et al., 2021) use an influence function F (C, z) to measure the joint influence of all objects in C on a query z to build a Cluster-induced Voronoi Diagram (CIVD), providing an advanced reform of the classical Voronoi Diagram. Observing that Prototypical Network (Snell et al., 2017) , a widely adopted metric-based few-shot learning (FSL) method, is essentially a Voronoi Diagram in the feature space, DeepVoro (Ma et al., 2022b) first unifies various kinds of FSL methods, and then constructs a CIVD by incorporating heterogeneous features, achieving the state-of-the-art performance in FSL. Besides FSL, Voronoi Diagram subdivision has also been used for deep learning uncertainty calibration (Ma et al., 2021) , adversarial robustness (Sitawarin et al., 2021) , topological data analysis (Polianskii & Pokorny, 2019; 2020; Poklukar et al., 2022) , and medical applications (Ma et al., 2018; 2019) . In this paper, distinct from the three aforementioned lines of research (i. In iVoro, the feature extractor from the first phase of (B) is frozen and used without fine-tuning for all the subsequent phases. The feature means are calculated as prototypes and no feature transformation is used. Note that only the features from the original images without rotation are used in iVoro. (D) The only difference with (C) is that all the expanded classes are also considered as independent cells, allowing for further integration. Result Analysis. For (A) fine-tuning and (B) PASS, the model's accuracy for data at individual phases are also shown in shadow. Fine-tuning are able to achieve near-perfect prediction for classes in the current phase locally, but fails to maintain satisfactory performance on any historical class (accuracy ∼0%). PASS, on the other hand, basically deteriorates slightly on the classes from the first phase, due to the high weights on the KD loss and the prototype loss, but it also becomes almost incapable of learning on new classes (accuracy ∼0%). iVoro, i.e. the simplest 1-nearest-neighbor model, surprisingly obtains superior accuracy (64.84%, 24.28% higher than PASS) by only using a fixed feature extractor trained from only 4 classes (16 expanded classes). iVoro-AC achieves comparable result (61.77%) with iVoro, but the 2D embedding makes it harder to demonstrate the efficacy of our proposed method. From this 2D illustration, it is obvious that the much higher performance of iVoro/iVoro-AC is achieved through a much better space partitioning.

C NOTATIONS AND ACRONYMS

In this section, we list all the notations used in the Methodology in Tab. C.1, the notations and acronyms for various geometric structures used in the paper in Tab. C.2, and all ablation methods in Tab. C.3.  v T z ′ -q = 0 α index of four rotations, α ∈ {0, 1, 2, 3} d (α,α ′ ) ∈ R K the collection of the distances from ϕ(x α ′ ), to K classes with rotation index α ′ HV Entropy-based geometric variance  ✔ ✘ ✘ ✘ ✘ ✘ iVoro-N ✔ ✔ ✘ ✘ ✘ ✘ iVoro-D ✔ ✘ ✔ ✘ ✘ ✘ iVoro-ND ✔ ✔ ✔ ✘ ✘ ✘ iVoro-AC ✔ ✘ ✘ ✔ ✘ ✘ iVoro-AI ✔ ✘ ✘ ✘ ✔ ✘ iVoro-NAC ✔ ✔ ✘ ✔ ✘ ✘ iVoro-NAI ✔ ✔ ✘ ✘ ✔ ✘ iVoro-NDAC ✔ ✔ ✔ ✔ ✘ ✘ iVoro-NDAI ✔ ✔ ✔ ✘ ✔ ✘ iVoro-NACL ✔ ✔ ✘ ✔ ✘ ✔ iVoro-NAIL ✔ ✔ ✘ ✘ ✔ ✔ iVoro-NDACL ✔ ✔ ✔ ✔ ✘ ✔ iVoro-NDAIL ✔ ✔ ✔ ✘ ✔ ✔

D DATASET DETAILS

Here we give the detailed statistics of the three datasets used in the paper. Augmentation consensus (iVoro-AC) and augmentation integration (iVoro-AI) work more favorably with images with higher resolution, e.g. ImageNet-Subset (see Fig. 8 ), as the rotation operation makes less sense if the image is too blur.  = 1 2 W k if b k = -1 4 ||W k || 2 2 , k = 1, ..., K. Proof. We first articulate Lemma E.1 and find the exact relationship between the hyperplane Π k (z) and the center of its associated cell in R n . By Definition 2.1, the cell for a point z ∈ R n is found by comparing d(z, c k ) 2 -ν k for different k, so we define the power function p(z, S) expressing this value p(z, S) = (z -u) 2 -r 2 (2) in which S ⊆ R n is a sphere with center u and radius r. In fact, the weight ν associated with a center in Definition 2.1 can be interpreted as the square of the radius r 2 . Next, let U denote a paraboloid y = z 2 , let Π(S) be the transform that maps sphere S with center u and radius r into hyperplane Π(S) : y = 2z • u -u • u + r 2 . (3) It can be proved that Π is a bijective mapping between arbitrary spheres in R n and nonvertical hyperplanes in R n+1 that intersect U (Aurenhammer, 1987) . Further, let z ′ denote the vertical projection of z onto U and z ′′ denote its vertical projection onto Π(S), then the power function can be written as p(z, S) = d(z, z ′ ) -d(z, z ′′ ), ) which implies the following relationships between a sphere in R n and an associated hyperplane in R n+1 (Lemma 4 in (Aurenhammer, 1987) ): let S 1 and S 2 be nonco-centeric spheres in R n , then the bisector of their Power cells is the vertical projection of Π(S 1 ) ∩ Π(S 2 ) onto R n . Now, we have a direct relationship between sphere S, and hyperplane Π(S), and comparing equation (3) with the hyperplanes used in logistic regression {Π k (z) : W T k z + b k } K k=1 gives us u = 1 2 W k r 2 = b k + 1 4 ||W k || 2 2 . ( ) Although there is no guarantee that b k + 1 4 ||W k || 2 2 is always positive for an arbitrary logistic regression model, we can impose a constraint on r 2 to keep it be zero during the optimization, which implies b k = - 1 4 ||W k || 2 2 . ( ) By this way, the radii for all K spheres become identical (all zero). After the optimization of logistic regression model, the centers { 1 2 W k } K k=1 will be used as probing-induced Voronoi centers.

F IMPLEMENTATION DETAILS AND RESULT ANALYSIS OF COMPREHENSIVE ABLATION STUDIES

iVoro. We generally follow the protocol of PASS (Zhu et al., 2021) to train the feature extractor ϕ but only on the data from the first phase, i.e. 50 (for 5/10 phases) or 40 (for 20 phases) classes of CIFAR-100, 100 classes of TinyImageNet, and 50 classes of ImageNet-Subset. We also reproduce the results of PASS, using the same hyper-parameters, e.g. the weight for the knowledge distillation loss set at 10, and the weight for the prototype augmentation loss set at 10. For iVoro (and all its subsequent variants), the trained model is frozen after the first phase and throughout all the remaining phases. At phase t, the prototypical centers {c} are computed for both the current and the historical phases τ ∈ {1, ..., t}, and are used to construct the Voronoi Diagram. The simplest iVoro method (i.e. the vanilla Voronoi Diagram, or 1-nearest-neighbor) can already achieve comparable or even better results than the state-of-the-art non-exemplar CIL methods. For example, the difference in accuracy in comparison to PASS is 0.29%/6.91%/3.63% for 5/10/20phase CIFAR-100, -3.58%/-1.09%/5.43% for 5/10/20-phase TinyImageNet, and 4.76% for 10-phase ImageNet-Subset. Notably, there is always a significant elevation of accuracy on long-phase data, suggesting the continuous fine-tuning of model, even with improved loss functions, tends to forget seriously on earlier data. With a fixed feature extractor, iVoro has shown an improved ability to overcome catastrophic forgetting. On the other hand, for short-phase data, iVoro is similar or worse than the state-of-the-art method, probably because of the prototypical centers are computed without considering data distribution (i.e. simply the mean of features). iVoro-N. To inspect the effectiveness of the parameterized feature transformation, we apply L 2 normalization with/without Tukey's ladder of powers transformation (λ varying from 0.3 to 0.9), and compare with iVoro. Generally, the improvement acquired from the feature normalization is more prominent on more complex datasets (e.g. TinyImageNet and ImageNet-Subset), or simpler datasets with longer phases (e.g. CIFAR-100 with 20 phases), with improvements ranging from 1.65% to 2.40% higher than iVoro. The detailed analysis is presented in Sec. H. iVoro-D/iVoro-ND. The detailed algorithm of iVoro-D is presented in Alg. 3. Specifically, for each phase τ ∈ {1, ..., t}, the local dataset D τ is used to train a logistic regression model (restricted by Thm. 2.1) with weight decay β at 0.0001 and initial learning rate at 0.001. The result is also shown in Tab. 2. Aided by the D&C algorithm and local logistic regression, iVoro-D is consistently better than iVoro, e.g. 0.48%∼0.96% higher on the CIFAR-100 dataset, 1.75%∼3.44% higher on the TinyImageNet dataset, and 0.72% higher on the ImageNet-Subset dataset. When further combined with feature normalization, iVoro-ND achieves even higher accuracy, 54.72% on 20-phase CIFAR-100, 42.10% on 20-phase TinyImageNet, and 58.52% on ImageNet-Subset. For comparison, PASS reaches 48.75% on 20-phase CIFAR-100, 32.86% on 20-phase TinyImageNet, and 50.63% on ImageNet-Subset. Therefore without incorporating more sophisticate techniques like iVoro-R/iVoro-AC/iVoro-L, the iVoro-ND method can already surpass previous state-of-the-art method by a large margin e.g. 5.97%/9.24%/7.89%. iVoro-AC/iVoro-AI/iVoro-NAC/iVoro-NAI. While the previous variants of iVoro only consider the prediction on the original image/class, here we show that the prediction can be substantially improved by augmentation consensus (iVoro-AC) and augmentation integration (iVoro-AI) proposed in this paper. Specifically, iVoro-AC improves upon iVoro by 13.76%∼14.58% on CIFAR-100, by 16.99%∼17.00% on TinyImageNet, and by 16.50% on ImageNet-Subset. iVoro-AI itself generally works worse than iVoro-AC, e.g. improves up to 6.73% on CIFAR-100, up to 9.82% on TinyImageNet, and 5.26% on ImageNet-Subset, but if combined with feature normalization, iVoro-NAI performs much better than iVoro-NAC on TinyImageNet (up to 65.83%) and ImageNet-Subset (77.12%). When augmentation consensus/integration is used, the previous variants iVoro-N and iVoro-D can all be promoted. Generally, adding an additional component will bring in more performance gain, as shown in Tab. 2 in detail. iVoro-NACL/iVoro-NAIL/iVoro-NDAIL. We further validate the multilayer VD for multiple feature spaces. As a proof of concept, we only extract the feature from the third block to build an additional VD and conduct the integration using CCVD (Def. 2.3) with γ set at 1. Compared with iVoro-NAC, multilayer VD (iVoro-NACL) further improves both average accuracy and last accuracy on CIFAR-100 under 5/10 phases settings (2.36%/2.20% better on last phase, respectively), which also realizes the highest performance on the given settings among all ablation settings. The final performance on ImageNet-Subset is also improved by 2.52%. In the meanwhile, multilayer VD does not make obvious difference on TinyImageNet. When adding multilayer VD to iVoro-NAI, labelled as iVoro-NAIL, significant performance gain is observed on all experiments of CIFAR-100 (9.1%/9.1%/15.76% average accuracy increments on 5/10/20 phases) and large improvement is also achieved on ImageNet-Subset (6.06% final accuracy growth). On the contrary, only limited gain is presented on TinyImageNet. The above ablation results may demonstrate that multilayer VD with augmentation integration works extraordinarily well for small class set (100 classes), but not very effective when class number increases. Robustness Analysis. We run PASS 5 times on CIFAR-100 with the 10-phase setting, and the last accuracies (%) are 48.25%, 49.03%, 53.03%, 53.95%, and 54.75%, respectively, with mean and standard deviation (std) being 51.80%±2.65%; meanwhile, we also test PASS for 5 runs on ImageNet-Subset with 10 phases, and the last accuracies (%) are 49.85%, 50.63%, 51.03%, 51.88%, and 52.52%, respectively, with mean±std being 51.18%±0.94%. As shown above, even though PASS achieves relatively good accuracy on average, the training is not very stable on CIFAR-100, where the difference between the highest and lowest accuracy on CIFAR-100 is as large as 6.5%; while the performance on ImageNet-Subset is slightly better regarding the robustness, PASS still ranges from 49.85% to 52.52%. On the contrary, compared to PASS, our iVoro method is naturally robust to various datasets with no fluctuation in performance, due to the frozen feature extractor and unbiased classifier based on VD. Each of the rotated images can possibly be assigned to each of the expanded labels (i.e. teddy (1) , teddy (2) , teddy (3) , and teddy ( 4) ). The final score of "teddy" will be the aggregation of all the 16 predictions. The variance of the 16 predictions reflects the confidence of this prediction.

H DETAILED ANALYSIS ON PARAMETERIZED FEATURE NORMALIZATION

In this section we give a detailed comparison between iVoro (no normalization) and iVoro-N (parameterized feature normalization). On both CIFAR-100 and TinyImageNet with different phases, we show the distribution of accuracy (across all phases) for iVoro, iVoro-N with only L 2 normalization, and iVoro-N with both L 2 normalization and Tukey's ladder of powers transformation (λ varying from 0.3 to 0.9). As shown in Fig. H .1, feature normalization benefits both datasets, but the efficacy is more prominent in more complex dataset e.g. TinyImageNet. Overall, iVoro-N improves the accuracy in the last phase by up to 1.65% on CIFAR-100, 2.21% on TinyImageNet, and 2.40 on ImageNet-Subset, respectively, compared to iVoro. 

J ANALYSIS ON THE FEATURE EXTRACTOR

In order to examine the effect of the feature extractor on the final result, we gradually decrease the number of classes in the first phase from 50 to 40, 30, 20, and 10 and still include 5 classes in each subsequent phase. When compared with PASS, the best version of iVoro still has 17.75%, 13.59%, 10.89%, and 1.60% improvements with 40, 30, 20, and 10 initial classes, respectively. As expected, there is always a substantial decrease in accuracy if only the features from the 3 rd block are used. For example, iVoro drops from 38.27% to 18.71%, from 56.05% to 43.22%, and from 55.40% to 35.10% on TinyImageNet, CIFAR-100, and ImageNet-Subset, respectively. However, when augmentation integration is used, the accuracy of iVoro3-NAI becomes 42.03% (TinyImageNet), 66.13% (CIFAR-100), and 64.32% (ImageNet-Subset), even much higher than iVoro and PASS. Moreover, when integrating the features from ϕ and ϕ (3) using CCVD, iVoro-NDAIL achieves the best performance across all variants of iVoro, 72.34% on TinyImageNet and 83.84% on ImageNet-Subset. Algorithm 3: iVoro-D Algorithm. The time complexity for the establishment of VD is O(( t τ =1 K τ ) 2 ), the time complexity for querying the VD is O(  t τ =1 K τ ). Data: Training datasets until phase t: D τ = {(x τ,i , y τ,i )} Nτ i=1 , x τ,i ∈ D, y τ,i ∈ C τ , τ ∈ {1, ..., t}, query example x Result: Prediction ŷ 1 for τ ∈ {1, ..., t} do 2 W τ , b τ ← Algorithm 1(D τ ) 3 cτ,k ← 1 2 W τ,k ; ◁ probing-induced centers 4 c τ,k ← Algorithm 2(D τ ) ; ◁ prototypical centers 5 for C τ,k1 , C τ,k2 ∈ C τ do 6 v ← cτ,k 1 -c τ,k 2 ||c τ,k 1 -c τ,k 2 ||2 7 q ← ||c τ,k 1 || 2 2 -||c τ,k 2 || 2 2 2||c τ,k 1 -c τ,k 2 ||2 ; ◁ within-clique boundaries 8 end 9 end 10 for C τ1,k ∈ C τ1 , C τ2,k ′ ∈ C τ2 do 11 v ← c τ 1 ,k -c τ 2 ,k ′ ||c τ 1 ,k -c τ 2 ,k ′ ||2 12 q ← ||c τ 1 ,k || 2 2 -||c τ 2 ,k ′ || 2 2 2||c τ 1 ,k -c τ 2 ,k ′ ||2 ; ◁ cross-

M FORGETTING ANALYSIS

Following PASS (Zhu et al., 2021) , we also quantitatively measure the degree of the catastrophic forgetting. Specifically, at phase t, the accuracy drop from the maximum accuracy to current accuracy for datasets from each phase τ ∈ {1, ..., t} is denoted as the average forgetting, as shown in Fig. M .1 and Fig. M .2 for CIFAR-100 and TinyImageNet, respectively. For PASS, the average forgetting keeps growing during phases. However, for iVoro/iVoro-N/iVoro-NAC/iVoro-NAI, the catastrophic forgetting is significantly overcome. For example, on CIFAR-100 (5 phase), forgetting in the last phase is decreased from 20.28 to 8.17/8.95/5.92 for iVoro/iVoro-N/iVoro-NAC. For a local dataset D τ = {(x τ,i , y τ,i )} Nτ i=1 , x τ,i ∈ D, y τ,i ∈ C τ and a set of fixed prototypes {c τ,k } τ ∈{1,...,t},k∈{1,...,Kτ } , the prediction ŷ = arg min k∈{1,...,Kτ } ||ϕ(x) -c τ,k || 2 2 is less likely to change compared to continuously updated model, as in PASS, implying that, aligning with our intuition (Sec. 1), the VD structure can naturally and successfully combat catastrophic forgetting. 91 12.39 13.91 15.90 17.28 17.58 18.34 21.38 21.89 22.89 23.39 24.94 22.99 24.79 26.41 25.47 26.15 0.00 2.05 3.36 3.61 3.73 5.33 5.22 5.40 5.23 5.52 5.41 5.97 6.52 6.45 6.44 6.64 6.69 6.69 6.97 6.88 6.95 0.00 2.15 4.33 4.83 4.68 5.54 5.78 6.20 6.21 6.37 6.39 6.93 7.47 7.30 7.27 7.44 7.64 7.72 8.09 7.97 8.06 0.00 1.20 2.59 2.69 2.93 4.49 4.64 5.06 4.89 5.07 5.11 5.41 5.76 5.66 5.62 5.70 5.70 5.92 6.27 6.28 6.27 0.00 1.20 2.59 2.69 2.93 4.49 4.64 5.06 4.89 5.07 5.11 5.41 5.76 5.66 5.62 5.70 5.70 5.92 6.27 6.28 6.27 CIFAR-100 (20 phases) .23 5.87 7.29 8.23 11.10 13.01 15.21 15.43 14.99 16.31 18.12 18.81 20.92 22.89 25.55 26.00 26.92 27.56 29.22 30.55 0.00 1.02 1.17 2.39 3.36 3.06 3.24 3.41 3.58 3.60 3.80 4.02 4.33 4.30 4.54 4.68 5.00 5.14 5.19 5.20 5.34 0.00 1.06 1.30 1.46 2.18 2.31 2.33 2.62 2.78 3.02 3.11 3.39 3.67 3.73 4.02 4.05 4.47 4.65 4.85 4.95 5.00 0.00 0.76 0.96 1.14 1.52 1.66 1.91 2.21 2.38 2.57 2.66 2.96 3.17 3.14 3.45 3.49 3.78 3.96 4.10 4.08 4.20 0.00 0.76 0.96 1.14 1.52 1.66 1.91 2.21 2.38 2.57 2.66 2.96 3.17 3.14 3.45 3.49 3.78 3.96 4.10 4.08 4.20 TinyImageNet (20 phases) 

O SAMPLE-LEVEL ANALYSIS

In Sec. N, we have analyzed the class-level improvement, and demonstrated that most of the time iVoro-NAI can surpass others by virtue of the postprocessing of the predictions from SSL-based label augmentation. In this section, we go into more detail at the sample level aiming at revealing why a misclassification can be corrected once our method is applied. To do so, we arbitrarily inspect two examples from the first class (i.e. "umbrella"), and show them in Interestingly, the original image is predicted as the same label ("ice_lolly") by iVoro. However, for the rotated images, the mis-predicted labels are various, "abacus", "bannister", "limousine", "vestment", "sewing_machine", "plunger", and "dumbbell". But importantly, the distance to the ground-truth label "umbrella" is always relatively small. Hence, when integrated altogether, the ground-truth label "umbrella" achieves the nearest distance (see upper right figure). Similarly, in Fig. O .2, the sample is misclassified as "beach_wagon" by PASS, and is still misclassified by iVoro as "volleyball". However, when augmentation integration (iVoro-AI) is applied, the accumulated distances clearly direct to the correct label "umbrella". In conclusion, these sample-level analyses show that iVoro-AC/AI elevates the performance by substantially enhancing the robustness of the (augmented) prediction -the rotated image set and the expanded label set collectively contribute to the strong two-digit improvement. 



Figure 1: Schematic illustrations of Voronoi Diagram (VD) for base sites (A), and when a new site (B) or a clique of new sites (C) is added to the system.

Figure 2: Visualization of of Voronoi Diagrams induced by (A) incremental fine-tuning, (B) PASS (Zhu et al., 2021), (C) iVoro, and (D) iVoro-AC on MNIST dataset in R 2 (best viewed in color). The dataset was split to 4, 3, and 3 disjoint classes. (See Appendix B for details.)

Figure 3: Schematic illustrations of iVoro (left) and iVoro-D (right). iVoro-D uses probing-induced boundaries within a phase.

Fig. G.3 for an illustrative comparison of iVoro and iVoro-D.

Figure 5: Top-1 classification accuracy on CIFAR-100 during 5/10/20 phases of CIL.

Figure 6: Illustration of the performance for the ablation methods shown in Table2.

Figure 7: Top-1 classification accuracy on ImageNet-Subset (10 phases).

Figure 8: Entropy-based geometric variance (HV) in class level as a function of ∆accuracy (ImageNet-Subset).

e. regularization-based, rehearsal-based, and architecture-based methods), we propose the geometry-based CIL method iVoro (and its variants), inspired by Voronoi Diagram subdivision.B DEMONSTRATIVE ILLUSTRATION ON MNIST DATASET IN 2D SPACEIn Figure2,MNIST (LeCun, 1998), a small and simple dataset, was used for the illustration of four methods, fine-tuning, PASS(Zhu et al., 2021), iVoro, and iVoro-AC, because of the convenience of embedding the examples into R 2 . The total 10 classes are split into a sequence of 4, 3, and 3 classes. A ResNet-18 model is used as the feature extractor for all four methods. (A) In fine-tuning, the model is firstly trained on the 4 classes in the first phase, and then fine-tuned only on the subsequent 3 and 3 classes in phase 2 and phase 3. (B) In PASS, SSL-based label augmentation is applied on all three phases and expands the classes to be 16, 9, and 9 classes. The default hyper-parameters are used to train PASS (i.e. the weight for knowledge distillation is 10 and the weight for prototype augmentation is 10). To ensure the final subdivision of space is a Voronoi diagram, Thm. 2.1 (i.e. Voronoi diagram reduction in Algorithm 1) is applied during the training of fine-tuning and PASS. (C)

Figure F.1: Top-1 classification accuracy on ImageNet-Subset with all 12 ablation methods during 10 phases of CIL.

Figure F.2: Top-1 classification accuracy on CIFAR-100 with all 12 ablation methods during 5/10/20 phases of CIL.

Figure G.1: Top-1 classification accuracy on TinyImagenet during 5/10/20 phases of CIL.

Figure G.3: Illustration of iVoro (left) and iVoro-D (right). iVoro treats all prototypes indifferently regardless of the phase, whereas iVoro-D refines the within-phase boundaries through a divide-andconquer algorithm.

Figure G.4: Illustration of augmentation integration using a sample image from the "teddy" class. Each of the rotated images can possibly be assigned to each of the expanded labels (i.e. teddy(1) , teddy(2) , teddy(3) , and teddy(4) ). The final score of "teddy" will be the aggregation of all the 16 predictions. The variance of the 16 predictions reflects the confidence of this prediction.

Figure H.1: Effect of parameterized feature transformation on CIFAR-100 and TinyImageNet.

Figure I.1: The distributions of Entropy-based geometric variance for each class in each phase on 5-phase CIFAR-100 (phase 0 to phase 1).

Figure I.2: The distributions of Entropy-based geometric variance for each class in each phase on 5-phase CIFAR-100 (phase 2 to phase 3).

Figure J.1: Top-1 classification accuracy on CIFAR-100 during 12/14 phases of CIL, in which the number of classes in the first phase are decreased to 40/30, respectively.

Figure K.1: Comparison between iVoros constructed from the final block (iVoro) and the 3 rd block (iVoro3) w.r.t. the top-1 classification accuracy on ImageNet-Subset during 10 phases of CIL.

Figure K.3: Comparison between iVoros constructed from the final block (iVoro) and the 3 rd block (iVoro3) w.r.t. the top-1 classification accuracy on TinyImagenet during 5/10/20 phases of CIL.

Figure M.1: Results of average forgetting on CIFAR-100.

Figure M.2: Results of average forgetting on TinyImageNet.

Figure N.2: Fig. N.1 continued.

Fig. O.1 and Fig. O.2, respectively. In Fig. O.1, "umbrella" is within the top-3 labels predicted by PASS, which is, however, overwhelmed by "ice_lolly". From row 2 to row 9, label set I/II/III/IV represent the expended labels (I for the original, and II/III/IV for the 90/180/270-degree rotated, see Fig. G.4), and the distances from the query image (original or rotated) to the 100 classes (original or expanded) are shown.

Figure O.1: Sample-level analysis of one example from the "umbrella" class. The y axis denotes the logit for 100 classes for PASS, and denotes the distance to 100 prototypes for iVoro. The green line indicates the ground-truth label while the red line denotes the predicted label. 49

Figure O.2: Sample-level analysis of one example from the "umbrella" class. The y axis denotes the logit for 100 classes for PASS, and denotes the distance to 100 prototypes for iVoro. The green line indicates the ground-truth label while the red line denotes the predicted label. 50

Comparison between the fully-fledged iVoro with state-of-the-art non-exemplar (marked by ✘) and exemplar-based (marked by ✔) CIL methods in terms of the accuracy (in %) in the last phase and the average accuracy (Avg., in %) across all phases. imp.↑ indicates the relative improvement upon the next best non-exemplar CIL method. ‡The best version of iVoro is shown here. See Tab. 2 for different versions of iVoro. Note that RMM uses a 100-class subset of ImageNet that is different from others (shown in blue).

Ablation experiments by testing with different combinations of parameterized feature normalization (⋆), progressive VD (♠), augmentation consensus (♣) or integration (♦), and multilayer VD (▼). See TableG.1 for the complete table.



ChunweiMa, Ziyun Huang, Jiayi Xian, Mingchen Gao, and  Jinhui Xu. Improving uncertainty calibration of deep neural networks via truth discovery and geometric optimization. In Uncertainty in Artificial Intelligence, pp. 75-85. PMLR, 2021. Chunwei Ma, Ziyun Huang, Mingchen Gao, and Jinhui Xu. Few-shot learning via dirichlet tessellation ensemble. In International Conference on Learning Representations, 2022a. URL https: //openreview.net/forum?id=6kCiVaoQdx9. Chunwei Ma, Ziyun Huang, Mingchen Gao, and Jinhui Xu. Few-shot learning via dirichlet tessellation ensemble. In International Conference on Learning Representations, 2022b. URL https: //openreview.net/forum?id=6kCiVaoQdx9. Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7765-7773, 2018.

1: Complete list of all notations used in Methodology 2. , y t,i ) data (image) and label in phase t, i ∈ {1, ..., N t }, t ∈ {1, ..., T } C t the set of classes in phase t C t,k the k th class in phase t N t,k number of examples in class k at phase t N t number of all examples in phase t, i.e. N t = feature extractor, but only outputs the features from the l th layer z feature for x, i.e. z = ϕ(x), z ∈ R n θ classification head, can be either a Voronoi Diagram, or logistic regression c τ,k prototypical Voronoi center for phase τ and class C τ,k cτ,k linear probing-induced Voronoi center for phase τ and class C τ,k

2: Notations and acronyms for VD, PD, and CCVD, three geometric structures used in the paper.

3: Complete list of all variants of iVoro.

1: Summarization of the datasets used in the paper. this section we provide the proof of Theorem 2.1. Lemma E.1. The vertical projection from the lower envelope of the hyperplanes {Π k (z) :W T k z + b k } Kk=1 onto the input space R n defines the cells of a PD.



clique boundaries

1: Average Forgetting (↓): Comparison between iVoro with state-of-the-art CIL methods

ACKNOWLEDGMENTS

This research was supported in part by NSF through grant IIS-1910492. 

annex

CIFAR-100 (5 phases): phase 5 

L COMPLETE LIST OF ALGORITHMS

Here we provide the algorithm for Voronoi Diagram reduction (Ma et al., 2022a) 

