LEARNING MIXTURE MODELS WITH SIMULTANEOUS DATA PARTITIONING AND PARAMETER ESTIMATION Anonymous

Abstract

We study a new framework of learning mixture models via data partitioning called PRESTO, wherein we optimize a joint objective function on the model parameters and the partitioning, with each model tailored to perform well on its specific partition. In contrast to prior work, we do not assume any generative model for the data. We connect our framework to a number of past works in data partitioning, mixture models, and clustering, and show that PRESTO generalizes several loss functions including the k-means, Bregman clustering objective, the Gaussian mixture model objective, mixtures of support vector machines, and mixtures of linear regression. We convert our training problem to a joint parameter estimation cum a subset selection problem, subject to a matroid span constraint. This allows us to reduce our problem into a constrained set function minimization problem, where the underlying objective is monotone and approximately submodular. We then propose a new joint discrete-continuous optimization algorithm which achieves a bounded approximation guarantee for our problem. We show that PRESTO outperforms several alternative methods. Finally, we study PRESTO in the context of resource efficient deep learning, where we train smaller resource constrained models on each partition and show that it outperforms existing data partitioning and model pruning/knowledge distillation approaches, which in contrast to PRESTO, require large initial (teacher) models.

1. INTRODUCTION

In the problem space of learning mixture models, our goal is to fit a given set of models implicitly to different clusters of the dataset. Mixture models are ubiquitous approaches for prediction tasks on heterogeneous data (Dasgupta, 1999; Achlioptas & McSherry, 2005; Kalai et al., 2010; Belkin & Sinha, 2010a; Pace & Barry, 1997; Belkin & Sinha, 2010b; Sanjeev & Kannan, 2001; Hopkins & Li, 2018; Fu & Robles-Kelly, 2008) , and find use in a plethora of applications, e.g., finance, genomics (Dias et al., 2009; Liesenfeld, 2001; Pan et al., 2003) , etc. Existing literature on mixture models predominately focuses on the design of estimation algorithms and the analysis of sample complexity for these problems (Faria & Soromenho, 2010; Städler et al., 2010; Kwon et al., 2019; Yi et al., 2014) , and analyzes them theoretically for specific and simple models such as Gaussians, linear regression, and SVMs. Additionally, erstwhile approaches operate on realizable settingsthey assume specific generative models for the cluster membership of the instances. Such an assumption can be restrictive, especially when the choice of the underlying generative model differs significantly from the hidden data generative mechanism. Very recently, Pal et al. (2022) consider a linear regression problem in a non-realizable setting, where they do not assume any underlying generative model for the data. However, their algorithm and analysis is tailored towards the linear regression task.

1.1. PRESENT WORK

Responding to the above limitations, we design PRESTO, a novel data partitioning based framework for learning mixture models. In contrast to prior work, PRESTO is designed for generic deep learning problems including classification using nonlinear architectures, rather than only linear models (linear regression or SVMs). Moreover, we do not assume any generative model for the data. We summarize our contributions as follows. Novel framework for training mixture models. At the outset, we aim to simultaneously partition the instances into different subsets and build a mixture of models across these subsets. Here, each model is tailored to perform well on a specific portion of the instance space. Formally, given a set of instances and the architectures of K models, we partition the instances into K disjoint subsets and train a family of K component models on these subsets, wherein, each model is assigned to one subset, implictly by the algorithm. Then, we seek to minimize the sum of losses yielded by the models on the respective subsets, jointly with respect to the model parameters and the candidate partitions of the underlying instance space. Note that our proposed optimization method aims to attach each instance to one of the K models on which it incurs the least possible error. Such an approach requires that the loss function helps guide the choice of the model for an instance, thus rendering it incompatible for use at inference time. We build an additional classifier to tackle this problem; given an instance, the classifier takes the confidence from each of the K models as input and predicts the model to be assigned to it. Design of approximation algorithm. Our training problem involves both continuous and combinatorial optimization variables. Due to the underlying combinatorial structure, the problem is NP-hard even when all the models are convex. To solve this problem, we first reduce our training problem to a parameter estimation problem in conjunction with a subset selection task, subject to a matroid span constraint (Iyer et al., 2014) . Then, we further transform it into a constrained set function minimization problem and show that the underlying objective is a monotone and α-submodular function (El Halabi & Jegelka, 2020; Gatmiry & Gomez-Rodriguez, 2018) and has a bounded curvature. Finally, we design PRESTO, an approximation algorithm that solves our training problem, by building upon the majorization-minimization algorithms proposed in (Iyer & Bilmes, 2015; Iyer et al., 2013a; Durga et al., 2021) . We provide the approximation bounds of PRESTO, even when the learning algorithm provides an imperfect estimate of the trained model. Moreover, it can be used to minimize any α-submodular function subject to a matroid span constraint and therefore, is of independent interest. Application to resource-constrained settings. With the advent of deep learning, the complexity of machine learning (ML) models has grown rapidly in the last few years (Liu et al., 2020; Arora et al., 2018; Dar et al., 2021; Bubeck & Sellke, 2021; Devlin et al., 2018; Liu et al., 2019; Brown et al., 2020) . The functioning of these models is strongly contingent on the availability of high performance computing infrastructures, e.g., GPUs, large RAM, multicore processors, etc. The key rationale behind the use of an expensive neural model is to capture the complex nonlinear relationship between the features and the labels across the entire dataset. Our data partitioning framework provides a new paradigm to achieve the same goal, while enabling multiple lightweight models to run on a low resource device. Specifically, we partition a dataset into smaller subslices and train multiple small models on each subslice-since each subslice is intuitively a smaller and simpler data subset, we can train a much simpler model on the subslice thereby significantly reducing the memory requirement. In contrast to approaches such as pruning (Wang et al., 2020a; Lee et al., 2019; Lin et al., 2020; Wang et al., 2020b; Jiang et al., 2019; Li et al., 2020; Lin et al., 2017) and knowledge distillation (Hinton et al., 2015; Son et al., 2021) , we do not need teacher models (high compute models) with the additional benefit that we can also train these models on resource constrained devices. Empirical evaluations. Our experiments reveal several insights, summarized as follows. (1) PRESTO yields significant accuracy boost over several baselines. (2) PRESTO is able to trade-off accuracy and memory consumed during training more effectively than several competitors, e.g., pruning and knowledge distillation approaches. At the benefit of significantly lower memory usage, the performance of our framework is comparable to existing pruning and knowledge distillation approaches and much better than existing partitioning approaches and mixture models.

1.2. RELATED WORK

Mixture Models and Clustering:. Mixture Models (Dempster et al., 1977; Jordan & Jacobs, 1994) and k-means Clustering (MacQueen, 1967; Lloyd, 1982) are two classical ML approaches, and have seen significant research investment over the years. Furthermore, the two problems are closely connected and the algorithms for both, i.e., the k-means algorithm and the Expectation Maximization algorithm for mixture models are closely related -the EM algorithm is often called soft-clustering, wherein one assigns probabilities to each cluster. Mixture models have been studied for a number of problems including Gaussian Mixture Models (Xu & Jordan, 1996) , Mixtures of SVMs (Collobert et al., 2001; Fu & Robles-Kelly, 2008) , and linear regression (Faria & Soromenho, 2010; Städler et al., 2010; Kwon et al., 2019; Pal et al., 2022) . As we will show in this work, our loss based data partitioning objective generalizes the objectives of several tasks, including k-means clustering (Lloyd, 1982) , Clustering with Bregman divergences (Banerjee et al., 2005) , and mixture models with SVMs and linear regression (Collobert et al., 2001; Pal et al., 2022) . Following initial analysis of the EM for Gaussian Models (Jordan & Jacobs, 1994; Xu & Jordan, 1996) , mixture models have also been studied for SVMs (Collobert et al., 2001) , wherein the authors extend the mixture-model formulation to take the SVM loss instead of the Gaussian distribution. A lot of recent work has studied mixtures of linear regression models. Most prior work involving mixture models pertaining to regression is in the realizable setting, with the exception of (Pal et al., 2022) . Mixture model papers aim at using the expectation-maximisation (EM) algorithm for parameter estimation. Balakrishnan et al. (2017) study the EM algorithm which is initialised with close estimates to ground truth, concluding that it is able to give reasonable fits to the data. When the sample size is finite, (Balakrishnan et al., 2017) show convergence within l 2 norm of the true parameters. Yi et al. (2014) provide yet another initialisation procedure which uses eigenspace analysis and works for two bimodal regressions.These works assume that the underlying Gaussian distributions have the same shared covariance. Li & Liang (2018) lift this assumption, proving near-optimal complexity. Pal et al. (2022) approach the problem of linear regression under the nonrealizable setting by defining prediction and loss in the context of mixtures and formulating a novel version of the alternating maximisation (AM) algorithm. Resource-constrained learning. In the pursuit of better performance, most state of the art deep learning models are often over-parameterized. This makes their deployment in the resource constrained devices nearly impossible. To mitigate the problems, several handcrafted architectures such as SqueezeNets (Iandola et al., 2016; Gholami et al., 2018) , MobileNets (Howard et al., 2017; Sandler et al., 2018) and ShuffleNets (Zhang et al., 2018; Ma et al., 2018) were designed to work in mobile devices. Recently, EfficientNet (Tan & Le, 2019) was proposed, that employs neural architecture search. However, these models are designed to work on the entire training set, and leave a heavy memory footprint. Simultaneous model training and subset selection. Data subset selection approaches are predominately static, and most often, do not take into account the model's current state (Wei et al., 2014a; b; Kirchhoff & Bilmes, 2014; Kaushal et al., 2019; Liu et al., 2015; Bairi et al., 2015; Lucic et al., 2017; Campbell & Broderick, 2018; Boutsidis et al., 2013) . Some recent approaches attempt to solve the problem of subset selection by performing joint training and subset selection (Mirzasoleiman et al., 2020a; b; Killamsetty et al., 2021b; a; Durga et al., 2021) . On the other hand, some other approaches (Mirzasoleiman et al., 2020a; Killamsetty et al., 2021a) select subsets that approximate the full gradient. Further, some of these approaches (Killamsetty et al., 2021b; a) demonstrate improvement in the robustness of the trained model by selecting subsets using auxiliary or validation set. Durga et al. (2021) train the model in such a way that the validation set error is controlled. All the aforementioned techniques try to determine (for purposes of training), a single subset of the training set.In this paper we propose PRESTO as a method to simultaneously select multiple subsets and also learn mixture of models (with a model corresponding to each subset), in order to improve robustness of the trained model, even in resource-constrained settings.

2. PROBLEM FORMULATION

In this section, we first present the notations and then formally state our problem. Finally, we instantiate our problem in different clustering and mixture modeling scenarios.

2.1. NOTATIONS

We have N training instances {(x i , y i ) | 1 ≤ i ≤ N }, where x i ∈ X is the feature vector and y i ∈ Y is the label of the i th instance. In our work, we set X = R d and treat Y to be discrete 1 . Given K, we denote h θ1 , .., h θ K as the component models that are going to be used for the K subsets resulting from a partition of X . Here θ k is the trainable parameter vector of h θ k . These parameters are not shared across different models, i.e., there is no overlap between θ k and θ k ′ for k ̸ = k ′ . In fact, the models h θ k and h θ k ′ can even have different architectures. Moreover, the representation of the output h θ k can vary across different settings. For example, given x ∈ X , sign(h θ k (x)) is a predictor of y for support vector machines with Y ∈ {±1} whereas, for multiclass classification, h θ k (x) provides a distribution over Y. To this end, we use ℓ(h θ k (x), y) to indicate the underlying loss function for any instance (x, y). We define [A] = {1, . . . , A} for an integer A.

2.2. PROBLEM STATEMENT

High level objective. Our broad goal is to fit a mixture of models on a given dataset, without making any assumption concerning the generative process, the instances or features. Given a family of model architectures, our aim is to learn to partition the instance space X into a set of subsets, determine the appropriate model architecture to be assigned to each subset and to subsequently train the appropriate models on the respective subsets. Problem statement. We are given the training instances D, the number of subsets K resulting from partitioning D and a set of model architectures h θ1 , . . . , h θ K . Our goal, then, is to partition the training set D into K subsets S 1 , ..S K with S k ∩ S k ′ = ∅ and ∪ k S k = D so that when the model h θ k is trained on S k for k ∈ [K], the total loss is minimized. To this end, we define the following regularized loss, i.e., F ({(S k , θ k ) | k ∈ [K]}) = k∈[K] i∈S k ℓ h θ k (x i ), y i + λ||θ k || 2 . (1) Then, we seek to solve the following constrained optimization problem: minimize θ1,...,θ K , S1,...S K F ({(S k , θ k ) | k ∈ [K]}) subject to, S k ∩ S k ′ = ∅, ∀k ∈ [K], ∪ k∈[K] S k = D. (2) Here, λ is the coefficient of the L 2 regularizer in Eq. ( 2). The constraint S k ∩ S k ′ = ∅ ensures that that each example i ∈ D belongs to exactly one subset and the constraint ∪ k∈[K] S k = D entails that the subsets {S k | k ∈ [K] } cover the entire dataset. The above optimization problem is a joint model parameter estimation and a partitioning problem. If we fix the partition {S 1 , S 2 , ..., S K }, then the optimal parameter vector θ k depends only on S k , the subset assigned to it. To this end, we denote the optimal value of θ k for the above partition as θ * k (S k ) and transform the optimization (2) into the following equivalent problem: minimize S1,S2,...,S K F ({S k , θ * k (S k ) | k ∈ [K]}) subject to, S k ∩ S k ′ = ∅, ∀k ∈ [K], ∪ k∈[K] S k = D. (3) Note that, computing θ * k (S k ) has a polynomial time complexity for convex loss functions. However, even for such functions, minimizing F as defined above is NP-hard. Test time prediction. Since computation of the optimal partitioning requires us to solve the optimization (3), it cannot be used to assign an instance x to a model h θ k during the test time. To get past this blocker, we train an additional multiclass classifier π ϕ : X → [K], which is trained on {(x i , k)} pairs where i ∈ S k , so that, during test time, it can assign an unseen instance x to a model component h θ k using k = π ϕ (x).

2.3. INSTANTIATIONS IN CLUSTERING AND MIXTURE MODELING

In this section, we show how the formulations listed in the previous section (e.g., Eqn (1)) have appeared in several applications ranging from clustering to mixture modeling. K-Means Clustering: Since k-means clustering (Lloyd, 1982; MacQueen, 1967) is unsupervised, we do not have access to labels y i . It then turns out that θ k = µ k , the cluster means, and ℓ(h θ k (x i )) = ||x i -µ k || 2 . Bregman Clustering: K-means clustering is a special case of the more general clustering with Bregman divergences (Banerjee et al., 2005) . Again, in this case, the parameters θ k = µ k , the cluster means and ℓ(h θ k (x i )) = B ϕ (x i , µ k ) 2 .

Mixture of SVMs:

The mixture of Support Vector Machines (Fu & Robles-Kelly, 2008 ) is a special case of Eq. ( 1) with ℓ(h θ k (x i ), y i ) = max(0, 1 -h θ k (x i )y i ). Mixture of Linear Regression: The mixture of linear regression models (Pal et al., 2022)  is a special case of Eqn. (1) with ℓ(h θ k (x i ), y i ) = ||h θ k (x i ) -y i || 2 .

2.4. APPLICATION IN THE RESOURCE-CONSTRAINED SETUP

In general, the relationship between an instance x and the label y can be arbitrarily nonlinear. A complex deep neural network captures such relationship by looking into the entire dataset. However, such networks require high performance computing infrastructure for training and inference. The training problem (3) can be used to build K lightweight models h θ1 , . . . , h θ K , each of which is localized to a specific regime of the instance space X . Thus, a model is required to capture the nonlinearity only from the region assigned to it and not the entire set X . As a result, it can be lightweight and used with limited resources. Consider a large model with number of parameters equal to the total number of parameters collectively across the K models, i.e., K k=1 dim(θ k ). Such a model has to be loaded entirely into a GPU RAM for training or inference, and thus requires a larger GPU RAM. In contrast, our approach requires us to load at a time, only one model component h θ k and the corresponding subset S k during both training and test. This is instead of having to load all the K model components and the entire dataset. While this can increase both training and inference time, it can substantially reduce the memory consumption by 1/K times in comparison to a large model having similar expressiveness.

3. PRESTO: PROPOSED FRAMEWORK TO SOLVE THE TRAINING PROBLEM (3)

In this section, we first show that the optimization problem (3) is equivalent to minimizing a monotone set function subject to matroid span constraint (Iyer et al., 2014; Schrijver et al., 2003) . Subsequently, we show that this set function is α-submodular and admits a bounded curvature. Finally, we use these results to design PRESTO, an approximation algorithm to solve the underlying constrained set function optimization problem. We next present these analyses, beginning with the necessary definitions about monotonicity, α-submodularity and different matroid related properties. Definition 1. (1) Monotonicity, α-submodularity and generalized curvature: Given a ground set V, let G : 2 V → R be a set function whose marginal gain is denoted by G(e | S) = G(S ∪ {e}) - G(S). The function G is monotone non-decreasing if G(e | S) ≥ 0 for all S ⊂ V and e ∈ V\S. G is α-submodular with α ∈ (0, 1] if G(e | S) ≥ αG(e | T) for all S ⊆ T and e ∈ V\T (Hashemi et al., 2019; El Halabi & Jegelka, 2020) . The generalized curvature of G(S) is defined as κ G (S) = 1 -min e∈S G(e | S\e)/G(e | ∅) (Iyer et al., 2013b; Zhang & Vorobeychik, 2016) . (2) Base, rank and span of a matroid: Consider a matroid M = (V, I) where I is the set of independent sets (Refer to Appendix A.1 for more details). A base of M is a maximal independent set. The rank function r M : 2 V → N is defined as: r M (S) = max I∈I:I⊂S |I|. A set S is a spanning set if it is the superset of a base or equivalently, r M (S) = r M (V) (Schrijver et al., 2003; Iyer et al., 2014; Edmonds, 2003) .

3.1. REPRESENTATION OF (3) AS A MATROID SPAN CONSTRAINED SUBSET SELECTION TASK

Transforming partitions into 2D configurations. Given the training instances D and the size of partition K, we first define the ground set V in the space of Cartesian products of D and [K], i.e., V = D × [K]. Thus V consists of all pairs {(i, k) | i ∈ D, k ∈ [K] } which enumerates all possible assignments between the instances and model components. Moreover, we define V i⋆ = {(i, k) | k ∈ [K]} and V ⋆k = {(i, k) | i ∈ D}. Here, V i⋆ = {i} × [K] enumerates all possible assignments of the i th instance and V ⋆k = D × {k} enumerates all possible configurations specifically wherein S k is assigned to an instance. Reformulating optimization (3) in constrained subset selection. Having defined the ground set V and the partitions in 2D configuration space as above, we define S = {(i, k) | i ∈ S k , k ∈ [K]} that encodes the set of assignments in space of V induced by the underlying partition. Then, the set S k = {(i, k) | i ∈ S k } specifies the set of instances attached to the subset k and elucidates the subset containing i. It can be observed that S k = S ∩ V ⋆k . Since every instance is assigned exactly to one subset S k ∈ {S 1 , . . . , S K }, we have that |S ∩ V i⋆ | = 1. To this end, we introduce the following set function, which is the sum of the fitted loss functions defined in Eq. ( 1), trained over individual subsets. G(S) = k∈[K] (i,•)∈S∩V ⋆k ℓ h θ * k (S∩V ⋆k ) (x i ), y i + λ||θ k || 2 . (4) We rewrite our optimization problem (3) as follows: minimize S⊂V G(S) subject to, |S ∩ V i⋆ | = 1 ∀ i ∈ D. ( ) Theoretical characterization of the objective and constraints in Eq. ( 5). Here, we show that our objective admits monotonicity, α-submodularity and bounded curvature in a wide variety of scenarios (Proven in Appendix A.2) Theorem 2. Given the set function G(S) defined in Eq. (4) and the individual regularized loss functions ℓ introduced in Eq. (1), we define: ε min = max k,i Eigen min [∇ 2 θ k ℓ(h θ k (x i ), y i )], ℓ min = min i∈Dθ k [ℓ(h θ k (x i ), y i ) + λ||θ k || 2 and ℓ min = max i∈D min θ k [ℓ(h θ k (x i ), y i ) + λ||θ k || 2 ] . Then, we have the following results: (1) Monotonicity: The function G(S) defined in Eq. ( 4) is monotone non-decreasing in S. (2) α-submodularity: If ℓ(h θ k (x), y) is L-Lipschitz for all k ∈ {1, .., K} and the regularizing coef- ficient λ satisfies: λ > -ε min , function G(S) is α-submodular with α ≥ α G = ℓ min ℓ min + 2L 2 λ+0.5εmin + λL 2 (λ+0.5εmin) 2 (6) (3) Generalized curvature: The generalized curvature κ G (S) for any set S is given by: κ G (S) ≤ κ * G = 1 -ℓ min /ℓ min . Solving the optimization ( 5) is difficult due to the equality constraint. However, as suggested by Theorem 2 (1), G(S) is monotone in S. Hence, even if we relax the equality constraints |S∩V i⋆ | = 1 to the inequality constraint |S ∩ V i⋆ | ≥ 1, they achieve the equality at the optimal solution S * . Thus, the optimization (5) becomes minimize S⊂V G(S) subject to, |S ∩ V i⋆ | ≥ 1 ∀ i ∈ D. As we formally state in the following proposition, the above constraint (set) can be seen as a matroid span constraint for a partition matroid. This would allow us to design an approximation algorithm to solve this problem. Proposition 3. Suppose that set S satisfies |S ∩ V i⋆ | ≥ 1 for all i ∈ D. Then S is a spanning set of the partition matroid M = (V, I) where I = {I | |I ∩ V i⋆ | ≤ 1}. Moreover, if S satisfies |S ∩ V i⋆ | = 1 for all i ∈ D, then S is a base of M. The above proposition suggests that the optimal solution S * of the optimization ( 7) is a base of the partition matroid M = (V, I) with I = {I | |I ∩ V i⋆ | ≤ 1}.

3.2. PRESTO: AN APPROXIMATION ALGORITHM TO SOLVE THE OPTIMIZATION PROBLEM (7)

In this section, we present our approximation algorithm PRESTO for minimizing the optimization problem (5), which is built upon the algorithm proposed by Durga et al. (2021) . Their algorithm aims to approximately minimize an α-submodular function, whereas we extend their algorithm for minimizing an α-submodular function with matroid span constraint and derive an approximation guarantee for it. Specifically, we employ the Majorization-Minimization approach (Iyer et al., 2013a; b) for minimizing the optimization problem (5). Toward that goal, we first develop the necessary ingredients as follows. Computation of modular upper bound. First, we present a modular upper bound for G(S) (Iyer et al., 2013a) . Given a fixed set S and the set function G which is α-submodular and monotone, let the modular function m G S [S] be defined as follows:  m G S [S] =G( S) - (i,k)∈ S α G G((i, k) | S\ {(i, k)}) + (i,k)∈ S∩S α G G((i, k) | S\ {(i, k)}) + (i,k)∈S\ S G((i, k) | ∅) α G . | k ∈ [K]}, the partitioning of D: D = ∪ k∈[K] S k 2: S1 ← ∅, .., SK ← ∅ 3: for k ∈ [K] do 4: θk ← INITPARAMS( ) 5: for all i ∈ D do 6: G({(i, k)}) ← Train({i} ; θk ) 7: end for 8: end for 9: for r ∈ [Iterations] do 10: for k ∈ [K] do 11: S k ← (i, k) | i ∈ S k 12: θk , G( S k ) ← Train( S k ; θ k ) 13: for i ∈ S k do 14: G( S k \(i, k))←Train( S k \i; θ k ) 15: M [i][k] ← αG[G( S k )-G( S k \{(i, k)})] 16: end for 17: for all i / ∈ S k , set M [i][k] = G({(i, k)}) αG 18: end for 19: (i * , k * ) ← argmin i,k (M [i][k]) 20: S k * ← S k * ∪ i * 21: For all k ̸ = k * : S k ← S k \ {i * } 22: S ← (i, k) | i ∈ S k , k ∈ [K] 23: end for 24: Return θ1, ..., θK , S If G is α-submodular and monotone, it holds that G(S) ≤ m G S [S] for all S ⊆ D (Durga et al., 2021) . Note that when G is submodular, i.e., when α = 1, the expression m G S [S] gives us the existing modular upper bounds for submodular functions (Nemhauser et al., 1978; Iyer et al., 2013a; Iyer & Bilmes, 2012) . Outline of PRESTO (Alg. 1). The goal of Algorithm 1 is to iteratively minimize m G S [S] with respect to S in a majorization-minimization manner (Iyer et al., 2013a; b; Durga et al., 2021) . Having computed S that minimizes m G S [S] in the iteration r -1, we set S = S for the iteration r and minimize m G S [S] with respect to S. We start with S = ∅ (the empty partition) in line 2. Since we minimize m G S [S] with respect to S, the key components for calculation are the last two terms of the RHS in Eq. ( 8), i.e., G({(i, k)})/α G and G((i, k) | S\{(i, k}). We pre-compute the former in lines 3-8 for all pairs (i, k). The complexity of this operation is O(|D|K). We can compute G((i, k) | S\{(i, k}) by finding the corresponding partitions { S 1 , • • • , S K } and for each k, compute G( S k ) and G( S k \(i, k)), (i, k) ∈ S k (lines 13-16). Finally, once we compute the modular upper bound m G S [S], we minimize it subject to the matroid span constraint-thus, we get a partition. Since G is a modular function, we are guaranteed to find a set which is in the base of the partition matroid. Finally, note that to evaluate G, we need to train the algorithm on the specific datapoints and partition, as encapsulated in the Train( ) routine in Algorithm 1. Here, Train(S, θ k ) trains the model for partition k on the subset S k for a few iterations and, returns the estimated parameters θk and the objective G( S k ). Approximation guarantee. The following theorem shows that if G is α-submodular with α > α G , and curvature κ > κ G , Algorithm 1 enjoys an approximation guarantee (Proven in Appendix A.3). Theorem 4. Given the set function G(S) defined in Eq. (4), let OP T be an optimal solution of the optimization problem (5) (or equivalently (7)). If Algorithm 1 returns S as the solution, then G( S) ≤ G(OP T )/[(1 -κ G )α 2 G ], where κ G and α G are computed using Theorem 2

4. EXPERIMENTS WITH REAL DATA

In this section, we provide a comprehensive evaluation of our method on four real world classification datasets and show that our method is able to outperform several unsupervised partitioning methods and mixture models. Next, we evaluate PRESTO in the context of resource constrained deep learning methods, where we show that PRESTO is able to provide comparable accuracy with respect to state-of-the-art models for network pruning and knowledge distillation. Appendix F contains additional experimental results.

4.1. EXPERIMENTAL SETUP

Datasets. We experiment with four real world classification datasets, viz., CIFAR10 (Krizhevsky, 2009) , PathMNIST (PMNIST) (Yang et al., 2021) , DermaMNIST (DMNIST) (Yang et al., 2021) and SVHN (Netzer et al., 2011) . Appendix D contains additional details about the datasets. (Bennett et al., 2000) , Kmeans++ (Arthur & Vassilvitskii, 2006) , Agglomerative (Müllner, 2011) ); Mixture of Experts(MoE) (Shazeer et al., 2017) and three mixture models (GMM (Bishop & Nasrabadi, 2006) , BGMM (Attias, 1999) and Learn-MLR (Pal et al., 2022) ) for all datasets. In all cases, we used exactly the same model architecture for each component, which is a ResNet18 network with reduced capacity (described in Section 4.1) The best and second best results are highlighted in green and yellow respectively. Implementation details. For all models, we extract features x from fifth layer of a pre-trained ResNet18 (He et al., 2016) . We design h θ k using a neural architecture similar to ResNet18 with reduced capacity-specifically, starting with the sixth layer and until the eighth layer, we reduce the number of convolutional filters to 4 instead of 128, 8 instead of 256 and 16 instead of 512, respectively. In each case, we set the number of model components K using cross validation. We found K = 4 for all datasets except DMNIST and K = 3 for DMNIST. We use a fully connected single layer neural network to model the additional classifier π ϕ that is used to decide the model component to be assigned to an instance during inference. The experimental setup is discussed in further details in Appendix D.

4.2. RESULTS

Comparison with unsupervised partitioning methods and mixture models. We first compare our method against several unsupervised partitioning methods and mixture models. In the case of unsupervised partitioning, we use clustering methods to slice the data into K partitions before the start of the training and then use these data partitions to train the K models. Specifically, we consider three unsupervised partitioning methods, viz., (1) Equal-Kmeans (Bennett et al., 2000) which is a K-means clustering method that ensures equal distribution of instances across different clusters, (2) Kmeans++ (Arthur & Vassilvitskii, 2006) , a variant of K-means method that uses a smarter initialisation of cluster centers. (3) Agglomerative clustering (Müllner, 2011) , where the clusters are built hierarchically; (4) Mixture of Experts(MoE) (Shazeer et al., 2017) , a Sparsely-Gated Mixture-of-Experts layer (MoE), which selects a sparse combination from a set of expert models using a trainable gating mechanism to process each input, and three mixture models, viz., (5) Gaussian mixture models (GMM) (Bishop & Nasrabadi, 2006) , ( 6) Bayesian Gaussian mixture model (BGMM) (Attias, 1999 ) and ( 7) Learn-MLR (Pal et al., 2022) which presents a version of the AM algorithm for learning mixture of linear regressions. We adapt this algorithm for classification as a baseline for PRESTO. Across all baselines, we employed exactly same set of model architectures. More implementation details of these methods is given in Appendix D. In Table 1 , we summarize the results in terms of classification accuracy P(ŷ = y) and Macro-F1 score. The Macro-F1 score is the harmonic mean of the precision and recall. We make the following observations. (1) PRESTO demonstrates better predictive accuracy than all the baselines. (2) For three out of four datasets, the unsupervised partitioning methods outperform the mixture models. Note that there is no consistent winner among the partitioning methods. Thus, the best choice of the underlying partitioning algorithm changes across different dataset. However, since the optimal clusters obtained by our method are guided by a supervised loss minimization, PRESTO performs well across all datasets. Application on resource constrained learning. Given K, the number of partitions, PRESTO works with K lightweight models, which can be trained on different partitions of a dataset, instead of training a very large model on the entire dataset. To evaluate the efficacy of PRESTO on a resource constrained learning setup, we compare it against two model pruning methods, viz., (1) GraSP (Wang et al., 2020a) , (2) SNIP (Lee et al., 2019) and two methods for knowledge distillation, viz., (3) KD (Hinton et al., 2015) (4) DGKD (Son et al., 2021) important for training, whereas knowledge distillation methods start with a high capacity model (teacher model) and then design a lightweight model (student) that can mimic the output of the large model with fewer parameters. For a fair comparison, we maintain roughly similar (with 5% tolerance) number of parameters of each of the final lightweight models obtained by the pruning methods and of the student models given by the knowledge distillation methods; the total number parameters used in our approach is K k=1 dim(θ k ) First, we set K of PRESTO same as in Table 1 . This gives the number of parameters of PRESTO as 82690 for CIFAR10, 54993 for PMNIST, 41047 for DMNIST and 55138 for SVHN. In Table 2 , we summarize the results for this setup. We observe that PRESTO consumes significantly lesser GPU memory than any other method across all datasets, whereas it yields the best accuracy for DMNIST. For DMNIST, it consumes an impressive 77% lesser GPU memory than the second most efficient method, viz., DGKD. The existing baselines use high capacity models, as reference models in pruning and teacher models in knowledge distillation, resulting in high training GPU memory consumption. Next we vary the number of parameters p of PRESTO (through K) as well as the baselines p ∈ [62020, 124030] for CIFAR10, p ∈ [41247, 82485] for PMNIST, p ∈ [41047, 82087] for DM-NIST, p ∈ [41356, 82702] for SVHN and probe the variation accuracy vs. maximum GPU memory used during both training and test. In Figure 3 , we summarize the results for PRESTO and KD and DGKD-the two most resource efficient methods from Table 2 . We make the following observations. ( 1) PRESTO consumes significantly lower memory for diverse model size during both training and inference; and, (2) In most cases, as we change the model size, the accuracies obtained by the baselines vary widely whereas, our model often shows only insignificant changes as we change the model size.

5. CONCLUSION

We present PRESTO, a novel framework of learning mixture models via data partitioning. Individual models specialising on a data partition, present a good alternative to learning one complex model and help achieve better generalisation. We present a joint discrete-continuous optimization algorithm to form the partitions with good approximation guarantees. With our experiments we demonstrate that PRESTO achieves best performance across different datasets when compared with several mixturemodels and unsupervised partitioning methods. We also present that PRESTO achieves best accuracy vs. memory utilisation trade-off when compared with knowledge distillation and pruning methods. Our work opens several areas for research, such as that of handling datasets with larger class imbalance, outlier detection, handling out of distribution data, etc.

6. REPRODUCIBILITY STATEMENT

We provide a zip file containing all the source code for PRESTO as well as the various baselines we use, as a part of the supplementary material. The implementation details and machine configuration for our experiments are given in Section D of the Appendix. All our datasets are public and can be obtained easily. Users can download the datasets and use the provided code to reproduce the results presented in this paper. Next, we note that g(θ * k (T ), T ∪ {j}) -g(θ * k (T ), T ) = g(θ * k (T ), {j}) which gives us: g(θ * k (T ∪ {j}), T ∪ {j}) -g(θ * k (T ), T ) = g(θ * k (T ), {j}) ) This leads us to bound the ratio in Eq. ( 12) as follows: g(θ * k (S ∪ {j}), S ∪ {j}) -g(θ * k (S), S) g(θ * k (T ∪ {j}), T ∪ {j}) -g(θ * k (T ), T ) (17) ≥ g(θ * k (S ∪ {j}), {j}) g(θ * k (T ), {j}) ≥ g(θ * k (S ∪ {j}), {j}) g(θ * k (S ∪ {j}), {j}) + g(θ * k (T ), {j}) -g(θ * k (S ∪ {j}), {j}) . ( ) To bound the denominator, we note that: g(θ * k (T ), {j}) -g(θ * k (S ∪ {j}), {j}) ≤ ℓ(h θ * k (T ) (x j ), y j ) -ℓ(h θ * k (S∪{j}) (x j ), y j ) + λ∥θ * k (T )∥ 2 Now, we note that 0 ≥ g(θ * k (T ), T ) -g(0, T ) = λ∥θ * k (T )∥ 2 + ∇ θ ℓ(h θ (x i ), y i ) ⊤ θ=0 θ * k (T ) + 1 2 [θ * k (T )∇ 2 ℓ(h θ (x i ), y i )] θ∈(0,θ * k (T )) θ * k (T ) =⇒ 1 2 ∥θ * k (T )∥ 2 ε min + λ∥θ k ∥ 2 -L∥θ k ∥ < 0 (19) Thus, we have: ∥θ k ∥ ≤ L λ+ ε min 2 . Putting this in Eq. ( 19), we have: g(θ * k (T ), {j}) -g(θ * k (S ∪ {j}), {j}) ≤ L∥θ * k (T ) -θ * k (S ∪ {j})∥ + λ∥θ * k (T )∥ 2 (20) ≤ 2L 2 λ + εmin 2 + λL 2 (λ + εmin 2 ) 2 Thus, replacing the above quantity in Eq. ( 18), we have: α ≥ α * = ℓ min ℓ min + 2L 2 λ+0.5εmin + λL 2 (λ+0.5εmin) 2 (22) (3) (Proof of the bound on Generalized curvature) From the definition of curvature, 1 -κ g (S) = min a∈V g(θ * k (S), S) -g(θ * k (S \ {a}), S \ {a}) g(θ * k ({a}), {a}) From Eq. ( 11), g(θ * k (S), S) -g(θ * k (S \ {a}), S \ {a}) ≥ ℓ min (24) g(θ * k ({a}), {a}) ≤ max a λ∥θ k ∥ 2 + l(f θ k (x a ), y a ) = ℓ min (25) Thus, κ g (S) ≤ 1 - ℓ min ℓ min ( ) If we add an element a = (j, t) to S only g(θ t , S) will be changed among the component models. Thus, κ G (S) = 1 -min a∈V G(S) -G(S \ {a}) G(a) (27) = 1 -min a∈V g(θ * t (S), S) -g(θ * t (S \ {a}), S \ {a}) g(θ * t ({a}), {a}) ≥ 1 - ℓ min ℓ min A.3 APPROXIMATION GUARANTEES We next prove the approximation bound for Theorem 4. Theorem (4). If the function G is α G -submodular and has a curvature κ G , Algorithm 1 obtains an approximation guarantee of |D| α G (1+(|D|-1)(1-κ G )α G ) ≤ 1 α 2 G (1-κ G ) assuming there exists a perfect training oracle in Lines (6, 12, 14) . Proof. From the definition of α-submodularity, note that α G G(S) ≤ i∈S G(i). Next, we can obtain the following inequality for any k ∈ S using weak submodularity: G(S) -G(k) ≥ α G j∈S\k (G(j|S\j) We can add this up for all k ∈ S and obtain: |S|G(S) - k∈S G(k) ≥ α G k∈S j∈S\k (G(j|S\j) ≥ α G (|S| -1) k∈S G(k|S\k) Finally, from the definition of curvature, note that G(k|S\k) ≤ (1 -κ f )G(k). Combining all this together, we obtain: |S|G(S) ≥ (1 + α G (1 -κ f )(|S| -1)) j∈S G(j) which implies: j∈S G(j) ≤ |S| 1 + α G (1 -κ G )(|S| -1) G(S) Combining this with the fact that α G G(S) ≤ i∈S G(i), we obtain that: G(S) ≤ 1 α G i∈S G(i) ≤ |S| α G (1 + α G (1 -κ G )(|S| -1)) G(S) (32) Note that |S|/(1 + α G (1 -κ G )(|S| -1) ≤ 1/α 2 G (1 -κ G ) so we just use this factor in the approximation bound. The approximation guarantee then follows from some simple observations. In particular, given an approximation m G (S) = 1 α G i∈S G(i) (33) which satisfies G(S) ≤ m G (S) ≤ β G G(S) , we claim that optimizing m G essentially gives a β G approximation factor. To prove this, let S * be the optimal subset, and Ŝ be the subset obtained after optimizing m G . The following chain of inequalities holds: G( Ŝ) ≤ m G ( Ŝ) ≤ m G (S * ) ≤ β G G(S * ) This shows that Ŝ is a β G approximation of S * . Finally, note that this is just the first iteration of PRESTO, and with subsequent iterations, PRESTO is guaranteed to reduce the objective value since we only proceed if there is a reduction in objective value.

B COMPARISON WITH EM ALGORITHM

General EM algorithm usually considers a prior and computes the membership probability of each datapoint at each iteration. It makes assumption about the generative mechanism of the data as well as the prior about the cluster probabilities. Such a probabilistic approach naturally leads one to develop EM algorithm. On the other hand, our method makes no assumption about the data as well as cluster membership. We do not make any probabilistic assumption which naturally led us to develop a set optimization problem-we do not make use of any "continuous" or "soft" scores on cluster membership. Thus, our method is an MM algorithm which is an iterative set optimization algorithm, which is functionally very different from EM algorithm.

C DIFFERENCE WITH MIXTURE OF LINEAR REGRESSION

All our experiments consider classification tasks. Therefore, we modify Learn-MLR (Pal et al., 2022) to a mixture of classifiers. However, we note that the optimization problem formulation of the mixture of linear regression (or classification) can be viewed as special case of the problem formulation of our method. In most cases, the existing methods make assumptions about the generative mechanism of the cluster membership (e.g., some definite prior) and resort to an algorithm which makes an involved use of that assumption and is significantly tailored to a particular task (mixture of regression or classification). In contrast, our framework operates on a generic framework and does not make any specific assumption about the data or the cluster membership. Moreover, the algorithm is very different from what is used by the usual mixture model learning algorithms.

D ADDITIONAL EXPERIMENTAL SETUP D.1 DATASET DETAILS

We experiment with four real world classification datasets in our experiments, • CIFAR10 (Krizhevsky, 2009) has images of 10 different real-world objects. • SVHN (Netzer et al., 2011) has real-world images for the 10 digits classification task • PathMNIST (PMNIST) (Yang et al., 2021) has images of colorectal cancer histology slides. • DermaMNIST (DMNIST) (Yang et al., 2021) has dermatoscopic images of common pigmented skin lesions. (Deng et al., 2009 ) dataset, we get feature embedding of these datasets. For CIFAR10 dataset we extract embeddings from the fourth last layer of pretrained ResNet18 and for other datasets we extract embeddings from the fifth last layer.

D.2 MODEL ARCHITECTURE USED FOR OUR METHOD, THE UNSUPERVISED PARTITIONING METHODS AND THE MIXTURE MODELS

In case of our method, the unsupervised partitioning methods and the mixture models (all methods in Table 1 and PRESTO in Table 2 ), we use a relatively light model. This model is exactly same in terms of architecture design to the final five layers of ResNet18 for PMNIST, DMNIST and SVHN, final four layers of ResNet18 for CIFAR10. This model design is chosen because we use extracted embeddings for training. The model reduces the convolution filters at various junctures. The fifth last layer has a filter of size 128 which is replaced with a filter of size 4. Similarly, the filter at fourth last layer is changed from 256 to 8, filter at third last layer is changed from 512 to 16. We train an additional classifier. This classifier is a single layer fully connected feed forward network that aids in predicting the final class for each instance.

D.3 IMPLEMENTATION DETAILS FOR PRESTO

Number of partitions K. Number of partitions (K) is found for various datasets by performing an analysis of classification accuracy v/s K in the validation. Table 5 shows K for each dataset. Initialization. PRESTO is initialised using one of the clustering techniques depending on the dataset. We also try to ensure that the clusters are of approximately the same size. We impose |N | |K| size constraints for each of the bin with some leeway. The leeway granted for every dataset is also a tunable hyperparameter. Final selection of (i, k) from the matrix M . The values of the matrix M computed in line 19 of Algorithm 1 are supplemented with euclidean distance (D) of a point to the existing bin center. The final criterion for binning becomes M + ϵD. Here, ϵ = 3e -3 for CIFAR10, SVHN and PMNIST. However, for DMNIST, we have ϵ = 6e -3. Optimizer. We use an Adam optimizer with the Cross-Entropy Loss to train the PRESTO models and the classifier. The learning rate is set to 0.01. Table 5 encapsulates the various hyperparameters discussed for the datasets. We set the value of the regularizer λ = 1e -4. In Algorithm 1, Train(•) train the models on the respective data partitions for 10 epochs. We set Iterations = 30. The additional classifier π ϕ is also trained for 10 epochs.

Dataset

Table 1 mentions various baselines. Details about each baseline is covered in the following points.

D.4 IMPLEMENTATION DETAILS ABOUT THE UNSUPERVISED PARTITIONING METHODS AND MIXTURE MODELS

Equal-Kmeans. We use the (Josh Levy-Kramer, 2022) implementation of Constrained K means clustering. We run the algorithm for K clusters, and constrain the size of a cluster to be within

|N |

|K| -Leeway, |N | |K| + Leeway , i.e. roughly equally sized clusters with tolerance equal to the Leeway parameter. A Kmeans++ initialization is used for selecting the initial centroids. We keep the number of times the algorithm will be run with different centroid seeds as n init = 10. The final result is the best output of n init consecutive runs in terms of inertia. The algorithm runs either a max of 300 iterations or convergence, whichever occurs earlier. The convergence criterion is defined as the Frobenius norm of the difference in the cluster centers of two consecutive iterations becoming less than 1e -4. Kmeans++. We use the scikit-learn (Pedregosa et al., 2011) implementation of Kmeans++ with the following parameters: The number of clusters is set as K. A Kmeans++ initialization is used for selecting the initial centroids. We keep the number of times the algorithm will be run with different centroid seeds as n init = 10. The final result is the best output of n init consecutive runs in terms of inertia. The algorithm runs either a max of 300 iterations or convergence, whichever occurs earlier. The convergence criterion is defined as the Frobenius norm of the difference in the cluster centers of two consecutive iterations becoming less than 1e -4. Agglomerative. It is a bottom-up hierarchical clustering approach. We use the scikit-learn (Pedregosa et al., 2011) implementation of Agglomerative We use the Euclidean distance metric to calculate distance between data points. Initially each data point is in its own cluster. The clusters are subsequently merged based on a linkage criterion. We use the ward linkage criterion, which minimizes the sum of square differences within each cluster. The number of clusters is set as K. GMM. We use the scikit-learn (Pedregosa et al., 2011) implementation of GMM with the number of components = K. The GMM weights are initialized according to a KMeans initialization. The convergence threshold for the EM iterations is set as 1e -3. BGMM. We use the scikit-learn (Pedregosa et al., 2011) implementation of BMM with the number of components = K. The BGM weights are initialized according to a KMeans initialization. The convergence threshold for the EM iterations is set as 1e -3. The weight concentration prior is set as dirichlet process. Learn-MLR. We adapt the original paper (Pal et al., 2022) for the purpose of classification. The initial partitioning is random and the we do not have a leeway when the data is re-partitioned. The ϵ parameter which gives a contribution to the euclidean distance (D) is set to 0. An Adam optimizer is used with Cross-Entropy loss. The learning rate is set to 0.01 and the regularizer λ = 1e -4. Mixture of Experts. We use the Sparse-MoE layer described in (Shazeer et al., 2017) for classification. The expert networks have the architecture described in Section D.2. We adapt the (Rau, 2019) implementation for this purpose. We use Noisy Top-K Gating as described in (Shazeer et al., 2017) , which balances the number of training examples each expert receives. We set the number of experts used for each batch element, k = 4.

D.5 IMPLEMENTATION DETAILS FOR RESOURCE-CONSTRAINED EXPERIMENTS

We evaluate PRESTO in a resource-constrained learning setting by comparing with model pruning and knowledge distillation methods. The specifications for them are mentioned in the following points. SNIP (Lee et al., 2019) . SNIP is a pruning at initialization method, that given a large reference network prunes network connections to a desired sparsity level (i.e. number of non-zero weights). Pruning is done prior to training, in a data-dependent way based on the loss function at a variance scaling initialization, to create a sparse subnetwork which is then used for inference. We adapt the implementation from (Su et al., 2020) to run experiments using SNIP. We train a large reference model, with the same architecture as the PRESTO models described in Section D.2. The reference model has the convolution filter size at the fifth last layer as 256, at the fourth last layer as 512 and at the third last layer as 1024. The pruning ratio is set so that the number of nonzero parameters in the reference network after pruning are equal to that of the corresponding PRESTO model in all experiments. The models are trained on using an Adam optimizer and the Cross-Entropy loss with a learning rate of 0.1. The reference models are trained for 300 epochs, and the subsequent pruned models are trained for a total of 20 epochs till convergence. GraSP (Wang et al., 2020a) . GraSP is a pruning at initialization method similar to SNIP and also learns a smaller subnetwork given a reference network, which can subsequently be trained independently. The gradient norm after pruning is the pruning criterion in GraSP, and those weights whose removal will result in least decrease in the gradient norm after pruning are pruned. We adapt the implementation from (Su et al., 2020) to run experiments using SNIP. We train a large reference model, with the same architecture as the PRESTO models described in Section D.2. The reference model has the convolution filter size at the fifth last layer as 256, at the fourth last layer as 512 and at the third last layer as 1024. The pruning ratio in all experiments is set so that the number of nonzero parameters in the reference network after pruning are equal to that of the corresponding PRESTO model in all experiments. The models are trained on using an Adam optimizer and the Cross-Entropy loss with a learning rate of 0.1. The reference models are trained for 300 epochs, and the subsequent pruned models are trained for a total of 20 epochs till convergence. KD (Hinton et al., 2015) . We train two models, a teacher model and a student model. The architecture of these models is same as the PRESTO models described in Section D.2. The teacher model is heavier in size, the convolution filter size at the fifth last layer is 128, at the fourth last layer is 256 and at the third last layer is 512. On the other hand, the lighter student model has convolution filter size set to 12,16 and 32. The models are trained on using an Adam optimizer and the Cross-Entropy loss. The KD Temperature hyperparameter is set to 5. The teacher model is trained and the loss of the student model is modified to include a contribution from the teacher model. The multiplicative hyperparameter controlling the contribution of the teacher model is set to 0.9. DGKD (Son et al., 2021) . This method requires us to train 3 models, a teacher model, a TA model and a student model. The teacher model is the heaviest, having the convolution filters set to 128, 256 and 512. The TA model is slightly lighter with filter size of 32, 64 and 128. The student model is the lightest with filter size of 12, 16 and 32. The temperature hyperparameter is 5 and the factor controlling the contribution of the teacher and TA model is set at 0.9. We train the teacher and the TA models independently. As in KD, the loss function of the student model is tweaked to consider a contribution from the teacher and the TA models. Machine configuration. We performed our experiments on a computer system with Ubuntu 16.04.6 LTS, an i-7 with 8 cores CPU and a total RAM of 528 GBs. The system had a single Titan RTX GPU which was employed in our experiments. -  2 -1 0 1 x[1] → -1 0 1 2 x[2] → k = 1 k = 2 k = 3 k = 4 y = +1 y = -1 (a) True Partitions -2 -1 0 1 x[1] → -1 0 1 2 x[2] → k = 1 k = 2 k = 3 k = 4 y = +1 y = -1 (b) Epoch r = 10 -2 -1 0 1 x[1] → -1 0 1 2 x[2] → k = 1 k = 2 k = 3 k = 4 y = +1 y = -1 (c) Epoch r = 20 -2 -1 0 1 x[1] → -1 0 1 2 x[2] → k = 1 k = 2 k = 3 k = 4 y = +1 y = -1 (d) Epoch = 50

F ADDITIONAL EXPERIMENTS ON REAL DATA

Accuracy vs K. Here, we aim to assess the impact of number of partitions (K) on the accuracy. K is varied from 3 to 6. Figure 7 shows the variation of K with the test accuracy and the Macro-F1 score. We observe that the performance is stable across these range. G EXPLANATION OF PRESTO (ALGORITHM 1) In PRESTO we aim to minimize a monotone approximate-submodular function G. In contrast to submodular or approximate submodular maximization where one can use greedy algorithms, minimizing the approximate submodular function G needs a completely different approach. Here, we resort to minimizing an upper bound m of Greducing the value of this upper bound m will ensure the value of the underlying function G remains low. Now, such an upper bound m has to be such that the gap between m and G is low and maximizing m is convenient. To this aim, we choose m to be modular. One can connect such a modular approximation of a set function to a simple linear approximation of a complex nonlinear function in the context of continuous optimization. α G depending on whether (i, k) ∈ Ŝ (3nd term in Eq (39)) or (i, k) ̸ ∈ Ŝ (4th term in Eq (39)). We compute these two quantities in L15 (line no. 15 in Algorithm 1) and L17 of the algorithm and store the values of m at different pairs (i, k) in the matrix M. Finally, we take the minimum of M to find (i * , k * ) (L19) which indicates that the partition k * should contain i * . Therefore, we include i * to Ŝk * (L20). Now, since any instance can belong to exactly one partition, we remove i * from all other partition k ̸ = k * (L21). Finally, we update Ŝ (L22).

H ADDITIONAL EXPERIMENTS WITH DATASETS HAVING LABEL NOISE

The datasets are homogenous, extremely balanced and there is a lack of noise. In this section, we present experiments where we added different amounts of label noise. Specifically, we changed each label y to a wrong label y ′ with probability 10% and then probe the performance. The results for CIFAR10 and DMNIST are presented in Table 9 . We observe that PRESTO performs much better than the clustering baselines. (Bennett et al., 2000) , Kmeans++ (Arthur & Vassilvitskii, 2006) , Agglomerative (Müllner, 2011) ). We change each label y to a wrong label y ′ with probability 10%, thereby adding heterogenity to the dataset. Numbers in green and yellow indicate the best and the second best method.

I TIME COMPLEXITY ANALYSIS

Although PRESTO performs data partitioning while simultaneously learning a set of mixture of models, the partitioning time (lines 19-22 in Algorithm 1) is negligible as the bulk of the time is devoted to training (lines 10-18 in Algorithm 1). Since the training stage is common to all the methods, PRESTO does not provide much disadvantage in terms of the time. et al., 2000) and Kmeans++ (Arthur & Vassilvitskii, 2006) 



For brevity, we present our analysis for the classification setup. However, our framework is also applicable, as-it-is, to the regression setup. The Bregman divergence is defined as: B ϕ (x, y) = ϕ(x) -ϕ(y) -⟨∇ϕ(y), x -y⟩.



Figure 3: Trade off between accuracy and maximum GPU memory for KD, DGKD and PRESTO during training (panel (a)) and inference (panel (b)) for CIFAR10 and PMNIST.

Figure 6: Snapshot of the true partition (panel (a)); and, the snapshots of the trained models {h θ k | k ∈ [K]} and the partitions {S k | k ∈ [K]} predicted by PRESTO (panels (b)-(d)) on the synthetic dataset during progression of Algorithm 1. The dataset is generated using a degenerate mixture model with K = 4 components. Here, an instance (x i , y i ) belongs to exactly one of the four sets {S * k | k ∈ [4]} with probability 1, where S * k are defined in Eqs. (35)-(38). Each model component is an SVM, i.e., h θ k (x) = w ⊤ k x + b k and ℓ(ŷ, y) = (1 -ŷy) + . We observe that as r increases, PRESTO becomes more and more accurate in assigning an instance to the correct mixture component and finally, at r = 50, it is almost able to recover the true mixture component-the ground truth assignments of the instances (panel (a)) is extremely close to the final assignments (panel (d)).

Figure 7: Accuracy vs. the number of partitions KAccuracy vs GPU usage for DMNIST and SVHN. Here, we perform the same experiments on DMNIST and SVHN as Figure3. Figure8summarizes the results, which reveal same insights as Figure3.

Figure 8: Trade off between accuracy and maximum GPU memory for KD, DGKD and PRESTO during training (panel (a)) and inference (panel (b)) for DMNIST and SVHN.

Now minimizing the modular function mŜ[S] wrt S is equivalent to taking the minimum of [m Ŝ[(i, k)] over different (i, k) pairs ( since, modular(Set) = e∈Set modular(e)).In each iteration of Algorithm 1, we simply compute the minimum of m and update S accordingly. Note that:m G Ŝ [S] = G( Ŝ) -(i,k)∈ Ŝ α G G((i, k)| Ŝ\ {(i, k)}) + (i,k)∈ Ŝ∩S α G G((i, k)| Ŝ\ {(i, k)}) m G Ŝ [(i, k)] is same as maximizing α G G((i, k)| Ŝ\ {(i, k)}) or G((i,k)|∅)

Algorithm 1 The PRESTO Algorithm Require: Training data D, αG, K model architectures, Iterations. 1: Output: The learned parameters {θ k

Comparison of classification accuracy P(ŷ = y) and Macro-F1 score of PRESTO against three unsupervised partitioning methods (Equal-Kmeans

lists the further details of these datasets. We see that CIFAR10 is a balanced dataset where as PathMNIST, DermaMNIST and SVHN have large skew in their label proportions.



Dataset specific hyperparameters for PRESTO

Comparison of classification accuracy P(ŷ = y) of PRESTO against three unsupervised partitioning methods (Equal-Kmeans

Table 10 presents an analysis of the per-iteration training time. Comparison of training time per iteration of PRESTO against Equal-Kmeans (Bennett

Appendix A PROOFS OF THE TECHNICAL RESULTS IN SECTION 3

A.1 FORMAL DISCUSSION ON MATROID Definition 5. A matroid is a combinatorial structure M := (V, I) defined on a ground set V and a family of independent sets I ⊆ 2 V , which satisfies two conditions.(1) If S ⊆ T and T ∈ I, then S ∈ I.(2) If S ∈ I and T ∈ I and |T| > |S|, then there exists a e ∈ T\S so that S ∪ {e} ∈ I.From the above definition, it is clear the all the maximal independent sets have same cardinality. A maximum independent set is called the base of the matroid.

A.2 MONOTONICITY AND α-SUBMODULARITY OF G

Here we prove the claims of Theorem 2. We repeat the theorem for convenience Theorem (2). Given the set function G(S) defined in Eq. (4) and the individual regularized loss functions ℓ introduced in Eq. (1), we define:Then, we have the following results:(1) Monotonicity: The function G(S) defined in Eq. ( 4) is monotone non-decreasing in S.(2) α-submodularity: If ℓ(h θ k (x), y) is L-Lipschitz for all k ∈ {1, .., K} and that the regularizing coefficient λ satisfies:(3) Generalized curvature: The generalized curvature κ G (S) for any set S is given by: κ). From Eq. ( 10), this leads to the following inequality:11) Now, if we include a new element (j, t) into S, then it will only change the loss g(θ * t (S), S) among all model components. Thus we have,(2) (Proof of α-submodularity) First, we try to show α-submodularity of g(θ * k (S), S). Hence, we first bound the following ratio:Eq. ( 11) directly provides bound on the numerator of this ratio:E EXPERIMENTS ON SYNTHETIC DATAIn this section, we experiment with synthetically generated instances from a latent degenerate mixture distribution and show that, Algorithm 1 can accurately recover the partitions which correspond to the true mixture components.

E.1 EXPERIMENTAL SETUP

Dataset generation. We generate 17, 2.59 ] and Y = {-1, +1}. The instances (x, y) belong to one of the K = 4 sets, viz., S * 1 , S * 2 , S * 3 , S * 4 , defined as follows:Panel (a) of Figure 6 shows a scatter plot of (x, y). The instances are homogeneously distributed across all components, i.e., |S * k | = 5000. Choice of f θ k and ℓ. We consider K = 4 linear support vector machinesas the margin based hinge loss, i.e., ℓ(h θ k (x), y) = (1-y(w k x k +b k )) + . Then we apply PRESTO (Algorithm 1) to simultaneously learn the parametersand the partitions S k such that D = ∪ 4 k=1 S k . In addition to the classification accuracy, we also measure the aggregated error in predicting the correct mixture component which is defined as Err S = i,j 1(i, j ∈ S * k , i ∈ S t , j ∈ S t ′ with t ̸ = t ′ ). Thus, ∅ computes the number of pairs that belong to the same latent mixture component S * k , but are predicted to have different mixture components by PRESTO.

E.2 RESULTS

In Figure 6 , we plot different snapshots of the models h θ k and the partitions D = ∪ 4 k=1 S k , as PRESTO progresses during different iterations r (line 9 in Algorithm 1). We make the following observations: (1) As r increases, the classification accuracy increases as well as mixture component prediction error Err S decreases. For r = 50, we observe that Err S becomes very small and PRESTO is able to recover the true partitions, i.e., S * k ≈ S k (2) We observe that PRESTO finds it relatively difficult to correctly assign the instances on S * 2 and S * 3 to the right mixture components. This is due to two reasons: first, the instances on S * 2 and S * 3 fit naturally to a nonlinear SVMs, whereas we attempt to fit a linear SVM; second, the instances in S * 2 and S * 3 , which are around (0, 0), are close to each other-this poses difficulty in demarcating their model components.

