A BENCHMARK DATASET FOR LEARNING FROM LABEL PROPORTIONS Anonymous authors Paper under double-blind review

Abstract

Learning from label proportions (LLP) has recently emerged as an important technique of weakly supervised learning on aggregated labels. In LLP, a model is trained on groups (a.k.a bags) of feature-vectors and their corresponding label proportions to predict labels for individual feature-vectors. While previous works have developed a variety of techniques for LLP, including novel loss functions, model architectures and their optimization, they typically evaluated their methods on pseudo-synthetically generated LLP training data using common small scale supervised learning datasets by randomly sampling or partitioning their instances into bags. Despite growing interest in this important task there are no large scale open source LLP benchmarks to compare various approaches. Construction of such a benchmark is hurdled by two challenges a) lack of natural large scale LLP like data, b) large number of mostly artificial methods of forming bags from instance level datasets. In this paper we propose LLP-Bench: a large scale LLP benchmark constructed from the Criteo Kaggle CTR dataset. We do an in-depth, systematic study of the Criteo dataset and propose a methodology to create a benchmark as a collection of diverse and large scale LLP datasets. We choose the Criteo dataset since it admits multiple natural collections of bags formed by grouping subsets of its 26 categorical features. We analyze all bag collections obtained through grouping by one or two categorical features, in terms of their bag-level statistics as well as embedding based distance metrics quantifying the geometric separation of bags. We then propose to include in LLP-Bench a few groupings to fairly represent real world bag distributions. We also measure the performance of state of the art models, loss functions (adapted to LLP) and optimizers on LLP-Bench. We perform a series of ablations and explain the performance of various techniques on LLP-Bench. To the best of our knowledge LLP-Bench is the first open source benchmark for the LLP task. We hope that the proposed benchmark and the evaluation methodology will be used by ML researchers and practitioners to better understand and hence devise state of art LLP algorithms. 2 -distance for ease of computation. We also prove certain metric like properties for the inter-bag distances. Details and analysis of these distances is present in Section 5.3. From the above analyzed groupings we select a representative subset with diverse statistical properties and include them in LLP-Bench -our proposed LLP benchmark as a collection of LLP datasets. For each of the 52 grouping we create a LLP model training and testing setup as follows. We remove the small and large bags as above, and then recreate the instance-level dataset out of the remaining bags. We apply 5-fold split to obtain 5 pairs of train/test splits at the instance-level. For each train/test split, the training bags are recreated using the same grouping on the train set. We then train a 1-Layer MLP, 2-layer MLP and the AutoInt models using various hyperparameter and optimizer settings, on the training bags and evaluate w.r.t AUC scores on the test set. Details of the training and reported AUC scores are present in Section 6.1. Statistically correlating LLP model performance. From the above experimentation we obtain statistics for different groupings based on their bag and label proportion distributions. We calculate the Pearson correlation scores with the model training performance of the LLP data-statistics for the different groupings. The main observations are:

1. INTRODUCTION

In traditional supervised learning, training data consists of feature-vectors (instances) along with their labels. A model trained using such data is then used during inference to predict the labels of test instances. In recent times, primarily due to privacy concerns and relative rarity of high quality supervised data, the weakly supervised framework of learning from label proportions (LLP) has gained importance (Scott & Zhang (2020) ; Saket et al. (2022) ; O'Brien et al. (2022) ). In LLP, the training data is available as a collection of subsets or bags of instances along with the label proportion for each bag. The goal is to learn a classification model for predicting the class-labels of individual instances (de Freitas & Kück (2005) ; Musicant et al. (2007) ). Clearly, supervised learning is the special case of LLP when all bags are unit-sized. Unlike supervised learning however, for which a multitude of task-specific real-world datasets are easily available, the same is not true for LLP. While previous works have developed and explored a variety of algorithmic, optimization and deep-neural net based techniques for LLP (see Sec. 2 for more de-tails), all of them experimentally evaluate their methods on pseudo-synthetic LLP datasets consisting instances of some supervised learning dataset randomly sampled/partitioned into the different bags. Further, most of the above works use limited scale data, typically small UCI (Dua & Graff (2017) ), image and social media datasets. An exception to the above is the work of Saket et al. (2022) which also uses the Criteo Kaggle CTR (Criteo (2014) ) and MovieLens-20m (Movielens-20M; Harper & Konstan (2016) ) which are fairly large in scale, roughly 45 million instances and 20 million instances respectively. In particular, the Criteo dataset has 13 numerical and 26 categorical features whose semantics are undisclosed. Each row is an impression and a {0, 1}-valued label indicates a click, with in total 7 days of impression-click data. The categorical features can be used to create many different bag collections depending on their subset used for grouping, where each choice of the subset's values yields a bag of instances having those feature values. These groupings simulate the typical aggregation scenarios in real-world use-cases, however Saket et al. (2022) only experimented in a limited manner with one grouping. In contrast to the above state of affairs, a large number of publicly accessible real-world and large scale supervised-learning datasets have been studied over the years, whereas there are hardly any datasets which are curated specifically for LLP.

1.1. OUR CONTRIBUTIONS

In this work we address the unavailability of a large scale benchmark and standardized evaluation methodologies for LLP. We make the following contributions in this paper towards creating an LLP benchmark building on top of the publicly available Criteo Kaggle CTR dataset. Bag collections using group-by feature-sets. Typically, for privacy preservation in CTR applications, the impressions are grouped into bags according the values of features such as advertiser-id, product-id, date etc. Thus, we can simulate such aggregations on the Criteo dataset using any subset of the categorical feature-set. However, we observe that choosing more than three categorical features likely results in small-sized bags which would be contrary to the goal of large scale LLP datasets. Therefore, our exploration limits the groupings to those obtained using at most two of the categorical features. Below we present the different aspects of our exploration of these groupings. We use a standard preprocessing previously used for training the AutoInt model (Song et al. (2019) ) on the Criteo dataset at the instance level. More details can be found in Section 4. Analysis, categorization, and filtering groupings. There are 26 categorical features, leading to 26 + 26 2 = 351 possible groupings using at most two categorical features. Our goal is to curate LLP datasets with not too small or very large bags (as the latter have very weak label supervision), and we always remove bags of size ≤ 50 and those of size > 2500 from these groupings, similar to the work of Saket et al. (2022) . Post this removal, we identify as outliers those groupings which have at most 500 bags. The remainder 308 groupings are further analyzed in relation to their bag size and label proportion distributions. For each grouping We calculate the threshold bags sizes such that t% of the bags have at most that size, for t = 50, 70, 85, 95. Using normalized vectors of these four values, we use k-Means clustering to the partition the groupings into four subsets typified by increasing bag sizes. More details of these clusters can be found in Sec 5.1. Further, modeling the labels as iid Bernoulli with bias given by the average label of the dataset, we compute for each grouping the average of the log likelihoods of the bag label proportions. Using this we also cluster the set of groupings into four subsets indicating how far-from random their label proportions are. Analysis of this characterisation can be found in Section 5.2. In the above removal of bags, a substantial fraction of the original dataset is also removed since there is an abundance of small bags for most groupings. For subsequent analysis involving model training, we further filter out those ones which retain less than 30% of the instances. This is to ensure that we only have large-scale LLP bag collections, and we obtain 52 groupings satisfying the retention condition. Details of removals by these filters can be found in Section 4.2. It turns out that these groupings have a similar number and average size of bags. We then proceed to estimate the geometric clustering of bags by computing the average inter-bag and intra-bag distances for these groupings. For this we use natural definitions of these notions based on the 1. A negative correlation with the percentile bag size thresholds indicating that the model performance degrades when the groupings have larger bags. This is intuitively consistent with larger bags having less label supervision (for the same label proportion) that smaller bags. 2. A negative correlation with the label proportion log likelihood. Since bags with label proportions deviating from the global label bias have lower likelihood, roughly speaking this means that those groupings where the positive labels are concentrated in fewer bags have better training performance. 3. A positive correlation with the ratio of Average Inter-Bag Distances and Average Intra-Bag Distances. Higher ratio indicates a good separation between the bags. Hence, positive correlation indicates that models perform better when there is considerable variation in bag distributions. Some other correlation scores obtained and their interpretation is described in Section 6.2.

2. PREVIOUS WORK

The study of LLP is motivated by applications in which only the aggregated labels for groups (bags) of feature vectors are available due to privacy or legal (Rüping (2010) , Wojtusiak et al. (2011) ) constraints or inadequate or costly supervision Dery et al. (2017) ; Chen et al. (2004) . It has also been used for several weakly supervised tasks such as IVF prediction (Hernández-González et al. (2018) ) and image classification (Bortsova et al. (2018) ; Ørting et al. (2016) ). More recently, LLP has been proposed by O'Brien et al. (2022) as a framework for privacy preserving conversion prediction. Several techniques for LLP have been studied over the years. de Freitas & Kück (2005) ; Hernández-González et al. ( 2013) applied trained probabilistic models using Monte-Carlo methods. Subsequent works (Musicant et al. (2007) 

3. PRELIMINARIES

In our exploration of LLP we shall only consider binary {0, 1}-valued instance labels. Notation: X := {x (i) ∈ R n } m i=1 is the dataset of m feature vectors in n-dimensional space with labels given by Y := {y (i) ∈ {0, 1}} m i=1 . We denote by Ŷ := {ŷ (i) ∈ [0, 1]} m i=1 the corresponding model predictions which are probabilities of the predicted label being 1. i) and with the corresponding label histogram y B := Σ i∈B y (i) . The label proportion of the bag is y B /|B|.  Definition 3.1 (Bag). A bag B ⊆ [m] consists of feature vectors X B := ∪ i∈B x ( = µ(B, Y ). Its log- likelihood is LL(B, y B ) = log f (|B|, y B , p) where f (r, k, p) := r k p k (1 -p) r-k is the pdf of B(r, p). The dataset average bag log-likelihood is AvgBagLL(B, Y ) := B∈B LL(B, y B ) /|B|.

3.1. LLP MODEL TRAINING

In order to train a model on an LLP dataset, we apply common loss functions at the bag level. In this work we experiment with the binary cross entropy loss L bce and the mean-squared error loss L mse , which are define for a bag B with with average label z B := y B /|B| and average label prediction ẑB := ŷB /|B| as: L bce (B, z B , ẑB ) := -(z B log ẑB + (1 -z B ) log(1 -ẑB )) , L mse (B, y B , ŷB ) := |y B -ŷ B | 2 . (1) Note that both the above losses are minimized when ŷB = y B . In our experiments we use mini-batch based model training on LLP dataset. A mini-batch here consists of k bags B 1 , . . . , B k and their corresponding label histograms y B1 , . . . , y B k . The model predicts on all the instances in the bags of the minibatch are aggregated into the predicted label histograms ŷB1 , . . . ŷB k . The batch-level loss is given by sum of the bag-level losses over the for the mini-batch bags.

3.2. BAG-LEVEL DISTANCES

We also analyse the geometric clustering of the feature vectors in bags, by comparing the separation among feature-vectors within bags and their separation across bags. For this, we define a natural bag separation. While BagSep is not a metric since BagSep(B, B) is not necessarily zero, the following lemma (proved in Appendix A.1) shows that it does satisfy the other metric properties. Lemma 3.4. BagSep satisfies non-negativity, symmetry and triangle inequality. We use BagSep to compute the average separation between pairs of bags and the average separation within each bag. If the feature-vectors in bags are clustered together and far away from those of other bags, we expect the former to be significantly greater than the later. Definition 3.5 (Inter-Bag Separation for a bag). Given B, and metric d on R n , the average inter-bag distance for a bag B ∈ B is defined as InterBagSep(B, d) := 1 |B|-1 B ∈B,B =B BagSep(B, B , d). For computing the average statistic for the entire dataset we define the following. Definition 3.6. The mean intra-bag separation of B is defined as MeanIntraBagSep(B, d) := 1 |B| B∈B BagSep(B, B, d). The mean of average inter-bag separation is MeanInterBagSep(B, d) := 1 |B| B∈B InterBagSep(B, d). We have the following lemma proved in Appendix A.2. Lemma 3.7. For any bag B, (i) InterBagSep(B, d)/BagSep(B, B, d) ≥ 1/2 when d is a metric, (ii) InterBagSep(B, d)/BagSep(B, B, d) ≥ 1/4 when d is the 2 2 distance. The following is a straightforward corollary of Lemma 3.7. Corollary 3.8. (i) MeanInterBagSep(B, d)/MeanIntraBagSep(B, d) ≥ 1/2 when d is a metric, (ii) MeanInterBagSep(B, d)/MeanIntraBagSep(B, d) ≥ 1/4 when d is the 2 2 distance. We expect this ratio to achieve values substantially less than 1 in adversarial cases. Appendix A.5 provides an example of such a case. For convenience, for B, we use InterIntraRatio to denote MeanInterBagSep(B, d)/MeanIntraBagSep(B, d) when d = 2 2 .

3.3. CRITEO DATASET

The Criteo CTR dataset (Criteo (2014) ) has 13 numerical and 26 categorical features and a binary label. Each of the approximately 45 million rows (instances) represents an impression (online ad) and the label indicates a click. The semantics of all the features is undisclosed and the values of all the categorical features hashed into 32-bits for anonymization. Additionally, the dataset has missing values. We use a preprocessed version of the dataset as done for the AutoInt (Song et al. (2019)) model, described and implemented in their provided codefoot_0 . For convenience we label the numerical and categorical features (in their order of occurrence) as N 1, . . . , N 13 and C1, . . . C26. The preprocessing applies int(logfoot_1 (x)) transformation when x > 2 on the numerical feature values x, and we further additively scale so that their values are non-negative integers. The categorical features are encoded as non-negative integers.

4. LLP DATASET: BAG CREATION

We create the LLP dataset by grouping the instances by subsets C ⊆ {C1, . . . , C26} of the categorical columns, where C ≤ 2. For each setting of the values of C we obtain a bag with instances with those values of C. Each such grouping yields an LLP dataset 2 . Thus, we obtain 26 2 + 26 = 351 LLP datasets, each referred to also as a grouping on C (|C| ≤ 2). Note that for any grouping, the set of bags partition the dataset and therefore each instance occurs in exactly one bag.

4.1. CLIPPING GROUPINGS FOR BAG DISTRIBUTION ANALYSIS

As mentioned in Sec. 1.1 we clip the groupings by discarding all bags of sizes less than 50 and greater than 2500, as our goal is to analyze reasonable LLP datasets. We observe that some groupings are left with with very few bags or zero bags, while others have a large number of bags. For e.g., the initial grouping on C9 creates only 3 bags and the grouping on C20 creates only 4 bags. Hence, after clipping thse groupings have no bags. The groupings on C6, C17, C22, C23 and {C9, C20} all contain less than 20 bags. On the other hand, groupings on {C10, C16} and {C4, C10} each contain more than 8 × 10 6 bags. We compute the mean bag sizes of the clipped groupings. The lowest mean bag size is 62 which we obtain is for the clipped grouping on C23. It manages to retain just one bag after clipping and has 62 instances in that bag. Similarly, the highest mean bag size that we obtain is 1292 obtained on clipped grouping on C17. It also retains a single bag after clipping with 1292 instances in that bag. Table 4 provides these statistics for a sample of the groupings. Refer Appendix A.8 for statistics of all groupings. The bag distribution analysis described in Sec. 5 is performed on the 308 clipped groupings with at least 500 bags remaining.

4.2. FILTERING GROUPINGS FOR MODEL TRAINING

We apply the following filters on the clipped groupings to choose groupings for model training. Label Information Loss Filter. If the number of bags that remain is less that 10000, we discard such groupings to ensure sufficient number training bags. After applying this filter, we are left with 240 groupings. Instance Information Loss Filter. We drop a grouping if it is left with less than 30% of the original number of instances (≈ 13.75 × 10 6 instances). After applying this filter, we are left with 52 groupings. All the groupings in single columns are filtered out as the maximum percentage of instances any of these groupings retains is 21.68% (C4). We finally obtain a set of 52 groupings which satisfy both of the conditions listed above, all of which are emboldened in Table 10 in the Appendix.

5. BAG DISTRIBUTION ANALYSIS

We perform the bag distribution analysis for all 308 clipped groupings which contain more than 500 bags.

5.1. CHARACTERISING THE DISTRIBUTION OF BAG SIZES

Since we have only have a label proportion for each bag, informally speaking, the larger the bag size the lower the amount of label supervision for that bag. The bag sizes for any grouping are characterized by their cumulative distribution function which plots the fraction of bags of size at most t for all t ≥ 1. In all the groupings, it is observed that the density of bags drops steeply with the increase in bag size in the histograms of bag sizes. Thus, we compute the bag sizes at the 50, 70, 85 and 95 percentile of cumulative distribution plot, for each grouping. Hence, we can naturally classify the groupings we obtain into long-tailed and short-tailed distributions. Short-tailed distributions have most bags of small size and a very few large sized bags whereas Long-tailed distributions contain many bags of large sizes. Bags of large sizes provide a very little label information for a lot of feature level information. Hence, they can be used for learning representations but are less useful in supervised training. In order to classify the groupings created into long-tailed and short-tailed, we compute the threshold bag sizes at which we attain 50, 70, 85 and 95 percentile of the bags for each clipped grouping. We normalize these values and obtain 4-dimensional vectors for each clipped grouping. Applying k-Means on these vectors we cluster the clipped groupings into 4 clusters. As shown in Table 1 , the mean t-percentile bag size, give the same cluster ordering for t = 50, 70, 85, 95. Hence, we name the clusters in increasing order of these mean bag sizes as Very Short-tailed, Short-tailed, Longtailed and Very Long-tailed bag size distributions. Appendix A.9 contains threshold bag sizes and cluster labels based on them for all groupings.

5.2. CHARACTERISING THE DISTRIBUTION OF LABEL HISTOGRAMS

We model the distribution of label histograms in a grouping as a binomial distribution, with bias as the label proportion of the grouping. We compute for each grouping its AvgBagLL value. The to the MLP models. AutoInt has a 16-dimensional trainable embedding layer for each feature 5 . Output layer has one unit with sigmoid activation. We perform minibatch gradient descent by sampling minibatches of 8 bags each, and the model predictions are aggregated into the predicted label proportions of the bags. The minibatch loss is the sum over the bags of either the bag-level L mse or L bce as described in Sec. 3.1. We then backpropagate this loss and update weights using the optimizer -either Adam or SGD. The specifications of experiments are in Table 7 . Using Adam, we train for 50 epochs with a learning rate of 1e -5 for initial 15 epochs and 1e -6 for the rest. Using SGD, we train for 300 epochs with constant learning rate of 1e -5. We use test AUC scores to qualify the tractability of an LLP dataset. For MLP trained using SGD and Adam we use the maximum reported AUC score. On the other hand, AutoInt has an increasing trend for both optimizers but it is not (even locally) monotonic, hence for it we use the average of last 5 epochs of training. The AUC score (averaged over the 5 splits) for trainable groupings in LLP-Bench and the various experiments (see in conjunction with Table 7 ) are listed in 5 and Table 6 . AUC score for all trainable groupings are listed in Table 8 . Appendix A.7 contains details of instance-level training which we perform for completeness. We compute the Pearson correlation between the AUC scores and the bag level statistics computed in Sec. 5. These are visualised in Fig. 1 . Some observations from these scores are: 1. Positive correlation with number of bags and number of instances. This is as expected as each bag adds to the label information and each instance adds to feature information available to the classifier. 2. Negative correlation with the mean bag size and percentile bag size thresholds. This is intuitively consistent with larger bags having less label supervision (for the same label proportion) than smaller bags, and typically the model performance would degrade when the groupings have larger bags. 3. Negative correlation with the label proportion log likelihood. A lower log likelihood indicates that labels proportions in the dataset are highly skewed. This means that those groupings where the positive labels are concentrated in fewer bags have better training performance. In this case, the bag grouping features provide significant supervision which the model can leverage. We can infer the same from highly positive correlation with standard deviation of label proportion. 4. Positive correlation with the InterIntraRatio. Higher ratio indicates a good separation between the bags in input space. Hence, positive correlation indicates that models perform better when bags are separable in input space. This can be explained as follows -The distribution of label proportions are skewed as the maximum log-likelihood exhibited is -3.26. Hence, substantial label information is present at the bag-level. -If the InterIntraRatio is high, much of the discriminative information at the bag-level lies in the input space itself. If the InterIntraRatio is low, most of this information is in some latent space that the model needs to learn. We also adopt an evaluation methodology using which we train various models on an appropriately filtered subset of groupings and demonstrate (as well explain) correlation of the model performance with the computed statistics. We believe our work addresses to a great extent the current lack of natural LLP benchmarks, and provides LLP-Bench using which LLP techniques can be systematically evaluated. Proof. From Def. 3.3, the non-negativity and symmetry properties are obvious. Triangle Inequality : let B 1 , B 2 , B 3 ∈ B, and we use the following notation for convenience: B 1 = {x i |i ∈ [n]}, B 2 = {y j |j ∈ [m]}, B 3 = {z k |k ∈ [l]}. As d is a metric, we know that for all i ∈ [n], j ∈ [m] and k ∈ [l], d(x i , z k ) ≤ d(x i , y j ) + d(y j , z k ). Hence, d(x i , z k ) ≤ Σ j=m j=1 d(x i , y j ) m + Σ j=m j=1 d(y j , z k ) m ⇒ Σ i=n i=1 d(x i , z k ) n ≤ Σ i=n i=1 Σ j=m j=1 d(x i , y j ) nm + Σ j=m j=1 d(y j , z k ) m ⇒ Σ k=l k=1 Σ i=n i=1 d(x i , z k ) ln ≤ Σ i=n i=1 Σ j=m j=1 d(x i , y j ) nm + Σ k=l k=1 Σ j=m j=1 d(y j , z k ) ml ⇒ BagSep(B 1 , B 3 , d) ≤ BagSep(B 1 , B 2 , d) + BagSep(B 2 , B 3 , d) A.2 PROOF OF LEMMA 2.7 Proof. Let B ∈ B. Using triangle inequality and symmetry from Lemma 2.4: ∀B ∈ B, BagSep(B, B, d) ≤ BagSep(B, B , d) + BagSep(B , B, d) ⇒ ∀B ∈ B, BagSep(B, B, d) ≤ 2BagSep(B , B, d) ⇒ BagSep(B, B, d) ≤ 2 Σ B ∈B,B =B BagSep(B , B, d) |B| -1 ⇒ BagSep(B, B, d) ≤ 2InterBagSep(B, d) ⇒ InterBagSep(B, d)/BagSep(B, B, d) ≥ 1/2 The squared euclidean distance is not a metric as it follows all properties other than the triangle inequality. Hence, we show the following Lemma A.1. For any a, b ∈ R n , 1 2 ||a + b|| 2 2 ≤ ||a|| 2 2 + ||b|| 2 2 Theorem A.2. Given X, Y and B, for any B 1 , B 2 , B 3 ∈ B 1 2 BagSep(B 1 , B 3 , 2 2 ) ≤ BagSep(B 1 , B 2 , 2 2 ) + BagSep(B 2 , B 3 , 2 2 ) Proof. Follows by replacing triangle inequality in Lemma 2.4 with inequality in Lemma A.1 Corollary A.3. InterBagSep(B, 2 2 )/BagSep(B, B, 2 2 ) ≥ 1/4 Proof. Follows by replacing inequality in proof of Lemma 2.7 with inequality in Theorem A.2 A.3 PROOF OF COROLLARY 2.8 Proof. Given X, Y and B, and metric d in R n . Starting with inequality in Lemma 2.7 ∀B ∈ B, BagSep(B, B, d) ≤ 2InterBagSep(B, d) ⇒ Σ B∈B BagSep(B, B, d) ≤ 2Σ B∈B InterBagSep(B, d) ⇒ 1 |B| BagSep(B, B, d) ≤ 2 1 |B| InterBagSep(B, d) ⇒ MeanInterBagSep(B, d)/MeanIntraBagSep(B, d) ≥ 1/2 Starting with inequality for 2 2 -distance in Lemma 2.7, we get MeanInterBagSep(B, 2 2 )/MeanIntraBagSep(B, 2 2 ) ≥ 1/2 A.4 BAG DISTANCE RESULTS USING SQUARED EUCLIDEAN DISTANCE We use the squared euclidean distance to compute the bag distances as it makes the computation faster. Algorithm 1 is used to compute the Bag Separation for any general metric d. Theorem A.4. Assuming the Bags to be disjoint, the running time of Algorithm 1 is O(m 2 n) where m is the number of examples and n is the dimension of the input space. Proof. Runtime = Σ B1∈B Σ B2∈B |B 1 ||B 2 |n = Σ B1∈B |B 1 |Σ B2∈B |B 2 |n = m 2 n Now, this computation can be simplified due to the following lemma. Lemma A.5. For any B, B ∈ B, BagSep(B, B , 2 2 ) = AvgSqNorm(B) + AvgSqNorm(B ) - 2DotProduct(M ean(B), M ean(B )) Algorithm 1: Compute Bag Separation of a dataset Data: Set of bags B, metric d on R n Result: BagSepMatrix(B, d) BagSepMatrix ← [0] |B|x|B| for B 1 ∈ B do for B 2 ∈ B do for i ∈ B 1 do for j ∈ B 2 do BagSepMatrix[B 1 , B 2 ] ← BagSepMatrix[B 1 , B 2 ] + d(x (i) , x (j) ) end end BagSepMatrix[B 1 , B 2 ] ← BagSepMatrix[B 1 , B 2 ]/(|B 1 ||B 2 |) end end Proof. Let B = {x i |i ∈ [n]}, B = {y j |j ∈ [m]} BagSep(B, B , 2 2 ) = 1 mn Σ i=n i=1 Σ j=m j=1 ||x i -y j || 2 2 ⇒ BagSep(B, B , 2 2 ) = 1 n Σ i=n i=1 ||x i || 2 2 + 1 m Σ j=m j=1 ||y j || 2 2 - 2 mn Σ i=n i=1 Σ j=m j=1 x i , y j ⇒ BagSep(B, B , 2 2 ) = 1 n Σ i=n i=1 ||x i || 2 2 + 1 m Σ j=m j=1 ||y j || 2 2 - 2 mn Σ i=n i=1 x i , Σ j=m j=1 y j Algorithm 2 is used to compute the Bag Separation for squared euclidean distance. Algorithm 2: Compute Bag Separation with squared euclidean distance Data: Set of bags B Result: BagSepMatrix(B, 2 2 ) BagSepMatrix ← [0] |B|x|B| AvgSqNorm ← [0] |B| BagMeans ← [0] |B|xn for B ∈ B do for i ∈ B do AvgSqNorm(B) ← AvgSqNorm(B) + ||x (i) || 2 2 BagMeans(B) ← BagMeans(B) + x (i) end AvgSqNorm(B) ← AvgSqNorm(B)/|B| BagMeans(B) ← BagMeans(B)/|B| end for B 1 ∈ B do for B 2 ∈ B do BagSepMatrix[B 1 , B 2 ] ← AvgSqNorm[B 1 ] + AvgSqNorm[B 2 ] -2DotProduct(BagMeans[B 1 ], Bagmeans[B 2 ]) end end Theorem A.6. Assuming the Bags to be disjoint, the running time of Algorithm 2 is O(mn+|B| 2 n+ |B| 2 ) where m is the number of examples and n is the dimension of the input space. Proof. Runtime = Σ B∈B |B|n + Σ B1∈B Σ B2∈B (1 + n) = mn + |B| 2 n + |B| 2 A.5 ADVERSARIAL EXAMPLE OF BAGS WITH RATIO OF MEAN INTER TO INTRA BAG SEPARATION AS 1/2 Consider X = {x (1) , x (2) , x (3) } which lie on a straight line. The distances are as follows: • d(x (1) , x (2) ) = d 1 • d(x (2) , x (3) ) = d 2 • d(x (1) , x (3) ) = d 1 + d 2 We have two bags B 1 = {x (1) , x (3) } and B 2 = {x (2) }. The Intra-bag separations for both of them are as follows: • BagSep(B 1 , B 1 , d) = 1 2 2 (d(x (1) , x (1) ) + d(x (1) , x (3) ) + d(x (3) , x (1) ) + d(x (3) , x (3) )) = 1 2 (d 1 + d 2 ) • BagSep(B 2 , B 2 , d) = 0 Hence, MeanIntraBagSep(B, d) = 1 4 (d 1 + d 2 ) . Now, the bag separation between the bags is as follows: • BagSep(B 1 , B 2 , d) = 1 1×2 (d(x (1) , x (2) ) + d(x (3) , x (2) )) = 1 2 (d 1 + d 2 ) • InterBagSep(B 1 , d) = 1 2-1 (BagSep(B 1 , B 2 , d)) = 1 2 (d 1 + d 2 ) • InterBagSep(B 2 , d) = 1 2-1 (BagSep(B 2 , B 1 , d)) = 1 2 (d 1 + d 2 ) Hence, MeanInterBagSep(B, d) = 1 2 (d 1 + d 2 ). Hence, MeanInterBagSep(B, d)/MeanIntraBagSep(B, d) = 1/2 A.6 LLP MODEL TRAINING RESULTS This table contains the AUC scores of all the experiments in Table 7 . Each value represents the mean AUC score in percentage across 5 splits which are created as mentioned in 6.1. The error is the standard deviation of mean AUC scores across these 5 splits. We perform all the experiments mentioned in 7 on instance-level data. The process remains similar for different configurations as described in Sec. 6.1. We perform a train-test spilt of 80:20 on the dataset. We then train using instance level mini-batch gradient descent for the same number of epochs, using the same optimizer, model, learning rate schedule and the instance-level variant of the loss function. We again report the AUC scores as described in Sec. 6.1. This table contains the threshold bags sizes such that t% of the bags have at most that size, for t = 50, 70, 85, 95 for all 349 groupings (removing groupings which were left with no bags after clipping). We perform K-Means of 308 of these groupings that have more than 500 bags after clipping. The cluster assigned to each bag is also listed. 

GROUPINGS

This table contains the Average Log Likelihood of Label Proportion distributions for all 349 groupings (removing groupings which were left with no bags after clipping). We perform K-Means of 308 of these groupings that have more than 500 bags after clipping. The cluster assigned to each bag is also listed. 



The url is https://github.com/DeepGraphLearning/RecommenderSystems/tree/master/featureRec . Note that for model training purposes such bags may be created from only the train set portion of the entire dataset 80% instances used for training and rest for validation. 1-Layer MLP has 64 hidden units, 2-Layer MLP has 128 and 64 units in successive layers, tanh activation Embedding Layer for categorical features, numerical features multiplied by trainable 16-dim vector



Definition 3.2 (LLP Dataset). A learning from label proportions (LLP) dataset corresponding to a collection of bags B := {B j } j=N j=1 is given by {(X B , y B ) | B ∈ B}. The label bias of training dataset is µ(B, Y ) := B∈B y B / B∈B |B| . Since the LLP training dataset lacks instance-level labels we use the dataset label bias to model the label histogram of a bag B as the binomial distribution B(|B|, p) where p

Definition 3.3 (Bag Separation). For a distance d on R n and collection of bags B = {B 1 , . . . , B M } the corresponding separation function is defined as BagSep(B, B , d) := 1 |B||B | Σ x∈B Σ x ∈B d(x, x ). We define the M × M matrix BagSepMatrix(B, d) whose (i, j)th element is given by BagSep(B i , B j , d).

Figure 1: Correlation Heatmap

Further, most of the above works use limited scale data, typically small to medium scale UCI datasets(Yu et al. (2013);Patrini et al. (2014); Scott & Zhang (2020)), image datasets(Liu et al. (2019)), social media data (Ardehaly & Culotta (2017)) etc. In general, the have been very few natural, real-world LLP datasets used for evaluations in previous works. As mentioned earlier Saket et al. (2022) experimented with MovieLens-20m and Criteo datasets. They used a temporal aggregation into bags for MovieLens-20m, while on the Criteo dataset they only used one pair of categorical features to create a collection of bags for experimentation.

Mean bag sizes at which groupings achieve 50, 70, 85 and 95 percentile in each cluster

Clustering on AvgBagLL.

Clustering on InterIntraRatio.



LLP-Bench Groupings. Bold : Analyzed for model training.

AUC Scores of MLP classifiers obtained on LLP-Bench Groupings

AUC Scores of AutoInt obtained on LLP-Bench Groupings

Experiment Legend

Stefan Rüping. SVM classifier estimation from group probabilities. In Johannes Fürnkranz and Thorsten Joachims (eds.), ICML, pp. 911-918, 2010. Rishi Saket. Learnability of linear thresholds from label proportions. In NeurIPS, pp. 6555-6566, 2021. Rishi Saket, Aravindan Raghuveer, and Balaraman Ravindran. On combining bags to better learn from label proportions. In AISTATS, volume 151 of Proceedings of Machine Learning Research, pp. 5913-5927. PMLR, 2022. URL https://proceedings.mlr.press/v151/ saket22a.html. Clayton Scott and Jianxin Zhang. Learning from label proportions: A mutual contamination framework. In NeurIPS, 2020. Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. Autoint: Automatic feature interaction learning via self-attentive neural networks. In CIKM, 2019.

AUC Scores obtained after different training configurations for all groupings

AUC scores obtained by instance level training on CriteoThe bag level statistics for all 349 groupings is as follows. The Groupings which are emboldened pass our filters and are used for training.

Bag Level Statistics of all the Groupings (Emboldened : Used for Training)

Threshold bag size values below which 50, 70, 85, 95% of bags present and clustering based on this distribution

Average Log Likelihood of Label Distribution of all grouping and it's clusters

annex

This table contains MeanInterBagSep, MeanIntraBagSep and their ratio for all 349 clipped groupings (removing groupings which were left with no bags after clipping). We perform K-Means of 308 of these groupings that have more than 500 bags after clipping. The cluster assigned to each bag is also listed. 

