A BENCHMARK DATASET FOR LEARNING FROM LABEL PROPORTIONS Anonymous authors Paper under double-blind review

Abstract

Learning from label proportions (LLP) has recently emerged as an important technique of weakly supervised learning on aggregated labels. In LLP, a model is trained on groups (a.k.a bags) of feature-vectors and their corresponding label proportions to predict labels for individual feature-vectors. While previous works have developed a variety of techniques for LLP, including novel loss functions, model architectures and their optimization, they typically evaluated their methods on pseudo-synthetically generated LLP training data using common small scale supervised learning datasets by randomly sampling or partitioning their instances into bags. Despite growing interest in this important task there are no large scale open source LLP benchmarks to compare various approaches. Construction of such a benchmark is hurdled by two challenges a) lack of natural large scale LLP like data, b) large number of mostly artificial methods of forming bags from instance level datasets. In this paper we propose LLP-Bench: a large scale LLP benchmark constructed from the Criteo Kaggle CTR dataset. We do an in-depth, systematic study of the Criteo dataset and propose a methodology to create a benchmark as a collection of diverse and large scale LLP datasets. We choose the Criteo dataset since it admits multiple natural collections of bags formed by grouping subsets of its 26 categorical features. We analyze all bag collections obtained through grouping by one or two categorical features, in terms of their bag-level statistics as well as embedding based distance metrics quantifying the geometric separation of bags. We then propose to include in LLP-Bench a few groupings to fairly represent real world bag distributions. We also measure the performance of state of the art models, loss functions (adapted to LLP) and optimizers on LLP-Bench. We perform a series of ablations and explain the performance of various techniques on LLP-Bench. To the best of our knowledge LLP-Bench is the first open source benchmark for the LLP task. We hope that the proposed benchmark and the evaluation methodology will be used by ML researchers and practitioners to better understand and hence devise state of art LLP algorithms.

1. INTRODUCTION

In traditional supervised learning, training data consists of feature-vectors (instances) along with their labels. A model trained using such data is then used during inference to predict the labels of test instances. In recent times, primarily due to privacy concerns and relative rarity of high quality supervised data, the weakly supervised framework of learning from label proportions (LLP) has gained importance (Scott & Zhang (2020) ; Saket et al. (2022) ; O'Brien et al. (2022) ). In LLP, the training data is available as a collection of subsets or bags of instances along with the label proportion for each bag. The goal is to learn a classification model for predicting the class-labels of individual instances (de Freitas & Kück (2005); Musicant et al. (2007) ). Clearly, supervised learning is the special case of LLP when all bags are unit-sized. Unlike supervised learning however, for which a multitude of task-specific real-world datasets are easily available, the same is not true for LLP. While previous works have developed and explored a variety of algorithmic, optimization and deep-neural net based techniques for LLP (see Sec. 2 for more de-tails), all of them experimentally evaluate their methods on pseudo-synthetic LLP datasets consisting instances of some supervised learning dataset randomly sampled/partitioned into the different bags. Further, most of the above works use limited scale data, typically small UCI (Dua & Graff (2017)), image and social media datasets. An exception to the above is the work of Saket et al. (2022) which also uses the Criteo Kaggle CTR (Criteo ( 2014)) and MovieLens-20m (Movielens-20M; Harper & Konstan (2016) ) which are fairly large in scale, roughly 45 million instances and 20 million instances respectively. In particular, the Criteo dataset has 13 numerical and 26 categorical features whose semantics are undisclosed. Each row is an impression and a {0, 1}-valued label indicates a click, with in total 7 days of impression-click data. The categorical features can be used to create many different bag collections depending on their subset used for grouping, where each choice of the subset's values yields a bag of instances having those feature values. These groupings simulate the typical aggregation scenarios in real-world use-cases, however Saket et al. ( 2022) only experimented in a limited manner with one grouping. In contrast to the above state of affairs, a large number of publicly accessible real-world and large scale supervised-learning datasets have been studied over the years, whereas there are hardly any datasets which are curated specifically for LLP.

1.1. OUR CONTRIBUTIONS

In this work we address the unavailability of a large scale benchmark and standardized evaluation methodologies for LLP. We make the following contributions in this paper towards creating an LLP benchmark building on top of the publicly available Criteo Kaggle CTR dataset. Bag collections using group-by feature-sets. Typically, for privacy preservation in CTR applications, the impressions are grouped into bags according the values of features such as advertiser-id, product-id, date etc. Thus, we can simulate such aggregations on the Criteo dataset using any subset of the categorical feature-set. However, we observe that choosing more than three categorical features likely results in small-sized bags which would be contrary to the goal of large scale LLP datasets. Therefore, our exploration limits the groupings to those obtained using at most two of the categorical features. Below we present the different aspects of our exploration of these groupings. We use a standard preprocessing previously used for training the AutoInt model (Song et al. ( 2019)) on the Criteo dataset at the instance level. More details can be found in Section 4. Analysis, categorization, and filtering groupings. There are 26 categorical features, leading to 26 + 26 2 = 351 possible groupings using at most two categorical features. Our goal is to curate LLP datasets with not too small or very large bags (as the latter have very weak label supervision), and we always remove bags of size ≤ 50 and those of size > 2500 from these groupings, similar to the work of Saket et al. (2022) . Post this removal, we identify as outliers those groupings which have at most 500 bags. The remainder 308 groupings are further analyzed in relation to their bag size and label proportion distributions. For each grouping We calculate the threshold bags sizes such that t% of the bags have at most that size, for t = 50, 70, 85, 95. Using normalized vectors of these four values, we use k-Means clustering to the partition the groupings into four subsets typified by increasing bag sizes. More details of these clusters can be found in Sec 5.1. Further, modeling the labels as iid Bernoulli with bias given by the average label of the dataset, we compute for each grouping the average of the log likelihoods of the bag label proportions. Using this we also cluster the set of groupings into four subsets indicating how far-from random their label proportions are. Analysis of this characterisation can be found in Section 5.2. In the above removal of bags, a substantial fraction of the original dataset is also removed since there is an abundance of small bags for most groupings. For subsequent analysis involving model training, we further filter out those ones which retain less than 30% of the instances. This is to ensure that we only have large-scale LLP bag collections, and we obtain 52 groupings satisfying the retention condition. Details of removals by these filters can be found in Section 4.2. It turns out that these groupings have a similar number and average size of bags. We then proceed to estimate the geometric clustering of bags by computing the average inter-bag and intra-bag distances for these groupings. For this we use natural definitions of these notions based on the

