A BENCHMARK DATASET FOR LEARNING FROM LABEL PROPORTIONS Anonymous authors Paper under double-blind review

Abstract

Learning from label proportions (LLP) has recently emerged as an important technique of weakly supervised learning on aggregated labels. In LLP, a model is trained on groups (a.k.a bags) of feature-vectors and their corresponding label proportions to predict labels for individual feature-vectors. While previous works have developed a variety of techniques for LLP, including novel loss functions, model architectures and their optimization, they typically evaluated their methods on pseudo-synthetically generated LLP training data using common small scale supervised learning datasets by randomly sampling or partitioning their instances into bags. Despite growing interest in this important task there are no large scale open source LLP benchmarks to compare various approaches. Construction of such a benchmark is hurdled by two challenges a) lack of natural large scale LLP like data, b) large number of mostly artificial methods of forming bags from instance level datasets. In this paper we propose LLP-Bench: a large scale LLP benchmark constructed from the Criteo Kaggle CTR dataset. We do an in-depth, systematic study of the Criteo dataset and propose a methodology to create a benchmark as a collection of diverse and large scale LLP datasets. We choose the Criteo dataset since it admits multiple natural collections of bags formed by grouping subsets of its 26 categorical features. We analyze all bag collections obtained through grouping by one or two categorical features, in terms of their bag-level statistics as well as embedding based distance metrics quantifying the geometric separation of bags. We then propose to include in LLP-Bench a few groupings to fairly represent real world bag distributions. We also measure the performance of state of the art models, loss functions (adapted to LLP) and optimizers on LLP-Bench. We perform a series of ablations and explain the performance of various techniques on LLP-Bench. To the best of our knowledge LLP-Bench is the first open source benchmark for the LLP task. We hope that the proposed benchmark and the evaluation methodology will be used by ML researchers and practitioners to better understand and hence devise state of art LLP algorithms.

1. INTRODUCTION

In traditional supervised learning, training data consists of feature-vectors (instances) along with their labels. A model trained using such data is then used during inference to predict the labels of test instances. In recent times, primarily due to privacy concerns and relative rarity of high quality supervised data, the weakly supervised framework of learning from label proportions (LLP) has gained importance (Scott & Zhang ( 2020 Clearly, supervised learning is the special case of LLP when all bags are unit-sized. Unlike supervised learning however, for which a multitude of task-specific real-world datasets are easily available, the same is not true for LLP. While previous works have developed and explored a variety of algorithmic, optimization and deep-neural net based techniques for LLP (see Sec. 2 for more de-



); Saket et al. (2022); O'Brien et al. (2022)). In LLP, the training data is available as a collection of subsets or bags of instances along with the label proportion for each bag. The goal is to learn a classification model for predicting the class-labels of individual instances (de Freitas & Kück (2005); Musicant et al. (2007)).

