CALIBRATION MATTERS: TACKLING MAXIMIZATION BIAS IN LARGE-SCALE ADVERTISING RECOMMENDA-TION SYSTEMS

Abstract

Calibration is defined as the ratio of the average predicted click rate to the true click rate. The optimization of calibration is essential to many online advertising recommendation systems because it directly affects the downstream bids in ads auctions and the amount of money charged to advertisers. Despite its importance, calibration often suffers from a problem called "maximization bias". Maximization bias refers to the phenomenon that the maximum of predicted values overestimates the true maximum. The problem is introduced because the calibration is computed on the set selected by the prediction model itself. It persists even if unbiased predictions are achieved on every datapoint and worsens when covariate shifts exist between the training and test sets. To mitigate this problem, we quantify maximization bias and propose a variance-adjusting debiasing (VAD) meta-algorithm in this paper. The algorithm is efficient, robust, and practical as it is able to mitigate maximization bias problem under covariate shifts, without incurring additional online serving costs or compromising the ranking performance. We demonstrate the effectiveness of the proposed algorithm using a state-of-the-art recommendation neural network model on a large-scale real-world dataset.

1. INTRODUCTION

The online advertising industry has grown exponentially in the past few decades. According to Statista (2022) , the total value of the global internet advertising market was worth USD 566 billion in 2020 and is expected to reach USD 700 billion by 2025. In the online advertising industry, to help advertisers reach the target customers, demand-side platforms (DSPs) try to bid for available ad slots in an ad exchange. A DSP serves many advertisers simultaneously, and ads provided by those advertisers form the DSP's ads candidate pool. From the DSP's perspective, the advertising campaign pipeline executes as follows: (1) The DSP uses data to build machine learning (ML) models for advertisement value estimation. An advertisement's value is often measured by the click-through rate (CTR) or conversion rate. (2) When the ad exchange sends requests in the form of online bidding auctions for some specific ad slots to a DSP, the DSP uses the ML models to predict values for ads in its ads candidate pool. (3) For the bidding requests, the DSP needs to choose the most suitable ads from its ads candidate pool. Therefore, based on the estimated values, the DSP chooses the ad candidates with the highest values and submits corresponding bids to the ad auctions in the ad exchange. (4) For each auction, an ad with the highest bid would win the auction, and would be displayed (i.e., recommended) in this specific ad slot. The ad exchange would charge the winning DSP a certain amount of money based on the submitted bid and the auction mechanism. For the machine learning models in Step (2), besides learning the ranking (i.e. which ads sent to ad exchange), DSPs also need to accurately estimate the value of the chosen ads, because in Step (3), DSPs bid based on the estimated value obtained from Step (2). Thus, DSPs try to avoid underbidding or overbidding, the latter of which may result in over-charging advertisers. We measure the estimation accuracy by calibration, which is the ratio of the average estimated value (e.g., estimated click-through rate) to the average empirical value (e.g., whether user click or not). Calibration is essential to the success of online ads bidding methods, as well-calibrated predictions are critical to the efficiency of ads auctions (He et al., 2014; McMahan et al., 2013) . Calibration is also crucial in applications such as weather forecasting (Murphy & Winkler, 1977; DeGroot & Fienberg, 1983; Gneiting & Raftery, 2005) , personalized medicine (Jiang et al., 2012) and natural language processing (Nguyen & O'Connor, 2015; Card & Smith, 2018) . There is rich literature for model calibration methods (Zadrozny & Elkan, 2002; 2001; Menon et al., 2012; Deng et al., 2020; Naeini et al., 2015; Kumar et al., 2019; Platt et al., 1999; Guo et al., 2017; Kull et al., 2017; 2019) . These existing methods focus on calibration for model bias. However, those methods do not explicitly consider the selection procedures in Step (3) of the aforementioned recommendation system pipeline. In this case, even if unbiased predictions are obtained for each ad, the calibration may perform poorly on the selection set due to maximization bias. Maximization bias occurs when maximization is performed on random estimated values rather than deterministic true values. We will provide a concrete example to illustrate the difference between maximization bias and model bias. Example 1. Assume there are two different ads with the same "true" CTR 0.5. Now we consider an ML model that learns the CTR of the two ads from data independently. We assume that the ML model predicts the CTR of either one of the ads as 0.6 or 0.4 with equal probabilities. Note that the estimation is unbiased and thus having zero model bias. After both advertisements are submitted to the auction system, the ad with the highest estimated CTR will be selected. In this case, the probability of the system selecting an ad with an estimated CTR 0.6 is 75% and an ad with an estimated CTR 0.4 is 25%. Therefore, in this example, the model has maximization bias because it overestimates the true value of the ads (3/4 × 0.6 + 1/4 × 0.4 = 0.55 > 0.5). This example explains why there may be maximization bias in a model after selection even if the model has zero model bias. Hypothetically, if the DSP submits all the ads with their corresponding bids to an ad exchange, the maximization bias is analogous to the so-called winner's curse, even in the absence of selection and maximization procedures during Step (3). In auction theory, the winner's curse means that in common value auctions, the winners tend to overbid if they receive noisy private signals. Consequently, this calibration issue arises in a wider context. What makes calibration even harder is the covariate shifts between training and test data (Shen et al., 2021; Wang et al., 2021) . The training data only consists of the previous winning and displayed ads, but during testing, DSPs need to select from a much larger ads candidate set. Therefore, the test set contains many ads that are underrepresented in the training set since those types of ads have never been recommended before. These covariate shifts will invalidate aforementioned calibration methods that reduce bias using labeled validation sets. In this paper, we propose a practical meta-algorithm to tackle the maximization bias in calibration, which could be in tandem with other calibration methods. Our algorithm neither compromises the ranking performance nor increases online serving overhead (e.g., inference cost and memory cost). Our contributions are summarized below: (1) We theoretically quantify the maximization bias in generalized linear models with Gaussian distributions. We show that the calibration error mainly depends on the variances of the predictor and the test distribution rather than number of items selected. (2) We propose an efficient, robust, and practical meta-algorithm called variance-adjusting debias (VAD) methodfoot_0 that can apply to any machine learning method and any existing calibration methods. This algorithm is mostly executed offline without any additional online serving costs. Furthermore, the algorithm is robust to covariate shifts that are common in modern recommendation systems. (3) We conduct extensive numerical experiments to demonstrate the effectiveness of the proposed meta-algorithm in both synthetic datasets using a logistic regression model and a large-scale realworld dataset using a state-of-the-art recommendation neural network. In particular, applying VAD in tandem with other calibration methods always improve the calibration performance compared with applying other calibration methods alone.



code available in https://anonymous.4open.science/r/VAD

