CALIBRATION MATTERS: TACKLING MAXIMIZATION BIAS IN LARGE-SCALE ADVERTISING RECOMMENDA-TION SYSTEMS

Abstract

Calibration is defined as the ratio of the average predicted click rate to the true click rate. The optimization of calibration is essential to many online advertising recommendation systems because it directly affects the downstream bids in ads auctions and the amount of money charged to advertisers. Despite its importance, calibration often suffers from a problem called "maximization bias". Maximization bias refers to the phenomenon that the maximum of predicted values overestimates the true maximum. The problem is introduced because the calibration is computed on the set selected by the prediction model itself. It persists even if unbiased predictions are achieved on every datapoint and worsens when covariate shifts exist between the training and test sets. To mitigate this problem, we quantify maximization bias and propose a variance-adjusting debiasing (VAD) meta-algorithm in this paper. The algorithm is efficient, robust, and practical as it is able to mitigate maximization bias problem under covariate shifts, without incurring additional online serving costs or compromising the ranking performance. We demonstrate the effectiveness of the proposed algorithm using a state-of-the-art recommendation neural network model on a large-scale real-world dataset.

1. INTRODUCTION

The online advertising industry has grown exponentially in the past few decades. According to Statista (2022) , the total value of the global internet advertising market was worth USD 566 billion in 2020 and is expected to reach USD 700 billion by 2025. In the online advertising industry, to help advertisers reach the target customers, demand-side platforms (DSPs) try to bid for available ad slots in an ad exchange. A DSP serves many advertisers simultaneously, and ads provided by those advertisers form the DSP's ads candidate pool. From the DSP's perspective, the advertising campaign pipeline executes as follows: (1) The DSP uses data to build machine learning (ML) models for advertisement value estimation. An advertisement's value is often measured by the click-through rate (CTR) or conversion rate. (2) When the ad exchange sends requests in the form of online bidding auctions for some specific ad slots to a DSP, the DSP uses the ML models to predict values for ads in its ads candidate pool. (3) For the bidding requests, the DSP needs to choose the most suitable ads from its ads candidate pool. Therefore, based on the estimated values, the DSP chooses the ad candidates with the highest values and submits corresponding bids to the ad auctions in the ad exchange. (4) For each auction, an ad with the highest bid would win the auction, and would be displayed (i.e., recommended) in this specific ad slot. The ad exchange would charge the winning DSP a certain amount of money based on the submitted bid and the auction mechanism. For the machine learning models in Step (2), besides learning the ranking (i.e. which ads sent to ad exchange), DSPs also need to accurately estimate the value of the chosen ads, because in Step (3), DSPs bid based on the estimated value obtained from Step (2). Thus, DSPs try to avoid underbidding or overbidding, the latter of which may result in over-charging advertisers. We measure the estimation accuracy by calibration, which is the ratio of the average estimated value (e.g., estimated click-through rate) to the average empirical value (e.g., whether user click or not). Calibration is essential to the success of online ads bidding methods, as well-calibrated predictions are critical to the efficiency of ads auctions (He et al., 2014; McMahan et al., 2013) . Calibration is also crucial in applications such as weather forecasting (Murphy & Winkler, 1977; DeGroot & Fienberg, 1983; Gneiting & Raftery, 2005) , personalized medicine (Jiang et al., 2012) and natural language processing (Nguyen & O'Connor, 2015; Card & Smith, 2018) . There is rich literature for model calibration methods (Zadrozny & Elkan, 2002; 2001; Menon et al., 2012; Deng et al., 2020; Naeini et al., 2015; Kumar et al., 2019; Platt et al., 1999; Guo et al., 2017; Kull et al., 2017; 2019) . These existing methods focus on calibration for model bias. However, those methods do not explicitly consider the selection procedures in Step (3) of the aforementioned recommendation system pipeline. In this case, even if unbiased predictions are obtained for each ad, the calibration may perform poorly on the selection set due to maximization bias. Maximization bias occurs when maximization is performed on random estimated values rather than deterministic true values. We will provide a concrete example to illustrate the difference between maximization bias and model bias. Example 1. Assume there are two different ads with the same "true" CTR 0.5. Now we consider an ML model that learns the CTR of the two ads from data independently. We assume that the ML model predicts the CTR of either one of the ads as 0.6 or 0.4 with equal probabilities. Note that the estimation is unbiased and thus having zero model bias. After both advertisements are submitted to the auction system, the ad with the highest estimated CTR will be selected. In this case, the probability of the system selecting an ad with an estimated CTR 0.6 is 75% and an ad with an estimated CTR 0.4 is 25%. Therefore, in this example, the model has maximization bias because it overestimates the true value of the ads (3/4 × 0.6 + 1/4 × 0.4 = 0.55 > 0.5). This example explains why there may be maximization bias in a model after selection even if the model has zero model bias. Hypothetically, if the DSP submits all the ads with their corresponding bids to an ad exchange, the maximization bias is analogous to the so-called winner's curse, even in the absence of selection and maximization procedures during Step (3). In auction theory, the winner's curse means that in common value auctions, the winners tend to overbid if they receive noisy private signals. Consequently, this calibration issue arises in a wider context. What makes calibration even harder is the covariate shifts between training and test data (Shen et al., 2021; Wang et al., 2021) . The training data only consists of the previous winning and displayed ads, but during testing, DSPs need to select from a much larger ads candidate set. Therefore, the test set contains many ads that are underrepresented in the training set since those types of ads have never been recommended before. These covariate shifts will invalidate aforementioned calibration methods that reduce bias using labeled validation sets. In this paper, we propose a practical meta-algorithm to tackle the maximization bias in calibration, which could be in tandem with other calibration methods. Our algorithm neither compromises the ranking performance nor increases online serving overhead (e.g., inference cost and memory cost). Our contributions are summarized below: (1) We theoretically quantify the maximization bias in generalized linear models with Gaussian distributions. We show that the calibration error mainly depends on the variances of the predictor and the test distribution rather than number of items selected. (2) We propose an efficient, robust, and practical meta-algorithm called variance-adjusting debias (VAD) methodfoot_0 that can apply to any machine learning method and any existing calibration methods. This algorithm is mostly executed offline without any additional online serving costs. Furthermore, the algorithm is robust to covariate shifts that are common in modern recommendation systems. (3) We conduct extensive numerical experiments to demonstrate the effectiveness of the proposed meta-algorithm in both synthetic datasets using a logistic regression model and a large-scale realworld dataset using a state-of-the-art recommendation neural network. In particular, applying VAD in tandem with other calibration methods always improve the calibration performance compared with applying other calibration methods alone.

2. RELATED WORK

There is a long line of work regarding calibration methods. Broadly speaking, existing methods could be classified into two groups: non-parametric and parametric methods. (Kweon et al., 2021) On one hand, non-parametric methods utilizes binning ideas, which include histogram binning (Zadrozny & Elkan, 2001) , isotonic regression (Zadrozny & Elkan, 2002) , smoothed isotonic regression (Deng et al., 2020 ), Bayesian binning (Naeini et al., 2015) , and scaling-binning calibrator (Kumar et al., 2019) . On the other hand, parametric methods explicitly learn a parametric function mapping from the model original scores to calibrated probabilities. Example methods include Platt scaling (Platt et al., 1999) , temperature scaling (Guo et al., 2017) , Beta calibration (Kull et al., 2017) , and Dirichlet calibration (Kull et al., 2019) . We refer the readers to Kweon et al. (2021) for a comprehensive survey on various calibration methods. As discussed in Introduction, those methods are not designed for correcting maximization bias. Maximization bias appears in many different domains, ranging from economics ( Van den Steen, 2004; Capen et al., 1971) , decision analysis (Smith & Winkler, 2006) , to statistics, which includes model selection (Varma & Simon, 2006) , over-fitting (Cawley & Talbot, 2010) , selection bias (Heckman, 1979) , and feature selection (Ambroise & McLachlan, 2002) . Maximization bias is especially well-documented in reinforcement learning literature (see, Sutton & Barto (2018, Section 6.7)). In reinforcement learning, estimating value function is a fundamental task, where value function is typically the maximum of many expected values of different actions. To reduce maximization bias, double learning or cross-validation estimators (Van Hasselt, 2010; 2011; 2013; Van Hasselt et al., 2016) are used and demonstrate strong empirical performances. The basic idea is to train two separate models: one model selects while the other predicts the probability. However, this type of methods is not applicable to large-scale ads recommendation systems since it will double online serving costs, and online serving efficiency is essential to recommendation system performances. Calibration for maximization bias is also closely related to estimating the maximum mean of several random variables in operations research and machine learning. Various estimators were proposed: Chen & Dudewicz (1976) develop a two-stage procedure to provide a confidence interval of the highest mean; Lesnevski et al. (2007) further integrate their method with screening ideas, variancereduction, and common random numbers techniques; Liu et al. (2019) propose an upper confidence bound (UCB) approach; Chang et al. (2005) incorporate similar UCB components into the Monte Carlo tree search; and D 'Eramo et al. (2016) use weighted average of the sample means, where the weights are computed by Gaussian approximations. However, those methods are either simulationbased or have access to multiple i.i.d. copies of random variables; thus, they cannot be directly applied to supervised learning settings in recommendation systems.

3. PRELIMINARIES AND PROBLEM SETTING

Consider a supervised learning setting. For each data point, we have a high-dimensional feature X ∈ X ⊂ R d and a label Y ∈ Y ≜ {0, 1}. In this paper, we focus on the binary label Y that represents whether the user clicks or not. Our method can be easily extended to continuous labels. Suppose we have access to a labeled training set (X, Y ) ∼ D train and an unlabeled test validation set X ∼ D val-test,X , which has the same distribution as the X margin of the real test set D test . Note that there are often covariate shifts between D train and D test since the training set consists of the historical recommended items, while the test set contains all possible candidates, many of which may have never been seen by users before. However, we can reasonably assume that there are no concept drifts between training and test distributions in Assumption 3.1. Assumption 3.1 (No concept drift). The conditional distribution Y given X is the same between the training distribution and the test distribution. Then, the recommendation system pipeline is summarized in Flowchart 1. Specifically, in the first step, the predictor f : X → [0, 1] is a prediction of P(Y = 1|X); in the second step, we rank all items by f (x) and pick the top α (unknown apriori) proportion, where the selected set is denoted by Dα test ; in the third step, we consider two different calibration metrics, including calibration error E He et al. ( 2014) and expected calibration error (ECE) and we provide additional results for maximum calibration error (MCE) in the Appendix (Naeini et al., 2015; Kweon et al., 2021) . The calibration error E and ECE are defined by E ≜ i∈ Dα test f (x i ) i∈ Dα test y i -1, ECE ≜ M m=1 |B m | N k∈Bm y k |B m | -k∈Bm f (x k ) |B m | (1) where we partition all items in Dα test into M equi-spaced bins by their values, and B m is m-th bin and N = | Dα test | is the number of samples. Figure 1 : Flowchart visualization of the procedure. In Section 6.2, we execute the pipeline on a real-world ads recommendation system dataset using a state-of-the-art neural network and observe that calibration errors are consistently larger than 3% as shown in Table 2 , if not performing any debiasing methods. Therefore, the goal of this paper is to find a predictor f that minimizes calibration errors in the selected set without compromising the ranking performance or incurring additional online serving costs. The serving costs refer to the costs of executing the method on the test sets. We note that the unlabeled validation set D val-test,X is easily obtained and offline available because we can use the candidate sets from previous online requests. Furthermore, α is usually small (but unknown) in practice since DSPs usually have a large ads candidate set. Remark: maximization bias v.s. model bias In this paper, we tackle maximization bias, which is different from model bias in machine learning. Model bias is the difference between the expected model prediction and true value for a given ad. Note that model predictions have randomness due to random data and stochastic optimization algorithms. However, maximization bias in our paper is a different type of bias, orthogonal to model bias. Maximization bias exists because of the selection (maximization) step. Even models with zero model bias may have maximization bias. We refer the readers back to Example 1 to see the difference between model bias and maximization bias.

4. QUANTIFYING MAXIMIZATION BIAS IN GENERALIZED LINEAR MODELS

Generalized linear models are a unifying framework of linear regression, logistic regression, and Poisson regression (Nelder & Wedderburn, 1972) . In particular, neural networks (NNs) are categorized as a generalized linear model if we view the neurons from the second-to-last layer as features and use a sigmoid function as the final activation function. In this section, we provide a rigorous quantification for the maximization bias in generalized linear models with Gaussian features. By a slight abuse of notation, D test represents the population distribution of {X, Y } in the test set. We need to select the top-α percent from the test set. We further assume the underlying true model is generalized linear in both training and test sets with the same conditional distributions, i.e., Y |X ∼ Ber ϕ(β ⊤ * X ), where ϕ(•) is a positive, continuous differentiable, and monotonically increasing function. Let βN be the parameter learned from the N -sample training set sampled from D train , which is independent of D test . We do not need to assume marginal distributions of X in the training data D train and test data D test have the same distribution. We let q 1-α (Z) be the 1 -α quantile of distribution Z, i.e., P(Z ≥ q 1-α (Z)) = α. By the monotonicity of ϕ(•), the estimated average probability on the selection set is E Dtest,Dtrain ϕ β⊤ N X | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ) ; the actual average probability on the selection set is E Dtest,Dtrain ϕ β ⊤ * X | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ) . Note that these expectations are taken with respect to the randomness of both X and βN . To abbreviate the notion, we drop the subscripts D test , D train when there is no confusion. We quantify the maximization bias for this generalized linear model with covariate shifts in Theorem 4.1. Theorem 4.1. Suppose X ∼ N (µ, Σ) and ϕ(•) is a positive, Lipschitz continuous, twice differentiable, and monotonically increasing function. If βN is a maximum likelihood estimator, we have the estimated average probability on the selection set E ϕ β⊤ N X | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ) = E ϕ β⊤ N µ + β⊤ N Σ βN Z |Z ≥ q 1-α (Z) , and the maximization bias on the selection set is E ϕ β⊤ N X | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ) -E ϕ β ⊤ * X | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ) =E    √ β⊤ N Σ βN β⊤ N Σβ * √ β⊤ N Σ βN h ′ (t)dt    + O 1 N . ( ) where Z ∼ N (0, 1) and h ′ (t) ≜ E ϕ ′ β ⊤ N µ + tZ Z|Z ≥ q 1-α (Z) . The O(1/N ) term only depends on ϕ, β * , and Σ, but does not depend on α. Remark: 1. The Gaussianity assumptions of the feature X are not necessary. One only needs β ⊤ * X and β⊤ N X to be jointly Gaussian conditional on βN and the proof would still go through. Figure 5 in Appendix B.2 shows that β⊤ N X is indeed very close to the Gaussian distribution. 2. Note that we do not assume that the training and test distributions are the same. Theorem 4.1 is true under arbitrary covariate shifts. In the setting of Theorem 4.1, Var( βX|X) is heterogeneous across X, and we allow to choose any α proportion of the test set. Furthermore, if α is small and µ is small, h ′ (t) has the same order as E [Z|Z ≥ q 1-α (Z)], which is large, and the maximization bias mainly depends on β⊤ N Σβ * and β⊤ N Σ βN . The formal statement is in Lemma A.2 in Appendix A. To reduce the maximization bias, we discount βN X by λ. Corollary 4.2 studies the bias for this discounted estimator. Corollary 4.2. Suppose the same assumptions in Theorem 4.1 are imposed. If we change β ⊤ N X to λβ ⊤ N X + (1 -λ)β ⊤ N µ for λ ∈ [0, 1], the bias becomes E ϕ λ β⊤ N X + (1 -λ)β ⊤ N µ | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ) -E ϕ β ⊤ * X | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ) = E    λ √ β⊤ N Σ βN β⊤ N Σβ * √ β⊤ N Σ βN h ′ (t)dt    + O 1 N . The transformation β⊤ N X → λ β⊤ N X + (1 -λ)β ⊤ N µ is linear; thus, it does not alter the item rankings. Despite its simplicity, we are able to find a λ such that the leading term of ( 5) is approximately zero in Section 5, i.e., E    λ √ β⊤ N Σ βN β⊤ N Σβ * √ β⊤ N Σ βN h ′ (t)dt    ≈ 0.

5. VARIANCE-ADJUSTING DEBIASING META-ALGORITHM

Based on the theory developed in Section 4, the goal is to find λ such that  λ β⊤ N Σ βN ≈ β⊤ N Σβ * β⊤ N Σ βN ⇔ λ ≈ β⊤ N Σβ * β⊤ N Σ βN β⊤ N Σ βN = E[ β⊤ N Σ βN ] + O p 1 √ d and β⊤ N Σβ * = E[ β⊤ N Σβ * ] + O p 1 √ d . In the context of generalized linear models, we observe that β⊤ N X = ϕ -1 (f ). Therefore, in general cases, we define l(x) = ϕ -1 (f (x)). In practice, l(x) can be obtained either by inverting ϕ or extracting the last layer of the NN if the neural network is used as the prediction model with ϕ as the activation function of the last layer. Then, based on these analysis, we propose a varianceadjusting debiasing (VAD) meta-algorithm in Algorithm 1. Since the λ estimation procedure is fully non-parametric and depends only on some means, variances, and conditional variances, our meta-algorithm applies to any machine learning algorithm.  l i (x) = ϕ -1 (f i (x)) for i ∈ [S]. Then compute the means Ȳ l i = E D val-test,X [l i (X)] for i ∈ [S] the test variance σl Ŷ 2 = Var D val-test,X [l 1 (X)]. 6: Compute the expected conditional variance σl f 2 =E D val-test,X   1 S -1 S j=1 l j (X) -Ȳ l j - 1 S S i=1 l i (X) -Ȳ l i 2   . 7: Compute λ = 1 -σl f 2 / σl Ŷ 2 . 8: Output f VAD (•) = ϕ λl 1 (•) + (1 -λ) Ȳ l 1 . Since the debiased predictor f VAD is a monotonic transformation of the original predictor f 1 in Algorithm 1, the prediction rankings remain the same. Further, λ and Ȳ l 1 are computed purely offline as long as we have a unlabeled validation set that has the same distribution as the test set. Thus, no additional serving costs are added. Furthermore, since we only need to estimate means and variances, we only need samples from the unlabeled candidate set of reasonable size. We can sample recent data points in the large (and potentially non-stationary) candidate set to get sufficiently accurate estimates. For the link function ϕ(•) in logistic regression and NNs with a final sigmoid activation function, ϕ(x) = (1 + exp(-x)) -1 and l i is the logit (the last layer in NNs) of the predictor f i . ϕ(•) could also be chosen as the identity mapping ϕ(x) = x. We note that this choice of ϕ(•) has similar performance to the choice of ϕ(x) = (1 + exp(-x)) -1 . We report additional results about the behavior of the identity link function in Appendix B. On Line 4 in Algorithm 1, we recommend bootstrapping (Efron & Tibshirani, 1994) if the base training model lacks intrinsic randomness, e.g., logistic regression, which is an efficiently solvable convex optimization problem. However, if the base model is highly non-convex with multiple local optima, e.g., neural networks, we recommend random initializations and random data orders because many empirical studies (Nixon et al., 2020; Lakshminarayanan et al., 2016; Lee et al., 2015) show that bootstrapping may hurt the performance in deep neural networks and the estimation of the conditional variances would benefit from the algorithmic randomness (Jiang et al., 2021) . In our method, we only need to choose one hyper-parameter S and we do not need to specify α. In fact, S = 2 would be sufficient and results in lower training cost. Therefore, our method doubles the training cost, which is usually acceptable in practice. More importantly, our method does not incur any additional online serving costs. All results reported in Section 6 use S = 2. We report additional results about different S choices in Appendix B.

6. EXPERIMENT RESULTS

In this section, we demonstrate the performance of our method in both synthetic data and a real-world ads recommendation dataset. We use calibration errors, ECE, and MCE to evaluate performances. Note that evaluation metrics are calculated on the selection set, i.e., we choose top α proportion of test data points using model predictions and compute the evaluation metrics in the top-α selection set. We provide additional numerical results for the Avazu datasetfoot_1 in Appendix B.3.

Data and Model

We consider a logistic regression model. We assume the response Y follows the Bernoulli distribution with probability 1 + exp(-β ⊤ X) -1 , for β, X ∈ R d . In Note that we do not compare with other calibration methods under a logistic regression model since logistic regression produces well-calibrated predictions. Applying other calibration methods to a logistic regression model will not improve the performance. (Niculescu-Mizil & Caruana, 2005) . Table 1 shows that the vanilla model without debiasing has a calibration error of more than 7%, which is mainly due to maximization bias as logistic regression produces well-calibrated predictions. After applying VAD, the calibration error is sufficiently close to zero. All the superiority is statistically significant at the 1% significance level. Additional experiment results are reported in Appendix B.1.

6.2. REAL-WORLD DATA

Dataset We use the Criteo Ad Kaggle datasetfoot_2 to demonstrate our method's performance. The Criteo Ad Kaggle dataset is a common benchmark dataset for CTR predictions. It consists of a week's worth of data, approximately 45 million samples in total. Each data point contains a binary label, which indicates whether the user clicks or not, along with 13 continuous, 26 categorical features. The positive label accounts for 25.3% of all data. The categorical features consist of 1.3 million categories on average, with 1 feature having more than 10 million categories, 5 features having more than 1 million categories. Due to computational constraints in our experiments, we use the first 15 million samples, shuffle the dataset randomly, and split the whole dataset into 85% train D train , 1.5% validation-train D val-train , 1.5% validation-test D val-test , and 12% test D test datasets. Base Model We use the state-of-the-art deep learning recommendation model (DLRM) (Naumov et al., 2019) open-sourced by Meta as our baseline model. DLRM employs a standard architecture for ranking tasks, with embeddings to handle categorical features, Multilayer perceptrons (MLPs) to handle continuous features and the interactions of categorical features and continuous features. Throughout our experiments, we use the default parameters and a SGD optimizer. Note that our method is model-agnostic, so it can be directly applied to other models (e.g. support vector machines, boosted trees, nearest neighbors, etc Friedman et al. (2001) ). Baseline Calibration Methods We compare our method with various classic calibration methods. For parametric methods, we compare against Platt scaling (Platt et al., 1999) . For non-parametric methods, we compare against histogram binning (Zadrozny & Elkan, 2002; 2001 ), isotonic regression (Menon et al., 2012)) , and scaling-binning calibrator (Kumar et al., 2019) . We use the labeled training validation dataset to do calibration for above methods. Note that none of the existing work explicitly considers maximization bias and thus fails to perform well in our setting. VAD can be combined with all existing calibration methods to achieve better performance by making a small change to the original VAD algorithm. VAD takes predictions calibrated by other calibration methods as inputs. Additionally, instead of directly using λ calculated from D val-test,X , we first calculate λ val-test from D val-test,X using original predictions and λ val-train from D val-train,X (the unlabeled validation set which has the same distribution as the X margin of the training set) using original predictions, and then use λ = λ val-test /λ val-train to adjust predictions calibrated by other methods. This change is due to the fact that other calibration methods already compensate for maximization bias in training distribution to some extent.

VAD Parameters

The last layer of the DLRM network uses the sigmoid activation function. In our method, by using the link function ϕ(x) = (1 + exp(-x)) -1 , we compute the means Ȳ l i , variance (σ l Ŷ ) 2 , and expected conditional variance (σ l f ) 2 of the last layer's neuron. To compute the expected prediction variance (σ l f ) 2 , we keep the training data unchanged and modify the random initialization and data orders since the optimizer itself incurs sufficient randomness. In the experiment, we train S = 2 times for our method. Covariate Shift Since the underlying true data-generating process is unknown, we employ a different strategy to construct out-of-distribution test data than how we generate synthetic data: we train another DLRM model using 85% × 15 million samples different from the original dataset; we randomly keep each data point in the original test set with probability 1 -p, where p is the newly-trained DLRM model prediction for the data point. The training set remains the same. After this shift, the positive samples account for 20.1% of all test data. By doing this, we ensure that the distributional change is only a covariate shift, and the positive sample ratio in the test data is lower than the positive sample ratio in the training data, which is consistent with the real-world recommendation systems. Performance We replicate the experiments 40 times and report averages and standard errors of calibration errors and ECE (with number of bins M = 50) for α ∈ {2%, 10%} in Tables 2 and 3 . (MCE is in Appendix B.2), where Original stands for using calibration methods solely and VAD+ represents the tandem combination of the calibration methods and VAD. We plot average calibration errors and average ECE for α ∈ [2%, 10%] in Figure 2 . We also report Log Loss improvement in Appendix B.2, indicating our method also improves prediction quality. 2 (a) show that the vanilla model without debiasing has a calibration error more than 3%. After debiased by existing calibration methods, there are still 1.5% ∼ 2.0% over-calibrations, which are largely due to maximization bias. Among all methods, the sole VAD method (i.e., not functioning in tandem with any other calibration methods) perform the best, but has a large variance. Moreover, from Tables 2, 3 and Figure 2 , We find that our method (VAD) outperforms the vanilla method. Particularly, all the calibration methods in tandem with VAD achieve better performance than the sole calibration methods alone. All the superiority is statistically significant at the 1% significance level. Additional experiment results are reported in Appendix B.2.

7. CONCLUSION

We proposed a theory-certified meta-algorithm variance-adjusting debiasing (VAD) to tackle maximization bias in recommendation systems. The meta-algorithm is easy-to-implement by adding only a few lines, scalable to large-scale systems with no additional serving costs, applicable to any machine learning methods, and robust to covariate shifts between training and test sets. Empirical results show its significant superiority over other methods. Our method can be directly used in industry with minor modifications, e.g. doing VAD separately for each group of data instead of doing VAD globally. Interesting follow-ups include combining VAD and other calibration methods in a better way and further reducing the training cost. We leave these for future work.

A PROOFS

Before the proof of Theorem 4.1, we first collect some useful results from the standard MLE theory. Lemma A.1. β follows the central limit theorem: √ N βN -β * ⇒ N (0, I -1 ), where I is the Fisher information matrix defined as I jk = E Dtrain - ∂ 2 Y ln ϕ β ⊤ * X + (1 -Y ) ln 1 -ϕ β ⊤ * X ∂β j ∂β k . Furthermore, the bias of β is of order 1/N, i.e., E[ βN -β * ] = O(1/N ). Proof. The first claim follows from the Lipschitzness of the link function ϕ and Theorem 5.39 in Van der Vaart (2000) . The second claim follows from formula 20 in Cox & Snell (1968) . Let Z ∼ N (0, 1), then conditional on β⊤ N , we have β⊤ N X ∼ N β⊤ N µ, β⊤ N Σ βN and β ⊤ * X ∼ N β ⊤ * µ, β ⊤ * Σβ * the standard normal distribution. Then, we have for the estimated average probability on the selection set conditional on β⊤ N , E Dtest ϕ β⊤ N X | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ), βN = E ϕ β⊤ N µ + β⊤ N Σ βN Z |Z ≥ q 1-α (Z), βN , Note that conditional on β⊤ N , Cov Dtest β⊤ N X, β ⊤ * X| βN = β⊤ N Σβ * . Therefore, we have β ⊤ * X = β⊤ N Σβ * β⊤ N Σ βN β⊤ N X + β ⊤ * X - β⊤ N Σβ * β⊤ N Σ βN β⊤ N X , where β ⊤ * X - β⊤ N Σβ * β⊤ N Σ βN β⊤ N X ⊥ β⊤ N X βN , β * . Note that β ⊤ * X - β⊤ N Σβ * β⊤ N Σ βN β⊤ N X βN , β * ∼ N   β ⊤ * µ - β⊤ N Σβ * β⊤ N Σ βN β⊤ N µ, β ⊤ * Σβ * - β⊤ N Σβ * 2 β⊤ N Σ βN    then, we have β ⊤ * X d = β⊤ N Σβ * β⊤ N Σ βN Z + β ⊤ * Σβ * - β⊤ N Σβ * 2 β⊤ N Σ βN Z 2 + β ⊤ * µ, where Z 2 ∼ N (0, 1) independent to Z and d = means equal in distribution. Thus, the actual average probability can be reformulated as E Dtest ϕ β ⊤ * X | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ), βN =E     ϕ     β⊤ N Σβ * β⊤ N Σ βN Z + β ⊤ * Σβ * - β⊤ N Σβ * 2 β⊤ N Σ βN Z 2 + β ⊤ * µ     Z ≥ q 1-α (Z), βN     . Published as a conference paper at ICLR 2023 Note that βN only depends on the training set. Therefore, βN , Z, Z 2 are mutually independent. By the Taylor expansion, we have E Dtrain     ϕ     β⊤ N Σβ * β⊤ N Σ βN Z + β ⊤ * Σβ * - β⊤ N Σβ * 2 β⊤ N Σ βN Z 2 + β ⊤ * µ     Z     = E Dtrain   ϕ   β⊤ N Σβ * β⊤ N Σ βN Z + β ⊤ * µ   Z   + E Dtrain     ϕ ′   β⊤ N Σβ * β⊤ N Σ βN Z + β ⊤ * µ   β ⊤ * Σβ * - β⊤ N Σβ * 2 β⊤ N Σ βN Z     E [Z 2 ] + O 1 N = E Dtrain   ϕ   β⊤ N Σβ * β⊤ N Σ βN Z + β ⊤ * µ   Z   + O 1 N . By Lemma A.1, we have E Dtrain      β ⊤ * Σβ * - β⊤ N Σβ * 2 β⊤ N Σ βN       = O 1 N . Finally by taking the Taylor expansion of ϕ β⊤ N µ + β⊤ N Σ βN Z around β⊤ N Σβ * √ β⊤ N Σ βN Z + β ⊤ * µ, we have E Dtrain   ϕ   β⊤ N Σβ * β⊤ N Σ βN Z + β ⊤ * µ   Z   = E Dtrain   ϕ   β⊤ N Σβ * β⊤ N Σ βN Z + β ⊤ N µ   Z   +O 1 N . Therefore, the actual average probability is E ϕ β ⊤ * X | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ) = E   ϕ   β⊤ N Σβ * β⊤ N Σ βN Z + β ⊤ N µ   Z ≥ q 1-α (Z)   . Let h(t) = E ϕ β ⊤ N µ + tZ |Z ≥ q 1-α (Z) , and h ′ (t) = E ϕ ′ β ⊤ N µ + tZ Z|Z ≥ q 1-α (Z) . Then, by using the Taylor expansion again, the maximization bias is E ϕ β⊤ N X | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ) -E ϕ β ⊤ * X | β⊤ N X ≥ q 1-α ( β⊤ N X| βN ) = E ϕ β⊤ N µ + β⊤ N Σ βN Z |Z ≥ q 1-α (Z) -E   ϕ   β⊤ N Σβ * β⊤ N Σ βN Z + β ⊤ N µ   Z ≥ q 1-α (Z)   + O 1 N = E    √ β⊤ N Σ βN β⊤ N Σβ * √ β⊤ N Σ βN h ′ (t)dt    + O 1 N . Since α ≤ 0.2, we have E [ZI{Z ∈ [q 1-α (Z), 2q 1-α (Z)]}|Z ≥ q 1-α (Z)] ≥ 1 2 E [Z|Z ≥ q 1-α (Z)] , which yields the desired lower bound. B NUMERICAL RESULTS

B.1 SYNTHETIC DATA

In this section, we report additional results on synthetic dataset. We plot average calibration errors, average ECE and average MCE for α ∈ [2%, 10%] in Figure 3 . MCE is defined as MCE ≜ max m∈{1,...,M } k∈Bm y k |B m | -k∈Bm f (x k ) |B m | , .We test methods with different hyperparameters. Specifically, we test our method with S = 3, and with identity mapping ϕ(x) = x (denoted as VAD(p)). The results are summarized in Tables 4, 5 and 6. We find that for the VAD method, S = 3 outperforms S = 2, with the cost of more training resources needed. VAD(p) and VAD have similar performances.  LogLoss( Dα test , f ) = - 1 | Dα test | i∈ Dα test y i log(f (x i )) + (1 -y i ) log(1 -f (x i )). Then, the Log Loss reduction is defined by LogLoss( Dα test , f VAD ) LogLoss( Dα test , f vanilla ) -1. Negative Log Loss reduction means that we achieve lower loss. We find that after applying our method, we achieve lower Log Loss, meaning that we improve the prediction quality. 4 . We find that for the VAD method, S = 3 outperforms S = 2 slightly, with the cost of more training resources needed. VAD(p) and VAD have similar performances. In addition, Table 11 reports the Log Loss reduction. We find that after applying our method, we achieve lower Log Loss uniformly, meaning that we improve the prediction quality. Finally, we present histograms of the last-layer neuron (logit) in the neural networks to justify our Gaussianity assumptions in Theorem 4.1. Figure 5 plots the histograms of model predictions from models using three different random seeds, where the black line is the estimated Gaussian density. By the Kolmogorov-Smirnov test, we cannot reject the null hypothesis that the empirical distribution is Gaussian at the 5% significance level. Base Model We use the xDeepFM model (Lian et al., 2018) open-sourced in (Shen, 2017) .

Covariate Shift

We train another xDeepFM model. We randomly keep each data point in the original test set with probability 1 -p if p < 0.2, probability 0.1 if p > 0.3 and probability 2.2 -7 * p if 0.2 < p < 0.3, where p is the newly-trained xDeepFM model prediction for the data point. The training set remains the same. By doing this, we ensure that the distributional change is only a covariate shift, and the positive sample ratio in the test data is lower than the positive sample ratio in the training data, which is consistent with the real-world recommendation systems. We report results for the calibration error, ECE, and MCE in Tables 19, 20 , and 21, respectively. We observe that all the calibration methods in tandem with VAD achieve better performance than the sole calibration methods alone, which is consistent with results in our paper.



code available in https://anonymous.4open.science/r/VAD https://www.kaggle.com/c/avazu-ctr-prediction https://www.kaggle.com/c/criteo-display-ad-challenge https://www.kaggle.com/c/avazu-ctr-prediction



Variance-adjusting debiasing (VAD) method 1: Input: Training dataset D train , the unlabeled test validation set D val-test,X , a link function ϕ : R → [0, 1], and the number of replications S. 2: Output: A variance adjusting debiased predictor f VAD . 3: Train a model on the training set and obtain the predictor f 1 . 4: Bootstrap (i.e. random sampling with replacement) the dataset S -1 times or retrain the model S -1 times using different random seeds and obtain the predictor f 2 , . . . , f S . 5: Let

Figure 3: Average calibration errors, ECE, and MCE on the synthetic data Tables 7 reports the Log Loss reduction on the selection set. Log Loss is defined as

Figure 5: Histograms of the last-layer neuron (logit) of the neural networks

; this equality is proved in Appendix A. Therefore, β⊤ N Σβ * could be approximated byβ⊤ N Σ βN -E Dtest Var Dtrain β⊤ N X -β⊤ N µ|X ,where the variance of βN could be estimated by bootstrapping or retraining the model using different random seeds. Note that this approximation is relatively accurate if the feature dimension d is large and the correlations between dimensions are small. This is because in this case, β⊤ N Σ βN and β⊤ N Σβ * concentrate on their means with an error O p (1/

Average and standard errors of calibration errors and ECE for synthetic data

Average calibration errors on the Criteo Ad Kaggle dataset

Average ECE on the Criteo Ad Kaggle dataset

Average and standard errors of MCE on synthetic data

Average and standard errors of calibration errors on synthetic data

Average and standard errors of ECE on synthetic data

Log loss reduction on synthetic data by applying VAD

Average and standard errors of calibration errors on Criteo Ad Kaggle dataset

acknowledgement

We open-sourced our implementation at https://github.com/tofuwen/VAD.

9. ACKNOWLEDGEMENTS

This project was partially supported by the National Institutes of Health (NIH) under Contract R01HL159805, by the NSF-Convergence Accelerator Track-D award #2134901, by a grant from Apple Inc., a grant from KDDI Research Inc, and generous gifts from Salesforce Inc., Microsoft Research, and Amazon Research.

annex

Proof of Corollary 4.2. Conditional on βN , we have λ β⊤ N X +(1-λ) β⊤ N µ ∼ N β⊤ N µ, λ 2 β⊤ N Σ βN . Then, we haveThe remaining proof follows similar lines with the proof of Theorem 4.1.Proof of Equation 6. Note that E βN = β * , we haveThen, the second term on the right hand size is equivalent toBy taking expectation conditional on X, we havewhich is because E βN = β * . By the tower property, we have the desired result.Lemma A.2. We assume 1. ϕ ′ (x) ≤ C for x ∈ R and ϕ ′ (x) ≥ c 0 > 0 for x ∈ [l, r].2. P β ⊤ N µ ∈ [µ l , µ r ] ≥ c 1 > 0 and l ≤ µ l + t l q 1-α (Z) ≤ µ r + 2t r q 1-α (Z) ≤ r for some t l < t r .Then, we have for t ∈ [t l , t r ] and α ≤ 0.2,Proof. The upper bound is immediate asNow, we focus on the lower bound.Published as a conference paper at ICLR 2023 Vanilla -0.27% -0.26% -0.25% -0.23% -0.21% -0.20% -0.19% -0.18% -0.17% Histogram Binning -0.08% -0.07% -0.06% -0.05% -0.04% -0.04% -0.04% -0.03% -0.03% Platt Scaling -0.05% -0.05% -0.05% -0.04% -0.04% -0.04% -0.03% -0.03% -0.03% Scaling-Binning -0.09% -0.07% -0.06% -0.06% -0.05% -0.04% -0.04% -0.04% -0.04% Isotonic Regression -0.05% -0.05% -0.04% -0.04% -0.04% -0.03% -0.03% -0.03% -0.03%We further check the performance using different bin numbers M ∈ {30, 40, 60, 70} in Tables 12 to  15 . We observe that regardless of the number of bins we choose, our method always outperforms. 0.0372±0.0008 0.0318±0.0007 0.0397±0.0010 0.0346±0.0009 Scaling-Binning 0.0401±0.0010 0.0349±0.0009 0.0425±0.0010 0.0373±0.0009 Isotonic Regression 0.0416±0.0010 0.0364±0.0010 0.0452±0.0012 0.0399±0.0012We then check the performance using different S ∈ {4, 5, 6} in Tables 16 to 18 . We find that different Ss have similar performance, except that for the Vanilla method, VAD+Vanilla with S = 6 is indeed better. 

