NO-REGRET LEARNING IN REPEATED FIRST-PRICE AUCTIONS WITH BUDGET CONSTRAINTS Anonymous

Abstract

Recently the online advertising market has exhibited a gradual shift from secondprice auctions to first-price auctions. Although there has been a line of works concerning online bidding strategies in first-price auctions, it still remains open how to handle budget constraints in the problem. In the present paper, we initiate the study for a buyer with a budget to learn her online bidding strategies in repeated first-price auctions. We propose an RL-based bidding algorithm against the optimal non-anticipating strategy under stationary competition. Our algorithm obtains O( √ T )-regret if the bids are all revealed at the end of each round, where O(•) is a variant of the big-O that hides logarithmic factors. With the restriction that the buyer only sees the winning bid after each round, our modified algorithm obtains O(T 7 12 )regret by techniques developed from survival analysis. Our analysis extends to the more general scenario where the buyer can have any bounded instantaneous utility function with regrets of the same order. Simulation experiments show that the constant factor inside the regret bound is rather small.

1. INTRODUCTION

There has been extensive growth in the online advertising market in recent years. It was estimated that the volume of online advertising worldwide would reach 500 billion dollars in 2022 (Statista, 2021) . In such a market, advertising platforms use auctions to allocate ad opportunities. Typically, each advertiser has a limited amount of capital for an advertisement campaign. Therefore, consecutive rounds of competition are interconnected by budgets of participating advertisers. Furthermore, each advertiser has very limited knowledge of 1) her valuation of certain keywords and 2) the competitors she is facing. There are many works being devoted to studying algorithms for learning strategies for optimally spending the budget in repeated second-price auctions (see Section 1.1). In practice, on the other hand, we have witnessed numerous switches from second-price auctions to first-price auctions in the online advertising market. A recent remarkable example is Google AdSenses' integrated move at the end of 2021 (LLC, 2021) . Earlier examples also include AppNexus, Index Exchange, and OpenX (Sluis, 2017) . This industry-wide shift is due to various factors including a fairer transactional process and increased transparency. Therefore, the shift to first-price auctions brings about major importance to the following open question which is barely considered in previous works: How should budget-constrained advertisers learn to compete in repeated first-price auctions? This paper thus initiates the study of learning to bid with budget constraints in repeated first-price auctions. It has been noted that the application of first-price auctions with budgets is not limited to online advertising mentioned above. Traditional competitive environments like mussel trade in Netherlands (van Schaik et al., 2001) , modern price competition, and procurement auctions (e.g. U.S. Treasury Securities auction (Chari and Weber, 1992) ) are examples as well.

Challenges and contributions

The challenges in this setting are two-fold. The first challenge relates to the specific information structure of first-price auctions. In practice, it is often the case that only the highest bid is revealed to all participants (Esponda, 2008) . This is known as censored-feedback or an informational version of winner's curse in literature (Capen et al., 1971) . This affects the information structure of learning since the buyer learns less information when she wins. This makes the problem challenging compared to standard contextual bandits (c.f. Section 1.1). The second challenge is more fundamental. It is known that the strategy in first-price auctions is notoriously complex to analyze, even in the static case (Lebrun, 1996) . To get an intuitive feeling of this difficulty in our problem compared to repeated second-price auctions. Let us consider the offline case where the opponents' bids are all known. Given the budget, the problem for second-price auctions can be reduced to a pure knapsack problem, where the budget is regarded as weight capacity and prices as weights. This structure enables mature techniques including duality theory to be applied to study the benchmark strategy. Pitifully in first-price auctions, since the payment depends on the buyer's own bid, the previous approach/benchmark is not directly usable. We provide a concrete example to further illustrate such difficulties. Example 1.1. Consider a case where the buyer's value v follows a uniform distribution on [0.4, 1] and the highest bid m of her opponents' follows a uniform distribution on [0, 0.5]. The time horizon is T and the buyer's budget B = 0.5T . The first-best benchmark (an anticipatingfoot_0 strategy that knows her values and her opponents' bids) can be viewed as a knapsack problem, which is E v∼F T m∼G T max b1,...,b T T t=1 (v t -b t )1 {bt≥mt} subject to T t=1 1 {bt≥mt} b t ≤ B ∀(v t ) T t=1 ; (m t ) T t=1 , where v t is her value and m t is her opponents' highest bid at time t. The buyer wants to determine each b t to maximize the revenue. In hindsight, we need to pay as close to m t as possible. Using the theory of knapsack, the utility is T • E[1 {v≥m} (v -m)] + = 0. 45T . On the contrary, the optimal non-anticipating bidding strategy in a first-price auction is to bid v 2 and the utility is T • E[1 { v 2 ≥m} v 2 ] = 0.26T . There is already an Ω(T ) separation between the first-best benchmark and the ideal case with full information. This example shows that simple characterization of the optimum in Balseiro and Gur (2019) is not applicable to our problem. Furthermore, it remains unclear what methodology can be applied in first-price auctions with budgets. The state-of-the-art adaptive pacing strategy downgrades to truthful bidding as the budget increases, so in first-price auctions, it may result in near-zero reward and thus cannot have any theoretical guarantee (see (Balseiro and Gur, 2019, §2.4) for further discussions). The present paper takes the first step to combat the challenges mentioned above with a dynamic programming approach. Correspondingly, our contribution is also two-fold: • We provide an RL-based learning algorithm. Through the characterization of the optimal strategy, we obtain O( √ T )-regret guarantee for the algorithm in the full-feedback casefoot_1 . • In the censored-feedback setting, by techniques developed from survival analysis, we modify our algorithm and obtain a regret of O(T 7 12 ).

1.1. RELATED WORK

Repeated second-price auctions with budgets There is a flourishing source of literature on bidding strategies in repeated auctions with budgets. Through the lens of online learning, Balseiro and Gur (2019) identify asymptotically optimal online bidding strategies known as pacing (a.k.a. bid-shading in literature) in repeated second-price auctions with budgets. Inspired by the pacing strategy, Flajolet and Jaillet (2017) develop no-regret non-anticipating algorithms for learning with contextual information in repeated second-price auctions. Another line of works that uses similar techniques in the present paper includes Amin et al. (2012); Tran-Thanh et al. (2014) ; Gummadi et al. (2012) . Gummadi et al. (2012) and Amin et al. (2012) study bidding strategies in repeated second-price auctions with budget constraints, but the former does not involve any learning and the latter does not provide any regret analysis (their estimator is biased). Tran-Thanh et al. (2014) derive regret bounds for the same scenario but the optimization objective is the number of items won instead of value or surplus. Baltaoglu et al. (2017) also use dynamic programming to tackle repeated second-price auctions with budgets. However, they assume per-round budget constraints and their dynamic programming algorithm is for allocating bids among multiple items. Again, we emphasize that no prior work has been done in repeated first-price auctions with budgets since the structure of the problem (compared to second-price variants) is fundamentally different (recall Example 1.1). Repeated first-price auctions without budgets Two notable works concerning repeated first-price auctions are Han et al. (2020b) and Han et al. (2020a) . Han et al. (2020b) introduce a new problem called monotone group contextual bandits and obtain an O( √ T ln 2 T )-regret algorithm for repeated first-price auctions without budget constraints under stationary settings. This bound is improved to O(T 1 3 +ϵ ) by Achddou et al. (2021) with additional assumptions on distributions. Han et al. (2020a) concentrate on an adversarial setting and develop a mini-max optimal online bidding algorithm with O( √ T ln T ) regret against all Lipschitz bidding strategies. Badanidiyuru et al. (2021) consider the problem in a contextual setting. A crucial difference is that in the present paper, budgets are involved thus the algorithms from previous works are not directly suitable for our needs. Bandit with knapsack From the bandit side, Badanidiyuru et al. (2013) develop bandit algorithms under resource constraints. They show that their algorithm can be used in dynamic procurement, dynamic posted pricing with limited supply, etc. However, since the bidder observes her value before bidding in our problem, results by Badanidiyuru et al. (2013) cannot be directly applied to our setting. Our setting also relates to contextual bandit problems with resource constraints (Badanidiyuru et al., 2014; Agrawal and Devanur, 2016; Agrawal et al., 2016) . Nevertheless, applying this contextual bandit approach requires discretizing the action space, which needs Lipschitz continuity of distributions. Our approach does not rely on any continuity assumption. Further, the performance guarantee (typically O(T 2 3 )) is worse than ours. It also does not fit into our information structure when the feedback is censored.

2. PRELIMINARIES

Auction mechanism We consider a repeated first-price auction with budgets. Specifically, we suppose that the buyer has a limited budget B to spend in a time horizon of T ≤ +∞ (can be unknown to her) rounds. At the beginning of each round t = 1, 2, . . . , T , the bidder privately observes a value v t for a fresh copy of item and bids b t according to her past observations h t and value v t . Denote the strategy she employs as π : (v t , B t , h t ) → b t , which maps her current budget B t , value v t and past history h t to a bid. Let m t be the maximum bid of the other bidders. Since the auction is a first price auction, if b t is larger than m t , then the buyer wins the auction, is charged b t , and obtains a utility of v t -b t ; otherwise, she loses and the utility is 0. Therefore, the instantaneous utility of the buyer is r t = (v t -b t )1 {bt≥mt} . The exact information structure of history the buyer observes is dictated by how the mechanism reveals m t . In full generality, we assume that the feedback is censored, i.e. only the highest bid is revealed at the end of each round and the winner does not observe m t exactly. This is considered to be an informational version of winner's curse (Capen et al., 1971) and is of practical interest (Esponda, 2008) . For the purpose of modeling, we suppose that ties are broken in favor of the buyer but this choice is arbitrary and by no means a limitation of our approach. Next, we state the assumptions on m t and v t . Without loss of generality, we assume that b t , m t , v t are normalized to be in [0, 1]. In the present paper, we consider a stochastic setting where m t and v t are drawn from some distributions F, G unknown to the buyer, respectively, and independent from the past. We will also refer to the cumulative distribution functions of F, G with the same notations. No further assumptions will be made on F, G. Now, the expected instantaneous utility of the buyer at time t with strategy π is R π (v t , b t ) = E mt∼F [r t ] = (v t -b t )F (b t ). To argue for the reasonability of this assumption, note that although other buyers may also involve learning behavior, it is typical that in a real advertising market, there are a large number of buyers (Kahng et al., 2004) . The specific buyer only faces a different small portion of them and is completely oblivious of whom she is facing in each round. Therefore, the buyer's sole objective is to maximize her utility (instead of fooling other buyers) and by the law of large numbers, the price m t and value v t the buyer observes are independent and identically distributed at least for a period of timefoot_2 .

Buyer's target and regret

The buyer aims at maximizing her long-term accumulated utility subject to budget constraints. Recall that the instantaneous utility of the buyer is r t = (v t -b t )1 {bt≥mt} . The payment is c t = b t 1 {bt≥mt} and the budget will then decrease accordingly as the payment incurs. She can continue to bid as long as the budget has not run out but must stop at τ * = min T + 1, min t ∈ N : t τ =1 c τ = B + 1 . The buyer's problem now becomes determining how much to bid in each round to maximize her accumulated utility. In line with works Gummadi et al. (2012) ; Golrezaei et al. (2019) ; Deng and Zhang (2021) , the buyer adopts a discount factor λ ∈ (0, 1). She takes discounts since she does not know T or τ * -Discount factors can be interpreted to be the estimate of the probability that the repeated auction will last for at least t rounds (Devanur et al., 2014; Drutsa, 2018) . It means that the process will terminate at each round with probability 1 -λ (Uehara et al., 2021) . On the economic side, in important real-world markets like online advertising platforms, buyers are impatient for opportunities since companies of different sizes have different capabilities. Discount factors model how impatientfoot_3 the buyer is in (Drutsa, 2017; Vanunts and Drutsa, 2019) . Now the buyer's optimization problem is to determine a non-anticipating strategy π for the following: max π E v∼F T m∼G T T t=1 λ t-1 r t subject to T t=1 1 {bt≥mt} b t ≤ B ∀(v t ) T t=1 ; (m t ) T t=1 , where b t = π(v t , B t , h t ). Here, v := (v 1 , . . . , v T ) denotes the sequence of private values the buyer observes, and m := (m 1 , . . . , m T ) is the sequence of the highest bids of the other bidders. V π (B, t) denotes the expected accumulated utility using strategy π with budget B and starting from time t. Let π * denote the optimal bidding strategy with the knowledge of the underlying distributions F and G. The corresponding expected accumulated utility is V π * (B, t). (We sometimes use V (•, •) to represent V π * (•, •) for convenience in the rest of the paper.) We now come to define the regret. First, write the per-episode revenue suboptimality for each round t as SubOpt t (π t ) = V π * (B t , t) -V πt (B t , t), where π t is the strategy used in round t. Our evaluation metric is then the sum of suboptimality for t = 1, . . . , T , namely Regret(T ) = E T t=1 SubOpt t (π t ) , where the expectation is taken over the trajectories of the achievement of v and the realization of others' bids inexplicitly. The definition of regret comes from traditional reinforcement learning (RL) literature of infinitehorizon discounted model (Kaelbling et al., 1996) . The definition is also inspired by the recent advances in Yang et al. ( 2021 (2021) call it cumulative error. It reflects the suboptimality for π t to learn the optimal valuation of attending the auction. In the most common scenario, such as Balseiro and Gur (2019) , the budget constraint is linear to time horizon T , i.e. B T ∼ Θ(1). Therefore, a bidder has an expectation that she will win for O(T ) rounds. With a sub-optimal policy, it is easy to suffer O(T ) regret which is intolerable for bidders. It leads to the challenge to achieve a sublinear regret in first-price auctions and we design algorithms to answer the question.

3. BIDDING ALGORITHM AND ANALYSIS

In this section, we present our bidding algorithm and the high-level ideas in the analysis of regret. We first consider the case where the feedback is not censored, i.e. the buyer is aware of m t no matter whether she wins or not. Then we extend our algorithm to the case where the feedback is censored with techniques developed from survival analysis.

3.1. FULL FEEDBACK

When F and G are known, the buyer's problem can be viewed as offline. The technical challenge lies in the observation that even when the distributions are known, the buyer's problem cannot be directly analyzed as a knapsack problem. To tackle this challenge, we use a dynamic programming approach to solve the problem. In particular, the optimal strategy π * satisfies the following Bellman equation: b * (B τ , v) ∈ arg max b [(v -b) + λV (B τ -b, τ + 1)]F (b) + λV (B τ , τ + 1)(1 -F (b)), V (B τ , τ ) = E v [(v -b * ) + λV (B τ -b * , τ + 1)]F (b * ) + λV (B τ , τ + 1)(1 -F (b * )), for all τ ∈ N and 0 ≤ B τ ≤ B. Note that for any B τ < 0, V (B τ , τ ) = -∞. By choosing appropriate initialization conditions, we can solve the equation recursively and obtain the optimal bidding strategy together with the function V (•, •). The above recursion also characterizes the optimal solution, which will be used in the analysis later. When the buyer does not have the information of F and G, she can learn the distributions from past observations. Therefore, it is natural to maintain estimations F and Ĝ of the true distributions. Our algorithm for the full-feedback case is now depicted in Algorithm 1. To ease technical loads, we first assume the knowledge of G and only estimate F in Algorithm 1. Later, we will add the estimation of G and its analysis is presented in Theorem 3.2. Algorithm 1 Algorithm for the full-feedback case 1: Input: Initial budget B and constant C 1 ▷ C 1 is an arbitrary positive constant 2: Initialize the estimation F of F to a uniform distribution over [0, 1] and B 1 ← B 3: for t = 1, 2, . . . do

4:

Observe the value v t in round t 5: Let t 0 be the smallest integer that satisfies λ t0-t 1 1-λ < C1 √ t 6: Set V F (B t0 , t 0 ) = 0 for any B t0 ▷ V F is algorithm's estimation of V 7: for τ = t 0 , t 0 -1, . . . , t do 8: Q v, F (B τ , τ, b) ← [(v -b) + λV F (B τ -b, τ + 1)] F (b) + λV F (B τ , τ + 1)(1 -F (b)) 9: Solve the optimization problem b * τ ← arg max b Q v, F (B τ , τ, b) 10: V F (B τ , τ ) ← E v∼G [Q v, F (B τ , τ, b * τ )] 11: end for 12: Place a bid bt ← arg max b Q v, F (B t , t, b) 13: Observe m t , c t and r t from this round of auction and update F (x) = 1 t t i=1 1 {mi≤x} . 14: B t+1 ← B t -c t . If B t+1 ≤ 0 then halt. 15: end for Similar to prior work (Amin et al., 2012) , Algorithm 1 performs exploration and exploitation simultaneously. The buyer initializes the estimation of F to a uniform distribution (Line 2). At round t, the buyer observes her valuation. Then, she uses her estimation of F to solve the dynamic programming problem recursively 5 to obtain an estimation of the optimal bid (Line 7~Line 11). To provide a base case for recursion, note that for sufficiently large t 0 ≫ t, V F (•, t 0 )'s impact to V F (•, t) is negligible due to the discount λ t0-t . Therefore, the buyer can approximate V F (•, t 0 ) with zero for t 0 (Line 5). Finally, the auction proceeds with m t , r t , c t revealed and the buyer updates her information accordingly (Line 13~Line 14).

Analysis of regret

To analyze the algorithm, we first assume that Algorithm 1 knows the distribution G exactly and establishes the regret. Then we add the contribution of the estimation of F . Theorem 3.1. Under the circumstance that F is unknown, the worst-case regret of Algorithm 1 is O( √ T ), where the regret is computed according to Equation (1). Explicitly, if we take C 1 = 1, Regret(T ) ≤ 4 ln( √ 2T ) 1 + λ (1 -λ) 3 + 4 1 -λ √ T + 1. To show an example of the application of the result, let us take the budget B to scale linearly with T as in Balseiro and Gur (2019) ; Flajolet and Jaillet (2017) . Specifically, assume that T < +∞ and there exists a constant β such that the budget B = βT , then we establish that the regret is O( √ T ) in this special case. Indeed, under this condition, we can simply set t 0 = T + 1 and V F (B T +1 , T + 1) = 0 for any B T +1 in Algorithm 1. Therefore, C 1 = 0 and the worst-case regret is bounded by 4 ln( √ 2T ) 1+λ (1-λ) 3 √ T + 1. Next, we deal with the case where G is also initially unknown. Based on Algorithm 1, we additionally maintain an estimation Ĝ of G based on past observations of valuations. Ĝ is initialized to be a uniform distribution and will be used to solve the dynamic programming problem (see Line 7 of Algorithm 2). Using similar techniques as before (with more work), we obtain the following theorem. Theorem 3.2. Under the circumstance that F, G are both unknown, it holds that the worst-case regret of Algorithm 1 using empirical distribution functions to estimate F and G is O( √ T ). Explicitly, if we take C 1 = 1, Regret(T ) ≤ ln(2T ) 6(1 + λ) (1 -λ) 3 + 4 1 -λ √ T + 1.

3.2. CENSORED FEEDBACK

In this subsection, we deal with the case that the buyer can only see the winner's bid after each round. In other words, the feedback is left-censored. Concretely, the buyer's observation is o t = max{b t , m t }. When she wins, the exact value of m t is not revealed. The buyer only knows that m t is smaller than her bid in the current round. To estimate the distribution of m t , there is a classical statistics (KM estimator) developed by Kaplan and Meier (1958) for the estimation of F in this scenario. However, the KM estimator assumes the sequence (m t ) T t=1 is deterministic, which does not fit our needs. Although Suzukawa (2004) 's extension allows random censorship, it requires independence between b t and m t , which is not realistic since we use the estimated distribution to place bids. To tackle this problem, we first introduce an estimator proposed by Zeng (2004) denoted by Fn to take place of the previous empirical distribution used in Algorithm 1. Estimation procedure We now present the procedure for estimating F under censored feedback. It suffices to estimate the distribution function of 1 -m t which is right-censored by 1 -b t . Let y t = min{1 -m t , 1 -b t }, r t = 1 {mt≥bt} . The observations can now be described as (y t , r t , h t ) T t=1 . Roughly speaking, to decouple the dependency between m t , b t , we use the fact that b t and m t are independent conditioning on h t . Intuitively, the history h t provides information for getting enough effective samples for m t . Next, we establish models to estimate the hazard rate functions 6 of 1 -m t , 1 -b t using h t . With the hazard rate functions, we use the maximum likelihood method with a kernel to compute the final estimation Ft and obtain Equation (3). Details follow. We initialize with a sequence (bandwidth) (a t ) T t=1 such that log 2 at ta 2 t → 0, ta 2 t → ∞, ta 4 t → 0 as t → +∞ and a symmetric kernel function K(•, •) ∈ C 2 (R 2 ) with bounded gradient. Now, at each time t, we compute two vectors β t , γ t which maximize each of the following likelihood functions (these can be regarded as loss functions of the estimation that we aim to optimize) f (β) = t τ =1 r τ t   β ⊤ h τ -log yi≥yτ e β ⊤ hτ   , g(γ) = t τ =1 1 -r τ t   γ ⊤ h τ -log yi≥yτ e γ ⊤ hτ   . (2) We arbitrarily pad h 1 , . . . , h t with zeros to make their length the same (we will show that this is without loss of generality in a moment). Compute Z t = (β ⊤ t h t , γ ⊤ t h t ) ⊤ . The survival function of 1 -m t , or equivalently the cumulative distribution function of m t , is estimated based on Zeng (2004)'s estimator Ft (x) = 1 t t i=1 t j=1 1 - K((Z i -Z j )/a n )1 {yj ≤x} r j n m=1 K((Z i -Z m )/a n )1 {yj ≤ym} . Now, we are ready to apply the estimator to design the algorithm for the censored-feedback case. Note that the new estimator's convergence rate is slower than that for the full-feedback case. Therefore, compared to Algorithm 1, Algorithm 2 is now a multi-phase algorithm. The algorithm only updates the estimation of F at the end of each phase (see Figure 1 for an illustration). The other elements of each phase in Algorithm 2 are similar to Algorithm 1. Algorithm 2 Algorithm for the censored-feedback case 1: Input: Initial budget B and constant C 1 ▷ C 1 is an arbitrary positive constant 2: Initialize the estimation F of F and the estimation Ĝ of G to uniform distributions over [0, 1] 3: B 1 ← B 4: for Phase i = 1, 2, . . . do ▷ Phase i (i > 1) lasts for 2 i rounds. Phase 1 lasts for 2 rounds 5: for each t in the time interval of round i do 6: Observe the value v t in round t 7: Update Ĝ(x) = 1 t t i=1 1 {vi≤x} .

8:

Let t 0 be the smallest integer that satisfies λ t0-t 1 1-λ < C1 √ t 9: Set V F , Ĝ(B t0 , t 0 ) = 0 for any B t0 ▷ V F , Ĝ is algorithm's estimation of V 10: for τ = t 0 , t 0 -1, . . . , t do ▷ This loop can be moved to the end of each phase to reduce the invocation time from T to ln T Observe o t , c t and r t from this round of auction 17: B t+1 ← B t -c t . If B t+1 ≤ 0 then halt. 18: end for 19: Update F using the estimator specified in Equation (3) with data observed before this phase 20: end for Analysis of regret To analyze the performance of Algorithm 2, we will prove a series of lemmas on the estimation error of Equation (3). We concentrate on the performance of the new estimator since this is the major difference between Algorithm 1 and Algorithm 2. In particular, our proof relies on the following convergence result. Lemma 3.3 (Zeng) . Let Fn be the estimation of F after using n observations. We have √ n( Fn (1 -x) -F (1 -x)) =⇒ B(x) in ℓ ∞ ([0, 1]), where B(x) is a Gaussian process.



An algorithm is anticipating if bid selection depends on future observations, seeFlajolet and Jaillet (2017). This is especially practical in public-sector auctions(Chari and Weber, 1992) as regulations mandate all bids to be revealed. This assumption has support from experimental evidence(Pin and Key, 2011). It can also be theoretically justified using mean field asymptotics. Please also seeHan et al. (2020b) for another justification. As an additional explanation, in budget-constrained first-price auctions, the bidder always bids below or equal to her value. So she is very sensitive to the market price. However, by not winning the auction at a certain price, the bidder creates a future opportunity to win an equivalent auction at a lower price. The use of a bid discount factor adds flexibility to tune this behavior. As the bidder has a preference for present utility over future utility, the discount factor moderates the extent of underbidding that she finds to be optimal, which makes the model more general. For the non-trivial case B ≤ T , this can be solved in O T 4.5(1-λ) time with only O(T -12 ) loss (see, e.g.Chow and Tsitsiklis, 1989). The hazard rate function of a random variable X with p.d.f. f and c.d.f. F is HX (x) = f (x) 1-F (x) .



); He et al. (2021); Liu and Su (2020); Zhou et al. (2021). Zhou et al.

F , Ĝ(B τ , τ, b) ← [(v -b) + λV F , Ĝ(B τ -b, τ + 1)] F (b) + λV F , Ĝ(B τ , τ + 1)(1 -F (b)) 12: Solve the optimization problem b * τ ← arg max b Q F , Ĝ(B τ , τ, b) 13: V F , Ĝ(B τ , τ ) ← E v∼G [Q F , Ĝ(B τ , τ, b * bt ← arg max b Q F , Ĝ(B t ,t, b)    16:

annex

Before we proceed to apply the lemma, we verify a series of prerequisites mentioned in Zeng (2004) to make sure it holds. First, we make sure that conditioning on h t , the random variables 1 -m t and 1 -b t are independent. Indeed, b t is completely decided by h t and m t is independent of h t . Second, we note that the maximizer shown in Equation ( 2) is essentially doing Cox's proportional hazards regression analysis. To establish consistency of the estimator, we show that at least one of m t := 1 -m t and 1 -b t follows Cox's proportional model. That is to say, there exists β and a function f (y) such that the hazard rate function of m t or 1 -b t conditioning on h t exactly follows(4) Equation ( 4) holds for m t . In fact, taking β = 0 and f (y) = F ′ (1-y) 1-F (1-y) suffices. Since we take β = 0, consistency establishes regardless of the way we pad h t .Next, consider some phase at time 2 n ≤ t ≤ 2 n+1 -1. The estimation Fn is computed using the first 2 n observed data points. Applying similar techniques for the rate of convergence of empirical process (Norvaiša and Paulauskas, 1991) , we obtain the following lemma on the performance of F in Algorithm 2. Lemma 3.4. Under the update process in Algorithm 2, for any 2 n ≤ t ≤ 2 n+1 -1, we have the following bounds for the estimation Fn :where M is a constant depending only on F and Algorithm 2.Finally, we now bound the difference between Fn and F with the help of Lemma 3.4.Lemma 3.5. Recall that we use the first 2 n data points to estimate F . Under the update procedure of Algorithm 2, for anywhere C 5 is an absolute constant.With Lemma 3.5 in hand, we now have Theorem 3.6. Under the circumstance that F, G are both unknown and the feedback is censored, the worst-case regret of Algorithm 2 is O(TRemark 3.7. The setting in Han et al. (2020b) is a special case of ours, where there are no budget constraints and λ = 0 (thus removing the 1 1-λ factor in our results). The buyer's aim is to maximize (v -b)F (b) each round. This is equivalent to V F = 0 in our setting with no need to estimate G, yielding regret O( √ T ) in the full-feedback case and regret O(T 7 12 ) in the censored-feedback case. Remark 3.8. The regret-bound looks unusual at a first glance. The reason is that the convergence rate of the estimator is lower than that in the commonly used "Hoeffding-type" or "Bernstein-type" inequalities. However, due to the information structure, they are not suitable to be used in our environment to our best knowledge.

4. LOWER BOUND

Here, we discuss the lower bound for the regret under such settings. This will shed light on the optimality gap of the proposed policies.

Full Feedback

We have proposed a general solution framework that works for any β where the budget constraint is B = βT and any discount rate λ ∈ [0, 1). Note that the general lower bound is no less than the lower bound for a specific case. Consider the case when β = 1 and λ = 0. Our problem reduces to the case where the buyer essentially does not face budget constraints and is extremely myopic. Under this circumstance, the problem is a multi-armed bandit problem. Auer et al. (2002) shows that it suffers from a Θ( √ T )-regret lower bound. This means that our algorithm is optimal up to logarithmic terms.

Censored Feedback

The O( √ T ) lower bound also applies here. However, the upper bound and the lower bound have not matched yet. We leave this as an intriguing open problem as there is a lack of relevant literature to show regret lower bounds under the censored-feedback case. However, we want to provide some evidence that our upper bound is sufficiently good. For example, parallel to our work, Gaitonde et al. (2022) ) regret bound (Castiglioni et al., 2020) .

5. DISCUSSION AND CONCLUSION

In this paper, we develop a learning algorithm to adaptively bid in repeated first-price auctions with budgets. On the theoretical side, our algorithm, together with its analysis of O( √ T )-regret in the full-feedback case and O(T 7 12 )-regret in the censored-feedback case, takes the first step in understanding the problem. On the practical side, our algorithm is simple and readily applicable to the digital world that has shifted to first-price auctions 7 .Questions raise themselves for future explorations. We observe here that in the limiting case λ → 1, the optimal bidding strategy in Algorithm 2 is similar to a pacing strategy, which relates to the open question 8 raised in Balseiro and Gur (2019) . In the limiting case of λ → 1, the optimal bid of Algorithm 2 is of the form vt 1+xt , where x t is a pacing multiplier that depends only on B t and F and can be computed without solving the dynamic programming problem. This observation can be viewed as a corollary of (Theorem 3.1 Gummadi et al., 2012) . This connection between Algorithm 2 and pacing suggests further investigations. Other immediate open questions include closing the gap between upper and lower bounds for the censored feedback case.

