CONTEXTUAL BANDITS WITH CONCAVE REWARDS, AND AN APPLICATION TO FAIR RANKING

Abstract

We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restricted to finite policy spaces or tabular representations. Our solution is based on a geometric interpretation of CBCR algorithms as optimization algorithms over the convex set of expected rewards spanned by all stochastic policies. Building on Frank-Wolfe analyses in constrained convex optimization, we derive a novel reduction from the CBCR regret to the regret of a scalar-reward bandit problem. We illustrate how to apply the reduction off-the-shelf to obtain algorithms for CBCR with both linear and general reward functions, in the case of non-combinatorial actions. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives, leading to the first algorithm with regret guarantees for contextual combinatorial bandits with fairness of exposure. T ), and hence not substantially change our results. In this section we describe our general approach for CBCR. We first derive our key reduction from CBCR to a specific scalar-reward bandit problem. We then instantiate our algorithm to the case of linear and general reward functions for smooth objectives f . Finally, we extend to the case of non-smooth objective functions using Moreau-Yosida regularization (Rockafellar & Wets, 2009) . 4 Gini(z1, . . . , zm) = 1 2m m i=1 m j=1 |zi -zj| is an unnormalized Gini coefficient.

1. INTRODUCTION

Contextual bandits are a popular paradigm for online recommender systems that learn to generate personalized recommendations from user feedback. These algorithms have been mostly developed to maximize a single scalar reward which measures recommendation performance for users. Recent fairness concerns have shifted the focus towards item producers whom are also impacted by the exposure they receive (Biega et al., 2018; Geyik et al., 2019) , leading to optimize trade-offs between recommendation performance for users and fairness of exposure for items (Singh & Joachims, 2019; Zehlike & Castillo, 2020) . More generally, there is an increasing pressure to insist on the multiobjective nature of recommender systems (Vamplew et al., 2018; Stray et al., 2021) , which need to optimize for several engagement metrics and account for multiple stakeholders' interests (Mehrotra et al., 2020; Abdollahpouri et al., 2019) . In this paper, we focus on the problem of contextual bandits with multiple rewards, where the desired trade-off between the rewards is defined by a known concave objective function, which we refer to as Contextual Bandits with Concave Rewards (CBCR). Concave rewards are particularly relevant to fair recommendation, where several objectives can be expressed as (known) concave functions of the (unknown) utilities of users and items (Do et al., 2021) . Our CBCR problem is an extension of Bandits with Concave Rewards (BCR) (Agrawal & Devanur, 2014) where the vector of multiple rewards depends on an observed stochastic context. We address this extension because contexts are necessary to model the user/item features required for personalized recommendation. Compared to BCR, the main challenge of CBCR is that optimal policies depend on the entire distribution of contexts and rewards. In BCR, optimal policies are distributions over actions, and are found by direct optimization in policy space (Agrawal & Devanur, 2014; Berthet & Perchet, 2017) . In CBCR, stationary policies are mappings from a continuous context space to distributions over actions. This makes existing BCR approaches inapplicable to CBCR because the policy space is not amenable to tractable optimization without further assumptions or restrictions. As a matter of fact, the only prior theoretical work on CBCR is restricted to a finite policy set (Agrawal et al., 2016) . We present the first algorithms with provably vanishing regret for CBCR without restriction on the policy space. Our main theoretical result is a reduction where the CBCR regret of an algorithm is bounded by its regret on a proxy bandit task with single (scalar) reward. This reduction shows that it is straightforward to turn any contextual (scalar reward) bandits into algorithms for CBCR. We prove this reduction by first re-parameterizing CBCR as an optimization problem in the space of feasible rewards, and then revealing connections between Frank-Wolfe (FW) optimization in reward space and a decision problem in action space. This bypasses the challenges of optimization in policy space. To illustrate how to apply the reduction, we provide two example algorithms for CBCR with noncombinatorial actions, one for linear rewards based on LinUCB (Abbasi-Yadkori et al., 2011) , and one for general reward functions based on the SquareCB algorithm (Foster & Rakhlin, 2020) which uses online regression oracles. In particular, we highlight that our reduction can be used together with any exploration/exploitation principle, while previous FW approaches to BCR relied exclusively on upper confidence bounds (Agrawal & Devanur, 2014; Berthet & Perchet, 2017; Cheung, 2019) . Since fairness of exposure is our main motivation for CBCR, we show how our reduction also applies to the combinatorial task of fair ranking with contextual bandits, leading to the first algorithm with regret guarantees for this problem, and we show it is computationally efficient. We compare the empirical performance of our algorithm to relevant baselines on a music recommendation task. Related work. Agrawal et al. (2016) address a restriction of CBCR to a finite set of policies, where explicit search is possible. Cheung (2019) use FW for reinforcement learning with concave rewards, a similar problem to CBCR. However, they rely on a tabular setting where there are few enough policies to compute them explicitly. Our approach is the only one to apply to CBCR without restriction on the policy space, by removing the need for explicit representation and search of optimal policies. Our work is also related to fairness of exposure in bandits. Most previous works on this topic either do not consider rankings (Celis et al., 2018; Wang et al., 2021; Patil et al., 2020; Chen et al., 2020) , or apply to combinatorial bandits without contexts (Xu et al., 2021) . Both these restrictions are impractical for recommender systems. Mansoury et al. (2021) ; Jeunen & Goethals (2021) propose heuristics with experimental support that apply to both ranking and contexts in this space, but they lack theoretical guarantees. We present the first algorithm with regret guarantees for fair ranking with contextual bandits. We provide a more detailed discussion of the related work in Appendix A.

2. MAXIMIZATION OF CONCAVE REWARDS IN CONTEXTUAL BANDITS

Notation. For any n ∈ N, we denote by n = {1, . . . , n}. The dot product of two vectors x and y in R n is either denoted x ⊺ y or using braket notation ⟨x | y⟩, depending on which one is more readable. Setting. We define a stochastic contextual bandit (Langford & Zhang, 2007) problem with D rewards. At each time step t, the environment draws a context x t ∼ P , where x ∈ X ⊆ R q and P is a probability measure over X . The learner chooses an action a t ∈ A where A ⊆ R K is the action space, and receives a noisy multi-dimensional reward r t ∈ R D , with expectation E[r t |x t , a t ] = µ(x t )a t , where µ : X → R D×K is the matrix-value contextual expected reward function. 1 The trade-off between the D cumulative rewards is specified by a known concave function f : R D → R ∪ {±∞}. Let A denote the convex hull of A and π : X → A be a stationary policy, 2 then the optimal value for the problem is defined as f * = sup π:X →A f E x∼P µ(x)π(x) . We rely on either of the following assumptions on f : Assumption A f is closed proper concave 3 on R D and A is a compact subset of R K . Moreover, there is a compact convex set K ⊆ R D such that • (Bounded rewards) ∀(x, a) ∈ X × A, µ(x)a ∈ K and for all t ∈ N * , r t ∈ K with probability 1. • (Local Lipschitzness) f is L-Lipschitz continuous with respect to ∥.∥ 2 on an open set containing K. Assumption B Assumption A holds and f has C-Lipschitz-continuous gradients w.r.t. ∥.∥ 2 on K. The most general version of our algorithm, described in Appendix D, removes the need for the smoothness assumption using smoothing techniques. We describe an example in Section 3.3. In the rest of the paper, we denote by D K = sup z,z ′ ∈K ∥z -z ′ ∥ 2 the diameter of K, and use C = C 2 D 2 K . We now give two examples of this problem setting, motivated by real-world applications in recommender systems, and which satisfy Assumption A. Example 1 (Optimizing multiple metrics in recommender systems.) Mehrotra et al. (2020) formalized the problem of optimizing D engagement metrics (e.g. clicks, streaming time) in a banditbased recommender system. At each t, x t represents the current user's features. The system chooses one arm among K, represented by a vector a t in the canonical basis of R K which is the action space A. Each entry of the observed reward vector (r t,i ) D i=1 corresponds to a metric's value. The trade-off between the metrics is defined by the Generalized Gini Function: f (z) = D i=1 w i z ↑ i , where (z ↑ i ) D i=1 denotes the values of z sorted increasingly and w ∈ R D is a vector of non-increasing weights. Example 2 (Fairness of exposure in rankings.) The goal is to balance the traditional objective of maximizing user satisfaction in recommender systems and the inequality of exposure between item producers (Singh & Joachims, 2018; Zehlike & Castillo, 2020) . For a recommendation task with m items to rank, this leads to a problem with D = m + 1 objectives, which correspond to the m items' exposures, plus the user satisfaction metric. The context x t ∈ X ⊂ R md is a matrix where each x t,i ∈ R d represents a feature vector of item i for the current user. The action space A is combinatorial, i.e. it is the space of rankings represented by permutation matrices: A = a ∈ {0, 1} m×m : ∀i ∈ m , m k=1 a i,k = 1 and ∀k ∈ m , m i=1 a i,k = 1 (1) For a ∈ A, a i,k = 1 if item i is at rank k. Even though we use a double-index notation and call a a permutation matrix, we flatten a as a vector of dimension K = m 2 for consistency of notation. We now give a concrete example for f , which is concave as usual for objective functions in fairness of exposure (Do et al., 2021) . It is inspired by Morik et al. (2020) , who study trade-offs between average user utility and inequality 4 of item exposure: f (z) = z m+1 user utility -β 1 2m m i=1 m j=1 |z i -z j | inequality of item exposure where β > 0 is a trade-off parameter. (2) The learning problem. In the bandit setting, P and µ are unknown and the learner can only interact online with the environment.Let h T = x t , a t , r t t∈ T -1 be the history of contexts, actions, and reward observed up to time T -1 and δ ′ > 0 be a confidence level, then at step t a bandit algorithm A receives in input the history h t , the current context x t , and it returns a distribution over actions A and selects an action a t ∼ A(h t , x t , δ ′ ). The objective of the algorithm is to minimize the regret R T = f * -f (ŝ T ) where ŝT = 1 T T t=1 r t . Note that our setting subsumes classical stochastic contextual bandits: when D = 1 and f (z) = z, maximizing f (ŝ T ) amounts to maximizing a cumulative scalar reward T t=1 r t . In Lem. 9 (App. C.3), we show that alternative definitions of regret, with different choices of comparator or performance measure, would yield a difference of order O(1/

3.1. REDUCTION FROM CBCR TO SCALAR-REWARD CONTEXTUAL BANDITS

There are two challenges in the CBCR problem: 1) the computation of the optimal policy sup π:X →A f E x∼P µ(x)π(x) even with known µ; 2) the learning problem when µ is unknown. 1: Reparameterization of the optimization problem. The first challenge is that optimizing directly in policy space for the benchmark problem sup π:X →A f E x∼P µ(x)π(x) is intractable without any restriction, because the policy space includes all mappings from the continuous context space X to distributions over actions. Our solution is to rewrite the optimization problem as a standard convex constrained problem by introducing the convex set S of feasible rewards: S = E x∼P µ(x)π(x) π : X → A so that f * = sup π:X →A f E x∼P µ(x)π(x) = max s∈S f (s). Under Assumption A, S is a compact subset of K (see Lemma 7 in App. C) so f attains its maximum over S. We have thus reduced the complex initial optimization problem to a concave optimization problem over a compact convex set. 2: Reducing the learning problem to scalar-reward bandits. Unfortunately, since P and µ are unknown, the set S is unknown. This precludes the possibility of directly using standard constrained optimization techniques, including gradient descent with projections onto S. We consider Frank-Wolfe, a projection-free optimization method robust to approximate gradients (Lacoste-Julien et al., 2013; Kerdreux et al., 2018) . At each iteration t of FW, the update direction is given by the linear subproblem: argmax s∈S ⟨∇f (z t-1 ) | s⟩, where z t-1 is the current iterate. Our main technical tool, Lemma 1, allows to connect the FW subproblem in the unknown reward space S to a workable decision problem in the action space (see Lemma 13 in Appendix E for a proof): Lemma 1 Let E t . be the expectation conditional on h t . Let z t ∈ K be a function of contexts, actions and rewards up to time t. Under Assumption A, we have: ∀t ∈ N * , E t max a∈A ⟨∇f (z t-1 ) | µ(x t )a⟩ = max s∈S ⟨∇f (z t-1 ) | s⟩. ( ) For all δ ∈ (0, 1], with probability at least 1 -δ, we have: T t=1 max s∈S ⟨∇f (z t-1 ) | s⟩ -max a∈A ⟨∇f (z t-1 ) | µ(x t )a⟩ ≤ LD K 2T ln(δ -1 ). Lemma 1 shows that FW for CBCR operates closely to a sequence of decision problems of the form (max a∈A ⟨∇f (z t-1 ) | µ(x t )a⟩) T t=1 . However, we have yet to address the problem that P and µ are unknown. To solve this issue, we introduce a reduction to scalar-reward contextual bandits. We can notice that solving for the sequence of actions maximizing T t=1 ⟨∇f (z t-1 ) | µ(x t )a⟩ corresponds to solving a contextual bandit problem with adversarial contexts and stochastic rewards. Formally, using z t = ŝtfoot_2 , we define the extended context xt = (∇f (ŝ t-1 ), x t ), the average scalar reward μ(x t ) = ∇f (ŝ t-1 ) ⊺ µ(x t ) and the observed scalar reward rt = ⟨∇f (ŝ t-1 ) | r t ⟩. This fully defines a contextual bandit problem with scalar reward. Then, the objective of the algorithm is to minimize the following scalar regret: R scal T = T t=1 max a∈A μ(x t ) ⊺ a - T t=1 rt = T t=1 max a∈A ⟨∇f (ŝ t-1 ) | µ(x t )a⟩ - T t=1 ⟨∇f (ŝ t-1 ) | r t ⟩. (6) In this framework, the only information observed by the learning algorithm is ht := xt ′ , a t ′ , rt ′ t ′ ∈ t-1 . This regret minimization problem has been extensively studied (see e.g., Slivkins, 2019, Chap. 8 for an overview). The following key reduction resultfoot_3 relates R scal T to R T , the regret of the original CBCR problem: Theorem 2 Under Assmpt. B, for every T ∈ N * and δ > 0, algorithm A satisfies, with prob. ≥ 1 -δ: R T = f * -f (ŝ T ) ≤ R scal T + LD K 2T ln(1/δ) + C ln(eT ) T . The reduction shown in Thm. 2 hints us at how to use or adapt scalar bandit algorithms for CBCR. In particular, any algorithm with sublinear regret will lead to a vanishing regret for CBCR. Since the worst-case regret of contextual bandits is Ω( et al., 2008) , we obtain near minimax optimal algorithms for CBCR. We illustrate this with two algorithms derived from our reduction in Sec. 3.2. √ T ) (Dani Proof sketch of Theorem 2: CBCR and Frank-Wolfe algorithms (full proof in Appendix E). Although the set S is not known, the standard telescoping sum argument for the analysis of Frank-Wolfe algorithms (see Lemma 14 in Appendix E, and e.g., (Berthet & Perchet, 2017, Lemma 12) for similar derivations) gives that under Assumption B, denoting g t = ∇f (ŝ t-1 ): T R T ≤ T t=1 max s∈S ⟨g t | s -r t ⟩ + C ln(eT ). The result is true for every sequence (r t ) t∈ T ∈ K T , and only tracks the trajectory of ŝt in reward space. We introduce now the reference of the scalar regret: T R T = T t=1 max s∈S ⟨g t | s⟩ -max a∈A ⟨g t | µ(x t )a⟩ + T t=1 max a∈A ⟨g t | µ(x t )a -r t ⟩ =R scal T + C ln(eT ) (8) Lemma 1 bounds the leftmost term, from which Theorem 2 immediately follows using (8).

3.2. PRACTICAL APPLICATION: TWO ALGORITHMS FOR MULTI-ARMED CBCR

To illustrate the effectiveness of the reduction from CBCR to scalar-reward bandits, we focus on the case where the action space A is the canonical basis of R K (as in Example 1). We first study the case of linear rewards. Then, for general reward functions, we introduce the FW-SquareCB algorithm, the first example of a FW-based approach combined with an exploration principle other than optimism. This shows our approach has a much broader applicability to solve (C)BCR than previous strategies. From LinUCB to FW-LinUCB (details in Appendix G). We consider a CBCR with linear reward function, i.e., µ(x) = θx where θ ∈ R D×d (recall we have D rewards) and x ∈ R d×K , where d is the number of features. Let θ := flatten(θ) and g t = ∇f (ŝ t-1 ). Using [.; .] to denote the vertical concatenation of matrices, the expected reward for action a in context x at time t can be written ⟨g t | µ(x)a⟩ = g ⊺ t θxa = ⟨ θ | xt a⟩ where xt ∈ R Dd×K is the extended context with entries xt = [g t,0 x t ; . . . ; g t,D x t ] ∈ R Dd×K . This is an instance of a linear bandit problem, where at each time t, action a is associated to the vector xt a and its expected reward is ⟨ θ | xt a⟩. As a result, we can immediately derive a LinUCB-based algorithm for linear CBCR by leveraging the equivalence FW-LinUCB(h t , x t , δ ′ ) = LinUCB( ht , xt , δ ′ ) . LinUCB's regret guarantees imply R scal T = O(d √ T ) with high probability, which, in turn give a O(1/ √ T ) for R T . From SquareCB to FW-SquareCB (details in Appendix H). We now consider a CBCR with general reward function µ(x). The SquareCB algorithm (Foster & Rakhlin, 2020 ) is a randomized exploration strategy that delegates the learning of rewards to an arbitrary online regression algorithm. The scalar regret of SquareCB is bounded depending on the regret of the base regression algorithm. For FW-SquareCB, we have access to an online regression oracle μt , an estimate of µ which is a function of h t , which has regression regret bounded by R oracle (T ). The exploration strategy of FW-SquareCB follows the same principles as SquareCB: let g t = ∇f (ŝ t-1 ) and denote μt = g ⊺ t μt (x t ), so that μ⊺  t a = ⟨g t | μt (x t )a⟩. Let A t = FW-SquareCB(h t , x t , δ ′ ) defined as ∀a ∈ A, A t (a) =    1 K+γt μ * t -μ ⊺ t a if a ̸ = a t 1 -a∈A a̸ =a t A t (a) if a = ′ = δ) FW-LinUCB µ(x)a = θxa for θ ∈ R D×d , x ∈ R d×K LDKdD ln (1 + T LD K dD )/δ √ T FW-SquareCB T t=1 μt(xt)at -µ(xt)at 2 2 ≤ R oracle (T ) L K R oracle (T ) + D 2 K ln(T /δ) √ T

3.3. THE CASE OF NONSMOOTH f

When f is nonsmooth, we use a smoothing technique where the scalar regret is not measured using ∇f (ŝ t-1 ), but rather using gradients of a sequence (f t ) t∈N of smooth approximations of f , whose smoothness decrease over time (see e.g., Lan, 2013 , for applications of smoothing to FW). We provide a comprehensive treatment of smoothing in our general approach described in Appendix D, while specific smoothing techniques are discussed in Appendix F. We now describe the use of Moreau-Yosida regularization (Rockafellar & Wets, 2009, Def. 1.22) : f t (z) = max y∈R D f (y) - √ t+1 2β0 ∥y -z∥ 2 2 . It is well-known that f t is concave and L-Lipschitz whenever f is, and f t is √ t+1 β0 -smooth (see Lemma 15 in Appendix F). A related smoothing method was used by Agrawal & Devanur (2014) for (non-contextual) BCR. Our treatment of smoothing is more systematic than theirs, since we use a smoothing factor β 0 / √ t + 1 that decreases over time rather than a fixed smoothing factor that depends on a pre-specified horizon. Our regret bound for CBCR is based on a scalar regret R scal,sm T where ∇f t-1 (ŝ t-1 ) is used instead of ∇f (ŝ t-1 ): R scal,sm T = T t=1 max a∈A ⟨∇f t-1 (ŝ t-1 ) | µ(x t )a⟩ - T t=1 ⟨∇f t-1 (ŝ t-1 ) | r t ⟩. ( ) Theorem 3 Under Assumptions A, for every z 0 ∈ K, every T ≥ 1 and every δ > 0, δ ′ > 0, Algorithm A satisfies, with probability at least 1 -δ -δ ′ : R T ≤ R scal,sm T T + LD K √ T D K Lβ 0 + 3 Lβ 0 D K + 2 ln 1 δ . ( ) The proof is given in Appendix F. Taking β 0 = D K L leads to a simpler bound where D K Lβ0 + 3 Lβ0 D K = 4.

4. CONTEXTUAL RANKING BANDITS WITH FAIRNESS OF EXPOSURE

In this section, we apply our reduction to the combinatorial bandit task of fair ranking, and obtain the first algorithm with regret guarantees in the contextual setting. This task is described in Example 2 (Sec. 2). We remind that there is a fixed set of m items to rank at each timestep t, and that actions are flattened permutation matrices (A is defined in Ex. 2, Eq. ( 1)). The context x t ∼ P is a matrix x t = (x t,i ) i∈ m where each x t,i ∈ R d represents a feature vector of item i for the current user. Observation model. The user utility u(x t ) is given by a position-based model with position weights b(x t ) ∈ [0, 1] m and expected value for each item v(x t ) ∈ [0, 1] m . Denoting u(x t ) the flattened version of v(x t )b(x t ) ⊺ ∈ R m×m , the user utility is (Lagrée et al., 2016; Singh & Joachims, 2018) : ⟨u(x t ) | a⟩ = m i=1 v i (x t ) m k=1 a i,k b k (x t ). In this model, b k (x t ) ∈ [0, 1] is the probability that the user observes the item at rank k. The quantity m k=1 a i,k b k (x t ) is thus the probability that the user observes item i given ranking a. We denote Algorithm 1: FW-LinUCBRank: linear contextual bandits for fair ranking.  input :δ ′ > 0, λ > 0, ŝ0 ∈ K V 0 = λI d , y 0 = 0 d , θ0 = 0 d 1 for t = 1, . . . do 2 Observe context x t ∼ P 3 ∀i, vt,i ← θ⊺ t-1 x t,i + α t δ ′ 3 ∥x t,i ∥ V -1 t-1 // UCB Assumption C sup x∈X ∥x∥ 2 ≤ D X and ∃θ ∈ R d , ∥θ∥ 2 ≤ D θ s.t. ∀x ∈ X , ∀i ∈ m , v i (x) = θ ⊺ x i . We propose an observation model where values v i (x) and position weights b(x) are unknown. However, we assume that at each time step t, after computing the ranking a t , we have two types of feedback: first, e t,i ∈ {0, 1} is 1 if item i has been exposed to the user, and 0 otherwise. Second c t,i ∈ {0, 1} which represents a binary like/dislike feedback from the user. We have E[e t,i x t , a t = m k=1 a t,i,k b k (x t ) E c t,i |x t , e t,i ] = v i (x t ) if e t,i = 1 0 if e t,i = 0 (11) This observation model captures well applications such as newsfeed ranking on mobile devices or dating applications where only one post/profile is shown at a time. What we gain with this model is that b(x) can depend arbitrarily on the context x, while previous work on bandits in the position-based model assumes b known and context-independent (Lagrée et al., 2016) .foot_4  Fairness of exposure. There are D = m + 1 rewards, i.e., µ(x) ∈ R (m+1)×m 2 . Denoting µ i (x) the ith-row of µ(x), seen as a column vector, each of the m first rewards is the exposure of a specific item, while the m + 1-th reward is the user utility: ∀i ∈ m , ⟨µ i (x) | a⟩ = m k=1 a i,k b k (x) and µ m+1 (x) = u(x) The observed reward vector r t ∈ R D is defined by ∀i ∈ m , r t,i = e t,i and r t,m+1 = m i=1 c t,i . Notice that E r t,m+1 x t = u(x t ). Let K be the convex hull of {z ∈ {0, 1} m+1 : m i=1 z i ≤ k and z m+1 ≤ m i=1 z i }, we have D K ≤ √ k k + 2 ≤ k + 1 and r t ∈ K with probability 1. The objective function f : R D → R makes a trade-off between average user utility and inequalities in item exposure (we gave an example in Eq. ( 2)). The remaining assumptions of our framework are that the objective function is non-decreasing with respect to average user utility. This is not required but it is natural (see Example 2) and slightly simplifies the algorithm. Assumption D The assumptions of the framework described above hold, as well as Assumption B. Moreover, ∀z ∈ K ∂f ∂zm+1 (z) > 0, and ∀x ∈ X , 1 ≥ b 1 (x) ≥ . . . ≥ b k (x) = . . . = b m (x) = 0. Algorithm and results. We present the algorithm in the setting of linear contextual bandits, using LinUCB (Abbasi-Yadkori et al., 2011; Li et al., 2010) as scalar exploration/exploitation algorithm in Figure 1 : (left) Multi-armed CBCR: Objective values on environments from (Mehrotra et al., 2020) . (middle) Ranking CBCR: Fairness objective value over timesteps on Last.fm data. (right) Ranking CBCR: Trade-off between user utility and item inequality after 5 × 10 6 iterations on Last.fm data. Algorithm 1. It builds reward estimates based on Ridge regression with regularization parameter λ. As in the previous section, we focus on the case where f is smooth but the extension to nonsmooth f is straightforward, as described in Section 3. Appendix I provides the analysis for the general case. As noted by Do et al. (2021) , Frank-Wolfe algorithms are particularly suited for fair ranking in the position-based model. This is illustrated by line 4 of Alg. 1, where for ũ ∈ R m , top-k(ũ) outputs a permutation (matrix) of m that sorts the top-k elements of ũ. Alg. 1 is thus computationally fast, with a cost dominated by the top-k sort. It also has an intuitive interpretation as giving items an adaptive bonus depending on ∇f (e.g., boosting the scores of items which received low exposure in previous steps). The following result is a consequence of (Do et al., 2021 , Theorem 1): Proposition 4 Let t ∈ N * and μt such that ∀i ∈ m , μt,i = µ i (x t ) and μt,m+1 = vt b(x t ) ⊺ viewed as a column vector, with v defined in line 3 of Algorithm 1. Then, under Assumption D, a t defined on line 4 of Algorithm 1 satisfies: ⟨∇f (ŝ t-1 ) | μt a t ⟩ = argmax a∈A ⟨∇f (ŝ t-1 ) | μt a⟩. The proposition says that even though computing a t as in line 4 of Alg. 1 does not require the knowledge of b(x t ), we still obtain the optimal update direction according to μt . Together with the usage of the observed reward r t in FW iterates (instead of e.g., μt a t as would be done by Agrawal & Devanur (2014) ), this removes the need for explicit estimates of µ(x t ). This is how our algorithm works without knowing the position weights b(x t ), which are then allowed to depend on the context. The usage of vt to compute a t follows the usual confidence-based approach to explore/exploitation principles for linear bandits, which leads to the following result (proven in Appendix I): Theorem 5 Under Assumptions B, C and D, for every δ ′ > 0, every T ∈ N * , every λ ≥ D 2 X k, with probability at least 1 -δ ′ , Algorithm 1 has scalar regret bounded by  R scal T = O L T k d ln(T /δ ′ ) d ln(T /δ ′ ) + D θ √ λ + k/d .

5. EXPERIMENTS

We present two experimental evaluations of our approach, which are fully detailed in App. B.

5.1. MULTI-ARMED CBCR: APPLICATION TO MULTI-OBJECTIVE BANDITS

We first focus on the multi-objective recommendation task of Example 1 where f (z) = D i=1 w i z ↑ i . Algorithms. We evaluate our two instantiations presented in Sec. 3.2 with the Moreau-Yosida smoothing technique of Sec. 3.3: (i) FW-SquareCB with Ridge regression and (ii) FW-LinUCB, where exploration is controlled by a scaling variable ϵ on the exploration bonus of each arm. We compare them to MOLinCB from (Mehrotra et al., 2020) . Environments. We reproduce the synthetic environments of Mehrotra et al. (2020) , where the context and reward parameters are generated randomly, and w i = 1 2 i-1 . We set K = 50 and D ∈ {5, 20} (we also vary K in App. B). Each simulation is repeated with 100 random seeds. Results. Following (Mehrotra et al., 2020) , we evaluate the algorithms' performance by measuring the value of f ( 1 T T t=1 µ(x t )a t ) over time. Our results are shown in Figure 1 (left). We observe that our algorithm FW-SquareCB obtains comparable performance with the baseline MOLinCB. These algorithms converge after ≈ 100 rounds. In this environment from (Mehrotra et al., 2020) , only little exploration is needed, hence FW-LinUCB obtains better performance when ϵ is smaller (ϵ = 0.01). The advantage of using an FW instantiation for the multi-objective bandit optimization task is that unlike MOLinCB, its convergence is also supported by our theoretical regret guarantees.

5.2. RANKING CBCR: APPLICATION TO FAIRNESS OF EXPOSURE IN RANKINGS

We now tackle the ranking problem of Section 4. We show how FW-LinUCBRank allows to fairly distribute exposure among items on a music recommendation task with bandit user feedback. Environment. Following (Patro et al., 2020) , we use the Last.fm music dataset from (Cantador et al., 2011) , from which we extract the top 50 users and items with the most listening counts. We use a protocol similar to Li et al. (2016) to generate context and rewards from those. We use k = 10 ranking slots, and exposure weights b k (x) = log(2) 1+log(k) . Simulations are repeated with 10 seeds. Algorithms. Our algorithm is FW-LinUCBRank with the nonsmooth objective f of Eq. ( 2), which trades off between user utility and item inequality. We study other fairness objectives in App. B. Our first baseline is LinUCBRank (Ermis et al., 2020) , designed for ranking without fairness. Then, we study two baselines with amortized fairness of exposure criteria. Mansoury et al. (2021) proposed a fairness module for UCB-based ranking algorithms, which we plug into LinUCBRank. We refer to this baseline as Unbiased-LinUCBRank. Finally, the FairLearn(c, α) algorithm (Patil et al., 2020) enforces as fairness constraint that the pulling frequency of each arm be ≥ c, up to a tolerance α. We implement as third baseline a simple adaptation of FairLearn to contextual bandits and ranking. Dynamics. Figure 1 (middle) represents the values of f over time achieved by the competing algorithms, for fixed β = 1. As expected, compared to the fairness-aware and -unaware baselines, our algorithm FW-LinUCBRank reaches the best values of f . Interestingly, Unbiased-LinUCBRank also obtains high values of f on the first 10 4 rounds, but its performance starts decreasing after more iterations. This is because Unbiased-LinUCBRank is not guaranteed to converge to an optimal trade-off between user fairness and item inequality. At convergence. We analyse the trade-offs achieved after 5 • 10 6 rounds between user utility and item inequality measured by the Gini index. We vary β in the objective f of Eq. ( 2) for FW-LinUCBRank and the strength c in FairLearn(c, α), with tolerance α = 1. In Fig. 1 (right), we observe that compared to FairLearn, FW-LinUCBRank converges to much higher user utility at all levels of inequality among items. In particular, it achieves zero-unfairness at little cost for user utility.

6. CONCLUSION

We presented the first general approach to contextual bandits with concave rewards. To illustrate the usefulness of the approach, we show that our results extend randomized exploration with generic online regression oracles to the concave rewards setting, and extend existing ranking bandit algorithms to fairness-aware objective functions. The strength of our reduction is that it can produce algorithms for CBCR from any contextual bandit algorithm, including recent extensions of SquareCB to infinite compact action spaces (Zhu & Mineiro, 2022; Zhu et al., 2022) and future ones. In our main application to fair ranking, the designer sets a fairness trade-off f to optimize. In practice, they may choose f among a small class by varying hyperparameters (e.g. β in Eq. ( 2)). An interesting open problem is the integration of recent elicitation methods for f (e.g., Lin et al., 2022) in the bandit setting. Another interesting issue is the generalization of our framework to include constraints (Agrawal & Devanur, 2016) . Finally, we note that the deployment of our algorithms requires to carefully design the whole machine learning setup, including the specification of reward functions (Stray et al., 2021) , the design of online experiments (Bird et al., 2016) , while taking feedback loops into account (Bottou et al., 2013; Jiang et al., 2019; Dean & Morgenstern, 2022) . 

A RELATED WORK

The non-contextual setting of bandits with concave rewards (BCR) has been previously studied by Agrawal & Devanur (2014) , and by Busa-Fekete et al. (2017) for the special case of Generalized Gini indices. In BCR, policies are distributions over actions. These approaches perform a direct optimization in policy space, which is not possible in the contextual setup without restrictions or assumptions on optimal policies. Agrawal et al. ( 2016) study a setting of CBCR where the goal is to find the best policy in a finite set of policies. Because they rely on explicit search in the policy space, they do not resolve the main challenge of the general CBCR setting we address here. Cheung 2021) address multi-objective reinforcement learning with concave aggregation functions, a problem more general than stochastic contextual bandits. In particular, Cheung (2019) use a FW approach for this problem. However, these works rely on a tabular setting (i.e., finite state and action sets) and explicitly compute policies, which is not possible in our setting where policies are mappings from a continuous context set to distributions over actions. Our work is the only one amenable to contextual bandits with concave rewards by removing the need for an explicit policy representation. Finally, compared to previous FW approaches to bandits with concave rewards, e.g. (Agrawal & Devanur, 2014; Berthet & Perchet, 2017) , our analysis is not limited to confidence-based exploration/exploitation algorithms. CBCR is also related to the broad literature on bandit convex optimization (BCO) (Flaxman et al., 2004; Agarwal et al., 2011; Hazan et al., 2016; Shalev-Shwartz et al., 2012) . In BCO, the goal is to minimize a cumulative loss of the form T t=1 ℓ t (π t ) , where the convex loss function ℓ t is unknown and the learner only observes the value ℓ t (π t ) of the chosen parameter π t at each timestep. Existing approaches to BCO perform gradient-free optimization in the parameter space. While BCR considers global objectives rather than cumulative ones, similar approaches have been used in non-contextual BCR (Berthet & Perchet, 2017) where the parameter space is the convex set of distributions over actions. As we previously highlighted, such parameterization does not apply to CBCR because direct optimization in policy space is infeasible. CBCR is also related to multi-objective optimization (Miettinen, 2012; Drugan & Nowe, 2013) , where the goal is to find all Pareto efficient solutions. (C)BCR, focuses on one point of the Pareto front determined by the concave aggregation function f , which is more practical in our application settings where the decision-maker is interested in a specific (e.g., fairness) trade-off. In recent years, the question of fairness of exposure attracted a lot of attention, and has been mostly studied in a static ranking setting (Geyik et al., 2019; Beutel et al., 2019; Yang & Stoyanovich, 2017; Singh & Joachims, 2018; Patro et al., 2022; Zehlike et al., 2021; Kletti et al., 2022; Diaz et al., 2020; Do & Usunier, 2022; Wu et al., 2022) . Existing work on fairness of exposure in bandits focused on local exposure constraints on the probability of pulling an arm at each timestep, either in the form of lower/upper bounds (Celis et al., 2018) or merit-based exposure targets (Wang et al., 2021) . In contrast, we consider amortized exposure over time, in line with prior work on fair ranking (Biega et al., 2018; Morik et al., 2020; Usunier et al., 2022) , along with fairness trade-offs defined by concave objective functions which are more flexible than fairness constraints (Zehlike & Castillo, 2020; Do et al., 2021; Usunier et al., 2022) . Moreover, these works (Celis et al., 2018; Wang et al., 2021) do not address combinatorial actions, while ours applies to ranking in the position-based model, which is more practical for recommender systems (Lagrée et al., 2016; Singh & Joachims, 2018) . The methods of (Patil et al., 2020; Chen et al., 2020) aim at guaranteeing a minimal cumulative exposure over time for each arm, but they also do not apply to ranking. In contrast, (Xu et al., 2021; Li et al., 2019) consider combinatorial bandits with fairness, but they do not address the contextual case, which limits their practical application to recommender systems. (Mansoury et al., 2021; Jeunen & Goethals, 2021) propose heuristic algorithms for fairness in ranking in the contextual bandit setting, highlighting the problem's importance for real-world recommender systems, but they lack theoretical guarantees. Using our FW reduction with techniques from contextual combinatorial bandits (Lagrée et al., 2016; Li et al., 2016; Qin et al., 2014) , we obtain the first principled bandit algorithms for this problem with provably vanishing regret.

B MORE ON EXPERIMENTS

Our experiments are fully implemented in Python 3.9.

B.1.1 DETAILS OF THE ENVIRONMENT AND ALGORITHMS

Environment Following (Patro et al., 2020) who also address fairness in recommender systems, we use the Last.fm music datasetfoot_5 from (Cantador et al., 2011) , which includes the listening counts of 1, 892 users for the tracks of 17, 632 artists, which we identify as the items. For the first environment, which we presented in Section 5 and which we call Lastfm-50 here, we extract the top n = 50 users and m = 50 items having the most interactions. In order to examine algorithms at larger scale, we also design another environment, Lastfm-2k, where we keep all n = 1.9k users and the top m = 2.5k items having the most interactions. In both cases, to generate contexts and rewards, we follow a protocol similar to other works on linear contextual bandits (Garcelon et al., 2020; Li et al., 2016) . Using low-rank matrix factorization with d ′ latent factorsfoot_6 , we obtain user factors u j ∈ R d ′ and item factors v i ∈ R d ′ for all j, i ∈ n × m . We design the context set as X = {flatten(u j v ⊺ i ) : j, i ∈ n × m } ⊂ R d , where d = d ′2 . At each time step t, the environment draws a user j t uniformly at random from n and sends context x t = flatten(u jt v ⊺ i ). Given a context x t and item i, clicks are drawn from a Bernoulli distribution: c t,i ∼ B(u ⊺ jt v i ). We set k = 10, and for the position weights, we use the standard weights of the discounted cumulative gain (DCG): ∀k ∈ k , b k = 1 log 2 (1+k) and bk +1 , . . . , b m = 0. Details of the algorithms For all algorithms, the regularization parameter of the Ridge regression is set to λ = 0.1. The first baseline we consider is the algorithm LinUCBRankfoot_7 of (Ermis et al., 2020) , which is a top-k ranking bandit algorithm without fairness. It is equivalent to using FW-LinUCBRank with f (s) = s m+1 , which corresponds to the usual top-k ranking objective without item fairness. More precisely, at each timestep, the algorithm produces a top-k ranking of θ⊺ t-1 x t,i + α t ( δ ′ 3 )∥x t,i ∥ V -1 t-1 m i=1 . t ′ =1 e t ′ ,i for item i at time t. Transposed to our setting, their module consists in a simple modification of LinUCBRank by multiplying the exploration bonus of each item i by a factor: η t,i = 1 - t-1 t ′ =1 e t ′ ,i t-1 t ′ =1 1 k m i ′ =1 e t ′ ,i ′ . ( ) More precisely, at each timestep, the algorithm produces a top-k ranking of θ⊺ t-1 x t,i + η t,i × α t ( δ ′ 3 )∥x t,i ∥ V -1 t-1 m i=1 . Following (Mansoury et al., 2021) , we call this baseline Unbiased-LinUCBRank. Our second baseline with fairness is the FairLearn(c, α) algorithm of Patil et al. (2020) for stochastic bandits with a fairness constraint on the pulling frequency N t,i of each arm i at each timestep t. The constraint is parameterized by a variable c and a tolerance parameter α: ⌊ct⌋ -N t,i ≤ α. We adapt FairLearn(c, α) to ranking by applying the algorithm sequentially for each recommendation slot, while constraining the algorithm not to choose the same item twice for a given ranked list. We also adapt FairLearn to contextual bandits by using LinUCB as underlying learning algorithm. More precisely, for the current timestep and slot, if the constraint is not violated, then the algorithm plays the item with the highest LinUCB upper confidence bound. Objectives To illustrate the flexibility of our approach, we use algorithm FW-LinUCBRank to optimize three existing objectives which trade off between user utility and item fairness, in the form: f (s) = s m+1 + βf item (s 1:m ). Gini measures item inequality by the Gini index, as in (Biega et al., 2018; Morik et al., 2020; Do & Usunier, 2022) , and eq. exposure uses the standard deviation (Do et al., 2021) : (Gini) f item (s) = m j=1 m -j + 1 m s ↑ j (eq. expo) f item (s) = - 1 m m j=1   s j - 1 m m j ′ =1 s j ′   2 (15) Since Gini is nonsmooth, we apply the FW-LinUCBRank algorithm for nonsmooth f with Moreau-Yosida regularization, presented in Section 3.3 and detailed in Appendix F.1 (we use β 0 = 1 in our experiments). To compute the gradient of the Moreau envelope f t , we use the algorithm of Do & Usunier (2022) which specifically applies to generalized Gini functions and top-k ranking. We also study additive concave welfare functions (Do et al., 2021; Moulin, 2003) where α is a parameter controlling the degree of redistribution of exposure to the worse-off items: (Welf) f item (s) = m j=1 s α j , α > 0 (16) B.1.2 ADDITIONAL RESULTS We now present additional results, which are obtained by repeating each simulation with 10 different random seeds. Dynamics For the three objectives described, Figure 2 represents the values of the user and item objectives (left and middle), and the value of the objective f (right) over time, achieved by the competing algorithms on Lastfm-50. We set β = 0.5 for all objectives and for welf, we set α = 0.5. We observe that with this value of β, the item objective f item is given more importance in f than the user utility. We observe that for Gini and welf, FW-LinUCBRank achieves the highest value of f across timesteps. This is because unlike LinUCBRank, it accounts for the item objective f item . In both cases, Unbiased-LinUCBRank achieves a high value of f over time but starts decreasing, after 10 4 iterations for Gini and 5.10 5 iterations for welf. This is because Unbiased-LinUCBRank is not designed to converge towards an optimum of f . For eq. exposure, when β = 0.5, Unbiased-LinUCBRank obtains surprisingly better values of f than FW-LinUCBRank. Therefore, depending on the objective to optimize and the timeframe, Unbiased-LinUCBRank can be chosen as an alternative to FW-LinUCBRank. However, due to its lack of theoretical guarantees, it is more difficult to understand in which cases it may work, and for how many iterations. Furthermore, unlike Unbiased-LinUCBRank, FW-LinUCBRank can be chosen to optimise a wide variety of functions by varying the tradeoff parameter β in all objectives, and α in welf to control the degree of redistribution. Unbiased-LinUCBRank does not have such controllability and flexibility. Figure 3 shows the objective values for Gini and welf on Lastfm-2k. We observe similar results where FW-LinUCBRank converges more quickly than its competitors (≈ 5, 000 iterations for Gini and ≈ 500 iterations for welf) and obtains the highest values of f. For the first 10 5 iterations of optimizing Gini, Unbiased-LinUCBRank obtains significantly lower values than FW-LinUCBRank on welf. Fairness trade-off for fixed T On the larger Lastfm-2k dataset, we study the tradeoffs between user utility and item inequality obtained by FW-LinUCBRank and FairLearn on Figure 4 after T = 10 6 rounds. The Pareto frontiers are obtained as follows: FW-LinUCBRank optimises for Gini, in which we vary β, and for FairLearn we vary the constraint value c at fixed α = 1. Figure 1 in Section 5 of the main paper illustrated the same Pareto frontier but for 5× more iterations and on the smaller Lastfm-50 dataset. Although the algorithms might not have converged for this larger dataset, we observe that FW-LinUCBRank obtains better trade-offs than FairLearn, achieving higher user utility at all levels of inequality. We conclude that even in a setting with more items and shorter learning time, FW-LinUCBRank effectively reduces item inequality, at lower cost for user utility than the baseline.

B.2 MULTI-ARMED CBCR: APPLICATION TO MULTI-OBJECTIVE BANDITS WITH GENERALIZED GINI FUNCTION

We provide the details and additional simulations on the task of optimizing the Generalized Gini aggregation Function (GGF) in multi-objective bandits (Busa-Fekete et al., 2017; Mehrotra et al., 2020) . We remind that the goal is to maximize a GGF of the D-dimensional rewards, which is a nonsmooth concave aggregation function parameterized by nonincreasing weights w 1 = 1 ≥ . . . ≥ w D ≥ 0: f (s) = D i=1 w i s ↑ i , where (s ↑ i ) D i=1 denotes the values of s sorted in increasing order. Mehrotra et al. (2020) study the contextual bandit setting, motivated by music recommendation on Spotify with multiple metrics. They consider atomic actions a t ∈ A (i.e., A is the canonical basis of R K ) and a linear reward model: ∀i ∈ D , ∃θ i ∈ R d , E t [r t,i ] = θ ⊺ i x ⊺ t a t . These are the same assumptions as described in Table 1 of Section 3.2 and in Appendix G. GGFs are concave functions, but they are nondifferentiable. Therefore, we use the variant of our FW approach for nonsmooth f (see Section 3.3), where we smooth the objective via Moreau-Yosida regularization with parameter β 0 = 0.01, using the algorithm of (Do & Usunier, 2022) to compute the gradients of the smooth approximations f t . Algorithms In the main body, we evaluated two instantiations of our FW meta-algorithm, namely FW-LinUCB and FW-SquareCB. The level of exploration in FW-LinUCB is controlled by a variable ϵ. More precisely, the exploration bonus is multiplied by √ ϵ, i.e. the UCBs are calculated as: θ⊺ t-1,i x t,k + √ ϵ α t (δ)∥x t,k ∥ V -1 t-1 . In FW-Square-CB, as detailed in Appendix H, the exploration is controlled by a sequence (γ t ) t≥1 , growing as √ t (higher γ t means less exploration). We set it to γ t = γ 0 √ t with γ 0 ∈ {10 3 , 10 4 }. In addition to the two algorithms presented in Section 5, to show the flexibility of our FW approach, we also implement FW-ϵ-greedy, another instantiation of our FW algorithm which uses ϵ-greedy as scalar bandit algorithm. We compare our algorithms with MOLinCB of Mehrotra et al. (2020) , an online gradient descent-style algorithm which was designed for this task, but was introduced without theoretical guarantees, as an extension of the MO-OGDE algorithm of Busa-Fekete et al. (2017) who study the non-contextual problem. We use the default parameters of MOLinCB recommended by Mehrotra et al. (2020) . Environments Since the Spotify dataset of Mehrotra et al. (2020) is not publicly available, we only focus on their simulated, controlled environments. We reproduced these environments exactly as described in Appendix A of their paper. For completeness, we restate the protocol here: we draw a hidden parameter θ ∈ R D×d uniformly at random in [0, 1], and each element of a context-arm vector x t,k is drawn from N ( 1 d , 1 d 2 ) . Given a context x t and arm k t , the D-dimensional reward is generated as a draw from N (θx t,kt , 0.01(θx t,kt ) 2 ). We choose d = 10 in the data generation and λ = 0.1 in the Ridge regression, as recommended by Mehrotra et al. (2018) . In Section 5 of the main body, we varied the number of objectives D ∈ {5, 20} and set K = 50. Here we also experiment with K = 200 to see the effect of varying the number of arms. The GGF weights are set to w j = 1 2 j-1 . Each simulation is repeated with 100 different random seeds.

Results

The extended results, with more arms and algorithms, are depicted in Figure 5 . We observe that FW-ϵ-greedy achieves similar performance to the baseline MOLinCB, with small exploration ϵ = 0.01. FW-SquareCB also achieves comparable performance to MOLinCB when there is little exploration, i.e. with γ 0 = 10 4 rather than 10 3 . This is coherent with our observation in Section 5 that FW-LinUCB obtains better performance when there is very little exploration on this environment from Mehrotra et al. (2018) . Note that there is no forced exploration in their algorithm MOLinCB. Overall, we obtain qualitatively similar results when K = 200 compared to K = 50.

C PROOFS OF SECTION 2

In this section we give the missing details of Section 2. For completeness, we remind the definitions of Lipschitz-continuity and super-gradients in the next subsection. Then, we start in Section C.2 the analysis of the structure of the set S defined in Section 3 of the main paper, and more precisely its support function g → max s∈S g ⊺ s. This contains new lemmas that are fundamental for the analysis throughout the paper, in particular in the proof of Lemma 9, which is given in Section C.3.

C.1 BRIEF REMINDER ON LIPSCHITZ FUNCTIONS AND SUPER-GRADIENTS

We remind the following definitions. Let D and D ′ be two integers, and f a function f : R D → R D ′ . We have: • (Lipschitz continuity) f is L-Lipschitz continuous with respect to ∥.∥ 2 on a set Z ⊆ R D if ∀z, z ′ ∈ Z, ∥f (z) -f (z ′ )∥ 2 ≤ L∥z -z ′ ∥ 2 . (17) • (super-gradients) If f : R D → R ∪ {±∞}, a super-gradient of f at a point z ∈ R D where f (z) ∈ R is a vector g such that for all z ′ ∈ R D , f (z ′ ) ≤ f (z) + ⟨g | z ′ -z⟩. We remind the following results when f : R D → R ∪ {±∞} is a proper closed concave function: • f has non-empty set of super-gradients at every point z where f (z) ∈ R, • if f is L-Lipschitz on Z ⊆ R D and Z is open, then for every z ∈ Z and every super-gradient g of f at z, we have ∥g∥ 2 ≤ L. The assumption of Lipschitz-continuity of f on a set Z implicitly implies the assumption that Z is in the domain of f . Remark 1 (About our Lipschitzness assumptions) We use Lipschitzness over an open set containing K in Assumption A because we use boundedness of the super-gradients of f . In fact, a more precise alternative would be to require that super-gradients are bounded uniformly on K by L. We choose the Lipschitz formulation because we believe it is more natural. As a side note, in assumption B, we use Lipschitzness of the gradients on K, not on an open set containing K. This is because smoothness in used in the ascent lemma (see Eq. 50), which uses Inequality 4.3 of Bottou et al. (2018) , the proof of which directly uses Lipschitz-continuity of the gradients on K (Bottou et al., 2018, Appendix B) , without relying on an argument of boundedness of gradients.

C.2 PRELIMINARIES: THE STRUCTURE OF THE SET S

We denote by x 1:T = (x 1 , . . . , x T ) a sequence of contexts of length T . Let S = E x∼P µ(x)π(x) π : X → A (18) ∀x 1:T ∈ X T , S(x 1:T ) = 1 T T t=1 µ(x t )π(x t ) π : X → A It is straightforward to show that S(x 1:T ) = 1 T T t=1 µ(x t )π t (π 1 , . . . , π T ) ∈ A T . These sets are particularly relevant because of the following equality, for every f : R D → R ∪ {±∞}: f * = sup π:X →A f E x∼P µ(x)π(x) = sup s∈S f (s) and f + T = sup (πt) t∈ T ∈A T f 1 T T t=1 µ(x t )π t = sup s∈S(x 1:T ) f (s). We study in this section the structure of these sets. We provide here the part of Assumption A that is relevant to this section: Assumption Ã A is a compact subset of R K and there is a compact convex set K ⊆ R D such that ∀(x, a) ∈ X × A, µ(x)a ∈ K. We remind the following basic results from convex sets in Euclidian spaces that we use throughout the paper without reference: Lemma 6 Let A be a compact subset of R K . We have: • (Rockafellar & Wets, 2009, Corollary 2.30) The convex hull A of a, denoted by A, is compact. • For every w ∈ R K , max a∈A w ⊺ a = max a∈A w ⊺ a. The following lemma allows us to use maxima instead of suprema over S and S(x 1:T ). The proof of this lemma is deferred to Appendix J.1. Lemma 7 Under Assumption Ã, S is compact and ∀T ∈ N * , ∀x 1:T ∈ X T , S(x 1:T ) is compact. The next result regarding the support functions of S and S(x 1:T ) is the key to our approach: Lemma 8 Let w ∈ R D and T ∈ N * . Under Assumption Ã, we have E x 1:T ∼P T max s∈S(x 1:T ) w ⊺ s = max s∈S w ⊺ s. ( ) Moreover, for every δ ∈ (0, 1], we have with probability at least 1 -δ: max s∈S(x 1:T ) w ⊺ s ≤ max s∈S w ⊺ s + ∥w∥ 2 D K 2 ln δ -1 T . ( ) The inequality max s∈S w ⊺ s ≤ max s∈S(x 1:T ) w ⊺ s + ∥w∥ 2 D K 2 ln δ -1 T also holds with probability 1 -δ. Proof. The first result is a direct consequence of the maximization of linear functions over the simplex. Using (20) with f (s) = w ⊺ s and the linearity of expectations, we have max s∈S w ⊺ s = max π:X →A E x∼P w ⊺ µ(x)π(x) . The optimal policy given w, denoted by π w is thus obtained by optimizing for every x the dot product between w ⊺ µ(x) ∈ R K and π(x) ∈ A ⊆ R K . Since, for each x, it is a linear optimization, we can find an optimizer in A (see Lemma 6), which gives: max s∈S w ⊺ s = E x∼P w ⊺ µ(x)π w (x) η w (x) where π w (x) ∈ argmax a∈A w ⊺ µ(x)a, where in the equation above we mean that π w is a measurable selection of x → argmax a∈A w ⊺ µ(x)a. For the same reason, we have max s∈S(x 1:T ) w ⊺ s = 1 T T t=1 η w (x t ). We obtain E x 1:T ∼P T max s∈S(x 1:T ) w ⊺ s = E x 1:T ∼P T 1 T T t=1 η w (x t ) = E x∼P η w (x) = max s∈S w ⊺ s. ( ) which is the first equality. For the high-probability inequality, let X t = η w (x t ) -E x∼P η w (x) . Since the (x t ) t∈ T are independent and identically distributed (i.i.d.), the variables (X t ) t∈ T are also i.i.d., and we have |X t | ≤ w ⊺ µ(x t )π w (x t ) ∈K -E x∼P µ(x)π w (x) ∈K ≤ ∥w∥ 2 D K and E X t = 0. Given δ ∈ (0, 1], Hoeffding's inequality applied to 1 T T t=1 X t gives, with probability at least 1 -δ: max s∈S(x 1:T ) w ⊺ s -max s∈S w ⊺ s = 1 T T t=1 X t ≤ ∥w∥ 2 D K 2 ln δ -1 T . ( ) The reverse equation is obtained by applying Hoeffding's inequality to -1 T T t=1 X t .

C.3 PROOF OF LEMMA 9

Lemma 9 Under Assumption A, ∀T ∈ N * , ∀δ ∈ (0, 1], we have, with probability at least 1 -δ: f + T -f * ≤ LD K 2 ln 4e 2 δ T where f + T = max (π1,...,π T )∈A T f 1 T T t=1 µ(x t )π t We also have, with probability 1 -δ over contexts, actions, and rewards: f (s T ) -f (ŝ T ) ≤ LD K 2 ln(2e 2 δ -1 ) T where s T = 1 T T t=1 µ(x t )a t . The first statement shows that the performance of the optimal non-stationary policy over T steps converges to f * at a rate O(1/ √ T ). Furthermore, measuring the algorithm's performance by expected rewards instead of observed rewards would also amount to a difference of order O(1/ √ T ). This choice would lead to what is commonly referred to as a pseudo-regret. Since the worst-case regret of BCR is Ω(1/ √ T ) (Bubeck & Cesa-Bianchi, 2012) , the previous lemma shows that the alternative definitions of regret would not substantially change our results. Proof. We start with the first inequality. We first prove that w.p. greater than 1 -δ/2, we have f + T ≤ f * + LD K 2 ln 2 δ T . Since f is continuous on K and since S ⊆ K and S is compact by Lemma 7, there is s * ∈ S such that f * = f (s * ). Similarly, since S(x 1:T ) is compact, there is s * T such that f (s * T ) = max s∈S(x 1:T ) f (s). Using (20), we need to prove that with probability at least 1 -δ/2, we have f (s * T ) ≤ f (s * ) + LD K 2 ln 2 δ T . Using the concavity of f , let g * be a supergradient of f at s * . We have f (s * T ) ≤ f (s * ) + ⟨g * | s * T -s * ⟩ (29) ≤ f (s * ) + max s∈S(x 1:T ) ⟨g * | s -s * ⟩ (30) =⇒ w.p. ≥ 1 -δ/2 : f (s * T ) ≤ f (s * ) + max s∈S ⟨g * | s -s * ⟩ ≤0 by def. of s * +∥g * ∥ 2 D K 2 ln 2 δ T (by Lemma 8) ≤ f (s * ) + LD K 2 ln 2 δ T . (by the Lipschitz assumption) Published as a conference paper at ICLR 2023 We now prove f * ≤ f + T + LD K 2 ln 4e 2 δ T with probability at least 1 -δ/2. Let π * ∈ argmax π:X →A f E x∼P µ(x)π(x) (an optimal policy exists by Lemma 7). Denote by (X t = µ(x t )π * (x t )) t∈ T a sequence of independent and identically distributed random variables obtained by sampling x t ∼ P . We have |X t -EX t | ≤ D K and EX t = s * . By the Lipschitz property of f , we obtain f (s * ) ≤ f ( 1 T T t=1 X t ) + L 1 T T t=1 X t -s * 2 . ( ) Using the version of Azuma's inequality for vector-valued martingale with bounded increments of Hayes (2005, Theorem 1.8) to obtain, for every ϵ > 0: P 1 D K 1 T T t=1 X t -s * 2 ≥ ϵ ≤ 2e 2 e -T ϵ 2 /2 . ( ) Setting δ 2 = 2e 2 e -T ϵ 2 /2 and solving for ϵ gives, with probability at least 1 -δ/2: f * ≤ f ( 1 T T t=1 X t ) + LD K 2 ln 4e 2 δ T ≤ f + T + LD K 2 ln 4e 2 δ T . ( ) For the second inequality: using L-Lipschitzness of f , the inequality is a direct consequence of the lemma below, which is itself a direct consequence of (Hayes, 2005, Theorem 1.8). In the following lemma and its proof, we use the two following filtrations: • F = (F t ) t∈N * where F t is the σ-algebra generated by (x 1 , a 1 , r 1 , . . . , x t-1 , a t-1 , r t-1 , x t ), • F = (F T ) T ∈N * where F T is the σ-algebra generated by (x 1 , a 1 , r 1 , . . . , x t-1 , a t-1 , r t-1 , x t , a t ). Our setup implies that the process (a t ) t∈N * is adapted to F while (r t ) t∈N * is adapted to F. Lemma 10 Under Assumption A, if the actions (a 1 , . . . , a T ) define a process adapted to (F T ) T ∈N , then, for every T ∈ N, for every δ, with probability 1 -δ, we have: ∥s T -ŝT ∥ 2 ≤ D K 2 ln 2e 2 δ T (34) Proof. Let X T = T t=1 r t -µ(x t )a t . We have ∥X T -X T -1 ∥ 2 ≤ D K , and (X T ) T ∈N is a martingale adapted to (F T ) T ∈N satisfying X 0 = 0. We can then use the version of Azuma's inequality for vector-valued martingale with bounded increments of Hayes (2005, Theorem 1.8) to obtain, for every ϵ > 0: P X T D K 2 ≥ ϵ ≤ 2e 2 e -ϵ 2 /(2T ) . ( ) Solving for ϵ gives the desired result.

D THE GENERAL TEMPLATE FRANK-WOLFE ALGORITHM

A more general framework The analysis of the next sections is done within a more general famework than that of the main paper, which is described in Algorithm 2. Similarly to the main paper, the action is drawn according to a t ∼ A(h t , x t , δ ′ ) (Line 3 of Alg. 2). However, we allow for a generic choice of Frank-Wolfe iterate with respect to which we compute (an extension of) the scalar regret (presented in (36) below). The update direction is denoted by ρ t and is chosen according to a function U(h t+1 , δ ′ ), a companion function from A(h t , x t , δ ′ ). Note that the update direction is chosen given h t+1 = (h t , (x t , a t , r t )), the history after the actions and rewards have been taken. Algorithm 2: Generic Frank-Wolfe algorithm for CBCR. input: initial point z 0 ∈ K, Approx. RLOO confidence parameter δ ′ 1 for t = 1 . . . T do 2 Observe x t ∼ P 3 Pull a t ∼ A(h t , x t , δ ′ ) // Explore/exploit step 4 Observe reward r t ∈ K, update temporal average of observed rewards ŝt 5 Let ρ t = U(h t+1 , δ ′ ) // Generic Frank-Wolfe update 6 Update z t = z t-1 + 1 t ρ t -z t-1 7 end The proofs of the main paper apply to the special case of Alg. 2 where ∀t ≥ 1, ρ t = r t . We then have the FW iterate z t in Line 6 of the algorithm satisfy ∀t ≥ 1, z t = ŝt . The reason we study this generalization is to show how our analysis applies in cases where the FW iterate is not the observed reward. In prior work on (non-contextual) BCR, Agrawal & Devanur (2014, Algorithm 4 ) use an upper-confidence approach and use the upper confidence on the expected reward as update direction. The generalization made by introducing U(h t+1 , δ ′ ) compared to the main paper allows for our analysis to encompass their approach. We need to update Assumptions A and B to account for the fact that ρ t is used in place of r t . Assumption A ′ f is closed proper concave on R D and A is a compact subset of R K . Moreover, there is a compact convex set K ⊆ R D such that • (Bounded rewards and iterates) For all t ∈ N * , r t ∈ K and ρ t ∈ K with probability 1. • (Local Lipschitzness) f is L-Lipschitz continuous with respect to ∥.∥ 2 on an open set containing K. Assumption B ′ Assumption A ′ holds and f has C-Lipschitz-continuous gradients w.r.t. ∥.∥ 2 on K. In Assumption A we added µ(x t )a t ∈ K for clarity, but it is not necessary since µ(x t )a t ∈ K with probability 1 is implied by r t ∈ K with probability 1. The difference between Assumption A ′ and Assumption A is to make sure that the updates ρ t , and thus the iterates z t belong to K and are in the domain of definition of f . Notice that in the special case of ρ t = r t , Assumption A ′ reduces to Assumption A and, similarly, Assumption B reduces to Assumption B ′ . We use the term smooth as a synonym of Lipschitz-continuous gradients. Analysis for (possibly) non-smooth objective functions We are going to present a single analysis that encompasses both the case where f is smooth (Assumption B of the main paper), and the case where f may not be smooth, which we briefly discussed in Section 3.3. In order for our analysis to be agnostic to the type of smoothing used and to also encompass the case where f is smooth, we propose the following assumption, where (f t ) t∈N is a sequence of smooth approximations of f : Assumption E Assumption A ′ holds and ∃(β 0 , L, M 1 , M 2 ) ∈ R 4 + such that (f t ) t∈N satisfy: 1. ∀t ∈ N, f t : R D → R ∪ {±∞} is proper closed concave on R D , 2. ∀t ∈ N, f t is differentiable on K with sup z∈K ∥∇f t (z)∥ 2 ≤ L, and f t is √ t+1 β0 -smooth on K, 3. ∀t ∈ N * , ∀z ∈ K, |f t (z) -f t-1 (z)| ≤ M1 t √ t and |f t (z) -f (z)| ≤ M2 √ t . Notice that any function f satisfying Assumption B with coefficient of smoothness C satisfies Assumption E with β 0 = 1/C, M 1 = M 2 = 0. Regarding non-smooth f , we discuss in more details in Appendix F specific methods to perform this smoothing, including the Moreau envelope used in Section 3.3. The generalization of the scalar regret takes into account both the approximation functions (f t ) t∈N and the general update z t : R gen T = T t=1 max a∈A ⟨∇f t-1 (z t-1 ) | µ(x t )a⟩ - T t=1 ⟨∇f t-1 (z t-1 ) | ρ t ⟩ + LT ∥z T -ŝT ∥ 2 . ( ) The general regret bound then takes the following form, where we distinguish between smooth and non-smooth f . Recall that C = CD 2 K /2. Theorem 11 Under Assumptions B ′ , using ∀T ∈ N, f T = f . For every T ∈ N, every z 0 ∈ K, every δ > 0, Algorithm 2 satisfies, with probability at least 1 -δ: R T ≤ R gen T + LD K 2T ln 1 δ + C ln(eT ) T ( ) Theorem 12 Under Assumptions E, for every z 0 ∈ K, every T ≥ 1 and every δ > 0, Algorithm 2 satisfies, with probability at least 1 -δ: R T ≤ R gen T T + D 2 K β0 + 4M 1 + 2M 2 + LD K 2 ln 1 δ √ T ( ) The proofs are given in Appendix E. The worst-case regret of contextual bandits is Ω( √ T ) (Bubeck & Cesa-Bianchi, 2012; Dani et al., 2008; Lattimore & Szepesvári, 2020) , which gives a lower bound for the worst-case regret of CBCR in Ω( 1 √ T ). The dependencies on the problem parameters are all directly derived from the regret bounds R gen T of the underlying scalar bandit algorithm (LinUCB, SquareCB, etc.). Therefore we obtain CBCR algorithms that are near minimax optimal as soon as R gen T ≤ O( √ T ). The residual terms O( 1 √ T ) terms are tied to the use of Azuma's inequality (Lemma 13) and FW analysis (using Lipschitz and smoothness parameters), and the dependencies to these parameters match usual convergence guarantees in optimization (Jaggi, 2013; Clarkson, 2010; Lan, 2013) . As we rely on a worst-case analysis in deriving our reduction guarantees, it remains an open question whether problem-dependent optimal bounds could be recovered as well. We make three remarks in order: Remark 2 (Why we need a specific result for smooth f ) The result for C-smooth f has a better dependency than the general result using β 0 = 1/C (ln(eT ) instead of √ T ), which makes a fundamental difference in practice if the smoothness coefficient is close to √ T . This is why we keep the two results separate. Remark 3 (Comparison to the smoothing as used by Agrawal & Devanur (2014) ) Agrawal & Devanur (2014, Thm 5.4 ) present an analysis for non-smooth f where, at a high-level, they run the smooth algorithm using f T instead of a sequence (f t ) t∈N , and then apply the convergence bound for smooth f . Our analysis has two advantages: 1. Anytime bounds: our approach does not require the horizon to be known in advance. 2. Better bound: they obtain a bound on ln T /T by suitably choosing the smoothing parameter, whereas we obtain a bound of 1/ √ T . In practice, it may not make a difference if R gen T T is itself in √ ln T /T , but the advantage of our approach is clear as far as the analysis of FW for (C)BCR is concerned. Remark 4 (About the confidence parameter δ ′ in A(h t , x t , δ ′ ) and U(h t+1 , δ ′ )) In practice, exploration/exploitation algorithms need a confidence parameter that defines the probability of their regret guarantee. For instance, in confidence-based approaches, it is the probability with which the confidence intervals are valid at every time step. In our case, it means that explicit upper bounds on R gen T are of the form R gen (T, δ ′ ) which hold with probability 1 -δ ′ , where δ ′ is the confidence parameter in A(h t , x t , δ ′ ). Using the union bound, we obtain bounds of the form R T ≤ R gen (T, δ ′ )/T + O ln(1/δ) T that are valid with probability 1 -δ -δ ′ . Note the difference in the roles of δ and δ ′ : δ is not a parameter of the algorithm, it is only here to account for the randomization over contexts.

E PROOFS FOR SECTION 3 AND APPENDIX D

This section contains the proofs for the results of Section 3. All the proofs are made for the more general framework described in Appendix D. The framework of the paper can be recovered as the special case ∀t ∈ N, ρ t = r t and z t = ŝt . Proof of Lemma 1. Lemma 1 is the special case of Lemma 13 when f is smooth. Note that every f satisfying Assumption A satisfies the assumptions of Lemma 13. Proof of Theorem 2. Thm. 2 is a special case of Theorem 11 of Appendix D, using ∀t ∈ N, ρ t = r t and z t = ŝt . The proof of Theorem 11 is given in Section E.1. Lemma 13 Assume that ∀T, f T is differentiable on K with ∀z ∈ K, ∥∇f T (z)∥ 2 ≤ L. Then, for every z ∈ K, we have: E x∼P max a∈A ⟨∇f t-1 (z) | µ(x)a⟩ = max π:X →A E x∼P ⟨∇f t-1 (z) | µ(x)π(x)⟩ = max s∈S ⟨∇f t-1 (z t-1 ) | s⟩. ( ) Assume furthermore that z t is a function of contexts, actions and rewards to time t. Let a * t ∈ argmax a∈A ⟨∇f t-1 (z t-1 ) | µ(x t )a⟩. For all δ ∈ (0, 1], with probability at least 1 -δ, we have: T t=1 max s∈S ⟨∇f t-1 (z t-1 ) | s -µ(x t )a * t ⟩ ≤ LD K 2T ln 1 δ (40) Proof. Let z ∈ K. We first prove (39). The first equality in (39) comes from the maximization over functions over the simplex with a linear objective: define π * t : X → A such that π * t (x) ∈ argmax a∈A ⟨∇f t-1 (z t-1 ) | µ(x)a⟩, using some arbitrary tie-breaking rule when the argmax is not unique. We have, for every policy π: E x∼P ⟨∇f t-1 (z) | µ(x)π(x)⟩ ≤ E x∼P max a∈A ⟨∇f t-1 (z) | µ(x)a⟩ (42) =⇒ max π:X →A E x∼P ⟨∇f t-1 (z) | µ(x)π(x)⟩ ≤ E x∼P ⟨∇f t-1 (z) | µ(x)π * t (x)⟩ . ( ) On the other hand, it is clear that E x∼P ⟨∇f t-1 (z) | µ(x)π * t (x)⟩ ≤ max π:X →A E x∼P ⟨∇f t-1 (z) | µ(x)π(x)⟩ , and we get the first equality of (39). The second equality in (39) holds by the definition of S since for every policy π, we have E x∼P ⟨∇f t-1 (z) | µ(x)π(x)⟩ = ⟨∇f t-1 (z) | E x∼P µ(x)π(x) ⟩. ( ) We now prove (40). Let E t . t≥1 be the conditional expectations with respect to the filtration F = ( F t ) t≥1 where F t is the σ-algebra generated by (x ′ t , a ′ t , r ′ t ) t ′ ∈ t-1 , i.e. , contexts, actions and rewards up to time t -1, so that we have: E t ⟨∇f t-1 (z t-1 ) | µ(x t )π * t (x t )⟩ = E x∼P ⟨∇f t-1 (z t-1 ) | µ(x)π * t (x)⟩ . ( ) Using ( 39) gives E t ⟨∇f t-1 (z t-1 ) | µ(x t )π * t (x t )⟩ = max s∈S ⟨∇f t-1 (z t-1 ) | s⟩, from which we obtain max s∈S ⟨∇f t-1 (z t-1 ) | s -µ(x t )a * t ⟩ = E t ⟨∇f t-1 (z t-1 ) | µ(x t )π * t (x t )⟩ -⟨∇f t-1 (z t-1 ) | µ(x t )π * t (x t )⟩ X T = T t=1 max s∈S ⟨∇f t-1 (z t-1 ) | s -µ(x t )a * t ⟩ thus defines a martingale adapted to F, and, using X 0 = 0, we have, for all t : |X t -X t-1 | ≤ L sup s∈S x∈X a∈A ∥s -µ(x)a∥ 2 ≤ L sup z,z ′ ∈K ∥z -z ′ ∥ 2 ≤ LD K . ( ) The results then follows from Azuma's inequality.

Published as a conference paper at ICLR 2023

The next lemma is the main technical tool of the paper. The proof is not technically difficult given the previous result, using the telescoping sum approach of the proof of Lemma 12 of Berthet & Perchet (2017) and organizing the residual terms. Lemma 14 Under Assumption E, denote ∀t ∈ N, f * t = max s∈S f t (s), and Rt (z) = f * t -f t (z). Let C(T ), F * (T ) in R ∪ {+∞} such that, ∀T ∈ N * , we have: T t=1 D 2 K 2 C t-1 t ≤ C(T ), T t=1 t Rt (z t ) -Rt-1 (z t ) ≤ F * (T ) And let B(T ) = C(T ) + F * (T ). Then, for all z 0 ∈ K, ∀T, ∀δ > 0, ∀δ ′ > 0, Algorithm 2 satisfies, with probability at least 1 -δ: f * T -f T (ŝ T ) ≤ B(T ) + R gen T + LD K 2T ln 1 δ T (49) Proof. We start with the standard ascent lemma using bounded curvature on K (Bottou et al., 2018, Inequality 4.3) , denoting CT = D 2 K 2 C T : f t-1 (z t ) ≥ f t-1 (z t-1 ) + 1 t ⟨∇f t-1 (z t-1 ) | ρ t -z t-1 ⟩ -Ct-1 t 2 (50) f * t-1 -f t-1 (z t ) ≤ f * t-1 -f t-1 (z t-1 ) - 1 t ⟨∇f t-1 (z t-1 ) | ρ t -z t-1 ⟩ + Ct-1 t 2 Let us denote by g t = ∇f t-1 (z t-1 ) and let a * t ∈ argmax a∈A ⟨g t | µ(x t )a⟩. We first decompose the middle term: ⟨g t | ρ t -z t-1 ⟩ = max s∈S ⟨g t | s -z t-1 ⟩ -max s∈S ⟨g t | s -µ(x t )a * t ⟩ -⟨g t | µ(x t )a * t -ρ t ⟩ (52) ≥ f * t-1 -f t-1 (z t-1 ) -max s∈S ⟨g t | s -µ(x t )a * t ⟩ αt -⟨g t | µ(x t )a * t -ρ t ⟩ ρt (by (53) below) Where the last inequality uses the concavity of f t : for all s * t-1 ∈ argmax s∈S f t-1 (s), we have: f * t-1 -f t-1 (z t-1 ) ≤ ⟨∇f t-1 (z t-1 ) | s * t-1 -z t-1 ⟩ ≤ max s∈S ⟨∇f t-1 (z t-1 ) | s -z t-1 ⟩ (53) and thus we get f * t-1 -f t-1 (z t ) ≤ f * t-1 -f t-1 (z t-1 ) (1 - 1 t ) + 1 t (α t + ρ t ) + Ct-1 t 2 (54) =⇒ t Rt (z t ) ≤ (t -1) Rt-1 (z t-1 ) + α t + ρ t + Ct-1 t + t Rt (z t ) -Rt-1 (z t ) (55) =⇒ T RT (z T ) ≤ T t=1 α t + T t=1 ρ t + T t=1 t Rt (z t ) -Rt-1 (z t ) + T t=1 Ct-1 t (56) Using the Lipschitz property for f T , we finally obtain T RT (ŝ T ) ≤ T t=1 α t + T t=1 ρ t + T L∥z T -ŝT ∥ 2 ≤LD K √ 2T ln(1/δ)+R gen T w.p. ≥1-δ by (36) and Lemma 13. + T t=1 t Rt (z t ) -Rt-1 (z t ) ≤F * (T ) + T t=1 Ct-1 t ≤C(T ) Which is the desired result. Published as a conference paper at ICLR 2023

E.1 PROOFS OF THE MAIN RESULTS

We prove the results of Appendix D. Proof of Theorem 11. First, notice that since f differentiable on K (since it is smooth) and since both z T and 1 T T t=1 µ(x t )a t are in K, using ∀t, f t = f , we have R T = f * -f (ŝ T ) = f * T -f T (ŝ T ). Using the notation of Lemma 14, we then have C(T ) = 0 and D(T ) = 0. Also: T t=1 D 2 K 2 C t t = T t=1 C t ≤ C(ln(t) + 1) The result then follows from Lemma 14. Proof of Theorem 12. Using the notation of Lemma 14, we specify C(T ), F * (T ) in turn. T t=1 D 2 K 2 C t-1 t = T t=1 D 2 K 2β 0 √ t ≤ D 2 K β 0 √ T . For F * (T ), we decompose Rt (z t ) -Rt-1 (z t ) into two terms: Rt (z t ) -Rt-1 (z t ) = f * t -f * t-1 + f t-1 (z t ) -f t (z t ) ≤ 2M 1 t √ t Using T t=1 1 √ t ≤ 2 √ T , we obtain F * ) ≤ 2M 1 T t=1 t t √ t ≤ 4M 1 √ T . Lemma 14 gives f * T -f T (ŝ T ) ≤ R gen T + D 2 K β0 + 4M 1 + LD K 2 ln(δ -1 /2) √ T To finish the proof, notice that: f * -f (ŝ T ) -f * T -f T (ŝ T ) ≤ 2 sup z ′ ∈K |f T (z ′ ) -f (z ′ )| ≤ 2M 2 √ T . The result follows from ( 61) and (62) using: R T = f * -f (ŝ T ) ≤ f * T -f T (ŝ T ) + 2M 2 √ T .

F SMOOTH APPROXIMATIONS OF NON-SMOOTH FUNCTIONS

We discuss here in more details two specific smoothing techniques: the Moreau envelope, also called Moreau-Yosida regularization in Section F.1, then randomized smoothing in Section F.2. As in Appendices D and E, we focus on the general framework described in Algorithm 2. Proof of Theorem 3. Usinh Theorem 12 above and Lemma 16 below gives the result since D 2 K β 0 + 4M 1 + 2M 2 = D 2 K β 0 + 3L 2 β 0 = LD K D K Lβ 0 + 3 Lβ 0 D K . F.1 SMOOTHING WITH THE MOREAU ENVELOPE For functions that are non-smooth, we propose first a smoothing technique based on the Moreau envelope, following the approach described by Lan (2013) . Let f : R D → R ∪ be a closed proper concave function. The Moreau envelope (or Moreau-Yosida regularization) of f with parameter β T (Rockafellar & Wets, 2009, Def. 1.22 ) is defined as fβ (z) = max y∈R D f (y) - 1 2β ∥y -z∥ 2 2 . For β > 0, let the proximal operator prox β = argmax y∈R D fβ (y). The basic properties of the Moreau envelope (Rockafellar & Wets, 2009, Th. 2.26 ) are that if f : R D → R ∪ {±∞} is an upper semicontinuous, proper concave function then fβ is concave, finite everywhere, continuously differentiable with 1 β -Lipschitz gradients. We also have that the proximal operator prox β is welldefined (the argmax is attained in a single point) and we have ∇ fβ (z) = 1 β z -prox β (z) . It is immediate to prove the following inequalities for every z ∈ R n and every β > 0: f (z) ≤ fβ (z) ≤ f (prox β (z)). The following properties of the Moreau envelope (See (Yurtsever et al., 2018, Appendix A.1 ) and (Thekumparampil et al., 2020, Lemma 1)) are key to the main results: Lemma 15 Let β > 0, f : R D → R ∪ {±∞} be a proper closed concave function, and Z ⊆ R D be a convex set such that f is locally L-Lipschitz-continuous on Z. Then: • ∀z ∈ Z such that prox β (z) ∈ Z, we have ∥z -prox β (z)∥ ≤ Lβ and: fβ (z) - L 2 β 2 ≤ f (z) ≤ fβ (z). • ∀z ∈ Z such that prox β (z) ∈ Z, ∀β > 0 and β ′ > 0, we have: fβ ≤ fβ ′ + 1 2 1 β ′ - 1 β z -prox β (z) 2 2 ≤ L 2 β 2 β β ′ -1 We reformulate the lemma above in the language of Appendix D: Lemma 16 Under Assumption A, assuming furthermore that f is L-Lipschitz on R D . Let f t = fβt with β t = β0 √ t+1 . Then f and (f t ) t∈N satisfy Assumption E with the corresponding values of β 0 and L, M 2 = L 2 β0 2 and M 1 = L 2 β0 2 . Proof. By Lemma 15, f t is L-Lipschitz on R D for every t, and we have M 2 = L 2 β0 2 . Moreover, Lemma 15 also gives 0 ≤ f t-1 (z)-f t (z) ≤ L 2 β0 2t ( √ t + 1- √ t) ≤ L 2 β0 2t √ t . and thus M 1 = L 2 β0 2 .

F.2 RANDOMIZED SMOOTHING

We now describe the randomized smoothing technique (Lan, 2013; Nesterov & Spokoiny, 2017; Duchi et al., 2012; Yousefian et al., 2012) , which consists in convolving f with a probability density function Λ. Following Lan (2013) who combines Frank-Wolfe with randomized smoothing for nonsmooth optimization, we present our results with Λ as the random uniform distribution in the ℓ 2 -ball {z ∈ R D : ∥z∥ 2 ≤ 1} . Let β > 0 and ξ a random variable with density Λ. Then the randomized smoothing approximation of f is defined as: f β (z) := E Λ [f (x + βξ)] = R D f (x + βy)Λ(y)dy. Following (Lan, 2013; Duchi et al., 2012) , we abuse notation and take the "gradient" of f inside integrals and expectation below, because f is almost-everywhere differentiable since it is We restate the following well-known properties of randomized smoothing (see e.g., (Yousefian et al., 2012, Lemma 8 )): Lemma 17 Let β > 0 and f β be defined as in Eq. (70). • ∀z ∈ K, f (z) ≤ f β (z) ≤ f (z) + Lβ. • f β is L-Lipschitz continuous over K. • f β is continuously differentiable and its gradient is L √ D β -Lipschitz continuous. • ∀z ∈ K, ∇f β (z) = E[∇f (z + βξ)]. We obtain the following results, stated in the language of Theorem 12 of Appendix D. Lemma 18 Under Assumption A, assuming furthermore that f is L-Lipschitz on R D . For t ≥ 1, let f t = f βt with β t = D 1 4 D K √ t+1 , and let β 0 = √ DD K L . Then f and (f t ) t∈N satisfy Assumption E with the corresponding values of β 0 and L, M 2 = LD 1 4 D K and M 1 = 2LD 1 4 D K . Proof. By Lemma 17, f t is L-Lipschitz on R D for every t, so that f t has L-bounded gradient. Moreover, with this definition of β 0 , f t is √ t+1 β0 -smooth. We have M 2 = LD 1 4 D K because: |f t (z) -f (z)| = |E[f (z + β t ξ)] -E[f (z)]| ≤ E[|f (z + β t ξ) -f (z)|] ≤ E[∥Lβ t ξ∥ 2 ] ≤ LD 1 4 D K √ t We also have M 1 = 2LD 1 4 D K because: |f t-1 -f t | ≤ E[|f (x + β t-1 ξ) -f (x + β t ξ)|] ≤ L |β t-1 -β t | E[∥ξ∥ 2 ] (72) = LD 1 4 D K ( 1 √ t - 1 √ t + 1 ) ≤ 2LD 1 4 D K t 3 2 . (73) G FW-LINUCB: UPPER-CONFIDENCE BOUNDS FOR LINEAR BANDITS WITH K ARMS In this section, we have: • a finite action space A which is the canonical basis of R K , i.e., we focus on the multi-armed bandit setting • X ⊆ R d×K , where d is the dimension of the feature space. Given x ∈ X , the feature representation of arm a ∈ A is given by the matrix-vector product xa, • Given a matrix θ ∈ R D×d , we denote by ∥θ∥ F the frobenius norm of θ, i.e., ∥θ∥ F = ∥flatten(θ)∥ 2 . In addition, we make here the following linear assumption on the rewards: Assumption F There is θ ∈ R D×d such that ∥θ∥ F ≤ D θ such that ∀x ∈ X , µ(x)a = θxa. Moreover, there is D X > 0 such that sup x∈X a∈A ∥xa∥ 2 ≤ D X . We perform the analysis under Assumption E, which is the more general we have. In particular, we assume that we have access to a sequence (f t ) t∈ T of smooth approximations of f . We focus on the special case of Algorithm 2 that is described in the main paper, i.e., where ρ T = r t . Algorithm 3: FW-linUCB: linear CBCR with K arms. input :δ ′ > 0, λ > 0, ŝ0 ∈ K V 0 = λI dD , 0 = 0 dD , θ0 = 0 dD 1 for t = 1, . . . do 2 Observe context x t ∼ P , x t ∈ R d×K 3 g t ← ∇f t-1 (ŝ t-1 ), xt ← [g t,0 x t ; . . . ; g t,D x t ] 4 ∀i ∈ K , ût,i ← θ⊺ t-1 xt,i + α t δ ′ 2 ∥x t,i ∥ V -1 t-1 // see ( 75) and ( 76) for def. of ∥.∥ V -1 t-1 and α t . 5 a t ← argmax a∈A ût a 6 Observe reward r t , let rt = g ⊺ t r t 7 Update ŝt ← ŝt-1 + 1 t (r t -ŝt-1 ) 8 V t ← V t-1 + (x t a t )(x t a t ) ⊺ , y t ← y t-1 + rt xt a t and θt ← V -1 t y t // regression 9 end The algorithm. As hinted in Section 3.2, FW-LinUCB applies the LinUCB algorithm (Abbasi-Yadkori et al., 2011) , designed for scalar-reward contextual bandits with adversarial contexts and stochastic rewards, to the following extended rewards and contexts, where we use [.; .] to denote the vertical concatenation of matrices and g t = ∇f t-1 (ŝ t-1 ): • xt ∈ R Dd×K is the extended context with entries xt = [g t,0 x t ; . . . ; g t,D x t ] ∈ R Dd×K , so that the feature vector of action a at time t is xt a; • rt = g ⊺ t r t is the scalar observed reward, • θ = flatten(θ) ∈ R dD is the ground-truth parameter vector and μ(x) = θ⊺ xt is the average reward function. Notice that under assumption A and F, denoting X = [g t,0 x t ; . . . ; g t,D x t ] : ∥g∥ 2 ≤ L, x ∈ X and D X = max x∈ X a∈A ∥xa∥ 2 , we have ∀t, xt ∈ X with probability 1 and D X ≤ LD X . Moreover, |r t -μ(x t )a t | ≤ LD K , which implies in particular that for every t ∈ T , rt is LD K /2-subgaussian. Given this notation, the FW-LinUCB algorithm is LinUCB applied to the scalar-reward bandit problem above. The algorithm is summarized in Algorithm 3 for completeness, where λ is the regularization parameter of the ridge regression, θt is the current regression parameters, the matrix V t and the vector y t are incremental computations of the relevant matrices to compute θt . The crucial part of the algorithm is Line 3 which defines an upper confidence bound on μ(x t )a, denoted by ût ∈ R K and defined by: ∀i ∈ K , ût,i = θ⊺ t-1 xt,i + α t (δ ′ /2)∥x t,i ∥ V -1 t-1 where ∥x t,i ∥ V -1 t-1 = x⊺ t,i V -1 t-1 xt,i , and α t is defined according to Theorem 2 of Abbasi-Yadkori et al. ( 2011): α t (δ ′ ) = LD K 2 dD ln 1 + T D 2 X /λ δ ′ + √ λD θ . ( ) Under Assumption E, we have with probability ≥ 1 -δ ′ /2: ∀t ∈ N * , ût a ≥ μ(x t )a (Abbasi-Yadkori et al., 2011, Theorem 2). The result. Let d = dD. The regret bound of LinUCB (Abbasi-Yadkori et al., 2011, Theorem 3) and Azuma inequality give: Theorem 19 Under Assumption E, for every T ∈ N * , for every δ ′ > 0, Algorithm 3 satisfies, with probability at least 1 -δ ′ : R scal T ≤4 T d log(1 + T D X / d) √ λD θ + LD K 2 2 ln(2/δ ′ ) + d ln 1 + T D X /(λ d) + LD K 2 ln(2/δ ′ ).  R scal T = T t=1 max a∈A μ(x t ) ⊺ a - T t=1 μ(x t ) ⊺ a t pseudo-regret + T t=1 μ(x t ) ⊺ a t -rt Xt (78) The pseudo-regret term is bounded using Theorem 3 by Abbasi-Yadkori et al. (2011) . The result applies as-is, except that they assume rewards |θ ⊺ xt | ≤ 1, which is not the case here. The bound is still valid without changes, as in our case we have | max a∈A μ(x t ) ⊺ a -μ(x t ) ⊺ a t | ≤ LD K . The steps in the proof where they use the assumption |θ ⊺ xt | ≤ 1 is below Equation 7 (Abbasi-Yadkori et al., 2011, Appendix C) , which in our notation and our assumption can be written as: max a∈A μ(x t ) ⊺ a -μ(x t ) ⊺ a t ≤ min 2α t (δ ′ /2)∥x t a t ∥ V -1 t-1 , LD K (79) ≤ 2α t (δ ′ /2) min(∥x t a t ∥ V -1 t-1 , 1) where the first inequality comes from Abbasi-Yadkori et al. ( 2011) and the second one is true in our case because 2α t (δ ′ ) ≥ LD K . From here on, the proof of Abbasi-Yadkori et al. ( 2011)'s regret bound follows the same as the original result.foot_8 Theorem 3 from Abbasi-Yadkori et al. (2011) gives us the first term of the regret bound of the theorem, which is true with probability at least 1 -δ ′ /2 in our case because we use α t (δ ′ /2). For the rightmost term, let F = F t t∈N * be the filtration where F t is the σ-algebra generated by (x 1 , a 1 , r 1 , . . . , x t-1 , a t-1 , r t-1 , x t , a t ). Then (X t ) t∈N * is a martingale difference sequence adapted to F with |X t | ≤ LD K . By Azuma's inequality, we have T t=1 X t ≤ LD K 2T k ln 2 δ ′ with probability 1 -δ ′ /2. The final result holds using a union bound. Bound of Table 1 . The bound is obtained by keeping the main dependencies in T, d, L and D K , ignoring the dependencies in λ and D θ , and using the fact that D X ≤ LD K (as described below (74)).

H FW-SQUARECB: CBCR WITH GENERAL REWARD FUCTIONS

The SquareCB algorithm was recently proposed by Foster & Rakhlin (2020) for zero-regret contextual multi-armed bandit with general reward functions, based on the notion of online regression oracles. They propose, for single-reward contextual bandits with adversarial contexts and stochastic rewards, a generic randomized exploration scheme that learning to an online regression algorithm. Their exploration/exploitation strategy then has (bandit) regret bounded as a function of the online regret of the regression algorithm. In this section, we extend the SquareCB approach to the case of CBCR. The main interest of this section is that by building on the work of Foster & Rakhlin (2020) , we obtain at nearly no cost an algorithm for general reward functions for multi-armed CBCR problems. This section shows how to extend this algorithm to our setting of concave rewards. To simplify the notation, we consider the case of finite K with atomic actions, i.e., |A| = K. Our algorithm is based on an oracle for multi-dimensional regression RegSq, which provides approximate values for µ: ∀T, ∀x ∈ X , μT (x) = RegSq x, (x 1 , a 1 , r 1 , . . . , a T -1 , r T -1 ) . ( ) The key assumption is that the problem is realizable and that RegSq has bounded regret: Assumption G There is a function T → R oracle (T ) ∈ R, non-decreasing in T ,foot_9 and Φ, a class of functions from X to R D×K such that, for every T ∈ N: 1. (Realizability) µ ∈ Φ, 2. (Regret bound) For every (x t , a t , r t ) t∈ T , ∈ (X × A × K) T , we have: T t=1 μt (x t )a t -r t 2 2 -inf ϕ∈Φ T t=1 ϕ(x t )a t -r t 2 2 ≤ R oracle (T ). ( ) 3. For every (x t , a t , r t ) t∈ T ∈ (X × A × K) T , μT (x T )a T ∈ K. Assumption G is the counterpart for multidimensional regression of Assumptions 1 and 2a of Foster & Rakhlin (2020) , which are the basis of the original SquareCB algorithm. Remark 5 (The "informal" assumption used in Table 1 ) Notice that in Table 1 , we describe an "informal" version of this assumption, which reads T t=1 μt (x t )a t -µ(x t )a t 2 2 ≤ R oracle (T ), which is the counterpart for multi-dimensional regression of Assumption 2b by Foster & Rakhlin (2020) . Our choice in the table was to simplify the presentation, as this assumption is shorter. Our analysis is also valid under this alternative assumption. Our proofs are made under Assumption G because it is more widely applicable (more discussion of these assumptions can be found in (Foster & Rakhlin, 2020) ). Algorithm 4 describes how SquareCB principles apply to our framework. We use the framework of the main paper, or, equivalently, the special case of Algorithm 2 where ∀t ∈ N, ρ t = r t and z t = ŝt . Note that the algorithm is parameterized by (γ t ) t∈N * instead of the desired confidence level δ ′ to make the analysis more general. Theorem 20 gives a formula for γ t as a function of the desired confidence δ ′ . As for the previous sections, we describe the algorithm for the general case of smooth approximations of f , using ∇f t-1 rather than ∇f in Line 4 of the algorithm. At time step t, the regression oracle provides an estimate of µ(x t ), then the algorithm computes A t , with a larger probability for the action which maximizes a → ⟨∇f (ŝ t-1 ) | μt (x t )a⟩. The exact formula for these probabilities A t follow the original SquareCB algorithm, with the exception that we use an iteration-dependent γ t instead of a constant γ. 13The main result of this section is the following (see Section H.2 and the next section for intermediate lemmas): Theorem 20 Let δ ′ > 0. For every t ∈ N * , let γ t = 2 L tK R oracle (t)+8D 2 K ln 4t 2 δ ′ . Then, under Assumptions E and G, Algorithm 4 satisfies, with probability at least 1 -δ ′ : R gen T ≤ 4L KT R oracle (T ) + 8D 2 K ln 4T 2 δ ′ + LD K 2T ln 2 δ ′ (83) Recall that Assumption B is a special case of E when ρ t = r t , as we are Thus, the bound on R gen T is the same irrespective of whether we use the algorithm for smooth f (in which case R scal T = R gen T ) or with smooth approximations (in which case R scal,sm T = R gen T ). This is because only the Lipschitzness of (f t ) t∈N is used in the analysis of R gen T for FW-SquareCB. The following result is a direct corollary of Theorem 20, and gives the order of magnitude we obtain for smooth f . Obtaining a similar for smooth approximations of f , using Theorem 12 instead of Theorem 11 is straightforward. Proof of the FW-SquareCB regret bound of Table 1 . We apply the bound obtained by Theorem 20 within the bound of Theorem 11, using δ ′ := 2δ/3 and δ := δ/3. We obtain: R T ≤ 4L KT R oracle (T ) + 8D 2 K ln 12t 2 δ + 2LD K 2T ln 3 δ + C ln(eT ) T . ( ) The bound given in the theorem uses the sub-additivity of √ . to group the terms in √ ln δ -1 for better readability. The proof of Theorem 20 is decomposed into two subsections: in the next subsection, we make the necessary adaptations to the SquareCB analysis to account for multi-dimensional regression. This proof follows essentially the same steps as the original analysis of SquareCB. There are only two changes: • We use multi-dimensional regression instead of scalar regression, while we need to bound a scalar regret. There is an additional step to go from the scalar regret to the multi-dimensional regression, but it turns out there is no added difficulty (see first line of the proof of Lemma 23). • For coherence with the overall bounds of the paper, we use an anytime analysis using an increasing sequence of (γ t ) t∈ T , instead of a fixed exploration parameter γ that needs be tuned for a specific horizon determined a priori. This introduces a bit more difficulty, where the main tool is Lemma 24. Our choice of anytime bound is more for coherence in the presentation of the paper than an intended contribution. Nonetheless, what we gain with our anytime bound is that the exploration parameter γ does not depend on a fixed horizon. What we lose, however, is that we need a high-probability bound on cumulative errors based on R oracle (t) that is valid for every t (see Lemma 23), while the "fixed gamma" case only requires this bound to hold for the horizon T . This is the reason for the ln T factor in our bound, which is not present in the original paper. In the next sections, we use the following notation: g t = ∇f t-1 (ŝ t-1 ), µ t = g ⊺ t µ t (x t ), µ * t = max a∈A µ t a, = g ⊺ t μt (x t ), μ * t = max a∈A μt a. H.1 ADAPTATION OF SQUARECB PROOF TO CBCR In the SquareCB paper, Foster & Rakhlin (2020) study high probability bounds on a different type of regret, based on average rewards associated to the actions µ(x t )a t rather than observed rewards r t . However, this difference has little influence since we can start with the following inequality, which is similar to (Foster & Rakhlin, 2020, Lemma 2) . Lemma 21 Under Assumption E, for every T ∈ N * , every δ ′ > 0, Algorithm 4 satisfies T t=1 µ * t -g ⊺ t r t ≤ T t=1 E a∼At µ * t -µ ⊺ t a + LD K 2T ln(1/δ ′ ). Proof. The proof is by Azuma's inequality. Let F = (F t ) t∈N * be the filtration where F t is the σ-algebra generated by (x 1 , a 1 , r 1 , . . . , x t-1 , a t-1 , r t-1 , x t ), and let us denote X T = T t=1 E a∼At µ ⊺ t a -g ⊺ t r t . Then, (X T ) T ∈N is a martingale adapted to filtration F and satisfies |X t -X t-1 | ≤ LD K . We obtain the result by noticing that X T = T t=1 µ * t -g ⊺ t r t - T t=1 E a∼At µ * t -µ ⊺ t a and applying Azuma's inequality to X T . Notice that the difference between (Foster & Rakhlin, 2020 , Lemma 2) and our Lemma 21 is that we consider the randomization over actions and rewards, while they only consider the over actions because they study average rewards. However, since it does not change the upper bound on the variations of the martingale, this additional randomness does not change the bound. The next step is the fundamental step in the proof of the original SquareCB algorithm. Even though the notation differ slightly from the original paper, the proof is the same as in (Foster & Rakhlin, 2020 , Appendix B): Lemma 22 (Foster & Rakhlin, 2020, Lemma 3) For every t ∈ N * , the choice of γ t and A(h t , x t , δ ′ ) of Algorithm 4 guarantees: E a∼At µ * t -µ ⊺ t a ≤ 2K γ t + γ t 4 E a∼At μ⊺ t a -µ ⊺ t a 2 . ( ) The last step of these preliminary lemmas is to relate the cumulative expected error to the oracle regret bound. We use here the same proof as (Foster & Rakhlin, 2020 , Lemma 2). We then have: Lemma 23 Under Assumption E, for every δ ′ > 0, Algorithm 4 satisfies, w.p. at least 1 -δ ′ : ∀T ∈ N * , T t=1 E a∼At μ⊺ t a -µ ⊺ t a t 2 ≤ 2L 2 R oracle (T ) + 16L 2 D 2 K ln 2T 2 δ ′ Proof. We first notice that T t=1 E a∼At μ⊺ t a -µ ⊺ t a t 2 ≤ L 2 T t=1 E a∼At μ(x t )a -µ(x t )a 2 . We then apply the same steps as in the proof of (Foster & Rakhlin, 2020 , Lemma 2) to T t=1 E a∼At μ(x t )a -µ(x t )a 2 2 (which we do not reproduce here) to obtain: for every every T ∈ N, every δ ′ T > 0, with probability at least 1 -δ ′ T : T t=1 E a∼At μ⊺ t a -µ ⊺ t a t 2 ≤ 2L 2 R oracle (T ) + 16L 2 D 2 K ln 1 δ ′ T ( ) Let δ ′ > 0. Applying a union bound and taking δ ′ t = δ ′ 2t 2 so that T t=1 δ ′ t ≤ π 2 12 δ ′ ≤ δ ′ , we obtain the desired result. Notice the log T factor in the bound, which appears because the bound is valid for all time steps. This is because we propose anytime convergence bounds, with the exploration parameter that decreases with time, whereas (Foster & Rakhlin, 2020) only prove their result in the case where the exploration parameter is chosen for a specific horizon. As the main first step for the final result, we need these two lemmas which are the main technical steps to our anytime bound. The proof is deferred to Appendix J.2 Lemma 24 Let (λ t ) t∈N ∈ R T + be a sequence of non-negative numbers, denote Λ T = T t=1 λ t and let (Λ T ) T ∈N such that ∀T ∈ N, Λ T > 0 and Λ T ≥ Λ T . T t=1 λ t Λ t ≤ 2 Λ T . We get the following corollary Lemma 25 Let R ′ oracle (T, δ ′ ) = 2L 2 R oracle (T ) + 16L 2 D 2 K ln 2T 2 δ ′ . Under the conditions of Lemma 23, assume that there is γ 0 > 0 such that ∀t ∈ T , γ t = γ 0 t R ′ oracle (t,δ ′ ) . Then, for every δ ′ > 0, Algorithm 4 satisfies, w.p. at least 1 -δ ′ : T t=1 γ t E a∼At μ⊺ t a -µ ⊺ t a 2 ≤ 2γ 0 T R ′ oracle (T, δ ′ ). Proof. Using γ t ≤ γ 0 T R ′ oracle (t,δ ′ ) , the sum on the left hand side of (92) has the form of Lemma 24 multiplied γ 0 √ T , with probability 1 -δ ′ by Lemma 23. The result thus follows from applying both Lemmas.

H.2 FINAL RESULT

Proof of Theorem 20. Notice that the value of γ t given in the theorem is equal to γ t = 2 2tK R ′ oracle (t, δ ′ /2) . ( ) Using this formula, we have T t=1 2K γ t = K 2 T t=1 R ′ oracle (t, δ ′ /2) t ≤ R ′ oracle (T, δ ′ /2)K 2 T t=1 1 √ t (94) ≤ 2KT R ′ oracle (T, δ ′ /2). Where the first line comes from the monotonicity of R oracle (T ) of Assumption G. Using Lemmas 22 and 25, we thus have, with probability 1 -δ ′ /2: T t=1 E a∼At µ * t -µ ⊺ t a ≤ 2 2KT R ′ oracle (T, δ ′ /2). Using a union bound and Lemma 21, we obtain, with probability at least 1 -δ ′ : T t=1 µ * t -g ⊺ t r t ≤ 2 2KT R ′ oracle (T, δ ′ /2) + LD K 2T ln 2 δ ′ . I FW-LINUCBRANK: CBCR FOR FAIR RANKING WITH LINEAR CONTEXTUAL

BANDITS

Algorithm 5: FW-linUCBRank: linear contextual bandits for fair ranking. Even though our linear contextual bandit setup is different from e.g., (Lagrée et al., 2016) for ranking, the availability of the feedback e t,i , which tells us whether item has been exposed, makes the analysis of the online linear regression similar to the general setup of linear bandits. Our approach builds on the confidence intervals developed by Li et al. (2016) , which expands the analysis of confidence ellipsoids for linear regression of Abbasi-Yadkori et al. (2011) to cascade user models in rankings. input :δ ′ > 0, λ > 0, ŝ0 ∈ K V 0 = λI d , y 0 = 0 d , θ0 = 0 d 1 for t = 1, . . . do 2 Observe context x t ∼ P 3 ∀i, vt,i ← θ⊺ t-1 x t,i + α t δ ′ 3 ∥x t,i ∥ V -1 t-1 // UCB on v i (x t ), see Lem. 26 for def. of α t 4 a t ← top-k{ ∂ft-1 ∂zm+1 (ŝ t-1 )v t,i + ∂ft-1 ∂zi (ŝ t-1 )} m i=1 // FW Each c t,i is 1 2 -subgaussian (because Bernoulli), and is conditionally independent of the other random variables conditioned and on e t,i and x t,i . The incremental linear regression of line 7 of Algorithm 5 is the same as (Abbasi-Yadkori et al., 2011) . Our observation model satisfies the conditions of the analysis of confidence ellipsoids of Li et al. (2016) , from which we obtain: Lemma 26 Under the probabilistic model described in Section 4, and under Assumption C. Let δ ′ > 0 and λ ≥ D 2 X k, and let α T (δ ′ ) = 1 2 ln det(V T ) V 0 δ ′2 + √ λD θ . Then, under Assumption C and with the notation of Algorithm 5, we have: Li et al., 2016, Lemma 4 .2)) with probability ≥ 1 -δ ′ , for all T ≥ 0, θ lies in the confidence ellipsoid: • (( C T = { θ ∈ R d : ∥ θT -θ∥ V T ≤ α T (δ ′ )} • ( (Li et al., 2016, Lemma 4.4 )): α T (δ ′ ) ≤ 1 2 2 ln 1 δ ′ + d ln 1 + T D 2 X k λd + √ λD θ . These results stem from (Li et al., 2016, Lemma A.4 and A.5 ) that claim that, with the assumptions of Lemma 26, the following inequality holds with probability 1: T t=1 m i=1 ∥x t,i ∥ 2 V -1 t-1 e t,i ≤ 2 ln det V T det(V 0 ) ≤ 2d ln 1 + T D 2 X k λd . Notice that terms equivalent to D X and D θ do not appear in (Li et al., 2016) because they assume they are ≤ 1. The D 2 X term comes from a modification necessary in (Li et al., 2016, Lemma A.4 ) while D θ is required by the initial confidence bound proved by Abbasi-Yadkori et al. (2011) . The term k plays the constant C γ of (Li et al., 2016) .

I.2 GUARANTEES FOR FW-LINUCB

We start by writing an alternative to Assumption D for the case where f is not smooth to carry out our analysis with as little assumptions on f as possible: where α T is defined in Lemma 26. Proof. Let g t = ∇f t-1 (ŝ t-1 ), and a * t ∈ argmax a∈A ⟨g t | µ(x t )a -r t ⟩. Let furthermore δ ′ > 0. Assume the algorithm uses α t (δ ′ /3), so that t = { θ ∈ R d : ∥ θt -θ∥ Vt ≤ α t (δ ′ /3)}. Let us define μt similarly to Proposition 4, i.e., ∀t ∈ N * , μt such that ∀i ∈ m , μt,i = µ i (x t ) and μt,m+1 = vt b(x t ) ⊺ viewed as a column vector, with v defined in line 3 of Algorithm 5. We have: Step 1: Upper bound on T t=1 A t via optimism Let t ≥ 0. For θ ∈ R d , denote µ θ (x) ∈ R D×K (recall D = m + 1), the average reward function where parameters θ replace θ. We first show that for every a ∈ A, we have max θ∈Ct ⟨g t | µ θ (x t )a⟩ ≤ ⟨g t | vt a⟩, where vt is given in Line 3 of Algorithm 5. Given a ∈ A, let us denote by mat(a) the view of a as an m × m permutation matrix (instead of an m 2 -dimensional column vector). Recalling that x t is a m × d matrix and g t ∈ R m+1 , let us denote by g t,1:m the vector containing the first m dimensions of g t . We have: The first equality is because g t,m+1 ≥ 0. The second equality is deduced by direct calculation from the definition of C t in Lemma 26, which gives vt,i = max θ∈Ct θ⊺ x t,i . ⟨g t | µ θ (x t )a⟩ = g ⊺ t By Proposition 4 we have that a t defined at Line 4 of Algorithm 5 maximizes ⟨g t | μt a⟩ over a. We thus have max a∈A max θ∈Ct ⟨g t | µ θ (x t )a⟩ ≤ ⟨g t | μt a t ⟩. By Lemma 26, we have θ ∈ C t for all t ≥ 0 with probability 1 -δ ′ /3. Therefore, with probability 1 -δ ′ /3, we have for all t ≥ 0: ⟨g t | µ θ (x t )a * t ⟩ ≤ ⟨g t | μt a t ⟩. Noting that µ θ (x t ) = µ(x t ) by definition of θ, we obtain that ∀t, A t ≤ 0 and thus T t=1 A t ≤ 0 with probability 1 -δ ′ /3. Step 2: Upper bound on T t=1 B t using linear bandit techniques Let a t,i ∈ R m denote the i-th row of mat(a t ), which contains only 0s except a 1 at the rank of item i in a. Since μt and µ(x t ) only differ in the last dimension, which is the user utility, we have, using (102): B t = g t,m+1 (v t -v(x t )) ⊺ mat(a t )b(x t ) = g t,m+1 m i=1 vt,i -v i (x t ) a ⊺ t,i b(x t ) Denoting e t,i = a ⊺ t,i b(x t ) ∈ R the expected exposure of item i in ranking a t given context x t , we have: B t = g t,m+1 ∈[0,L] m i=1 vt,i -v i (x t ) e t,i ≤ L m i=1 ( θt-1 -θ) ⊺ x t,i + α t (δ ′ /3)∥x t,i ∥ V -1 t-1 e t,i (106) ≤ L m i=1 ∥ θt-1 -θ∥ Vt-1 ∥x t,i ∥ V -1 t-1 + α t (δ ′ /3)∥x t,i ∥ V -1 t-1 e t,i (by Cauchy-Schwarz) By Lemma 26, we have, with probability 1 -δ ′ /3: ∥ θt-1 -θ∥ Vt-1 ≤ α t (δ ′ /3), and thus: B t ≤ 2Lα t δ ′ 3 m i=1 ∥x t,i ∥ V -1 t-1 e t,i = 2Lα t δ ′ 3 m i=1 ∥x t,i ∥ -1 t-1 (e t,i -e t,i ) X ′ t + m i=1 ∥x t,i ∥ V -1 t-1 e t,i We first deal with the sum over t of the right-hand side, using e t,i ∈ {0, 1}: For the left-hand term, we have that T t=1 X ′ t T ∈N * is a martingale adapted to the filtration F = (F T ) T ∈N * where F T is the σ-algebra generated by (x 1 , a 1 , r 1 , . . . , x T -1 , a T -1 , r T -1 , x T , a T ), with X ′ t ≤ D X k √ λ . Thus, with probability at least 1 -δ ′ /3, we have T t=1 m i=1 ∥x t,i ∥ V -1 t-1 e t,i = T t=1 m i=1 (∥x t,i ∥ V -1 t-1 e t,i ) × e t,i T t=1 m i=1 m i=1 ∥x t,i ∥ V -1 t-1 (e t,i -e t,i ) ≤ D X k √ λ 2T ln 3 δ ′ ≤ 2T k ln 3 δ ′ . ( ) Where the last inequality comes from the assumption λ ≥ D 2 X k. We conclude this step by saying that with probability 1 -2δ ′ /3, we have: T t=1 B t ≤ 2Lα t δ ′ 3 T k 2 ln 3 δ ′ + d ln 1 + T D 2 X k λd . ( ) Step 3: Upper bound on T t=1 X t using Azuma's inequality Following the same arguments as in the proof of Thm. 19, let F = F t t∈N * be the filtration where F t is the σ-algebra generated by (x 1 , a 1 , r 1 , . . . , x t-1 , a t-1 , r t-1 , x t , a t ). Then (X t ) t∈N is a martingale difference sequence adapted to F with |X t | ≤ LD K , so that T t=1 X t ≤ L 2T k ln 3 δ ′ with probability 1 -δ ′ /3. The final result is obtained using a union bound, considering that Step 1 and Step 2 use the same confidence interval given by Lemma 26 which is valid w.p. ≥ 1 -δ ′ /3, Step 2 uses an addition Azuma inequality valid w.p. 1 -δ ′ /3, and step 3 uses an additional Azuma inequality which valid with probability ≥ 1 -δ ′ /3. Lemma 7 Under Assumption Ã, S is compact and ∀T ∈ N * , ∀x 1:T ∈ X T , S(x 1:T ) is compact. Proof. We start with S(x 1:T ). Let x 1:T ∈ X T . We notice that S(x 1:T ) is the image of A T by the continuous mapping ϕ : (R K ) T → R D defined by ϕ(a 1 , ..., a T ) = 1 T T t=1 µ(x t )a t . Since A is compact, A T is compact as well. S(x 1:T ) is thus the image of a compact set by a continuous function, and is therefore compact. For the set S, we provide a proof here using Diestel's theorem (see (Yannelis, 1991) ).  where G ⊆ L 1 (X , P ) is the collection of all P -integrable selections of G, i.e. the collection of all P -integrable functions g : X → R such that g(x) ∈ G(x) for P -a.e x ∈ X . Now, since A is compact, convex and nonempty, the values of the set-valued function G are nonempty, convex, and compact. Moreover, since sup x∈X ,a∈A ∥µ(x)a∥ 2 < +∞ because ∀x, a, µ(x)a ∈ K, the set-valued function G is P -integrably bounded in the sense of (Yannelis, 1991, Section 2.2). It then follows from Diestel's Theorem (Yannelis, 1991, Theorem 3.1) that the collection G of P -integrable selections of G is weakly compact in L 1 (X , P ). Finally, since g → X g dP is a weakly continuous mapping from L 1 (X , P ) to R D , and S ⊆ R D is the image of G under this mapping (refer to the correspondence (117)), we deduce that S is weakly compact as a subset of R D , and therefore compact since R D is finite-dimensional.

J.2 PROOF OF LEMMA 24

Lemma 24 Let (λ t ) t∈N ∈ R T + be a sequence of non-negative numbers, denote Λ T = T t=1 λ t and let (Λ T ) T ∈N such that ∀T ∈ N, Λ T > 0 and Λ T ≥ Λ T . T t=1 λ t Λ t ≤ 2 Λ T . Proof. First, we treat the case where λ 0 > 0. Then ∀ t ∈ T , Λ t > 0. We thus have T t=1 λ t Λ t ≤ T t=1 λ t √ Λ t We now prove that the right-hand term is ≤ √ Λ T . Let us observe that, for every α ≥ 0, β > α: 1 2 α √ β ≤ β -β -α, which is proved using β - √ β -α = β β-α 1 2 √ s ds ≥ α 1 2 √ β . Using the telescoping sum (with Λ 0 = 0): T t=1 λ t √ Λ t ≤ 2 T t=1 Λ t -Λ t -λ t =Λt-1 = 2 Λ T ≤ 2 Λ T , we obtain the desired result. More generally, if λ 0 = 0, there are two cases: 1. if ∀ T ∈ T , λ t = 0 then the result is true; 2. otherwise, let T 0 = min{t ∈ T : λ t > 0}. Using the result above, we have: T t=1 λ t Λ t = T t=T0 λ t Λ t ≤ 2 Λ T .



Notice that linear structure between µ(xt) and at is standard in combinatorial bandits(Cesa-Bianchi & Lugosi, 2012) and it reduces to the usual multi-armed bandit setting when A is the canonical basis of R K . 2 In the multi-armed setting, stationary policies return a distribution over arms given a context vector. In the combinatorial setup, π(x) ∈ A is the average feature vector of a stochastic policy over A. For the benchmark, we are only interested in expected rewards so there is to need to specify the full distribution over A.3 This means that f is concave and upper semi-continuous, is never equal to +∞ and is finite somewhere. For simplicity, we presented our reduction with zt = ŝt but other choices of zt are possible (see Appendix D). The important point is that the reduction works without restricting zt to S. In practice, this result is used in conjunction with an upper bound R scal (T, δ ′ ) on R scal T that holds with probability ≥ 1 -δ ′ , which gives RT ≤ R scal (T, δ ′ )/T + O( ln(1/δ)/T ) with probability at least 1 -δ -δ ′ using the union bound. When b is unknown, depends on the context x, and we do not observe et, several approaches have been proposed to estimate the position weights (see e.g.,Fang et al., 2019). Incorporating these approaches in contextual bandits for ranking is likely feasible but out of the scope of this work. https://www.last.fm, the dataset is publicly available for non-commercial use. Using the Python library Implicit, MIT License: https://implicit.readthedocs.io/ LinUCBRank appears under various names in the literature, including PBMLinUCBRank(Ermis et al., 2020) andCascadeLinUCB (Kveton et al., 2015). In short, they have different bounds, one involving the varance of rt and the other one involving average rewards μ(x). We assume rewards rT are uniformly bounded in K, so we do not have to deal with two different quantities in our bounds and have LDK everywhere. Monotonicity of R oracle is not required in(Foster & Rakhlin, 2020). We use it in (94) below to deal with time-dependent γt. Meaningful R oracle (T ) are non-decreasing with T since they bound a cumulative regret. Throughout the paper, we chose to provide anytime bounds rather than bounds that depend on horizondependent parameters. The analysis with fixed γ is easier.



Thus, considering only d, T, k and δ= δ ′ Alg. 1 has regret R T ≤ O dk ln(T /δ) √ Tw.p. at least 1 -δ.

CBCR: Application to fairness of exposure in rankings with bandit feedback 16 B.2 Multi-armed CBCR: Application to multi-objective bandits with generalized Gini function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C Proofs of Section 2 21 C.1 Brief reminder on Lipschitz functions and super-gradients . . . . . . . . . . . . . . 21 C.2 Preliminaries: the structure of the set S . . . . . . . . . . . . . . . . . . . . . . . 21 C.3 Proof of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 D The general template Frank-Wolfe algorithm 24 E Proofs for Section 3 and Appendix D 27 E.1 proofs of the main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Smooth approximations of non-smooth functions 29 F.1 Smoothing with the Moreau envelope . . . . . . . . . . . . . . . . . . . . . . . . 30 F.2 Randomized smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 G FW-LinUCB: upper-confidence bounds for linear bandits with K arms 31 H FW-SquareCB: CBCR with general reward fuctions 33 H.1 Adaptation of SquareCB proof to CBCR . . . . . . . . . . . . . . . . . . . . . . . 35 H.2 Final result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 I FW-LinUCBRank: CBCR for fair ranking with linear contextual bandits 37 I.1 Results for online linear regression (from (Li et al., 2016)) . . . . . . . . . . . . . 38 I.2 Guarantees for FW-LinUCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 J Additional technical lemmas 41 J.1 Proof of Lemma 7 (S is compact) . . . . . . . . . . . . . . . . . . . . . . . . . . 41 J.2 Proof of Lemma 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2019); Siddique et al. (2020); Mandal & Gan (2022); Geist et al. (

Figure 2: Lastfm-50: Objective values over time for (top) Gini, (middle) eq. exposure, (bottom) welf.

Figure 3: Lastfm-2k: Objective values over time for (top) Gini, (bottom) welf.

Figure 4: Trade-offs between user utility and inequality on Lastfm-2k, after T = 10 6 rounds.

t

Figure 5: Multi-objective bandits: GGF value achieved on various synthetic environments.

The assumptions of the framework of Sec. 4 hold, as well as Ass. E. Moreover, ∀t ∈ N, ∀z ∈ K ∂ft ∂zm+1 (z) > 0, and ∀x ∈ X ,1 ≥ b 1 (x) ≥ . . . ≥ b k (x) = . . . = b m (x) = 0.Lemma 27 Under Assumptions D ′ and C Let T > 0, δ ′ > 0 and λ ≥ D 2 X k. Then for every δ ′ > 0, Algorithm 5 satisfies, with probability at least 1 -δ ′ :R gen T ≤ 2Lα T (δ ′ /3) T k   2 ln( 3 δ ′ ) + 2d ln 1 + T D 2 X k λd   + LD K 2T ln 3 δ ′ .

Under Assumptions B, C and D, for every δ ′ > 0, every T ∈ N * , every λ ≥ D 2 X k, with probability at least 1 -δ ′ , Algorithm 1 has scalar regret bounded byR scal T = O L T k d ln(T /δ ′ ) d ln(T /δ ′ ) + D θ √ λ + k/d .(13) Thus, considering only d, T, k and δ = δ ′ Alg. 1 has regret R T ≤ O dk ln(T /δ) √ T w.p. at least 1 -δ. Proof. Let δ > 0 and use δ ′ := 3δ/4 and δ := δ/4 in the bound on R T obtained by applying Lemma 27 and Theorem 11. Notice that Using λ ≥ D 2 X k and D K = O(k), we have: R scal (T, 3δ/4) = O Lα T (δ) T kd ln(T /δ) + Lk T ln(1/δ) (112) and α T (δ) = O d ln(T /δ) + D θ √ λ .We thus getR scal (T, δ) = O L T k d ln(T /δ) d ln(T /δ) + D θ √ λ + Lk T ln(1/δ)(113) = O L T k d ln(T /δ) d ln(T /δ) + D θ √ λ + k/d (114) For the smooth the total bound adds O(Lk T ln(1/δ) + C ln T T ). A bound on the complete regret is thus R T = O L T k d ln(T /δ) d ln(T /δ) + D θ OF LEMMA 7 (S IS COMPACT)

Consider the set-valued map defined by G :X → {B | B ⊆ R D } G(x) := µ(x)A := {µ(x)a | a ∈ A}.(116)Then, S can be written as the Aumann integral of G over X w.r.t P , i.e.

Regret bounds depending on assumptions and base algorithm A, for multi-armed bandits with K arms (in dimension d for LinUCB). See Appendix G and H for the full details.

on v i (x t ) (def. of α t in Lem. 26, App. I) ≤ m the maximum rank that can be exposed to any user. In most practical applications, k ≪ m. As formalized in Assumption D below, the position weights b k (x) are always non-increasing with k since the user browses the recommended items in order of their rank. We use a linear assumption for item values, where D X and D θ are known constants:

Algorithm 4: FW-SquareCB: contextual bandits with concave rewards and regression oracles input :initial point ŝ0 ∈ K, exploration parameters (γ t ) t∈N . A is the canonical basis R K . 1 for t = 1 . . . do Let g t = ∇f t-1 (ŝ t-1 ) and μt = g ⊺ t μt (x t ) ∈ R K

Observe exposed items e t ∈ {0, 1} m and user feedback c t ∈ {0, 1} m .1 RESULTS FOR ONLINE LINEAR REGRESSION (FROM (LI ET AL., 2016))

| µ(x t )a -r t ⟩ = | μt a t -µ(x t )a t ⟩ | µ(x t )a t -r t ⟩

ACKNOWLEDGMENTS

The authors would like to thank Clément Vignac, Marc Jourdan, Yaron Lipman, Levent Sagun and the anonymous reviewers for their helpful comments.

