CONTEXTUAL BANDITS WITH CONCAVE REWARDS, AND AN APPLICATION TO FAIR RANKING

Abstract

We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restricted to finite policy spaces or tabular representations. Our solution is based on a geometric interpretation of CBCR algorithms as optimization algorithms over the convex set of expected rewards spanned by all stochastic policies. Building on Frank-Wolfe analyses in constrained convex optimization, we derive a novel reduction from the CBCR regret to the regret of a scalar-reward bandit problem. We illustrate how to apply the reduction off-the-shelf to obtain algorithms for CBCR with both linear and general reward functions, in the case of non-combinatorial actions. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives, leading to the first algorithm with regret guarantees for contextual combinatorial bandits with fairness of exposure.

1. INTRODUCTION

Contextual bandits are a popular paradigm for online recommender systems that learn to generate personalized recommendations from user feedback. These algorithms have been mostly developed to maximize a single scalar reward which measures recommendation performance for users. Recent fairness concerns have shifted the focus towards item producers whom are also impacted by the exposure they receive (Biega et al., 2018; Geyik et al., 2019) , leading to optimize trade-offs between recommendation performance for users and fairness of exposure for items (Singh & Joachims, 2019; Zehlike & Castillo, 2020) . More generally, there is an increasing pressure to insist on the multiobjective nature of recommender systems (Vamplew et al., 2018; Stray et al., 2021) , which need to optimize for several engagement metrics and account for multiple stakeholders' interests (Mehrotra et al., 2020; Abdollahpouri et al., 2019) . In this paper, we focus on the problem of contextual bandits with multiple rewards, where the desired trade-off between the rewards is defined by a known concave objective function, which we refer to as Contextual Bandits with Concave Rewards (CBCR). Concave rewards are particularly relevant to fair recommendation, where several objectives can be expressed as (known) concave functions of the (unknown) utilities of users and items (Do et al., 2021) . Our CBCR problem is an extension of Bandits with Concave Rewards (BCR) (Agrawal & Devanur, 2014) where the vector of multiple rewards depends on an observed stochastic context. We address this extension because contexts are necessary to model the user/item features required for personalized recommendation. Compared to BCR, the main challenge of CBCR is that optimal policies depend on the entire distribution of contexts and rewards. In BCR, optimal policies are distributions over actions, and are found by direct optimization in policy space (Agrawal & Devanur, 2014; Berthet & Perchet, 2017) . In CBCR, stationary policies are mappings from a continuous context space to distributions over actions. This makes existing BCR approaches inapplicable to CBCR because the policy space is not amenable to tractable optimization without further assumptions or restrictions. As a matter of fact, the only prior theoretical work on CBCR is restricted to a finite policy set (Agrawal et al., 2016) . We present the first algorithms with provably vanishing regret for CBCR without restriction on the policy space. Our main theoretical result is a reduction where the CBCR regret of an algorithm is bounded by its regret on a proxy bandit task with single (scalar) reward. This reduction shows that it is straightforward to turn any contextual (scalar reward) bandits into algorithms for CBCR. We prove this reduction by first re-parameterizing CBCR as an optimization problem in the space of feasible rewards, and then revealing connections between Frank-Wolfe (FW) optimization in reward space and a decision problem in action space. This bypasses the challenges of optimization in policy space. To illustrate how to apply the reduction, we provide two example algorithms for CBCR with noncombinatorial actions, one for linear rewards based on LinUCB (Abbasi-Yadkori et al., 2011) , and one for general reward functions based on the SquareCB algorithm (Foster & Rakhlin, 2020) which uses online regression oracles. In particular, we highlight that our reduction can be used together with any exploration/exploitation principle, while previous FW approaches to BCR relied exclusively on upper confidence bounds (Agrawal & Devanur, 2014; Berthet & Perchet, 2017; Cheung, 2019) . Since fairness of exposure is our main motivation for CBCR, we show how our reduction also applies to the combinatorial task of fair ranking with contextual bandits, leading to the first algorithm with regret guarantees for this problem, and we show it is computationally efficient. We compare the empirical performance of our algorithm to relevant baselines on a music recommendation task. 2016) address a restriction of CBCR to a finite set of policies, where explicit search is possible. Cheung (2019) use FW for reinforcement learning with concave rewards, a similar problem to CBCR. However, they rely on a tabular setting where there are few enough policies to compute them explicitly. Our approach is the only one to apply to CBCR without restriction on the policy space, by removing the need for explicit representation and search of optimal policies.

Related work. Agrawal et al. (

Our work is also related to fairness of exposure in bandits. Most previous works on this topic either do not consider rankings (Celis et al., 2018; Wang et al., 2021; Patil et al., 2020; Chen et al., 2020) , or apply to combinatorial bandits without contexts (Xu et al., 2021) . Both these restrictions are impractical for recommender systems. Mansoury et al. ( 2021); Jeunen & Goethals (2021) propose heuristics with experimental support that apply to both ranking and contexts in this space, but they lack theoretical guarantees. We present the first algorithm with regret guarantees for fair ranking with contextual bandits. We provide a more detailed discussion of the related work in Appendix A.

2. MAXIMIZATION OF CONCAVE REWARDS IN CONTEXTUAL BANDITS

Notation. For any n ∈ N, we denote by n = {1, . . . , n}. The dot product of two vectors x and y in R n is either denoted x ⊺ y or using braket notation ⟨x | y⟩, depending on which one is more readable. Setting. We define a stochastic contextual bandit (Langford & Zhang, 2007) problem with D rewards. At each time step t, the environment draws a context x t ∼ P , where x ∈ X ⊆ R q and P is a probability measure over X . The learner chooses an action a t ∈ A where A ⊆ R K is the action space, and receives a noisy multi-dimensional reward r t ∈ R D , with expectation E[r t |x t , a t ] = µ(x t )a t , where µ : X → R D×K is the matrix-value contextual expected reward function. 1 The trade-off between the D cumulative rewards is specified by a known concave function f : R D → R ∪ {±∞}. Let A denote the convex hull of A and π : X → A be a stationary policy, 2 then the optimal value for the problem is defined as f * = sup π:X →A f E x∼P µ(x)π(x) . We rely on either of the following assumptions on f : Assumption A f is closed proper concave 3 on R D and A is a compact subset of R K . Moreover, there is a compact convex set K ⊆ R D such that • (Bounded rewards) ∀(x, a) ∈ X × A, µ(x)a ∈ K and for all t ∈ N * , r t ∈ K with probability 1. 



Notice that linear structure between µ(xt) and at is standard in combinatorial bandits(Cesa-Bianchi & Lugosi, 2012) and it reduces to the usual multi-armed bandit setting when A is the canonical basis of R K . 2 In the multi-armed setting, stationary policies return a distribution over arms given a context vector. In the combinatorial setup, π(x) ∈ A is the average feature vector of a stochastic policy over A. For the benchmark, we are only interested in expected rewards so there is to need to specify the full distribution over A.3 This means that f is concave and upper semi-continuous, is never equal to +∞ and is finite somewhere.



• (Local Lipschitzness) f is L-Lipschitz continuous with respect to ∥.∥ 2 on an open set containing K. Assumption B Assumption A holds and f has C-Lipschitz-continuous gradients w.r.t. ∥.∥ 2 on K.

