HIGH-DIMENSIONAL CONTINUUM ARMED AND HIGH-DIMENSIONAL CONTEXTUAL BANDIT: WITH APPLI-CATIONS TO ASSORTMENT AND PRICING

Abstract

The bandit problem with high-dimensional continuum arms and high-dimensional contextual covariates is often faced by decision-makers but remains unsolved. Recent developments in contextual bandit problems focus on the setting where the number of arms are small but are impracticable with high-dimensional continuous arm spaces. To bridge the gap, we propose a novel model for the high-dimensional continuum armed and high-dimensional contextual bandit problem that captures the effect of the arm and covariates on the reward via a low-rank representation matrix. The representation matrix is endowed with interpretability and predictive power. We further propose an efficient bandit algorithm based on a low-rank matrix estimator with theoretical justifications. The generality of our model allows wide applications including business and healthcare. In particular, we apply our method to assortment and pricing, both of which are important decisions for firms such as online retailers. Our method can solve the assortment-pricing problem simultaneously while most existing methods address them separately. We demonstrate the effectiveness of our method to jointly optimize assortment and pricing for revenue maximization for a giant online retailer.

1. INTRODUCTION

The bandit problem dates back to when Robbins (1952) formulated the problem as the sequential design of experiments and has been studied to a great extent recently due to the demand for online decision-making, especially from e-commerce and health care. A decision-maker chooses an action (arm) at each round and observes a reward and the goal is to act strategically so as to find an optimal action that maximizes the long-term reward without sacrificing too much. The bandit literature mostly focuses on the problem of a finite number of independent arms, but it is often the case that infinite number of of arms and the arms share some common structure and thus can be indexed by variables as a continuum armed bandit problem. In e-commerce, the retailer needs to decide the product assortment and pricing to maximize long-term profits; in mobile health, the personal device provides exercise and dietary suggestions to improve physical and mental health. The possible actions in both examples can be parameterized as continuous variables, which are possibly high dimensional. In addition, decision-makers observe other covariates/features, i.e., the contextual bandit problem where the reward is modeled as a function of unknown parameters and the contextual variables, and in many practical settings, the covariates are high-dimensional. As the dimensionalities of the action space (for arms) and the contextual variables grow, the traditional bandit algorithms suffer from the curse of dimensionality and it is impossible or prohibitively costly to learn the optimal decision. Albeit both the arm and the contextual are high-dimensional, the dimension of the underlying factors is often, fortunately, small -for high-dimensional bandit problems, one can assume a low-dimensional structure on the unknown parameters, such as the LASSO bandit (Bastani & Bayati, 2020) ; and for high-dimensional continuum armed bandit problems, one can assume the reward function depends only on the low-dimensional subspace of the action space (Tyagi et al., 2016) . While low-dimensional representation has been successfully adopted in highdimensional bandit problems and high-dimensional continuum armed bandit problems respectively, a natural but important question remains open: can we efficiently solve the bandit problem with both high-dimensional continuum arms and high-dimensional contextual variables simultaneously? In this paper, we tackle the above problem by proposing a novel model that captures the effect of the arm and the contextual with an approximately low-rank matrix representation as well as an efficient algorithm (Hi-CCAB) to efficiently solve the problem with theoretical justifications. Specifically, for an action that is presented as a vector a ∈ R da and the corresponding contextual covariates x ∈ R dx , we yield reward r = a ⊤ Θx + ε where Θ ∈ R da×dx is the unknown representation matrix, which is assumed to have rank d ≪ min{d a , d x } and ε is the independent error. To learn the low-rank representation matrix, we adapt the low-rank matrix estimator to the bandit setting. We further demonstrate the benefits of our methodologies in e-commerce with real sales data where the online retailer needs to decide on the product assortment and pricing jointly. The generality of our model makes it possible to learn policy on product assortment and pricing at the same time, while previous literature mostly studies the assortment and pricing problem separately. Contributions. We highlight the following contributions of our paper: 1. We propose a new model for high-dimensional continuum armed and high-dimensional contextual bandit problem, which is often faced by decision-makers but very little existing literature attempts to solve. The crux of our model is the low-rank representation matrix that exploits the low-dimensional structure of both the high-dimensional arms and high-dimensional covariates. Our model unifies a large class of bandit models. 2. The low-rank representation matrix is endowed with interpretability and predictive power. One can perform singular value decomposition (SVD) on the representation matrix -the left singular vectors reveal the latent structure and relationships among the arms, while the right singular vectors show the latent factors of the covariate. In other words, our model implicitly performs principle component analysis (PCA) on the effect of arms and covariates on the mean reward. On the other hand, given the covariate, our model is able to predict the reward of an unseen arm. Both interpretability and predictive power can be tremendously useful for decision-makers. 3. We propose an efficient algorithm for the High-dimensional Contextual and High-dimensional Continumm Armed Bandit (Hi-CCAB) by adopting the low-rank matrix estimator. We further provide an upper bound for the convergence rate of Hi-CCAB in terms of the time-averaged expected cumulative regret. 4. The generality of our model allows for a wide range of applications. Specifically, we apply Hi-CCAB to the joint assortment and pricing problem. We show that our model reveals insights for product designs, assortment, and pricing and that the assortment-pricing policy based on Hi-CCAB yields sales four times as high as the original strategy. Literature review. Literature on high-dimensional bandit problems has been expanding recently, especially after statistical tools for high-dimensional problems become mature (Negahban & Wainwright, 2011; Wainwright, 2019) . Lots of high-dimensional bandit literature focuses on contextual bandits with high-dimensional covariates, such as the LASSO bandit problem (Abbasi-Yadkori et al., 2012; Kim & Paik, 2019; Bastani & Bayati, 2020; Hao et al., 2020; Papini et al., 2021) where they assume the mean reward is a linear function of a sparse unknown parameter vector, the low-rank matrix bandit where the covariate and unknown parameter are both of matrix form (Kveton et al., 2017; Lu et al., 2021) , and other non-parametric methods that learns the reward function using random forest or deep learning (Féraud et al., 2016; Zhou et al., 2020; Ban et al., 2022; Chen et al., 2022; Xu et al., 2022) . The high-dimensional bandit models are special cases of our model. Another stream of high-dimensional bandit literature studies representation learning in linear bandits, specifically for multi-task learning where several bandits are played concurrently. The arms for each task are embedded in the same space and share a common low-dimensional representation (Lale et al., 2019; Yang et al., 2020; Hu et al., 2021; Xu & Bastani, 2021) . Our problem is different from multi-task learning since at each time we only have one bandit and thus observe one reward while in the multi-task bandit problem, multiple bandits are played at the same time. For continuum armed bandits, there exists a thread of literature that assumes the mean reward function is smooth and continuous on the action space in some sense, e.g., the function lies in the Lipschitz or Hölder space (Agrawal, 1995; Kleinberg, 2004; Kleinberg et al., 2019) . Most work discretizes the arm space or adopts the non-parametric regression to estimate the reward function, which is very different from our approach. Recent literature studies on continuum armed bandit with contextual covariates further assumes the mean reward function is continuous on the arm-covariate space (Lu et al., 2010; Slivkins, 2011; Krishnamurthy et al., 2020) . Literature on high-dimensional continuum armed bandits, however, is scarce (Turgay et al., 2020; Majzoubi et al., 2020) . Again, the techniques and assumptions therein are different from ours and their model is hard to interpret. In terms of matrix estimation techniques, low-rank matrix estimation and recovery have been studied extensively in statistics and widely used in numerous applications (Candes & Plan, 2010; Candès & Tao, 2010; Negahban & Wainwright, 2011; Shabalin & Nobel, 2013; Gavish & Donoho, 2014; Cai & Zhang, 2018; Wainwright, 2019) . We adapt the techniques in this literature to provide convergence analysis for our algorithm. Finally, in operation research, assortment and pricing are important decisions for firms and there exists voluminous literature on dynamic assortment and dynamic pricing. Most of the work on assortment is based on the multinomial logit (MNL) choice model (Caro & Gallien, 2007; Kök et al., 2008; Sauré & Zeevi, 2013) and recently a strand of work adopt the multi-arm bandit technique to the MNL model (Chen & Wang, 2017; Agrawal et al., 2019; Kallus & Udell, 2020; Chen et al., 2021) . For dynamic pricing, the problem usually comes with demand learning. In presence of covariates, the demand can be modeled as a parametric function (Qiang & Bayati, 2016; Ban & Keskin, 2021) or a nonparametric function (Chen & Gallego, 2021) which adopt the continuum armed bandit techniques in Slivkins (2011) . However, there are relatively few papers on the joint assortment-pricing problem. Recently, Miao & Chao (2021) provides a solution using the MNL choice model with finite arms, while our model targets at infinite many arms. In addition, their model assumes the products are independent of each other and can only handle a small number of products. Their model can neither incorporates contextual information nor predicts new products. Roadmap. The rest of the paper is organized as follows. Section 2 describes the problem formulation and introduces our model with two concrete examples in assortment-pricing and health care. Section 3 presents our Hi-CCAB algorithm and its convergence result. Finally, Section 4 shows the empirical results on simulated data and a case study on real sales data from one of the largest online retailers. The proof of our theorem and additional empirical results are provided in the Appendix.

2. PROBLEM FORMULATION

In this section, we first introduce our high-dimensional continuum armed and high-dimensional contextual bandit model. Since our model is novel and different from traditional bandit models, we further provide intuition and two real applications of our model in assortment-pricing and healthcare. Finally, we show that a large class of bandit models can be reformulated into our model. Notation. We use bold lowercase for vectors and bold uppercase for matrices. For any vector a, we use ∥a∥ to denote its ℓ 2 norm. For any matrix A, we use ∥A∥ F := ij a 2 ij to denote its Frobenius norm, ∥A∥ 2 to denote its ℓ 2 spectrum norm, i.e., ∥A∥ 2 := sup ∥x∥2=1 ∥Ax∥ 2 , and ∥A∥ * := d k=1 s k to denote its nuclear norm where d is the rank and s k 's are the singular values of A. We use ⟨a, b⟩ := a ⊤ b to denote the inner product between two vectors and ⟨A, B⟩ := trace(A ⊤ B) between two matrices. Problem setup. At each time t, we make one decision for a batch of objects of size L. Before making the decision, we observe the attributes of these L objects, which can be characterized in potentially high-dimensional contextual vectors: x t,1 , • • • , x t,L ∈ R dx . Then based on all the observations we have before time t and the contextual vectors at time t, we decide on an action to take (or equivalently an arm to choose), which can be characterized as a high-dimensional vector a t that takes value in a constraint set A in high-dimensional space R da . After we take the action each time, we observe a batch of rewards, r t,j = a ⊤ t Θx t,j + ε t,j , j = 1, 2, • • • , L where Θ is a low-rank matrix and ε t,j is independent noise with E[ε t,j ] = 0 and Var[ε t,j ] ≤ σ 2 . As a bandit problem, our goal is to design a sequential decision-making policy π that maximizes the expected cumulative reward, or equivalently, minimizes the expected cumulative regret. Specifically, suppose policy π governs the way we take actions a 1 , a 2 , a 3 , • • • , we have an expected cumulative regret measuring the difference between the cumulative expected reward of the best possible action when the underlying true parameter (i.e., Θ) is known and that we can achieve under policy π, R π T = E   T t=1 max a∈At   L j=1 a ⊤ Θx t,j -a ⊤ t,π Θx t,j     (2) where the expectation is taken with respect to (x t,j , ε t,j ) since a t,π depends on both. We seek an optimal policy π * that minimizes the expected cumulative regret R π T . Note that R π T grows with T . To better measure the performance of a policy, we focus on the time-averaged expected cumulative regret, R π T /T . We will ignore the subscript π for a for notation simplicity in the rest of the paper. At first sight, our model seems remote from other bandit models and hard to interpret. Our model is, in fact, a generalization of a large class of bandit models, and the generality of our model makes it applicable to a wide range of decision-making problems. In the following, we will parse the model and provide more intuitions. Let us consider the classical K-arm bandit model (without context). Each arm can be represented by a k-dimensional standard basis and the covariate vector is simply 1. Then Θ becomes a vector where each element is the mean reward for the corresponding arm. For the multi-arm contextual linear bandit problem, we further observe covariate x as the contextual. Then each row of Θ becomes the coefficient β for each arm, i.e., r k = β ⊤ k x for k = 1, . . . , K. Our model further unifies a large class of bandit models and we will formalize the above statements later in Proposition 1. The novelty of our model lies in the low-rank representation matrix Θ. It encapsulates the effect of the arm and covariates on the reward and exploits the low-dimensional structure in the highdimensional arm and covariates. To be more specific, let us consider Θ to be exactly low-rank and of rank d. Suppose its singular value decomposition is Θ = U SV , where U ⊤ U = I d , S is a d × d diagonal matrix with positive diagonal elements, and V ⊤ V = I d . Let the left singular vectors U = (u 1 , . . . , u d ), the singular values in the diagonal of S be s 1 ≥ s 2 ≥ . . . ≥ s d > 0 and the right singular vectors V = (v 1 , . . . , v d ). Then the mean reward in ( 1) can be re-expressed as E[r] = a ⊤ Θx = d i=1 s i ⟨a, u i ⟩ • ⟨v i , x⟩. In other words, the mean reward is the summation of inner products between the action projected on the left singular vector and the covariates projected on the right singular vector, weighted by the singular values. By assuming Θ to be low-rank, the mean reward is assumed to be governed by only a few linear combinations of the arm attributes and covariates. Hence our model automatically explores the low-dimensional structure of the arm vector and the contextual vector in terms of its effect on the reward, from which we can draw interpretation and insights from the effective subspaces of both the arm and covariates. As a concrete illustration of our model and to explain why our model is reasonable in real applications, we provide the following use cases in the joint assortment-pricing problem and health care. Example 1 (Assortment and Pricing). In retailing and e-commerce, the assortment problem is to decide what combination of products to present at each given time with constraints on the capacity (Kök et al., 2008) , and the pricing problem is to decide the prices of the products. The goal of the two problems is to maximize certain objective such as maximizing the revenue or profit. Products can be usually characterized by attributes such as color, pattern, and fit for apparels or technical specifications for electronics and appliance. We focus on instant noodles, which will be our case study in Section 4. Each product is single-flavor or assorted with different packs and can be represented as a feature vector p = (#f lavor 1 , #f lavor 2 , • • • , #f lavor m ) where m is the number of possible flavors and is priced as p. Then the store needs to decide on what products to present and their corresponding prices. Namely, the arm (action) vector can be represented as a = ( p1 , p 1 , p2 , p 2 , • • • , pK , p K , 1) where K is the maximum number of slots. The arm vector is clearly in a high-dimensional continuous space. At the same time, we observe the contextual covariates x for each period of time, such as the location and season at the aggregated level or demographics information at the user level. The demand and sales of products with similar attributes react similarly to the same market conditions. It is often the case that there exists latent factors of the products that governs the demand and sales. Therefore, it is reasonable to parameterize the reward function in the form of (1) rather than ignoring the similarity between products as in the literature (Miao & Chao, 2021; Kallus & Udell, 2020; Chen et al., 2021) . Our model can further suggests new products rather than only the products that has already been provided. Example 2 (Healthcare). In healthcare, for the health-monitoring apps which both monitors health conditions and give suggestions on actions to take for users, the arm (a) is high-dimensional and continuous (e.g., sleeping time, length and kind of exercise, usage of social media, diet choices including energy, water, protein, minerals, and nutrition intakes), and the health outcome not only depends on our suggestions, but also depends on the user's characteristics (e.g., age, gender, weight, height, basic health status, tendency of following suggestions) as contextual variables (x). Clearly, both the arm and the contextual variable vector are possibly high-dimensional and the arm can take continuous values. The classical bandit models do not fit the situation. The actions usually share similar effect on health and the user's characteristics can be usually captured by a few latent factors. Therefore, it is reasonable to assume Θ to be low-rank. To close the section, Proposition 1 shows that the traditional multi-arm bandit, multi-arm highdimensional contextual bandit, and continuum arm bandit can be written in the form of model ( 1). Proposition 1. The following bandit models can be expressed as special cases of our model. 1. (multi-arm bandit) For i-th arm, a = (0, 0, • • • , 1, • • • , 0), where 1 is in i-th element. Suppose x has its first element being constant. Then Θ i,1 = µ i , where µ i is the mean reward of the i-th arm, and Θ i,j = 0 if j ̸ = 1. Clearly, Θ has rank 1. 2. (multi-arm high-dimensional contextual bandit) For i-th arm, a = (0, 0, • • • , 1, • • • , 0), where 1 is in i-th element. x is the contextual vector. Then Θ = (β 1 , β 2 , • • • , β m ) ⊤ , where β i is the parameter vector corresponding to i-th arm (Bastani & Bayati, 2020) . 3. (continuum arm bandit (without contextuals)) Suppose the arm in original continuum arm bandit is denoted by a, and the mean reward function is f (a). Since all continuous function on a bounded interval can be approximated by polynomial functions to arbitrary precision, it's reasonable to assume f (a) to be polynomial of order n, which is not known precisely and only an upper bound N is known. Let a = (1, a, a 2 , a 3 , • • • , a n , • • • , a N ), and suppose the first element of x is constant 1, then Θ i,j = 1 i! f (i) (a) for j = 1 and Θ i,j = 0 for j ̸ = 1. Clearly, Θ is rank n.

3. HI-CCAB ALGORITHM AND THEORETICAL RESULTS

In this section, we present our learning algorithm with a regret upper bound. Specifically, we detail the Hi-CCAB algorithm in Section 3.1 and establish an upper bound for its convergence rate of the time-averaged expected cumulative regret in Section 3.2.

3.1. DESCRIPTION OF THE LEARNING ALGORITHM

Our policy consists of two phases for each period t ∈ [T ]: the first phase learns a low-rank representation and the second phase determines the assortment and the selling prices. In the first phase, our policy estimates Θ t by an penalized least-square estimator using (a i , x i,j , r i,j ) for i = 1, . . . , t and l = 1, . . . , L. Based on Θ t , we look for the optimal assortment and pricing within the action space A t . Algorithm 1 describes the detailed procedure of our policy. Low-rank representation learning. As mentioned in Section 2, both the arm and the contextual vectors a ∈ R da and x ∈ R dx are high-dimensional, and thus Θ ∈ R da×dx is also highdimensional. Fortunately, there often exists structure in both the arm and covariate space as explained in Section 1. To leverage the underlying structure, we impose a low-rank assumption on Θ, which automatically explores the effect of the low-rank structure and the relationships between the action and the contextual arms. To estimate the low-rank representation of Θ at time t, one can adopt the rank-penalized least square: Θ t := arg min Θ t i=1 L j=1 a ⊤ i Θx i,j -r i,j 2 + λ t • rank(Θ) (4) where λ t > 0 is the penalization parameter and rank(Θ) is the rank of the matrix Θ. However, the rank penalization makes (4) a non-convex problem, leading to computational challenges. To address the computational challenges, the rank penalization term is often replaced by the nuclear norm in matrix estimation and completion literature so that the optimization problem becomes a convex problem. We adopt a similar idea and our objective function then becomes: Θ t := arg min Θ t i=1 L j=1 a ⊤ i Θx i,j -r i,j 2 + λ t • ∥Θ∥ * . The penalization parameter λ t is updated in each iteration such that λ t = λ 0 / √ t where λ 0 is the initialized penalization parameter, which can be chosen by cross-validation or guided by ∥ 1 2t1L t1 i=1 L j=1 |a ⊤ i Θ t1 x i,j -r i,j |x i,j a ⊤ i ∥ 2 . Algorithm 1: The Hi-CCAB Algorithm. Result: Actions a t1+1 , . . . , a T . Input: The number of steps for initialization t 1 , set of possible actions A t1 , action vectors based on domain knowledge {a i } t1 i=1 , covariates vector {x i,j } t1 i=1 , rewards r i,j for j = 1, . . . , L, and exploration parameter h; Initialization: λ 0 ← ∥ 1 2t1L t1 i=1 L j=1 |a ⊤ i Θ t1 x i,j -r i,j |x i,j a ⊤ i ∥ 2 , t ← t 1 ; while t < T do λ t ← λ 0 / √ t; Low-rank representation learning: Θ t ← arg min Θ 1 tL t i=1 L j=1 (a ⊤ i Θx i,j -r i,j ) 2 + λ t ∥Θ∥ * ; Policy learning: ât+1 ← arg max a∈At L j=1 a ⊤ Θ t x t+1,j ; Exploitation if t / ∈ {⌊w 3 2 ⌋ : w ∈ Z + }: a t+1 ← ât+1 ; Exploration if t ∈ {⌊w 3 2 ⌋ : w ∈ Z + }: a t+1 ← ât+1 + δ t+1 , update action space A t+1 1. δ t+1 ∼ N (0 da , hI da ) or; 2. δ t+1 ∼ N (0 da , diag( τ 2 t ) ) and τ 2 t,j = sd({ã i,j } t i=1 ), sd(•) calculates the standard error; Apply action a t+1 and observe reward r t+1,j for j = 1, . . . , L; t ← t + 1; end Policy learning. Once we estimated the low-rank representation of Θ, we can proceed to the action step. The goal of the action step is to exploit the knowledge we learned from the previous time, i.e., Θ t , so as to decide on the next action a t+1 that maximizes the reward, and at the same time to explore actions that better inform the true Θ, which in turns will help make better decision to achieve higher long-term rewards. Specifically, given Θ t and the covariate x t+1,j for j = 1, . . . , L, we look for an action ât+1 in the action space A t that maximizes the total rewards across L objects: ât+1 := arg max a∈At L j=1 a ⊤ Θ t x t+1,j . We further perturb ât+1 for the purpose of exploration by adding random noise to each coordinate when t ∈ {⌊w 3 2 ⌋ : w ∈ Z + }, i.e., a t+1 = ât+1 + δ t+1 where δ t+1 ∼ N (0 da , hI da ) and h is a tuning parameter. The intuition for ⌊w 3 2 ⌋ is to explore more in the initial stage and exploit more in the later stage of the algorithm. To be specific, there are around T 2 3 steps for exploration before time T . The density of exploration at a small time frame around T is T -1 3 , which goes to zero as T → ∞. Note that the exponent can be any number larger than 1, instead of 3 2 , which will affect the convergence rate of the regret as we will discuss later in Remark 2. The polynomial form can be changed as well. For each exploration step, one can also let δ t+1 ∼ N (0 da , diag( τt )) where each element of τt is the coordinate-wise standard error of the previous actions {a i } t i=1 . The intuition is to avoid tuning parameter h while taking the right scale. Finally we update the action space A t+1 according to a t+1 . For example, if the action space A t ∈ R da can be defined by an upper limit āt and a lower limit a t , then we simply expand the action space by pushing the boundary of each coordinate to a t+1,j if a t+1,j / ∈ [a t,j , āt,j ] for j = 1, . . . , d a . Remark 1. To take advantage of the interpretability of our model, we can further explore the structure of the Θ t . Specifically, we can apply singular value decomposition (SVD) on Θ t to explore the underlying latent structure of the covariates from the right singular vectors; and apply SVD on ( Θ t L j x t,j ) to explore the latent structure of the arms from the left singular vectors. One can further rotate the singular vectors so as to reveal the underlying factors using techniques in factor analysis such as Varimax (Kaiser, 1958; Rohe & Zeng, 2020) or to perform clustering analysis by performing K-means on the singular vectors.

3.2. THEORETICAL RESULTS

In this section, we establish in Theorem 1 that the convergence rate of time-averaged expected cumulative regret for Algorithm 1 is at least as fast as T -2 15 and outline the proof strategy for Theorem 1. We consider the time-averaged expected cumulative regret since it measures the trend of the newly incurred regret, in the long run, more directly; on the other hand, the expected cumulative regret grows with time T , which is less interpretable. This theorem implies that the newly incurred regret, roughly speaking, converges to zero fast. Theorem 1. Suppose x t,l i.i.d ∼ N (0 dx , I dx ), and the errors (ε t,j ) defined in reward model (1) follows normal distribution: ε t,j i.i.d ∼ N (0, σ 2 ). Suppose Θ is rank d. Suppose the exploration step in Algorithm 1 is a t = ât + δ t for t ∈ {⌊w 3 2 ⌋ : w ∈ Z + } where δ t ∼ N (0 da , hI da ), A t = {a ∈ R da : ∥a∥ ≤ 1} , then there is a T 1 such that for T ≥ T 1 , the expected cumulative regret of the Algorithm 1, R π T , satisfies R π T T ≤ 2 Ld x ∥Θ∥ 2 T 1 T -1 + 72 5 λ 0 √ 2dd x L h 2 T -1 6 + 60 13 Ld x ∥Θ∥ 2 T -2 15 + 90 13 σ(d x + 1) h 2 T -2 15 , (7) where T 1 = C h,L,λ0 (d x + d a ) 6 (log(d x + d a )) 3 and the constant C h,L,λ0 depends on h, L and λ 0 . For T ≤ T 1 , R π T T ≤ 2 √ Ld x ∥Θ∥ 2 . Remark 2 (Convergence rate). An intuitive understanding of Theorem 1 is that the expected regret incurred each time converges to zero at a speed at least T -2 15 as T going to infinity. The convergence rate depends on the frequency of the exploration which depends on the exponent 3 2 in the exploration set, {⌊w 3 2 ⌋ : w ∈ Z + }. Recall that the exponent can be changed with any number larger than 1, which can be considered as a tuning parameter. Remark 3 ("Burnout" term). The first term in inequality ( 7) is a "burnout" term, where the algorithm is gaining knowledge of Θ from scratch. We do not impose any assumptions on these starting steps so that we have a relative conservative "burnout" term. However, in practice, we usually have historical data to start with so that the algorithm can start from a reasonable estimation of Θ and much smaller "burnout" term. Recall that the exponent of the exploration set can be any number larger than 1. The order of the "burnout" term depends on the exponent of the w in the exploration set -the more exploration there is, the smaller the "burnout" term. The exponent can be chosen depending on the situation -how ample the historical data is. Remark 4 (Constant C h,L,λ0 of T 1 ). While constant C h,L,λ0 depends on h, L, λ 0 , the primary dependency is actually on h and L. The order of λ 0 in terms of dimensions and noise level is σ √ d x . We do not assume the order of λ 0 or bound it with a high probability bound in order to show its role in time-averaged expected cumulative regret. If we utilize the order σ √ d x , then C h,L,λ0 can be replaced by a constant depending on h and L only. Remark 5 (Dependence on dimensions d a , d x and rank d). When T is small, the "burnout" term (the first term) dominates. It depends on T and the dimensions but not the rank as (d a + d x ) 6 (log(d a + d x )) 3 T -1 , whose order depends on the exponent defining the exploration set (i.e., how frequent we explore). As T grows, the second term dominates. Recall Remark 4, λ 0 is of order σ √ d x , so the second terms depends on T, d x and d but not d a at the order of Ω(d x √ dT -1 6 ). Without the low-rank assumption, the order would be Ω(d 3 2 x T -1 6 ) instead. When T becomes even larger, the last two terms dominates, at the order Ω(d x T -2 15 ). However, the last case rarely happens, as it requires the order of T equal to or larger than d 15 . Therefore, taking dimensions and rank into consideration, the time-averaged expected cumulative regret is mostly at the order of Ω(d x √ dT -1 6 ). Proof sketch We outline the proof strategy for Theorem 1. There are two major steps: (1) bounding the estimation error for the low-rank representation matrix estimator; (2) bounding the expected cumulative regret. The detailed proof of Theorem 1 is in Appendix A. (1) Bounding the estimation error of Θ t with a high probability bound. Denote δΘ t = Θ t -Θ. We show that for a large t, Note that the action taken is based on previous estimators and affect the accuracy of future estimators, leading to lots of dependencies. The classical matrix completion results can no longer apply. Through careful use of conditional expectations, martingales, and empirical process we separate out different sources of randomness (i.e., δ 1 , P ∥δΘ t ∥ F ≤ 3 T 2 15 σ √ d x + 1 √ Lh 2 + 6λ 0 √ 2d h 2 T 1 6 ≤ 1 - 3 t + 2 t 2 + 2 Lt + 2 L 3 t 3 + 1 t 2 15 • • • , δ t , x 1,• , • • • , x t,• ) to derive the bounds. Lemma 1 establishes a restricted-strong-convexity-type result of the sum of squares in the objective function. Lemma 2 establishes a Lipschitz-type result of the sum of squares in the objective function. Further analysis of the nuclear-norm-penalized sum of squares with the two lemmas and low-rank properties gives the tail bound of the estimation error. (2) Bounding the time-averaged expected regret. Let Q t = {∥δΘ t ∥ F ≤ 3 T 2 15 σ √ dx+1 √ Lh 2 + 6λ 0 √ 2d h 2 T 1 6 } be the event such that δΘ t is bounded. We know that from the first step, for large t, P (Q c t ) ≤ 3 t + 2 t 2 + 2 Lt + 2 L 3 t 3 + 1 t 2 15 . Consider the expectation of the regret on Q t and Q c t separately and both terms vanish with t at the polynomial rate.

4. SIMULATION STUDY AND ASSORTMENT-PRICING CASE STUDY

In this section, we conduct simulation studies to compare the proposed Hi-CCAB with LinUCB (Li et al., 2010) , Lasso Bandit (Bastani & Bayati, 2020) , NeuralUCB (Zhou et al., 2020) and EE-Net (Ban et al., 2022) ; we then study the joint assortment-pricing problem on the e-commerce platform for one of the largest instant noodles producers in China. Details on the tuning parameters of each algorithms and additional results of the case study are provided in Appendix B-C.

Simulation study

We consider the multi-armed linear bandit setup, a special case of our model as shown in Proposition 1, i.e., Θ = (β 1 , β 2 , • • • , β m ) ⊤ so that each row of Θ is the parameter of each arm for the multi-arm contextual bandit. Specifically, we set the number of arms d a = {10, 30, 50} and the dimension of covariates d x = 100. For Θ, we consider a non-sparse and sparse case. For the non-sparse case, we generate Θ = U DV ⊤ where U ∈ R da×r , V ∈ R dx×r (r = 5), and D is a diagonal matrix with (1, .9, .9, .8, .5) as the diagonal entries. All entries of U and V are first generated from i.i.d. N (0, 1), and then applied Gram-Schmidt to make each column orthogonal. U is scaled to have length √ d a so that the rewards are comparable across different d a 's. For the sparse case, each row of Θ are set as zero except for s 0 = 2 randomly selected elements that are drawn from N (0, 1). We generate the covariate x i.i.d ∼ N (0, I dx ) and the rewards from ( 1) with σ = 0.1. Figure 1 shows the cumulative regret (averaged over 50 simulations). For the non-sparse case, Hi-CCAB converges faster than all other methods. The advantage of Hi-CCAB is more pronounced when the dimension of the arms becomes larger. For the sparse case, which is not to the advantage of Hi-CCAB, when the dimension of arms is relatively small (d a = 10), Lasso Bandit converges faster but the gap between Hi-CCAB and Lasso Bandit is small. As the number of arms increases, Hi-CCAB outperforms all other methods. Assortment-pricing case study. The original data contains daily sales of 176 products across 369 cities from March 1st, 2021 to May 31st, 2022 (T = 456 days). We aggregate the sales by 31 provinces. Each product is of either single or assorted flavors (13 possible flavors) with different counts. The assortment and price of each product changed daily. In addition, we know the dates for promotion. The assortment, prices, and promotions were the same across locations. The maximum number of products to be shown on the homepage is K = 30. The total possible combinations are then 176 30 and therefore if we consider one combination as one arm, we are facing extremely high-dimensional arms, for which most multi-arm bandit algorithms are not applicable. To apply Hi-CCAB, we specify the arms a t and the covariate vectors {x t,j } L=31 j=1 at given time t following the setup in Example 1. The arm is represented as a = ( p1 , p2 1 , p 1 , p 2 1 , promo 1 , promo 2 1 , • • • , pK , p2 K , p K , p 2 K , promo K , promo 2 K , 1) ∈ R 2(m+2)K+1=901 where pk = (#f lavor k,1 , • • • , #f lavor k,m ) is a vector of non-negative integers to denote the counts of m = 13 flavors, p k is the price, promo k is the indicator of promotion of product k, and p2 k is the element-wise quadratics. The covariate x t,j ∈ R 50 for location j includes dummy variables of 31 provinces, year 2021/2022, 12 months, weekdays, and an indicator of annual sales event on Jun 18 and Nov 11. More details are deferred to Appendix C. To run simulations using the dataset, we first create a pseudo-truth model. To be specific, we estimate Θ and σ using all data of 456 days and consider them as the pseudo ground truth. We perform a sanity check on our model assumption ( 1), the pseudo ground truth against our data before preceding to the formal analysis and further examine the structure of the representation matrix Θ in Appendix C. We evaluate the performance of Hi-CCAB in terms of the cumulative regret (2) and the percentage gain of the cumulative sales by comparing with the original actions, since no existing bandit algorithm is applicable to this problem. Figure 2a shows the time-averaged cumulative regret (averaged over 100 simulations) and Figure 2b shows the percentage gain in cumulative sales compared to the real sales The time-averaged cumulative regret of Hi-CCAB converges to zero while that of original actions remains flat. In terms of percentage gain in cumulative sales, Hi-CCAB boosts cumulative sales by more than 4 times. On a separate note, Hi-CCAB with exploration performs better in terms of both cumulative regret and percentage sales gain than Hi-CCAB without exploration.

5. CONCLUSION

With an increasing demand for online decision-making, the bandit problem is receiving increasingly more attention from both theoreticians and practitioners. Even though the volume of bandit literature has been expanding, there exists very little literature on high-dimensional continuum armed contextual bandit with high-dimensional covariates. In this work, we formulate and propose a model for this problem. Our model is general as it unifies a large class of bandit problems and has interpretability and predictive power. We propose an efficient algorithm Hi-CCAB by adopting the low-rank matrix estimator and provide an upper bound for its convergence rate in terms of the time-averaged expected cumulative regret. The generality and flexibility of our model allow for its application in the joint assortment-pricing problem, where the assortment and pricing optimization problems have been studied extensively in operation research separately but not their joint optimization problem. By applying our model and algorithm to the real case study on the joint assortment-pricing problem for one of the largest instant noodles producers in China, we are able to boost the sales by four times and provide insights into the underlying structure of the effect on the reward of the arms and covariates such as purchasing behaviors. Therefore, both the theoretical and the real case study indicate that our model and algorithm can be effective for the high-dimensional continuum armed and high-dimensional contextual bandit problem faced by decision-makers in various fields. Since our model is new to the bandit literature, there is space for improvement in our regret analysis. This is an interesting future direction.

APPENDIX A PROOF OF THEOREM 1

In this proof, we denote the true parameter as Θ * . Let L T (Θ) := 1 2LT T t=1 L l=1 (a ⊤ t,l Θx t,l -r t,l ) 2 . Then we have the following lemmas that we will prove later. Lemma 1. Suppose all the assumptions in Theorem 1 holds. Denote E T (∆) = L T (Θ * + ∆) -L T (Θ * ) -⟨∇L T (Θ * ), ∆⟩. Then with probability at least 1 -1 LT -2 T -1 T 2 , E T (∆) ≥ ⌊T 2 3 ⌋ 2T h 2 ∥∆∥ 2 F -14T -2 3 (h + h 2 ) (2d x + 2d a + 6 log T + 6 log L) 2 log T ∥∆∥ 2 2 . ( ) Lemma 2. Suppose all the assumptions in Theorem 1 holds. With probability at least 1 -1 T 2 15 - 2 L 3 T 3 -1 LT -1 T -1 T 2 , the following holds for all ∆ |⟨∇L T (Θ * ), ∆⟩| ≤∥∆∥ F σ √ d x + 1 √ LT T 1 30 + 2hσT -2/3 logT max{d a , d x }log(d a + d x ) L + 8hσ T log(T L) (d x + 3log(LT ))(d a + 3logT )(log(d x + d a ) + 2logT ) ∥∆∥ * . Recall the definition of Θt , we know that L T ( ΘT ) + λ T ∥ ΘT ∥ * ≤ L T (Θ * ) + λ T ∥Θ * ∥ * . ( ) Denote δΘ t = Θt -Θ * and for notation simplicity we will drop the subscript t for δΘ t in the following when there is no confusion. Equation ( 10) then implies that E T (δΘ) ≤ -⟨∇L T (Θ * ), δΘ⟩ + λ T (∥Θ * ∥ * -∥Θ * + δΘ∥ * ) . Suppose the singular value decomposition of Θ * is Θ * = U SV ⊤ , where S is an d × d diagonal matrix. Let U ⊤ be an d a × (d a -d) matrix satisfying (U , U ⊥ )(U , U ⊥ ) ⊤ = I da . We define V ⊥ similarly. Denote δΘ ⊥ = U ⊤ ⊥ δΘV ⊥ . Then ∥Θ * + δΘ∥ * ≥ ∥Θ * + δΘ ⊥ ∥ * -∥δΘ -δΘ ⊥ ∥ * = ∥Θ * ∥ * + ∥δΘ ⊥ ∥ * -∥δΘ -δΘ ⊥ ∥ * ≥ ∥Θ * ∥ * + ∥δΘ ⊥ ∥ * - √ 2d∥δΘ -δΘ ⊥ ∥ F . Going back to Inequality (11), and combing with Lemma 1 and Lemma 2, we have, with probability at least 1 -3 T -2 T 2 -2 LT -2 L 3 T 3 -1 T 1 3 , the following holds ⌊T 2 3 ⌋ 2T h 2 -14T -2 3 (h + h 2 ) (2d x + 2d a + 6 log T + 6 log L) 2 log T ∥δΘ∥ 2 F ≤∥δΘ∥ F σ √ d x + 1 √ LT T 1 30 + 8hσ T log(T L) (d x + 3log(LT ))(d a + 3logT )(log(d x + d a ) + 2logT )+ 2hσT -2/3 logT max{d a , d x }log(d a + d x ) L (∥δΘ -δΘ ⊥ ∥ * + ∥δΘ ⊥ ∥ * ) + λ 0 √ T T √ 2d∥δΘ∥ F -λ 0 √ T T ∥δΘ ⊥ ∥ * . Note that ∥δΘ -δΘ ⊥ ∥ * ≤ √ 2d∥δΘ -δΘ ⊥ ∥ F , divide both side with ∥δΘ∥ F and multiply both sides with 3T 1 3 /h 2 . Suppose T 1 satisfies T 1 ≥8, T 1 3 1 ≥12 × 14(1 + 1 h ) (2d x + 2d a + 6 log T 1 + 6 log L) 2 log T 1 , λ 0 T 1 6 1 ≥ 8hσ T 1 3 1 log (T 1 L) (d x + 3 log(LT ))(d a + 3 log T 1 )(log(d x + d a ) + 2logT 1 )+ 2hσT -2/3 1 log T 1 max{d a , d x } log (d a + d x ) L we have ∥δΘ∥ F ≤ 3 T 2 15 σ √ d x + 1 √ Lh 2 + 6λ 0 √ 2d h 2 T 1 6 . ( ) Note that there is a constant C h,L,λ0 depending on L, h and λ 0 such that for T 1 ≥ C h,L,λ0 (d x + d a ) 6 (log(d x + d a )) 3 , Inequalities (13) holds. Next we will proceed to bound the regret. Denote the event that ( 14) holds to be Q t and its complement as Q c t . Then P(Q c t ) ≤ 3 T + 2 T 2 + 2 LT + 2 L 3 T 3 + 1 T 2 15 and Q c t , Θt ⊥ ⊥ b t+1 . Let the oracle Under review as a conference paper at ICLR 2023 optimal action at time t be a * t and b t = L l=1 x t,l . Then R π T -E T1-1 t=0 L l=1 a * ⊤ t+1 Θ * x t+1,l -a ⊤ t+1 Θ * x t+1,l ≤ E T -1 t=T1 L l=1 a * ⊤ t+1 Θ * x t+1,l -a ⊤ t+1 Θ * x t+1,l ≤ E T -1 t=T1 Θ * b t+1 ∥Θ * b t+1 ∥ 2 - Θt b t+1 ∥ Θt b t+1 ∥ 2 , Θ * b t+1 = T -1 t=T1 E Θ * b t+1 ∥Θ * b t+1 ∥ 2 - Θt b t+1 ∥ Θt b t+1 ∥ 2 , Θ * b t+1 1{Q t } + E Θ * b t+1 ∥Θ * b t+1 ∥ 2 - Θt b t+1 ∥ Θt b t+1 ∥ 2 , Θ * b t+1 1{Q c t } ≤ T -1 t=T1 E (Θ * -Θt )b t ∥Θ * b t+1 ∥ 2 + ∥ Θt b t+1 ∥ 2 -∥Θ * b t+1 ∥ 2 ∥ Θt b t+1 ∥ 2 ∥Θ * b t+1 ∥ 2 Θt b t+1 , Θ * b t+1 1{Q t } + 2∥Θ * ∥ 2 E[∥b t+1 ∥ 2 ] 3 t + 2 t 2 + 2 Lt + 2 L 3 t 3 + 1 t 1 3 ≤ T -1 t=T1 E [2∥δΘ∥ 2 ∥b t+1 ∥1{Q t }] + 2 Ld x ∥Θ * ∥ 2 3 t + 2 t 2 + 2 Lt + 2 L 3 t 3 + 1 t 2 15 ≤ T -1 t=T1 E [2∥δΘ∥ F ∥b t+1 ∥1{Q t }] + 2 Ld x ∥Θ * ∥ 2 3 t + 2 t 2 + 2 Lt + 2 L 3 t 3 + 1 t 2 15 ≤ 2 T -1 t=T1 Ld x 3 t 2 15 σ √ d x + 1 √ Lh 2 + 6λ 0 √ 2d h 2 t 1 6 + 2 Ld x ∥Θ * ∥ 2 3 t + 2 t 2 + 2 Lt + 2 L 3 t 3 + 1 t 2 15 . ( ) Similar arguments also give E T1-1 t=0 L l=1 E a * ⊤ t Θ * x t,l -a ⊤ t Θ * x t,l ≤ T 1 × 2 Ld x ∥Θ * ∥ 2 . Therefore, for T ≥ T 1 , R π T T ≤ 2 Ld x ∥Θ * ∥ 2 T 1 T -1 + 60 13 Ld x ∥Θ * ∥ 2 T -2 15 + 90 13 σ(d x + 1) h 2 T -2 15 + 72 5 λ 0 √ 2dd x L h 2 T -1 6 A.1 PROOF OF LEMMA 2 Suppose r t,l = a t,l Θ * x t,l + σε t,l , then ∇L T (Θ * ) = σ LT T t=1 L l=1 -ε t,l x t,l a ⊤ t,l = σ LT L l=1 -ε 1,l x 1,l a ⊤ 1,l + σ LT T t=2 L l=1 -ε t,l x t,l â⊤ t -ε t,l x t,l δ ⊤ t ( ) Now we consider the terms in (18) separately. Let S 2 = σ LT L l=1 -ε 1,l x 1,l a ⊤ 1,l + σ LT T t=2 L l=1 -ε t,l x t,l â⊤ t . S 3 = σ LT T t=2 L l=1 -ε t,l x t,l δ ⊤ t (19) Elementary Calculation show that E(∥S 2 ∥ 4 F ) ≤ σ 4 (d 2 x + 2d x ) L 2 T 2 . ( ) Therefore, P (∥S 2 ∥ F ≥ σ √ d x + 1 √ LT T 1 30 ) ≤ 1 T 2 15 . ( ) For S 3 , let G be an event defined as G = max{|ε t,l | : 1 ≤ t ≤ T, 1 ≤ l ≤ L} ≤ 3 log T L, max{∥x t,l ∥ 2 : 1 ≤ t ≤ T, 1 ≤ l ≤ L} ≤ 2d x + 6 log LT , max{∥δ t /h∥ 2 2 : 1 ≤ t ≤ T } ≤ 2d a + 6 log T . Then elementary calculation shows that P (G c ) ≤ 2 T 3 L 3 + 1 LT + 1 T . Using Matrix Bernstein Inequality (Tropp, 2012) on event G, we have the operator norm of S 3 on G is bounded as follows P ({∥ LT σ S 3 ∥ 2 ≥ α} ∩ G) ≤ (d x + d a ) exp ( -α 2 2σ 2 S3 + 2Dα/3 ), where σ 2 S3 ≥ max T t=1 E ( L l=1 ε t,l x t,l δ ⊤ t )( L l=1 ε t,l x t,l δ ⊤ t ) ⊤ 2 , T t=1 E ( L l=1 ε t,l x t,l δ ⊤ t ) ⊤ ( L l=1 ε t,l x t,l δ ⊤ t ) 2 , and D = max t sup event G holds ∥ L l=1 -ε t,l x t,l δ ⊤ t ∥ 2 ≤ 6L log T Lh (d x + 3 log LT )(d a + 3 log T ). ( ) Elementary calculation shows that taking σ 2 S3 = h 2 ⌊T 2 3 ⌋L max{d a , d x } satisfies Equation (25). Taking α =2hT 1 3 logT L max{d a , d x }log(d a + d x )+ 8hL logT L (d x + 3log(LT ))(d a + 3logT )(log(d x + d a ) + 2logT ) P ({∥ LT σ S 3 ∥ 2 ≥ α} ∩ G) ≤ 1 T 2 . ( ) Therefore, we have P ∥S 3 ∥ 2 ≤ 2hσT -2/3 logT max{d a , d x }log(d a + d x ) L + 8hσ T log(T L) (d x + 3log(LT ))(d a + 3logT )(log(d x + d a ) + 2logT ) ≥ 1 - 2 L 3 T 3 - 1 LT - 1 T - 1 T 2 Recalling that |⟨∇L T (Θ * ), ∆⟩| = |⟨S 2 , ∆⟩ + ⟨S 3 , ∆⟩| ≤ ∥S 2 ∥ F ∥∆∥ F + ∥S 3 ∥ 2 ∥∆∥ * , we get the statement of the lemma. A.2 PROOF OF LEMMA 1 Let b t = L l=1 x t,l . Let δ t = 0 for exploitation rounds. Then we know that E T (∆) = 1 2LT T t=1 L l=1 (a ⊤ t,l ∆x t,l ) 2 = 1 2LT T t=1 L l=1 (( b ⊤ t Θ⊤ t-1 ∥b ⊤ t Θ⊤ t-1 ∥ + δ ⊤ t )∆x t,l ) 2 (32) Define D T (∆) = 1 2LT T t=1 L l=1 ( b ⊤ t Θ⊤ t-1 ∥b ⊤ t Θ⊤ t-1 ∥ ∆x t,l ) 2 + (δ ⊤ t ∆x t,l ) 2 , D 1,T (∆) = 1 2LT T t=1 L l=1 ( b ⊤ t Θ⊤ t-1 ∥b ⊤ t Θ⊤ t-1 ∥ ∆x t,l ) 2 D 2,T (∆) = 1 2LT T t=1 L l=1 (δ ⊤ t ∆x t,l ) 2 (33) Then E T (∆) -D T (∆) = 1 LT T t=1 L l=1 ( b ⊤ t Θ⊤ t-1 ∥b ⊤ t Θ⊤ t-1 ∥ ∆x t,l )(δ ⊤ t ∆x t,l ) Elementary calculation shows that E(E T (∆) -D T (∆)) = 0, and E(D 2,T (∆)) ≥ ⌊T 2 3 ⌋ 2T h 2 ∥∆∥ 2 F . Now we proceed with proving that the following two holds with high probability: inf ∥∆∥2>0 E T (∆) -D T (∆) ∥∆∥ 2 2 ≥ -7T -2 3 (h + h 2 ) (2d x + 2d a + 6 log T + 6 log L) 2 log T inf ∥∆∥2>0 D 2,T (∆) -E(D 2,T (∆)) ∥∆∥ 2 2 ≥ -7T -2 3 (h + h 2 ) (2d x + 2d a + 6 log T + 6 log L) 2 log T. Note that ∥x t,l ∥ 2 2 ∼ χ 2 dx , ∥δ t /h∥ 2 2 ∼ χ 2 da . Therefore, we have that P (sup t,l ∥x t,l ∥ 2 2 ≤ d x + 2ϵ 1 + 2 ϵ 1 d x , sup t ∥δ t /h∥ 2 2 ≤ d a + 2ϵ 2 + 2 ϵ 2 d a ) ≥ 1 -(LT exp (-ϵ 1 ) + T exp (-ϵ 2 )) . ( ) Let ϵ 1 = 2 log LT , ϵ 2 = 2 log T . Denote U 1 = d x + 2ϵ 1 + 2 ϵ 1 d x , U 2 = d a + 2ϵ 2 + 2 ϵ 2 d a . (39) And let the event O be O = {sup t,l ∥x t,l ∥ 2 2 ≤ U 1 , sup t ∥δ t /h∥ 2 2 ≤ U 2 }. For the following, we restrict our attention to event O. Note that inf  U0/1.1∥∆∥2≤U0 E T (∆) -D T (∆) ≥ inf U0/1.1∥∆∥2≤U0, Θ⊤ t-1 ̸ =0 for 1≤t≤T E T (∆) -D T (∆), U0/1.1≤∥∆∥2≤U0, Θ⊤ t-1 ̸ =0 for 1≤t≤T ( b ⊤ t Θ⊤ t-1 ∥b ⊤ t Θ⊤ t-1 ∥ ∆x t,l )(δ ⊤ t ∆x t,l ) - inf U0/1.1≤∥∆∥2≤U0, Θ⊤ t-1 ̸ =0 for 1≤t≤T ( b ⊤ t Θ⊤ t-1 ∥b ⊤ t Θ⊤ t-1 ∥ ∆x t,l )(δ ⊤ t ∆x t,l ) ≤ 2U 1 √ U 2 hU 2 0 Therefore, through Functional Hoeffding theorem (Theorem 3.26 in Wainwright ( 2019))), we have P (E T (∆) -D T (∆) ≤ -γ 1 |O) ≤ exp   - T 2 ⌊T 2 3 ⌋ γ 2 1 16U 2 1 U 2 h 2 U 4 0    for γ 1 > 0. Similarly, for the exploration rounds in D 2,T (∆), we have sup U0/1.1≤∥∆∥≤U0 (δ ⊤ t ∆x t,l ) 2 - inf U0/1.1≤∥∆∥≤U0 (δ ⊤ t ∆x t,l ) 2 ≤ U 1 U 2 U 2 0 h 2 . ( ) Again, according to Functional Hoeffding theorem, we have P (D 2,T (∆) -E (D 2,T (∆)) ≤ -γ 2 |O) ≤ exp (- T 2 ⌊T 2 3 ⌋ γ 2 2 4U 2 1 U 2 2 U 4 0 h 4 ) Take γ 1 = γ 2 = 7T -2 3 (h + h 2 ) (2d x + 2d a + 6 log T + 6 log L) 2 ∥∆∥ 2 2 log T . Therefore, P E T (∆) -E(D 2,T ) ≤ -14T -2 3 (h + h 2 ) (2d x + 2d a + 6 log T + 6 log L) 2 ∥∆∥ 2 2 log T ≤ P E T (∆) -D T (∆) ≤ -7T 

APPENDIX B DETAILS ON THE SIMULATION STUDY

In this section, we detail the tuning parameters of each algorithm we used for the simulation study. On the other hand, Table 1 explores the loadings for the arm on May 29th 2022, the last Sunday in our data (i.e., the leading left singular vectors multiplied with ⟨v 1 , x⟩ where x is the average of x j for j = 1, . . . , L on May 29th 2022). Specifically, we investigate the effect of flavors on the reward given the context. We take the average of the loadings of the linear and quadratic terms for each flavor in all 30 products and compare with the total sales of each flavor across all Sundays in Mays. For ease of comparison, we further scale the sales and the loadings by their corresponding largest numbers. The loadings and sales are closely related to each other. 1 As in Table 1 , on May 29th 2022, flavor 1 (F1) has the largest effect, followed by flavor 10, 13, 7, 9 and 11. Therefore, our model learns the values of the flavors (per unit). F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 Sales 1.00 0.05 0.00 0.00 0.00 0.03 0.19 0.00 0.08 0.19 0.18 0.00 0.38 ũ1 (linear) 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ũ1 (quadratic) 1.00 0.12 -0.00 0.00 0.00 0.03 0.19 0.00 0.15 0.39 0.16 0.03 0.33 Table 1 : Total sales and loadings of the linear and quadratic terms (scaled) of the 13 flavors. More on simulation with additional numerical results. We first detail how we ran the simulation and then provide more simulation results. To be specific, we first use t 1 = 100 for the initialization step to estimate Θ t1 ; and then at each time t = t 1 + 1, . . . , T , we follow Algorithm 1 to decide on the action a t for assortment and pricing. After determining a t , we generate the sales r t according to (1) using the pseudo true Θ and σ. We further compare the performance of the assortment-pricing policy with exploration and without exploration and with different initialization time t 1 . Each setup is simulated 100 times. Figures 6a-6b show cumulative regret and Figures 6d show percentage gain in cumulative sales when t 1 = 20, 50, 100 with exploration and without exploration. Hi-CCAB with exploration performs better then without exploration. As expected, longer initialization steps provide a better initial estimation of the Θ and thus helps with the performance in a short time windows. As time goes by, the time-averaged cumulative regret all converge to zero and the percentage gain in cumulative sales should converge.



The correlation of sales and the linear-term loadings is 0.91 and that of the quadratic-term loadings is 0.97.





Figure 1: Cumulative regret under the non-sparse (first row) and sparse (second row) settings.

Figure 2: Performance of Hi-CCAB compared with real actions over 100 simulations. The boundaries of the shadow boundaries are the 5-th and 95-th quantiles.

⌋ terms in the sum of E T (∆) -D T (∆) are not zero, and for any term in the exploration round sup

Total sales by product in 1k RMB.

Number of products with various package sizes.

Figure 4: Real sales vs simulated sales.

Figure 5: Loadings of the leading right singular vectors for the covariates.

-2 3 (h + h 2 ) (2d x + 2d a + 6 log T + 6 log L) 2 ∥∆∥ 2 2 log T |O + P (D 1,T (∆) ≤ 0|O) + P D 2,T -E(D 2,T ) ≤ -7T -2 3 (h + h 2 ) (2d x + 2d a + 6 log T + 6 log L)

APPENDIX C MORE DETAILS ON THE CASE STUDY AND ADDITIONAL NUMERICAL RESULTS

In this section, we provides more background information on the case study and additional interpretations of the represetation matrix Θ and numerical results. Figure 3a shows the daily sales by product and each color represents one product (only products that appeared more than 95% of the days are colored; the rest are colored as grey). The days corresponding to the vertical dashed grey lines are days with promotion. The two red vertical lines correspond to the annual sales events. The variation between products was large and one product dominated the rest most of the time. The sales were also driven by the promotion -the sales went up when there is a promotion. Figure 3b shows the median unit price across time with the 25th and 75th quantiles as the boundaries of the grey area. The median unit price was around 3.2 RMB and there were variations in unit price among products. Figure 3c shows the number of single-flavor and multi-flavor products. Three-quarters of the products were single-flavored. Note that products with the same flavor can have different package sizes. Figure 3d shows the number of products with different package sizes. The package size of about 60% of the products is larger than 20 with 30% having package sizes between 10 and 20 and the rest less than 10. Figure 4 compares the real sales with the simulated sales based on model (1) using the pseudo ground truth Θ and σ that we estimated using all the data, i.e., the teal line is L j=1 r t,j where r t,j is generated by r t,j = a ⊤ t Θx t,j + ε t,j , ε t,j N (0, σ 2 ) where a t and x t,j are from the real data. As shown in Figure 4 , the real sales and the simulated sales follow quite closely across time, which indicates that both our model and estimation are reasonable. Structure of the representation matrix Θ. One advantage of model is the interpretability which allows us to gain insights from the representation matrix Θ. Specifically, our model is able to discover the underlying factors of the effect of both arms and covariates on the reward. In the following, we will examine the pseudo ground truth Θ we obtained using all the data. The rank of Θ is 5 with the singular values being (2.5, 0.3, 0.2, 0.02, 0.002). The leading singular value dominates the rest and thus the leading left and right singular vectors are the most important ones in explaining the effect on the reward and we focus on the leading singular vectors in what follows.Figure 5 shows the loadings for different covariates (i.e., the leading right singular vector) and our algorithm is able to learn interpretable patterns of the effects on the reward -for weekday, the effects are drastically different during the weekend and during the weekend; for months, the effects show different patterns during the promotion month (June and November) from other months; for location, the effects of the coastal provinces are different from the rest, which exactly corresponds to the levels of economic development of different regions in China. In sum, our model can exploit the underlying structure of the covariates and provide insights into purchasing behavior and seasonality. 

