HIGH-DIMENSIONAL CONTINUUM ARMED AND HIGH-DIMENSIONAL CONTEXTUAL BANDIT: WITH APPLI-CATIONS TO ASSORTMENT AND PRICING

Abstract

The bandit problem with high-dimensional continuum arms and high-dimensional contextual covariates is often faced by decision-makers but remains unsolved. Recent developments in contextual bandit problems focus on the setting where the number of arms are small but are impracticable with high-dimensional continuous arm spaces. To bridge the gap, we propose a novel model for the high-dimensional continuum armed and high-dimensional contextual bandit problem that captures the effect of the arm and covariates on the reward via a low-rank representation matrix. The representation matrix is endowed with interpretability and predictive power. We further propose an efficient bandit algorithm based on a low-rank matrix estimator with theoretical justifications. The generality of our model allows wide applications including business and healthcare. In particular, we apply our method to assortment and pricing, both of which are important decisions for firms such as online retailers. Our method can solve the assortment-pricing problem simultaneously while most existing methods address them separately. We demonstrate the effectiveness of our method to jointly optimize assortment and pricing for revenue maximization for a giant online retailer.

1. INTRODUCTION

The bandit problem dates back to when Robbins (1952) formulated the problem as the sequential design of experiments and has been studied to a great extent recently due to the demand for online decision-making, especially from e-commerce and health care. A decision-maker chooses an action (arm) at each round and observes a reward and the goal is to act strategically so as to find an optimal action that maximizes the long-term reward without sacrificing too much. The bandit literature mostly focuses on the problem of a finite number of independent arms, but it is often the case that infinite number of of arms and the arms share some common structure and thus can be indexed by variables as a continuum armed bandit problem. In e-commerce, the retailer needs to decide the product assortment and pricing to maximize long-term profits; in mobile health, the personal device provides exercise and dietary suggestions to improve physical and mental health. The possible actions in both examples can be parameterized as continuous variables, which are possibly high dimensional. In addition, decision-makers observe other covariates/features, i.e., the contextual bandit problem where the reward is modeled as a function of unknown parameters and the contextual variables, and in many practical settings, the covariates are high-dimensional. As the dimensionalities of the action space (for arms) and the contextual variables grow, the traditional bandit algorithms suffer from the curse of dimensionality and it is impossible or prohibitively costly to learn the optimal decision. Albeit both the arm and the contextual are high-dimensional, the dimension of the underlying factors is often, fortunately, small -for high-dimensional bandit problems, one can assume a low-dimensional structure on the unknown parameters, such as the LASSO bandit (Bastani & Bayati, 2020) ; and for high-dimensional continuum armed bandit problems, one can assume the reward function depends only on the low-dimensional subspace of the action space (Tyagi et al., 2016) . While low-dimensional representation has been successfully adopted in highdimensional bandit problems and high-dimensional continuum armed bandit problems respectively, a natural but important question remains open: can we efficiently solve the bandit problem with both high-dimensional continuum arms and high-dimensional contextual variables simultaneously? In this paper, we tackle the above problem by proposing a novel model that captures the effect of the arm and the contextual with an approximately low-rank matrix representation as well as an efficient algorithm (Hi-CCAB) to efficiently solve the problem with theoretical justifications. Specifically, for an action that is presented as a vector a ∈ R da and the corresponding contextual covariates x ∈ R dx , we yield reward r = a ⊤ Θx + ε where Θ ∈ R da×dx is the unknown representation matrix, which is assumed to have rank d ≪ min{d a , d x } and ε is the independent error. To learn the low-rank representation matrix, we adapt the low-rank matrix estimator to the bandit setting. We further demonstrate the benefits of our methodologies in e-commerce with real sales data where the online retailer needs to decide on the product assortment and pricing jointly. The generality of our model makes it possible to learn policy on product assortment and pricing at the same time, while previous literature mostly studies the assortment and pricing problem separately. Contributions. We highlight the following contributions of our paper: 1. We propose a new model for high-dimensional continuum armed and high-dimensional contextual bandit problem, which is often faced by decision-makers but very little existing literature attempts to solve. The crux of our model is the low-rank representation matrix that exploits the low-dimensional structure of both the high-dimensional arms and high-dimensional covariates. Our model unifies a large class of bandit models. 2. The low-rank representation matrix is endowed with interpretability and predictive power. One can perform singular value decomposition (SVD) on the representation matrix -the left singular vectors reveal the latent structure and relationships among the arms, while the right singular vectors show the latent factors of the covariate. In other words, our model implicitly performs principle component analysis (PCA) on the effect of arms and covariates on the mean reward. On the other hand, given the covariate, our model is able to predict the reward of an unseen arm. Both interpretability and predictive power can be tremendously useful for decision-makers. 3. We propose an efficient algorithm for the High-dimensional Contextual and High-dimensional Continumm Armed Bandit (Hi-CCAB) by adopting the low-rank matrix estimator. We further provide an upper bound for the convergence rate of Hi-CCAB in terms of the time-averaged expected cumulative regret. 4. The generality of our model allows for a wide range of applications. Specifically, we apply Hi-CCAB to the joint assortment and pricing problem. We show that our model reveals insights for product designs, assortment, and pricing and that the assortment-pricing policy based on Hi-CCAB yields sales four times as high as the original strategy. Literature review. Literature on high-dimensional bandit problems has been expanding recently, especially after statistical tools for high-dimensional problems become mature (Negahban & Wainwright, 2011; Wainwright, 2019) . Lots of high-dimensional bandit literature focuses on contextual bandits with high-dimensional covariates, such as the LASSO bandit problem (Abbasi-Yadkori et al., 2012; Kim & Paik, 2019; Bastani & Bayati, 2020; Hao et al., 2020; Papini et al., 2021) where they assume the mean reward is a linear function of a sparse unknown parameter vector, the low-rank matrix bandit where the covariate and unknown parameter are both of matrix form (Kveton et al., 2017; Lu et al., 2021) , and other non-parametric methods that learns the reward function using random forest or deep learning (Féraud et al., 2016; Zhou et al., 2020; Ban et al., 2022; Chen et al., 2022; Xu et al., 2022) . The high-dimensional bandit models are special cases of our model. Another stream of high-dimensional bandit literature studies representation learning in linear bandits, specifically for multi-task learning where several bandits are played concurrently. The arms for each task are embedded in the same space and share a common low-dimensional representation (Lale et al., 2019; Yang et al., 2020; Hu et al., 2021; Xu & Bastani, 2021) . Our problem is different from multi-task learning since at each time we only have one bandit and thus observe one reward while in the multi-task bandit problem, multiple bandits are played at the same time. For continuum armed bandits, there exists a thread of literature that assumes the mean reward function is smooth and continuous on the action space in some sense, e.g., the function lies in the Lipschitz or Hölder space (Agrawal, 1995; Kleinberg, 2004; Kleinberg et al., 2019) . Most work discretizes the arm space or adopts the non-parametric regression to estimate the reward function, which is very different from our approach. Recent literature studies on continuum armed bandit with

