HIGH-DIMENSIONAL CONTINUUM ARMED AND HIGH-DIMENSIONAL CONTEXTUAL BANDIT: WITH APPLI-CATIONS TO ASSORTMENT AND PRICING

Abstract

The bandit problem with high-dimensional continuum arms and high-dimensional contextual covariates is often faced by decision-makers but remains unsolved. Recent developments in contextual bandit problems focus on the setting where the number of arms are small but are impracticable with high-dimensional continuous arm spaces. To bridge the gap, we propose a novel model for the high-dimensional continuum armed and high-dimensional contextual bandit problem that captures the effect of the arm and covariates on the reward via a low-rank representation matrix. The representation matrix is endowed with interpretability and predictive power. We further propose an efficient bandit algorithm based on a low-rank matrix estimator with theoretical justifications. The generality of our model allows wide applications including business and healthcare. In particular, we apply our method to assortment and pricing, both of which are important decisions for firms such as online retailers. Our method can solve the assortment-pricing problem simultaneously while most existing methods address them separately. We demonstrate the effectiveness of our method to jointly optimize assortment and pricing for revenue maximization for a giant online retailer.

1. INTRODUCTION

The bandit problem dates back to when Robbins (1952) formulated the problem as the sequential design of experiments and has been studied to a great extent recently due to the demand for online decision-making, especially from e-commerce and health care. A decision-maker chooses an action (arm) at each round and observes a reward and the goal is to act strategically so as to find an optimal action that maximizes the long-term reward without sacrificing too much. The bandit literature mostly focuses on the problem of a finite number of independent arms, but it is often the case that infinite number of of arms and the arms share some common structure and thus can be indexed by variables as a continuum armed bandit problem. In e-commerce, the retailer needs to decide the product assortment and pricing to maximize long-term profits; in mobile health, the personal device provides exercise and dietary suggestions to improve physical and mental health. The possible actions in both examples can be parameterized as continuous variables, which are possibly high dimensional. In addition, decision-makers observe other covariates/features, i.e., the contextual bandit problem where the reward is modeled as a function of unknown parameters and the contextual variables, and in many practical settings, the covariates are high-dimensional. As the dimensionalities of the action space (for arms) and the contextual variables grow, the traditional bandit algorithms suffer from the curse of dimensionality and it is impossible or prohibitively costly to learn the optimal decision. Albeit both the arm and the contextual are high-dimensional, the dimension of the underlying factors is often, fortunately, small -for high-dimensional bandit problems, one can assume a low-dimensional structure on the unknown parameters, such as the LASSO bandit (Bastani & Bayati, 2020) ; and for high-dimensional continuum armed bandit problems, one can assume the reward function depends only on the low-dimensional subspace of the action space (Tyagi et al., 2016) . While low-dimensional representation has been successfully adopted in highdimensional bandit problems and high-dimensional continuum armed bandit problems respectively, a natural but important question remains open: can we efficiently solve the bandit problem with both high-dimensional continuum arms and high-dimensional contextual variables simultaneously?

