SCALED NEURAL MULTIPLICATIVE MODEL FOR TRACTABLE OPTIMIZATION

Abstract

Challenging decision problems in retail and beyond are often solved using the predict-then-optimize paradigm. An initial effort to develop and parameterize a model of an uncertain environment is followed by a separate effort to identify the best possible solution of an optimization problem. Linear models are often used to ensure optimization problems are tractable. Remarkably accurate Deep Neural Network (DNN) models have recently been developed for various prediction tasks. Such models have been shown to scale to large datasets without loss of accuracy and with good computational performance. It can, however, be challenging to formulate tractable optimization problems based on DNN models. In this work we consider the problem of shelf space allocation for retail stores using DNN models. We highlight the trade-off between predictive performance and the tractability of optimization problems. We introduce a Scaled Neural Multiplicative Model (SNMM) with shape constraints for demand learning that leads to a tractable optimization formulation. Although, this work focuses on a specific application, the formulation of the models are general enough such that they can be extended to many real world applications.

1. INTRODUCTION

The predict-then-optimize framework is ubiquitous in applied research. A predictive model is first developed to approximate the true dynamics of a system under consideration to a given level of accuracy. A mathematical programming formulation based on the predictive model is then used to help researchers identify optimal policies for challenging real-world decision problems. Despite the fact that predictive models that are estimated based on the historical data need not be causal, there are recent works that shed some light on the principles behind this approach (Bertsimas & Kallus, 2020) . An alternative to the predict-then-optimize framework is integrating the two stages, predicting and optimizing at the same time, by utilizing a specific loss function in prediction models (Elmachtoub & Grigas, 2022) . For example, model-free Reinforcement learning (RL) approaches explore their environment while also exploiting it to make optimal decisions. Big-box retailers have thousands of stores located across the country. They use data-driven decision to optimize operations; e.g, assortment planning, pricing, and supply chain optimization. They work on problems related to shelf space allocation, making the best use of limited space in stores. Space planning for apparel is particularly challenging, due to the different shapes and sizes of the merchandise and the temporal shifts in brand importance. The problem that motivated our work in this area involves deciding how much space in terms of fixtures to assign to each of several brands within a category of products for sale. These problems are typically solved at the department or store level. Products are arranged into departments and categories; for example, 'activewear tops' is a category within the women's apparel department. In this work, we introduce a novel approach based on the predict-then-optimize framework for solving a shelf space allocation problem. In particular our contributions are as follows • We begin by identifying certain key characteristics of relevant real-world data that make modeling challenging. • We discuss predictive model selection considering the tractability of resulting optimization problem formulations and focus on the convexity of the formulations. • We propose multiplicative models and establish conditions for tractability of the optimization. We also discuss why family of linear models are not suitable for the given application. • We hypothesize that it is possible to convert an intractable optimization to a tractable one without loss of prediction accuracy. We achieve this via a Scaled Neural Multiplicative Model (SNMM) and demonstrate that our proposed model performs well relative to alternative models.

2. RELATED WORK

The relationship between the shelf space allocated to a product and that product's sales has been studied extensively (Bianchi-Aguiar et al., 2021; Hübner & Kuhn, 2012; Karampatsa et al., 2017) . Analyses often focus on incremental returns, estimating the space elasticity. Under the assumption of diminishing returns to scale, the relationship between shelf space and sales can be modeled using a concave function (Curhan, 1972; Eisend, 2014) . The assumption of diminishing returns makes intuitive sense and has been widely used in production functions (Gopalswamy & Uzsoy, 2019 )(Aigner & Chu, 1968) ; the incremental gain in sales by adding more space to showcase a product decreases as the space allocated increases. This assumption can also have a side benefit of making shelf space allocation optimization problems easier to solve. Optimization: There have been a number of scientific articles focused on the assortment optimization problem; a problem related to the one we consider in this work. Retailers solve this problem to determine which products to sell. Shelf space in stores is often the most prominent constraint. Kök & Fisher (2007 ), Yücel et al. (2009 ), Lo (2019) , and numerous others focus on developing optimization frameworks that include complicated consumer choice models. This allows the authors to select the optimal product mix accounting for the fact that some consumers will substitute one product for another depending on what is and is not sold by a retailer. Our problem is somewhat different in that we are not selecting specific products to sell, but rather deciding which brands to carry and how much space to allocate to these brands. These brands will offer products that are relatively unique within a store. Hübner & Kuhn (2011) points out the relationships between space allocated, consumer demand, and inventory costs. Assuming a fixed amount of space available, a retailer can choose to offer fewer facings of a more diverse set of products to increase consumer interest. This will, however, also increase inventory holding and replenishment costs. There will be increasing demands placed on store labor. Ryzin & Mahajan (1999) came to a similar conclusion earlier looking specifically at apparel. Our problem is, again, somewhat different. Brand managers are responsible for selecting product mix within brands sold in specific stores, managing inventory costs and the tradeoff between such costs and expected sales or profits. Shape-Constrained Models: An integral part of our prediction model that makes the optimization tractable is shape constraints. These are models that are constrained to be monotonic, convex, concave or non-negative among others. Shape constraints provide effective regularization, reducing the chance that noisy training data or adversarial examples produce a model that does not behave as expected. These impose strong priors on the data and can be used effectively to produce well behaved structured prediction models. It is this property that we enforce in our prediction model to yield a tractable optimization. Shape constraints on neural networks have been studied in different applications (Gupta et al., 2018) . Most of the prior works consider GAMs (Generalized Additive Models) and neural networks separately. In our work, we combine the strength of neural networks and the simplicity of GAMs and propose a Scaled Neural Multiplicative model that can model concave, convex or monotone constraints effectively. Differentiable Optimization: Recent works have introduced classes of deep learning models where certain layers involves solving optimization problems (Donti et al., 2017; Agrawal et al., 2019; Wilder et al., 2019) . In these differentiable optimization architectures, backpropagation is based on a loss function that reflects the decision problem(s). The goal is to directly learn a policy that performs well on the optimization problems under the true distributions of uncertain parameters. Donti et al. (2017) demonstrate that this approach can outperform "both traditional modeling and purely blackbox policy optimization approaches" on sample stochastic optimization problems. 

3. PRELIMINARIES

Given a closed set X and function F θ that is parameterized by θ, decision making problem is to find the solution to the optimization problem given below x * = arg max x∈X F θ (x) The complexity of the underlying optimization model is based on the form of the function F θ , the objective function, and the definition of the set X, the constraints. Further, the problem is generally tractable only for concave functions with respect to x (including affine). Although, more generic functional forms can be incorporated using Mixed Integer Programming (MIP) with piecewise linear approximations (Vielma et al., 2010) (Gopalswamy et al., 2019) , the solution time for MIPs in general does not scale well for large problems. By tractability, we refer to the class of optimization problems that can be solved in polynomial time. A linear functional form for the objective function leads to a relatively simple convex (linear) optimization problem if the set X is a convex polytope. In this case, the prediction model must be linear with respect to the decisoptimization variable x. The c model parameters are coefficients in a linear function of x while the rest of the features P\x in the prediction model do not have any restrictions. The predictive model can be a generic neural network, for instance. Note that features other than x are irrelevant to the decision problem. F θ (x) = c t x + ϕ(P\x) A multiplicative model form can be a better alternative to a linear model form for two reasons: (i) it can handle heterogeneity in variance w.r.t features in X and (ii) other features in P can play a role in determining the optimal solution to equation 1. A general multiplicative model is given by F θ (x) = i x βi i ϕ(P\x i ) If i β i = 1 and ϕ(P\x i ) ∈ R + , then our objective function is convex. Such conditions can be difficult to enforce on a general neural network model without explicit architectural design (Amos et al., 2017) . Further, such models can be difficult to interpret and communicate to business stakeholders in general. To alleviate the above problems, we consider the following multiplicative form F θ (x) = i x βi i p∈P\xi ϕ p (w p i ) where w p i is the feature p that depends on the index i of the optimization variable. For example, variable x i could represent space for brand i whereas w i p could be description of brand i embedded using a language model. Details on how to constraint 3 to enforce convexity (concavity) will be discussed in the next section.

4. PROBLEM FORMULATION

We discuss a specific instance of the optimization problem based on the predictive model F θ that captures the relationship between sales and features. The variable of interest in the optimization, fixture count, defines the shelf space in a store. This work specifically considers the problem of finding optimal space allocation for each brand-category pair for a given department in each store to maximize revenue. The tractability of the above formulation depends on the set X and the parametric function F θ . In this work we will consider a family of functions from Generalized linear models (GLM) such as additive, multiplicative as well as DNNs that extend GAMs (Agarwal et al., 2021) . We will analyze the models from the perspective of flexibility and convexity with respect to the fixture count feature We approach this problem in two stages -• Learn the functional form of F θ (x) through demand learning. • Use this relationship to optimize for space across all stores.

4.1. DATA SPARSITY

A major challenge in pushing research from academia to application rests with the data quality. It is non trivial to achieve the appropriate functional form or relationship that captures the expected model behavior without loss in performance. In the retail space in particular, we have found that the domains of the functions we would like to use are not fully observed in the historical data. This is due to the challenges in experimentation in physical retail stores at the granularity needed for making decisions. For example, in apparel there are a large number of items sold, while the sales traffic at each item level is very sparse. Ideally, one would be interested in estimating the space elasticity of item i for time period t. This estimation problem is challenging due to sparsity in data and non-dynamic space allocation in retail stores (i.e, space allocation for an item or category usually stays the same for a quarter to offset high labor costs involved). We consider models that have a single coefficient for fixture count. Further, the category and brand have multiplicative effects in the models we consider. This way, we are able to estimate the space elasticity by using data across different products and also estimate total effect for any given brand and category.

4.2. GENERALIZED LINEAR MODELS

Linear Fixed Effects Model: A baseline approach involves fitting a linear regression model where y, the $ sales amount, is a linear combination of x and d i , the features. Here, x denotes the fixture count variable and d i denotes other features like brand, category and department. y = β 0 + β 1 x + i ξ i d i (4) When used as an objective function with x as the decision variable, the linear model has constant space elasticity across all brand-category pairs and the effect of other features other than fixture count becomes irrelevant to the optimization.

Log-Log Model:

A log-log model ensures that the effect of fixture count on sales is positive. Moreover, unlike a simple linear model, the coefficients of log-log model have a multiplicative effect on the independent variable. The multiplicative effect of coefficients here ensures that the effect of different features like brand, category and department are directly related to the fixture count variable. These variations across brand, category and department learned from the historical data allows the optimization model to identify the right weights associated with each fixture count variable and assign space accordingly. log y = β 0 + β 1 log x + i ξ i d i y = e β0 x β1 e i ξidi y = F (d)x β1 The log-log model is convex when β 1 > 1 or β 1 < 0. In our motivating example, these would be the cases where allocating additional space to a brand would either reduce sales or increase sales at a faster rate than before (increasing returns to scale). Neither of these cases seem realistic.

4.3. DEEP NEURAL NETWORKS

Neural Additive Models (NAM): Deep neural networks have been successful in prediction tasks. We consider a special form of neural networks that are additive in nature (Agarwal et al., 2021) . NAMs learn a linear combination of networks, each of which attend to a single input feature: each f i in 6 is parameterized by a neural network. NAMs are explainable, elegant and easy to understand models. y = β + f 1 (x 1 ) + f 2 (x 2 ) + • • • + f K (x K ) (6) Neural Multiplicative Models (NMM): Similar to the multiplicative linear models, we consider multiplicative form for the neural additive model using the log transformation of the dependent variable y and fixture count x y = e β e f (log x) e i f d i (di) Equation 7 in general will not lead to a tractable optimization problem. The functional form considered in Agarwal et al. (2021) involves EXU layer g(x) = e w (x -b) and linear layers. We consider a linear functional form of f that can be modeled via linear layers. To improve learning and expressiveness we use hidden dimension of 100, thus forming a total of 2 linear layers. y = e β x w ′ 2 w1 e w ′ 2 b1+b2 e i f d i (di) The parameters of Equation 8 can be constrained to produce a concave function well suited for use as an objective function in a subsequent optimization problem. The constraint 0 ≤ w ′ 2 w 1 ≤ 1, can be modeled by considering σ(w 2 ), σ(w 1 ) instead w 2 , w 1 , where σ(x) = exp xi i exp xi n i=1 . It can be shown that 0 ≤ σ(w 2 ) ′ σ(w 1 ) ≤ 1. Lemma 4.1. For any x, y ∈ ∆ where ∆ = {w ∈ R n + | i w i = 1}, we have 0 ≤ i x i y i ≤ 1 Proof. In appendix. Scaled Neural Multiplicative Models (SNMM): Generally, constraining neural network models leads to reduction in their capacity. The direct consequence being the inability to be a universal function approximator. We propose a NMM model where the scaling of the feature x before the log transformation can be learned in an end-to-end fashion. y = e β z σ(w2) ′ σ(w1) e σ(w2 ) ′ b1+b2 e i f d i (di) (9a) z = 1 + max (0, s ′ 2 s 1 x + s ′ 2 t 1 + t 2 ) (9b) Equation 9 is tractable with respect to the optimization when s ′ 2 s 1 x+s ′ 2 t 1 +t 2 ≥ 0. Although, max operator cannot be modeled without mixed integer program, we noticed that the above condition for tractability holds on the experiments we carried out with different number of stores. In our application, y represents sales in dollars and x represents fixture count. Equation 9 can be written for an explicit category i, brand j and department k as follows y i,j,k = a i,j,k z γ i,j,k (10a) z i,j,k = s ′ 2 s 1 x i,j,k + s ′ 2 t 1 + t 2 (10b) a i,j,k = e β e σ(w2) ′ b1+b2 e fi(di)+fj (dj )+f k (d k ) (10c) Note that a i,j,k is constant with respect to optimization variables.

4.4. OPTIMIZATION

After learning the functional form of F θ , we can formulate the problem for stage two as follows. Let the sets I and K denote the set of brand, category pairs (i, j) and set of departments, respectively. The parameter a i,j,k is the coefficient of category i, brand j and department k based on the multiplicative model in 9. Let x i,j,k denote the variable fixture count for category i, brand j and department k, with l i,j,k and u i,j,k as lower and upper bounds on the variable, respectively. Further, lets define parameters s and s k and γ as the total fixture count (across all departments), fixture count for a department k and space elasticity, respectively. We define the optimization with explicit index (i, j) to indicate the dependency of the coefficients a i,j,k on brand, category and department encoding, even though it can be combined into a single index. The objective is to maximize sales revenue from equation 10. The explicit form of equation 10, f (x i,j,k ), depends on the model of choice (SNMM in our case). max k∈K i,j∈I y i,j,k (11a) s.t. y i,j,k = f (x i,j,k ), ∀i, j, k i,j∈I x i,j,k ≤ s k , ∀k ∈ K (11c) k∈K i,j∈I x i,j,k ≤ s (11d) l i,j,k ≤ x i,j,k ≤ u i,j,k , ∀i, j, k (11e) x i,j,k ≥ 0, ∀i, j, k

5. EXPERIMENTS

We demonstrate the ability of the proposed model to handle shape constraints and predictive power with two sets of experiments. In the first experiment, we compare SNMM with different models from (Gupta et al., 2018) using real world datasets on their ability to handle constraints such as concavity with respect to the input. The models compared include: a "standard unconstrained DNN" (Gupta et al., 2018) , partial convex shape-constrained neural networks (Amos et al., 2017) , Random Tiny Lattices (RTLs) (Canini et al., 2016) , and calibrated linear models (Gupta et al., 2016) . See Gupta et al. (2018) for more details on these models. Our results regarding their performance are taken from this work. In the second experiment, we provide results on a proprietary dataset from a retailer with respect to modeling space elasticity.

5.1. SHAPE-CONSTRAINED BENCHMARK

We consider three publically available datasets and test the ability of SNMM to handle convexity and concavity constraints based on established benchmarks. ), for which we predict monthly car sales (in thousands) from the price (in thousands). We enforce a convexity constraint on the price variable with respect to the sales variable. The results can be seen in Table 1 . SNMM performed relatively better compared to all models with the lowest Validation and Test MSE.

5.1.2. PUZZLE SALES FROM REVIEWS

The data consists of 3 features, 156 training, 169 validation, and 200 non-IID test examples (courtesy of Artifact Puzzles and available at www.kaggle.com/dbahri/puzzles), for which we predict the 6-month sales of different wooden jigsaw puzzles from three features based on its Amazon reviews: its average star rating, the number of reviews, and the average word count of its reviews. Here we assume a convexity constraint on the star rating, and concavity constraints on number of reviews and word count. Table 1 shows that for SNMM, even though the Validation MSE is highest compared to the rest of the models, the Test MSE remains within the range of the highest Test MSE. The results aren't as impressive as on the Car Sales problem, indicating that our assumptions regarding convexity and concavity are more limiting here. It is important to note that these assumptions are still necessary to develop a tractable optimization problem formulation. [80, 100] , price of the wine (the most important feature), country (21 Bools), and 39 Bool features based on the wine description from Wine Enthusiast Magazine. We constrain the price feature to be concave in order to predict the quality of wine. From Table 2 , it can be seen that while DNN has the lowest error, the Test MSE for SNMM is within the range of the highest Test MSE across all models. 

5.2. CASE STUDY ON REAL DATA

In this section, we perform tests on a random selection of stores across the country from one retailer. We first describe the dataset and the features considered. We will then compare eight different Under review as a conference paper at ICLR 2023 models for predicting sales with the metrics given below. R 2 = 1 - ∥y -ŷ∥ 2 ∥y -ȳ∥ 2 M SE = N -1 ∥y -ŷ∥ 2 (13) M AE = N -1 |y -ŷ| We consider linear and multiplicative models with variants. The variants are explored in lieu of search for computationally tractable optimization model. Further, certain models offer better modeling capacity with respect to the solution space. The functional relationship between the space feature (variable in optimization) and sales defines the objective function. We expect the models to capture the concave relationship from the data, but it seldom happens in reality. Real-word data is noisy and unobserved confounders can affect the data. Further, retail data in general depends on exogenous uncertainty such as market dynamics, competitors, economic factors, etc. Explicit shape constraint is placed on the model such as non-negativity of sales and concavity of sales with respect to space feature. As discussed in sections 4.2 and 4.3, we enforce the constraints and report performance metrics. We present the metrics in both log scale and linear scale whenever applicable to offer complete comparison across all the models The following models are used for the experiment: • GLM-L: Classical linear regression model that assumes a linear relationship between the dependent and independent variables, Equation 4. • GLM-M: Multiplicative model that takes a log transformation of dependent variable and space variable as in Equation 5. • GLM-CM: Constrained GLM-M with space elasticity between 0 and 1. • NAM: Neural Additive Model as described in Equation 6• NMM: NAM that takes a log transformation of dependent variable and space variable as in Equation 7. • NMM-L: NMM with linear layers for the space variable as in Equation 8• NMM-CL: Constrained NMM-L with space elasticity between 0 and 1 using lemma 4.1 • SNMM: Proposed model in Equation 9Dataset: The models consider historical space and sales data of Apparel departments from 2018 to 2022. We want to predict and optimize for space at the department-category-brand level, so we aggregate data to that level at a weekly granularity. Model Evaluation: Numerical tests were carried out on a set of 5, 10 and 20 stores. The effectiveness of the models in sales prediction is evaluated using the metrics described above. The proposed model is compared with variety of different models and the results are shown in Tables 3, SNMM model performs the best across all the scenarios; MSE and MAE for SNMM are the lowest across all the models. Comparable performance can be seen in NMM-L model where we enforce linearity of the space feature in log space. We highlight the benefit of enforcing structured constraints by comparison to several other models that are unconstrained. Although constrained models usually have lower capacity, SNMM is able to recover the performance lost due to constraints when compared to the constrained models and even improve on them. This clearly shows that the generalization ability of the shape constrained model is due to regularization with respect to the structure.  y i ∵ x i ≤ 1, y i ≤ 1 =⇒ x 2 i ≤ x i , y 2 i ≤ y i i x i y i ≤ 1 ∵ i x i = i y i = 1 Trivially, i x i y i ≥ 0 since x i , y i ≥ 0 ∀i



Synthetic experiments: Car Sales and Puzzle Review

Synthetic experiments: Wine Quality

Benchmark results for 5 stores

Benchmark results for 10 stores

Data summary

Benchmark results for 20 stores The results are reported on the test-split. Models are tuned for hyper-parametersbatch-size {32,64,128,256,512}, hidden-size {10,50,100}.

Venus Hiu Ling Lo. Capturing Product Complementarity in Assortment Optimization. PhD thesis, Cornell University, 2019. Garrett van Ryzin and Siddharth Mahajan. On the relationship between inventory costs and variety benefits in retail assortments. Management Science, 45(11):1496-1509, 1999. Juan Pablo Vielma, Shabbir Ahmed, and George Nemhauser. Mixed-integer models for nonseparable piecewise-linear optimization: Unifying framework and extensions. Operations research, 58 (2):303-315, 2010. Bryan Wilder, Bistra Dilkina, and Milind Tambe. Melding the data-decisions pipeline: Decisionfocused learning for combinatorial optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 2019. Eda Yücel, Fikri Karaesmen, F Sibel Salman, and Metin Türkay. Optimizing product assortment under customer-driven demand substitution. European Journal of Operational Research, 199(3): 759-768, 2009.

6. CONCLUSION AND FUTURE WORK

In this work, we presented a shelf space allocation problem for retail use case. A general framework of predict and optimize is discussed in detail with regards to trade-off between tractability of the optimization and prediction model accuracy. We discuss the need for multiplicative models and the difficulty in estimating space elasticity on a sparse dataset (most often this is the case in real world problems). To that extent, we propose a Scaled Neural Multiplicative Model (SNMM) that satisfies the conditions: non-negativity of sales and concavity with respect to the space feature. The optimization problem is formulated as a convex problem which can solved using convex solvers. Specifically, we use the power cone formulation provided by cvxpy. The model proposed in this work is general enough to extend to many other applications such as advertisement optimization, revenue maximization, etc. In future, we plan to explore these areas and enhance the model to handle different sets of constraints jointly.

