PURE: AN UNCERTAINTY-AWARE RECOMMENDATION FRAMEWORK FOR MAXIMIZING EXPECTED POSTE-RIOR UTILITY OF PLATFORM

Abstract

Commercial recommendation can be regarded as an interactive process between the recommendation platform and its target users. One crucial problem for the platform is how to make full use of its advantages so as to maximize its utility, i.e., the commercial benefits from recommendation. In this paper, we propose a novel recommendation framework which effectively utilizes the information of user uncertainty over different item dimensions 1 and explicitly takes into consideration the impact of display policy on user in order to achieve maximal expected posterior utility for the platform. We formulate the problem of deriving optimal policy to achieve maximal expected posterior utility as a constrained non-convex optimization problem and further propose an ADMM-based solution to derive an approximately optimal policy. Extensive experiments are conducted over data collected from a real-world recommendation platform and demonstrate the effectiveness of the proposed framework. Besides, we also adopt the proposed framework to conduct experiments with an intent to reveal how the platform achieves its commercial benefits. The results suggest that the platform should cater to the user's preference for item dimensions that the user prefers, while for item dimensions where the user is with high uncertainty, the platform can achieve more commercial benefits by recommending items with high utilities.

1. INTRODUCTION

Commercial recommendation systems have been widely applied among prevalent content distribution platforms such as YouTube, TikTok, Amazon and Taobao. During the interactive process on the recommendation platform, the users may find contents of their interests and avoid the information overload problem with the help of recommendation services. Meanwhile, the platform may gain commercial benefits from user behaviors on the platform such as clicks and purchases. As the platform may serve millions of users and can determine which contents to be recommended, it naturally has some advantages over individual user. Therefore, it would be crucial for the platform to make full use of its advantages in order to maximize the commercial benefits. One typical advantage of the platform is its information advantage, i.e., they may collect plenty of information over users and items for conducting better recommendation. Typical state-of-the-art recommendation systems (Covington et al., 2016; Guo et al., 2017; Ren et al., 2019; Zhou et al., 2019) always take these information into consideration including user profiles, item features and historical interactions between users and recommended items. It is worth noting that information over item features is always directly incorporated into the recommendation models without considering that the user may be with different levels of uncertainty over different item dimensions (which can be regarded as different hidden attributes describing different high-order features of the item). For instance, when buying a new coat on the platform, a user may be sure that the logistics is very fast as she (he) has bought clothes from the same online store before (i.e., the user is with low uncertainty over the logistics). But she (he) may be uncertain about the quality of the coat since it is of the brand that she (he) does not know much about (i.e., the user is with high uncertainty over the quality). Thus, it would be crucial for the platform to figure out whether it is possible to leverage the user uncertainty over different item dimensions to maximize the platform utility, and if yes, how? Actually, with consideration of the user uncertainty over different item dimensions, we would show that more commercial benefits can be gained from the item dimensions with higher uncertainty. Another advantage of the platform is that it owns the capacity of determining which items to display for the users and thus may affect the users' behaviors. It has been proved by lots of works (Kamenica & Gentzkow, 2011; Immorlica et al., 2019; Abdollahpouri & Mansoury, 2020) that the display signal itself would highly affect users' behaviors, and affected behaviors would apparently result in different benefits for the platform. Regarding the recommendation as a game between the platform and the users, it is possible for the platform to achieve more commercial benefits from the game by taking a proper display (recommendation) policy. However, though there are works to explore the impact of recommendation policies, it is still not well-studied in recommendation area how to explicitly model and exploit the impact of the display policy over users. In this paper, we propose an uncertainty-aware expected Posterior Utility maximization framework for REcommendation platforms (denoted as PURE in short). We take both the two previously mentioned factors, i.e., user uncertainty over different item dimensions and influence of display policy over the user, into account and introduce a generic utility function which can be flexibly adjusted for different real-world scenarios. Then, we formulate the problem of maximizing expected posterior utility for the platform as a constrained non-convex optimization problem, and correspondingly propose a solution based on Alternating Direction Method of Multipliers (ADMM, Boyd et al. (2011) ) to derive the approximately optimal policy. To verify the effectiveness of the proposed framework, extensive experiments are conducted over data collected from a real-world recommendation platform. Furthermore, we also provide practical insights derived from carefully designed experiments and empirically reveal how the platform utilizes its information advantage to achieve more commercial benefits, which may help to better understand and conduct commercial recommendation.

2. RELATED WORK

Existing state-of-the-art recommendation systems (Zhou et al., 2018; Pi et al., 2019; Qu et al., 2016) mainly try to make full use of the information advantage of the platform. These works take these information into consideration including user profiles, item features, contextual information and historical interactions between users and recommended items. Typically, some works (Qu et al., 2016; Zhou et al., 2018; Li et al., 2019) focus on how to achieve better feature interactions or conduct better user interest modeling, while some works (Ren et al., 2019; Pi et al., 2019) may pay more attention to utilizing extremely long sequential interactive information. However, most of them ignore the existence of user uncertainty over different item dimensions, which might be crucial to conduct better commercial recommendation. In the research area to explore the display influence to the information receiver, Bayesian Persuasion (Kamenica & Gentzkow, 2011) is one of the most crucial works, which theoretically proves that the information sender may benefit from displaying proper information to the receiver. Some works (Immorlica et al., 2019; Mansour et al., 2016) follow this idea and strive to incentivize exploration via information asymmetry in scenarios such as recommendation. In another research direction that try to develop Reinforcement Learning (RL) based solutions for recommendation scenarios, a series of works (Dulac-Arnold et al., 2015; Zhao et al., 2018; Chen et al., 2019) model the recommendation process as a Markov Decision Process (MDP) and maximize the long-term reward via utilizing learned sequential patterns, which can also be regarded as taking the display (recommendation) influence into consideration to some extent.

3.1. OPTIMAL POLICY FOR MAXIMIZING PLATFORM'S EXPECTED POSTERIOR UTILITY

From the perspective of the platform, the optimal recommendation policy is the one with maximal expected utility (i.e., maximal expected commercial benefits). As mentioned before, the influence of display policy over users can not be ignored as it would highly affect the commercial benefits of the platform. In this paper, taking the impact of display policy on users into consideration, we formulate the platform's optimal policy π u for user u over a given item set I as follows. π u = argmax π i∈I π i U u (i|display; π), s.t., ∀i ∈ I, π i ≥ 0 and i∈I π i = 1, where U u (i|display; π) is the posterior utility of recommending item i to user u with consideration of the influence of display policy π. With this formulation, the remaining problem is how to model the posterior utility properly. In the following, we illustrate two reasonable assumptions in detail, which make it possible to model the posterior utility with consideration of the user uncertainty over different item dimensions as well as the influence of display policy over user. As discussed before, it would be crucial to explicitly consider the user uncertainty over different item dimensions to conduct recommendation. For a given user, we assume that the representation for an item is sampled from a multivariate Gaussian distribution and adopt the variances to describe the user uncertainty over different item dimensions, which is formulated as the following assumption. Assumption 1 (Assumption of uncertainty (correlation) over different item dimensions). For a user u, the representation of item i is sampled from a n-dimension multivariate Gaussian distribution N (µ u,i , Σ u,i ), i.e., the probability density function of the representation is: p u,i (x) = 1 (2π) n |Σ u,i | e -1 2 (x-µu,i) T Σ -1 u,i (x-µu,i) , where x ∈ R n , µ u,i and Σ u,i denote the mean vector and the covariance matrix respectively. The covariance matrix can be decomposed as Σ u,i = D u,i C u,i D u,i , where D u,i is the diagonal standard deviation matrix and C u,i is the correlation matrix (see Barnard et al. (2000) for more information). Thus, the covariance matrix can depict the user uncertainty over different item dimensions (with diagonal standard deviation matrix D u,i ) as well as the correlation between different item dimensions (with correlation matrix C u,i ). Note that we provide a practical method to gain µ u,i and Σ u,i in Section 4.1 while any other reasonable approach to get µ u,i and Σ u,i can be applied. From the perspective of users, they may try to understand the display policy of the platform from the interactive process. When an item i is recommended to a user u, the user may consider the corresponding display probability, which could influence his behavior. One reasonable assumption is that the probability of displaying item i from the perspective of user u is linear to the similarity between the item representation x and the representations of historical recommended items. Without loss of generality, we formulate this assumption as follows. Assumption 2 (Assumption of the influence of the display policy over user). Given display policy π and item representation x, the probability of recommending item i to user u from the user's perspective is: p u,i (display|x; π) = Φ(av T u x + b), where display denotes the event of displaying (recommending) the corresponding item to the target user, a and b are scale hyper-parameters, a > 0, Φ (z) = z -∞ 1 √ 2π e -x 2 2 dx and v u = E i∼π (µ u,i ) = i∈I π i µ u,i . Note that Φ is the cumulative distribution function of standard normal distribution, which is widely adopted to map the input into range [0, 1] in many models such as probit model, and v u takes the expected value of item representations w.r.t. the display policy π over item set I. Thus, with a > 0, higher similarity (calculated by inner product) between the current item representation x and the expected representation of display items would lead to higher likelihood, which is reasonable as discussed before. So far, we have presented two crucial assumptions adopted in our framework. In the following, we introduce the utility function and derive the formula of posterior utility. To ease the illustration, we denote the utility function of recommending item i to user u given sampled item representation x as f (w u , x), where w u is the embedding of user side typically and f (w u , x) depicts the utility (benefits) that the platform can gain from recommending item i to user u with item representation x. For instance, when we regard the click trough rate (CTR) as the platform's utility, one simple way is to model f as the inner product of w u and x where w u can be regarded as the user preference vector of user u. Note that the function f can be flexibly adjusted to fit different requirements of different scenarios. For example, when we try to maximize the gross merchandise volume (GMV) in a recommendation scenario, f can be defined as the product of the corresponding CTR, conversion rate (CVR) and the item price. Now we can combine the two assumptions and the utility function f to derive the formula of posterior utility. By adopting the Bayes' theorem and the law of total probability, we present the formula of posterior utility of recommending item i to user u as follows. U u (i|display; π) = R n f (w u , x)p u,i (x|display; π) dx (4) = R n f (w u , x)p u,i (x) p u,i (display|x; π) p u,i (display; π) dx (5) Equation 5 provides a way to model the posterior utility of the platform taking into account both the user uncertainty over different item dimensions and the influence of display policy over user. However, it is still challenging to derive the optimal policy π u for given user u as the right part of Equation 5is not a close-form expression (which makes it intractable to calculate the exact value of the posterior utility).

3.2. POSTERIOR UTILITY DERIVATION FOR LINEAR AND NON-LINEAR UTILITY FUNCTIONS

In this section, we present how to compute (estimate) the value of posterior utility for both linear and non-linear utility functions. To derive an effective calculation formula of U u (i|display; π) for the case with linear utility function, we first introduce a lemma in the following. Lemma 1. +∞ -∞ (c x + d )Φ(a x + b )N (x|0, 1) dx = a c √ 2π(a 2 +1) e -b 2 2a 2 +2 + d Φ( b √ 1+a 2 ), ∀a , b , c , d ∈ R. The detailed proof of Lemma 1 is provided in Appendix A.1 due to page limits. Lemma 1 reveals the fact that the definite integral +∞ -∞ (c x + d )Φ(a x + b )N (x|0, 1 ) dx can be computed directly, which is a crucial property to derive an effective calculation formula of U u (i|display; π) for the case with linear utility function. By utilizing Lemma 1, the following corollary shows that p u,i (display; π) can be straightforwardly calculated if the covariance matrix is positive definite. Corollary 1. p u,i (display; π) = Φ( av T u µu,i+b √ 1+a 2 v T u Σu,ivu ), if Σ u,i is positive definite. Due to page limits, the detailed proof of Corollary 1 is also presented in Appendix A.2. With Lemma 1 and Corollary 1, we can derive the calculation formula of U u (i|display; π) if the utility function is linear, which is illustrated in detail as follows. Corollary 2. U u (i|display; π) = w T u µ u,i + av T u Σu,iwu √ 2π(a 2 v T u Σu,ivu+1)Φ( av T u µ u,i +b √ a 2 v T u Σ u,i vu +1 ) e - (av T u µ u,i +b) 2 2(a 2 v T u Σ u,i vu +1) , if f (w u , x) = w T u x and Σ u,i is positive definite. Proof. If Σ u,i is positive definite, Σ -1 u,i is also positive definite. Then we can conduct Cholesky decomposition: Σ -1 u,i = A T u,i A u,i , where A u,i is an upper triangular matrix with real and positive diagonal entries. Thus, we have: R n f (w u , x)p u,i (x)p u,i (display|x; π) dx (6) = R n w T u xΦ(av T u x + b)N (x|µ u,i , Σ u,i ) dx (7) = R n w T u (A -1 u,i z + µ u,i )Φ(av T u (A -1 u,i z + µ u,i ) + b)N (z|0, I) dz (8) =w T u µ u,i Φ( av T u µ u,i + b a 2 v T u Σ u,i v u + 1 ) + av T u Σ u,i w u 2π(a 2 v T u Σ u,i v u + 1) e - (av T u µ u,i +b) 2 2(a 2 v T u Σ u,i vu +1) Equation 8 and Equation 9 are derived from Proposition 3 and Proposition 5 respectively. Due to page limits, Proposition 3 and Proposition 5 and their detailed proofs are presented in Appendix. Combing Corollary 1 and Equation 9, we have: U u (i|display; π) = 1 p u,i (display; π) R n f (w u , x)p u,i (x)p u,i (display|x; π) dx (10) = w T u µ u,i + av T u Σ u,i w u 2π(a 2 v T u Σ u,i v u + 1)Φ( av T u µu,i+b √ a 2 v T u Σu,ivu+1 ) e - (av T u µ u,i +b) 2 2(a 2 v T u Σ u,i vu+1) Corollary 2 reveals that the posterior utility can be effectively calculated (since it is with a closeform expression and thus can avoid the intractable calculation in Equation 5) when Σ u,i is positive definite and the utility function f is linear, which makes it possible to be effectively utilized in the real-world scenarios. However, when the utility function f is non-linear, it might be challenging or even impossible to calculate the exact value of the posterior utility. To estimate the posterior utility when f is nonlinear, we leverage the importance sampling technique to approximate the posterior utility. Combing Assumption 2, Equation 5and Corollary 1, we have: U u (i|display; π) = R n f (w u , x)p u,i (x)l(x; π) dx, where l(x; π) = Φ(av T u x+b) Φ( av T u µ u,i +b √ a 2 v T u Σ u,i vu +1 ) . Thus, computing U u (i|display; π) can be regarded as calculating the expected value of f (w u , X) w.r.t. the target probability density function q(X = x) = p u,i (x)l(x; π). Since p u,i (x) is the probability density function of a multivariate Gaussian distribution from which one can easily conduct sampling, we can adopt p u,i (x) as the sampler density and as a result, given a sample x, the corresponding importance weight is l(x; π). Thus, given a sample set X which consists of m samples drawn i.i.d. according to distribution N (µ u,i , Σ u,i ), we can approximate the posterior utility as: Ûu (i|display; π) = 1 x∈X l(x; π) x∈X f (w u , x)l(x; π). In this section, we derive an effective calculation formula for posterior utility in the case with linear utility function and provide a sampling-based method to estimate posterior utility in the case with non-linear utility function. Note that the sampling-based method can also be applied to estimate the posterior utility for the case with linear utility function. Combining the results in this section (i.e., Equation 11 and Equation 13) and the formulation of optimal policy (as shown in Equation 1), we notice that the problem of deriving the optimal policy for maximizing platform's expected posterior utility, which is formulated in Equation 1, is turned into a constrained non-convex optimization problem even for the case with linear utility function. To solve the problem, an ADMM-based solution is proposed in the next section.

3.3. AN ADMM-BASED SOLUTION FOR DERIVING APPROXIMATELY OPTIMAL POLICY

As presented in the previous section, finding the optimal policy to maximize the expected posterior utility of the platform is formulated as a constrained non-convex optimization problem. Noticing that the ADMM technique (Boyd et al., 2011) has been successfully adopted as an effective method to find approximate solution for constrained non-convex optimization problem (see Leng et al. (2017) for more details), we develop an ADMM-based solution to solve the aforementioned constrained non-convex optimization problem. Denoting g u,I (π) = -i∈I π i U u (i|display; π), we rewrite the constrained non-convex optimization problem (which is formulated in Equation 1) as follows. argmin π g u,I (π), s.t., ∀i ∈ I, π i ≥ 0 and i∈I π i = 1. ( ) By introducing an auxiliary parameter vector π , the problem can be reformulated as: argmin π g u,I (π ), s.t., π = π, ∀i ∈ I, π i ≥ 0 and i∈I π i = 1. ( ) Note that the target function g u,I (π ) and the target policy π is linked via the first constrain, i.e., π = π. We regard the last two constrains as hard constrains which are required to be satisfied at any optimization step. And the augmented Lagrangian w.r.t. to the first constrain is: L ρ (π, π , λ) = g u,I (π ) + λ T (π -π) + ρ 2 ||π -π|| 2 2 , ( ) where λ is the Lagrangian multipliers and ρ > 0 is a hyper-parameter. To achieve an approximate solution, the ADMM method consists of the following three steps for each iteration: π (k+1) := argmin π L ρ (π, π (k) , λ (k) ) (17) π (k+1) := argmin π L ρ (π (k+1) , π , λ (k) ) (18) λ (k+1) := λ (k) + ρ(π (k+1) -π (k+1) ) We can observe that the last two steps do not involve optimization w.r.t. π. Thus, the third step can be simply achieved by value updating while the second step can be approximately solved by adopting gradient-based solutions such as stochastic gradient decent. The optimization problem w.r.t. the first step can be reformulated as: argmin π L ρ (π, π (k) , λ (k) ) = argmin π λ (k)T (π (k) -π) + ρ 2 ||π (k) -π|| 2 2 (20) = argmin π ρ 2 ||π (k) + λ (k) ρ -π|| 2 2 , s.t., ∀i ∈ I, π i ≥ 0 and i∈I π i = 1, which is a convex quadratic programming problem and can be solved by convex quadratic programming solutions such as active set methods (Wong, 2011) . It is worth noting that the optimization problem w.r.t. the first step can be regarded as finding the nearest point from a given convex set (which satisfy the two constrains) to a given point (i.e., π (k) + λ (k) ρ ). By iteratively conducting the aforementioned three optimization steps until convergence (say, after k steps), we can acquire an approximately optimal policy π (k ) for maximizing expected posterior utility of the platform, which is demonstrated to be great enough in the experiment section.

4. EXPERIMENTS

In this section, we present the experiment setup and the experiment results in detail. First, we present the detailed setup of the experiments including a practical way to gain the mean vector and covariance matrix for given pair of user and item, which makes it possible to take the user uncertainty over different item dimensions into consideration. Second, we conduct experiments to verify the effectiveness of the proposed technical solutions involved in our framework, i.e., the effectiveness of the ADMM-based solution for maximizing platform expected posterior utility and the sampling-based technique for posterior utility estimation. Third, to verify the superior of the policy derived from the proposed framework, we conduct experiments and present the results of comparison between our policy and other two heuristic policies. Fourth, to answer the crucial question that how the platform maximizes its commercial benefits, we analyse the relation between the learned policy and two significant factors, the user preference and uncertainty over given item dimension, and provide practical insights to better understand and conduct commercial recommendation.

4.1. EXPERIMENT SETUP

To conduct experiments, we first collect click log data of about 1 million users and 5 million items from the GUESS U LIKE scenario of the Taobao Application. To encode the users and the items, we adopt prevalent Embedding&MLP paradigm (He et al., 2017; Guo et al., 2017) and encode the information of the user and the item sides respectively (Xu et al., 2019) . By training on specific task (e.g, click through rate prediction task), we can gain the embedding of the user side as well as that of the item side. Without loss of generality, we consider the following two model architectures: i) the model output is the inner product of the user and the item embeddings; ii) the model output is achieved by applying a non-linear function over the user and the item embeddings. Note that these two architectures correspond to the cases with linear and non-linear utility functions respectively. For each pair of user group (divided according to age and gender) and item category, the clicked item set is extracted and the mean vector and the covariance matrix of the corresponding item embeddings are calculated. For a given pair of user and item, the mean vector and the covariance matrix can be gained by adopting those of the pair of the corresponding user group and item category. Without loss of generality, we empirically set the value of hyper-parameters a and b to ensure that most of the values of p u,i (display|x; π) lie between 0.001 and 0.1 (noticing that other reasonable ranges also lead to similar experimental conclusions). Since Φ is a function in the form of integral which might be intractable to calculate, we adopt an effective approximation (Waissi & Rossin, 1996) to ease the calculation. To avoid learning a highly unbalanced policy (with probability 1 for one item while with 0 for others) due to unbalanced utilities, we also incorporate an entropy-based regularization term into our model. Specifically, instead of directly adopting the optimization step as shown in Equation 18, we add an extra term to smooth the learned policy and the corresponding optimization step is: π (k+1) := argmin π L ρ (π (k+1) , π , λ (k) ) + η i∈I (π i - 1 |I| ) 2 , ( ) where |I| denotes the number of items in item set I, η is the parameter to control the smooth level of the learned policy and is set to 10.0 empirically in the experiments. Besides, the optimizer we adopted for gradient decent is the Adam Optimizer (Kingma & Ba, 2014) and the value of ρ is set as 1.0 empirically, which is proved to be effective in the following experiments. In this section, we present the experiment results to demonstrate the effectiveness of the two proposed technical solutions absorbed into our framework, i.e., the effectiveness of the ADMM-based solution for maximizing expected posterior utility and the importance sampling technique for posterior utility estimation.

4.2. EFFECTIVENESS OF THE PROPOSED TECHNICAL SOLUTIONS

To demonstrate the effectiveness of the proposed ADMM-based solution for maximizing expected posterior utility, we record the optimization curves (i.e., expected posterior utility w.r.t. the number of optimization iterations) and present the results in Figure 1 . According to Section 3.2, there are three cases to be considered: i) linear utility function with exact posterior utility (calculated by adopting Equation 11); ii) linear utility function with estimated posterior utility; iii) non-linear utility function with estimated posterior utility. We randomly sample 3 users and 100 items and utilize the proposed ADMM-based optimization method to learn the approximately optimal policy to maximize the expected posterior utility of the platform. As shown in Figure 1 , the expected posterior utility first increases and then converges to its maximum with the increase of the number of optimization iterations. These results demonstrate that the proposed ADMM-based solution can effectively learn a policy to maximize the expected posterior utility. Besides, comparing the optimization curves of the three cases, we can observe that the variances of the expected posterior utility in the cases with estimated posterior utility Ûu (i|display; π) (the last two cases) are higher than that in the case with exact calculated posterior utility U u (i|display; π) (the first case). The reason is that extra random noise is induced by importance sampling when we conduct posterior utility estimation and as a result, expected posterior utility is with higher variance in the case with estimated posterior utility. In order to verify the effectiveness of the adopted importance sampling technique for posterior estimation, we record the optimization curves w.r.t. different numbers of samples for estimating posterior utility for three randomly sampled users and present the results in Figure 2 . The linear utility function is adopted to conduct the experiments which makes it possible to make comparison between the results of utilizing estimated posterior utility (estimated according to Equation 13) and exact posterior utility (calculated according to Equation 11). From Figure 2 we can observe that even in the case with 1 sample per estimation, the tendency of the optimization curve is similar to that of the case with exact posterior utility, which verifies that the adopted sampling-based technique for posterior utility estimation is effective. Besides, when varying the number of samples per estimation from 1 to 1000, we observe that the optimization curves become more and more stable. It is because that larger number of samples per estimation leads to more accurate approximation of posterior utility, and thus results in more stable optimization curve. More details including the Root Mean Square Error (RMSE) between the estimated values and the exact values of the posterior utilities for each sampling setting are provided in Appendix A.4 due to page limits.

4.3. EXPERIMENTAL COMPARISON WITH HEURISTIC POLICIES

In this section, we present the experiment details and the corresponding results to verify the effectiveness of the policy derived from the proposed framework. To the best of our knowledge, the presented framework is the first complete work that proposes to maximize the expected posterior utility (which explicitly takes the impact of display policy into consideration) and develops a corresponding solution to derive the approximately optimal policy. Thus, there may lack solutions for conducting straightforward comparison and we intent to verify the effectiveness of the learned policy by comparing it with other heuristic policies. One simple policy for comparison is the random policy that recommends each item with same probability. Thus, the random policy can be regarded as an uniform probability distribution over the candidate items. We also incorporate a heuristic prior policy for comparison where the recommendation probability is higher for item with higher prior utilityfoot_1 . Specifically, the recommendation probability of item i is given by π pop i = e Uu (i) τ j∈I e Uu (j) τ ( ) where τ is the temperature parameter (Hinton et al., 2015) to control the smooth level of the heuristic prior policy. For conducting fair comparison, we adjust the value of τ to ensure that the smooth levels of the heuristic prior policy and our learned policy are comparable. We randomly sample 1000 users, calculate the mean of expected posterior utilities for each policy and present the results in Table 1 . As shown in Table 1 , the policy derived from the proposed framework achieves the highest expected posterior utility compared to the random and the heuristic prior policies in both linear and non-linear cases, which demonstrates the superior of the proposed framework.

4.4. HOW DOES THE PLATFORM ACHIEVE ITS COMMERCIAL BENEFITS

In this section, we intent to adopt the proposed framework to reveal how the platform achieves its commercial benefits taking both the user preference and the user uncertainty over different item dimensions into account. In the experiments, the utility function is realized by applying the sigmoid Each element of w u can be explained as the user preference over the corresponding item dimension. For a randomly sampled user, we denote the mean vector of the user preference vectors over items as w and denote its first element as w0 , whose value indicates the user's preference of item dimension 0. We vary the value of w0 and record the learned expected value (w.r.t. the learned policy) of the elements with dimension 0 of the mean vectors, which is denoted as v 0 . As shown in the left part of Figure 3 , with the increase of w0 , the value of v 0 increases approximately linearly (noting that the Pearson correlation coefficient is 0.9995 which indicates strong linear correlation). According to its definition, v 0 can reflect to what extent the policy has adjusted to fit item dimension 0 (more details about how the policy adjust to different levels of user preference or uncertainty are provided in Appendix A.5). Thus, the result suggests that the policy would tend to adjust more for dimensions with higher value. In other words, the platform can achieve more commercial benefits from the item dimension with stronger user preference. We also vary the mean of item variance of dimension 0, which indicates the user uncertainty over item dimension 0, and record v 0 as presented in the right part of Figure 3 . By similar analysis, the result indicates that the platform can achieve more commercial benefits from the item dimension that the user is with higher uncertainty. Therefore, the results in Figure 3 indicate that commercial benefits mainly come from the item dimensions with either strong user preference or high user uncertainty when we take the impact of display policy on user into consideration. Note that this result may contribute to better understand how commercial benefits are achieved and better conduct commercial recommendation. For instance, from the perspective of the platform, one effective way to improve its commercial benefits is to better balance the user experience and its commercial actions, i.e., for item dimensions where the user is with high preference, the platform should cater to his preference, while for flexible item dimensions where the user is with high uncertainty, the platform may achieve more benefits by recommending items with high platform utilities.

5. CONCLUSION

In the paper, we propose a novel recommendation framework and take into account the user uncertainty over different item dimensions and the influence of display policy over user, both of which could highly affect the commercial benefits of the platform. We derive the calculation formula of the posterior utility in the case with linear utility function and provide a sampling-based method to estimate the posterior utility in the case with non-linear utility function. Based on these works, we formulate the problem of deriving optimal policy for maximizing the expected posterior utility of the platform as a constrained non-convex optimization problem and further propose an ADMM-based solution to derive an approximately optimal policy. To demonstrate the effectiveness of the proposed technical solutions absorbed into our framework, extensive experiments are conducted over data collected from a real-world recommendation scenario. Furthermore, we also provide some practical insights about how to achieve commercial benefits for the platform taking both the user preference and the user uncertainty over different item dimensions into consideration, which might contribute to better understand and conduct commercial recommendation. 

A APPENDIX

A.1 PROOF OF LEMMA 1 To prove Lemma 1, we first prove the following two propositions. Proposition 1.

+∞

-∞ Φ(a x + b )N (x|µ 0 , σ 2 0 ) dx = Φ( a µ0+b √ 1+a 2 σ 2 0 ), ∀a , b , µ 0 , σ 0 ∈ R. Proof. By introducing two independent variables X and Y which satisfy X ∼ N (µ 0 , σ 0 ) and Y ∼ N (0, 1), we have +∞ -∞ Φ(a x + b )N (x|µ 0 , σ 2 0 ) dx = E X∼N (µ0,σ0) (Φ(a X + b )) (25) = P(Y < a X + b ) (26) = P(Y -a X < b ) (27) = P( Y -a X + a µ 0 1 + a 2 σ 2 0 < b + a µ 0 1 + a 2 σ 2 0 ) (28) = Φ( a µ 0 + b 1 + a 2 σ 2 0 ) Note that Equation 26 is gained by the definition of function Φ, i.e., Φ(z) is the probability of sampling from the standard normal distribution with the sampled value less than z. Since X ∼ N (µ 0 , σ 0 ) and Y ∼ N (0, 1), we have Y -a X+a µ0 √ 1+a 2 σ 2 0 ∼ N (0, 1). Thus, Equation 29 can be straightforwardly derived by definition of Φ, which concludes the proof. Proposition 2. Denoting the error function by erf which satisfies erf (z) = 2 √ π z 0 e -t 2 dt, we have: ∞ -∞ xe -x 2 2 erf (a x + b ) dx = 2a √ a 2 + 1 2 e -b 2 2a 2 +1 , ∀a , b ∈ R. Proof. For case with a = 0, we have: ∞ -∞ xe -x 2 2 erf (a x + b ) dx = 0 -∞ xe -x 2 2 erf (b ) dx + ∞ 0 xe -x 2 2 erf (b ) dx (30) = - ∞ 0 xe -x 2 2 erf (b ) dx + ∞ 0 xe -x 2 2 erf (b ) dx (31) = 0 (32) = 2a a 2 + 1 2 e -b 2 2a 2 +1 For case with a = 0, we have: ∞ -∞ xe -x 2 2 erf (a x + b ) dx = -e -x 2 2 erf (a x + b ) ∞ -∞ - ∞ -∞ -e -x 2 2 2a √ π e -(a x+b ) 2 dx (34) = ∞ -∞ 2a √ π e -(a 2 + 1 2 )x 2 -2a b x-b 2 dx (35) = ∞ -∞ 2a √ π e -(a 2 + 1 2 )(x+ a b a 2 + 1 2 ) 2 + a 2 b 2 a 2 + 1 2 -b 2 dx (36) = ∞ -∞ 2a √ π e a 2 b 2 a 2 + 1 2 -b 2 e -(a 2 + 1 2 )z 2 dz (37) = 2a √ π e a 2 b 2 a 2 + 1 2 -b 2 π a 2 + 1 2 (38) = 2a a 2 + 1 2 e -b 2 2a 2 +1 Note that Equation 34 is derived by applying Newton-Leibniz formula. Thus, combining the above two cases, we have: ∞ -∞ xe -x 2 2 erf (a x + b ) dx = 2a √ a 2 + 1 2 e -b 2 2a 2 +1 , ∀a , b ∈ R. With Proposition 1 and Proposition 2, now we prove Lemma 1 as follows. Proof of Lemma 1. By definition, Φ(z) = 1+erf ( z √ 2 ) 2 . Thus, we have: +∞ -∞ (c x + d )Φ(a x + b )N (x|0, 1) dx (40) = +∞ -∞ 1 2 √ 2π (c x)(1 + erf ( a x + b √ 2 ))e -x 2 2 dx + +∞ -∞ d Φ(a x + b )N (x|0, 1) dx (41) = a c 2π(a 2 + 1) e -b 2 2a 2 +2 + d Φ( b √ 1 + a 2 ) (42) The last equation is derived by utilizing Proposition 1 and Proposition 2, which completes the proof.

A.2 PROOF OF COROLLARY 1

In order to prove Corollary 1, we first prove the following two propositions. Proposition 3. If Σ u,i is positive definite, then R n h(x)N (x|µ u,i , Σ u,i ) dx = R n h(A -1 u,i z + µ u,i )N (z|0, I) dz, where A u,i is the upper triangular matrix gained by Cholesky decomposition over Σ -1 u,i , i.e., Σ -1 u,i = A T u,i A u,i Proof. If Σ u,i is positive definite, Σ -1 u,i is also positive definite. Then we can conduct Cholesky decomposition: Σ -1 u,i = A T u,i A u,i , where A u,i is an upper triangular matrix with real and positive diagonal entries. Thus, by adopting integration by substitution, we have: R n h(x)N (x|µ u,i , Σ u,i ) dx (44) = Φ( av T u µ u,i + b 1 + a 2 v T u Σ u,i v u ) Equation 59 and Equation 60 are derived from Proposition 3 and Proposition 4 respectively, which concludes the proof. A.3 PROPOSITION ADOPTED FOR PROVING COROLLARY 2 Proposition 5. R n (u T x + b )Φ(v T x + a )N (x|0, I) dx = v T u √ 2π(v T v+1) e - a 2 2(v T v+1) + b Φ( a √ v T v+1 ), ∀u, v ∈ R n and a , b ∈ R. Proof. Denoting ũi = b + n j=i+1 u j x j and ṽi = a + n j=i+1 v j x j , we have: As described in Section 3.2, importance sampling technique can be adopted to estimate the value of posterior utility. However, it is apparent that the number of samples for each posterior utility estimation would highly affect the results. Extra experiments are conducted to verify the effectiveness of the importance sampling based posterior utility estimation as shown in Table 2 . When we vary the number of samples per estimation from 1 to 1000, the Root Mean Square Error (RMSE) between the estimated values and the exact values of the posterior utilities decreases from 0.0369 to 0.0011, which indicates that larger number of samples per estimation leads to more accurate approximation. R n (u T x + b )Φ(v T x + a )N (x|0, I) dx (61) = +∞ -∞ • • • +∞ -∞ (b + n j=1 u j x j )Φ(a + n j=1 v j x j )N (x 1 |0, 1) dx 1 • • • N (x n |0, 1) dx n (62) = +∞ -∞ • • • +∞ -∞ (u 1 x 1 + ũ1 )Φ(v 1 x 1 + ṽ1 )N (x 1 |0, 1) dx 1 • • • N (x n |0, 1) dx n (63) = +∞ -∞ • • • +∞ -∞ ( v 1 u 1 2π(v 2 1 + 1) e - ṽ2 1 2(v 2 1 +1) + ũ1 Φ( ṽ1 1 + v 2 1 ))N (x 2 |0, 1) dx 2 • • • N (x n |0, 1) dx n (64) = +∞ -∞ • • • +∞ -∞ ( v 1 u 1 2π(v 2 1 + 1) e - (v 2 x 2 + ṽ2 ) 2 2(v 2 1 +1) + (u 2 x 2 + ũ2 )Φ( v 2 x 2 + ṽ2 1 + v 2 1 ))N (x 2 |0, 1) dx 2 • • • N (x n |0, 1) dx n (65) = +∞ -∞ • • • +∞ -∞ ( v 1 u 1 2π(v 2 1 + v 2 2 + 1) e - (v 3 x 3 + ṽ3 ) 2 2(v 2 1 +v 2 2 +1) + v 2 u 2 2π(v 2 1 + v 2 2 + 1) e - (v 3 x 3 + ṽ3 ) 2 2(v 2 1 +v 2 2 +1) + (u 3 x 3 + ũ3 )Φ( v 3 x 3 + ṽ3 1 + v 2 1 + v 2 2 ))N (x 3 |0, 1) dx 3 • • • N (x n |0, 1) dx n (66) = +∞ -∞ • • • +∞ -∞ ( v 1 u 1 + v 2 u 2 2π(v 2 1 + v 2 2 + 1) e - (v 3 x 3 + ṽ3 ) 2 2(v 2 1 +v 2 2 +1) + (u 3 x 3 + ũ3 )Φ( v 3 x 3 + ṽ3 1 + v 2 1 + v 2 2 )) N (x 3 |0, 1) dx 3 • • • N (x n |0, 1) dx n (67) = v T u 2π(v T v + 1) e - a 2 2(v T v+1) + b Φ( a √ v T v + 1 ) We also utilize the proposed ADMM-based solution to maximize the expected posterior utility and record the mean and variance of the expected posterior utilities of the last 50 optimization iterations for each sampling setting. For the case with the number of samples per estimation set to 100 (or 1000), the variance of the expected posterior utilities is very small while the learned final value of expected posterior utility is around 0.2681, which is similar to the result of optimization with exact value of posterior utility and thus verifies the effectiveness of the proposed importance sampling based posterior utility estimation. A.5 ADDITIONAL EXPERIMENTS TO ILLUSTRATE HOW THE POLICY ADJUST W.R.T. DIFFERENT LEVELS OF USER PREFERENCE OR UNCERTAINTY i 1 i 2 i 3 i 4 i 5 i 6 i 7 i 8 i 9 i 10 i 11 i 12 i 13 i 14 i 15 i 16 w 0 = 1.0 We also provide analysis about how different levels of user preference would affect the final learned policy over items. For a randomly sampled user, we vary w0 from -1.0 to 1.5 and keep the values of other dimensions unchanged. We also randomly pick 16 items from candidate item set, sort these items according to the value of dimension 0 of the item mean vector and denote the sorted items as i 1 , i 2 ,• • • ,i 16 . For each case (with specific value for w0 ), we record the final learned policy (recommendation probabilities) over the picked items and present the heat map in Figure 4 where darker color represents larger recommendation probability. As shown in Figure 4 , with the increase of the value of w0 , the learned recommendation probabilities of items with large value of dimension 0 (e.g., i 15 and i 16 ) increase while those with small value decrease. Thus, to maximize the platform's commercial benefits, the platform would tend to recommend items that match user's preference over item dimension with high user preference. Similar results can be derived when we analyse how the learned policy adjust to different levels of user uncertainty over item dimensions. Thus, the proposed framework reveals that the policy would adjust more for item dimensions with higher preference or uncertainty so as to maximize the commercial benefits of the platform.



Item dimensions: Typical state-of-the-art solutions for recommendation systems always encode each item as an embedding. The item dimensions refer to different dimensions of the item embedding, which can be explained as different high-order features. Similar to the derivation of posterior utility, we can derive that i) for linear case, Uu(i) = w T u µu,i; ii) for non-linear case, Uu(i) = R n f (wu, x)pu,i(x) dx.



Figure 2: Optimization curves w.r.t. different numbers of samples per estimation

Figure 3: Influence of user preference and item variance

Figure 4: Learned policy w.r.t. different levels of user preference

Figure 5: policy w.r.t. different levels of user uncertainty



Chen Xu, Quan Li, Junfeng Ge, Jinyang Gao, Xiaoyong Yang, Changhua Pei, Fei Sun, Jian Wu, Hanxiao Sun, and Wenwu Ou. Privileged features distillation at taobao recommendations. arXiv preprint arXiv:1907.05171, 2019. Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems, pp. 95-103, 2018. Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 5941-5948, 2019.

Statistics of posterior utilities w.r.t. samples per estimation

annex

Thus, if Σ u,i is positive definite and A u,i is the upper triangular matrix gained by Cholesky decomposition over Σ -1 u,i , then we have), ∀v ∈ R n and a ∈ R.Proof. Denoting ṽi = a + n j=i+1 v j x j , we have:Equation 52 and Equation 54 are derived from Lemma 1 (or Proposition 1). Similarly, Equation 56is achieved by applying Lemma 1 (or Proposition 1) iteratively, which completes the proof.With Proposition 3 and Proposition 4, now we can prove Corollary 1 as follows.Proof of Corollary 1. If Σ u,i is positive definite, Σ -1 u,i is also positive definite. Then we can conduct Cholesky decomposition: Σ -1 u,i = A T u,i A u,i , where A u,i is an upper triangular matrix with real and positive diagonal entries. Thus, we have: 

