PROVABLY EFFICIENT REINFORCEMENT LEARNING FOR ONLINE ADAPTIVE INFLUENCE MAXIMIZATION

Abstract

Online influence maximization aims to maximize the influence spread of a content in a social network with an unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish O( √ T ) regret bound for our algorithm. Empirical evaluations on synthetic and real-world networks demonstrate the efficiency of our algorithm.

1. INTRODUCTION

Influence Maximization (IM) (Kempe et al., 2003; Kitsak et al., 2010; Centola & Macy, 2007) , motivated by real-world social-network applications such as viral marketing, has been extensively studied in the past decades. In viral marketing, a marketer selects a set of users (seed nodes) with significant influence for content promotion. These selected users are expected to influence their social network neighbors, and such influence will be propagated across the network. With limited seed nodes, the goal of IM is to maximize the information spread over the network. A typical IM formulation models the social network as a directed graph and the associated edge weights are the propagation probabilities across users. Influence propagation is commonly modeled by a certain stochastic diffusion process, such as independent cascade (IC) model and linear threshold (LT) model (Kempe et al., 2003) . A popular variant is topic-aware IM (Chen et al., 2015; 2016) where the activation probabilities are content-dependent and personalized, i.e., edge weights are different when propagating different contents. Classical influence maximization solutions are studied in an offline setting, assuming activation probabilities are given (Kempe et al., 2003; Chen et al., 2009; 2010) . However, this information may not be fully observable in many real-world applications. Online influence maximization (Chen et al., 2013; Wen et al., 2017; Vaswani et al., 2017) has recently attracted significant attention to tackle this problem, where an agent learns the activation probabilities by repeatedly interacting with the network. Most existing works on online influence maximization are formulated as a multi-armed bandits problem making a non-adaptive batch decision: at each round, the seed nodes are computed prior to the diffusion process by balancing exploring the unknown network and maximizing the influence spread; the agent observes either edge-level (Chen et al., 2013; Wen et al., 2017; Wu et al., 2019) or node-level (Vaswani et al., 2017; Li et al., 2020) activations when the diffusion finishes and updates its model. Combinatorial multi-armed bandits (Chen et al., 2013; Wang & Chen, 2017 ) and combinatorial linear bandits (Wen et al., 2017; Wu et al., 2019) algorithms have been proposed as solutions, where most works follow independent cascade model with edge-level feedback. In contrast to the non-adaptive setting, adaptive influence maximization allows the agent to select seed nodes in a sequential manner after observing partial diffusion results (Golovin & Krause, 2011; Tong et al., 2016; Peng & Chen, 2019 ). The agent can achieve a higher influence spread since the decision adapts to the real-time feedback of diffusion. In viral marketing, the agent could observe partial diffusion feedback from the customer and adjust the campaign for the rest of budgets based on current diffusion state. Unfortunately, online influence maximization in an adaptive setting is under-explored. Previous bandit-based solutions cannot be applied because the decisions of bandit algorithms are independent of the network state. In this paper, we study the content-dependent online adaptive influence maximization problem: at each round, the agent selects a user-content pair to activate based on current network state, observes the immediate diffusion feedback, and updates its policy in real-time. The network's activation probabilities are content-dependent and are unknown to the agent. The agent's goal is to maximize the total influence spread. We formulate this problem as an infinite-horizon discounted Markov decision process (MDP), where the state is users' current activation status under different contents (user-content pairs), an action is to pick a user-content pair as the new seed, and the total reward is the discounted sum of active user counts. Specifically, we study the problem under the independent cascade model with node-level feedback. Similar to combinatorial linear bandits (Wen et al., 2017; Vaswani et al., 2017) , we formulate a tensor network diffusion process where activation probabilities are assumed to be linear with respect to both user and content features. To tackle the problem of node-level feedback, we propose a Bernoulli independent cascade model, a linear approximation to the classic IC model which requires edge-level feedback to learn. We propose a model-based reinforcement learning (RL) algorithm to learn the optimal adaptive policy. Our approach builds on prior work of bandit-based influence maximization algorithms (Chen et al., 2013; Wen et al., 2017; Wu et al., 2019) and has the following distinct features: (1) Our adaptive IM policy makes decisions and updates policy on the fly, without waiting till the end of diffusion process; (2) Our algorithm takes into consideration real-time feedback from the network, thus approaching a dynamic-optimal policy and outperforming bandit-based static-optimal solutions; (3) Our algorithm learns from node-level feedback, which greatly relaxes the common edge-level feedback assumption in previous works with IC model; (4) Our policy can handle content-dependent networks and select the best content for the right user for the campaign; (5) To improve computation efficiency, we adopt the slow switching strategy (Abbasi-Yadkori et al., 2011 ) that only update model parameter for O(d log T ) times, where d is the feature space dimension. Our contributions are summarized as follows: • We propose a linear tensor diffusion model for content propagation in social networks and formulate the problem as an infinite-horizon discounted MDP. • We propose a tensor-regression-based RL influence maximization algorithm with optimistic planning that learns an adaptive policy from node-level feedback, which selects the content and next seed user based on current state of the network. • We proved a O(d √ T /∆ + √ dN KT )foot_0 regret of our algorithm, where T is the total rounds, N is the number of users, K is the number of contents, ∆ is the coefficient for diffusion decay, d is the dimension related to user and content feature. To our best knowledge, this is the first sublinear regret bound for online adaptive influence maximization. • We empirically validated on synthetic and real-world social networks that our algorithm explores the unknown network more thoroughly than conventional bandit methods, achieving larger influence spread. Related Works. The classical works on (offline) influence maximization (Kempe et al., 2003; Chen et al., 2009; 2010 ) assume the network model, i.e., the activation probabilities, is known to the agent and the goal is to maximize the influence spread, i.e., total number of activated users. IM has been studied in a non-adaptive setting where the agent chooses the seed nodes before the diffusion starts (Kempe et al., 2003; Chen et al., 2009; 2010; Bourigault et al., 2016; Netrapalli & Sanghavi, 2012; Saito et al., 2008) , or an adaptive setting where the agent sequentially selects the seed nodes adaptive to current diffusion results (Golovin & Krause, 2011; Tong et al., 2016; Han et al., 2018; Peng & Chen, 2019; Tong & Wang, 2020) . Online influence maximization (Chen et al., 2013; 2015; Lei et al., 2015; Lugosi et al., 2019; Perrault et al., 2020; Zuo et al., 2022) is proposed to learn network model while selecting seed nodes in the non-adaptive setting. Existing works on online IM studies mostly follow IC model and edge-level feedback (Chen et al., 2013; Wang & Chen, 2017; Wen et al., 2017; Vaswani et al., 2017; Lugosi et al., 2019) . Chen et al. ( 2013) and Wang & Chen (2017) formulated the online IM problem as combinatorial bandits problem and proposed combinatorial upper confidence



O(•) ignores all logarithmic terms.

