PROVABLY EFFICIENT REINFORCEMENT LEARNING FOR ONLINE ADAPTIVE INFLUENCE MAXIMIZATION

Abstract

Online influence maximization aims to maximize the influence spread of a content in a social network with an unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish O( √ T ) regret bound for our algorithm. Empirical evaluations on synthetic and real-world networks demonstrate the efficiency of our algorithm.

1. INTRODUCTION

Influence Maximization (IM) (Kempe et al., 2003; Kitsak et al., 2010; Centola & Macy, 2007) , motivated by real-world social-network applications such as viral marketing, has been extensively studied in the past decades. In viral marketing, a marketer selects a set of users (seed nodes) with significant influence for content promotion. These selected users are expected to influence their social network neighbors, and such influence will be propagated across the network. With limited seed nodes, the goal of IM is to maximize the information spread over the network. A typical IM formulation models the social network as a directed graph and the associated edge weights are the propagation probabilities across users. Influence propagation is commonly modeled by a certain stochastic diffusion process, such as independent cascade (IC) model and linear threshold (LT) model (Kempe et al., 2003) . A popular variant is topic-aware IM (Chen et al., 2015; 2016) where the activation probabilities are content-dependent and personalized, i.e., edge weights are different when propagating different contents. Classical influence maximization solutions are studied in an offline setting, assuming activation probabilities are given (Kempe et al., 2003; Chen et al., 2009; 2010) . However, this information may not be fully observable in many real-world applications. Online influence maximization (Chen et al., 2013; Wen et al., 2017; Vaswani et al., 2017) has recently attracted significant attention to tackle this problem, where an agent learns the activation probabilities by repeatedly interacting with the network. Most existing works on online influence maximization are formulated as a multi-armed bandits problem making a non-adaptive batch decision: at each round, the seed nodes are computed prior to the diffusion process by balancing exploring the unknown network and maximizing the influence spread; the agent observes either edge-level (Chen et al., 2013; Wen et al., 2017; Wu et al., 2019) or node-level (Vaswani et al., 2017; Li et al., 2020) activations when the diffusion finishes and updates its model. Combinatorial multi-armed bandits (Chen et al., 2013; Wang & Chen, 2017 ) and combinatorial linear bandits (Wen et al., 2017; Wu et al., 2019) algorithms have been proposed as solutions, where most works follow independent cascade model with edge-level feedback. In contrast to the non-adaptive setting, adaptive influence maximization allows the agent to select seed nodes in a sequential manner after observing partial diffusion results (Golovin & Krause, 2011; Tong et al., 2016; Peng & Chen, 2019) . The agent can achieve a higher influence spread since the decision adapts to the real-time feedback of diffusion. In viral marketing, the agent could observe partial diffusion feedback from the customer and adjust the campaign for the rest of budgets based on current diffusion state. Unfortunately, online influence maximization in an adaptive setting is under-explored. Previous bandit-based solutions cannot be applied because the decisions of bandit algorithms are independent of the network state. In this paper, we study the content-dependent online adaptive influence maximization problem: at each round, the agent selects a user-content pair to activate based on current network state, observes the immediate diffusion feedback, and updates its policy in real-time. The network's activation probabilities are content-dependent and are unknown to the agent. The agent's goal is to maximize the total influence spread. We formulate this problem as an infinite-horizon discounted Markov decision process (MDP) , where the state is users' current activation status under different contents (user-content pairs), an action is to pick a user-content pair as the new seed, and the total reward is the discounted sum of active user counts. Specifically, we study the problem under the independent cascade model with node-level feedback. Similar to combinatorial linear bandits (Wen et al., 2017; Vaswani et al., 2017) , we formulate a tensor network diffusion process where activation probabilities are assumed to be linear with respect to both user and content features. To tackle the problem of node-level feedback, we propose a Bernoulli independent cascade model, a linear approximation to the classic IC model which requires edge-level feedback to learn. We propose a model-based reinforcement learning (RL) algorithm to learn the optimal adaptive policy. Our approach builds on prior work of bandit-based influence maximization algorithms (Chen et al., 2013; Wen et al., 2017; Wu et al., 2019) and has the following distinct features: (1) Our adaptive IM policy makes decisions and updates policy on the fly, without waiting till the end of diffusion process; (2) Our algorithm takes into consideration real-time feedback from the network, thus approaching a dynamic-optimal policy and outperforming bandit-based static-optimal solutions; (3) Our algorithm learns from node-level feedback, which greatly relaxes the common edge-level feedback assumption in previous works with IC model; (4) Our policy can handle content-dependent networks and select the best content for the right user for the campaign; (5) To improve computation efficiency, we adopt the slow switching strategy (Abbasi-Yadkori et al., 2011 ) that only update model parameter for O(d log T ) times, where d is the feature space dimension. Our contributions are summarized as follows: • We propose a linear tensor diffusion model for content propagation in social networks and formulate the problem as an infinite-horizon discounted MDP. • We propose a tensor-regression-based RL influence maximization algorithm with optimistic planning that learns an adaptive policy from node-level feedback, which selects the content and next seed user based on current state of the network. • We proved a O(d

√

T /∆ + √ dN KT )foot_0 regret of our algorithm, where T is the total rounds, N is the number of users, K is the number of contents, ∆ is the coefficient for diffusion decay, d is the dimension related to user and content feature. To our best knowledge, this is the first sublinear regret bound for online adaptive influence maximization. • We empirically validated on synthetic and real-world social networks that our algorithm explores the unknown network more thoroughly than conventional bandit methods, achieving larger influence spread. Related Works. The classical works on (offline) influence maximization (Kempe et al., 2003; Chen et al., 2009; 2010 ) assume the network model, i.e., the activation probabilities, is known to the agent and the goal is to maximize the influence spread, i.e., total number of activated users. IM has been studied in a non-adaptive setting where the agent chooses the seed nodes before the diffusion starts (Kempe et al., 2003; Chen et al., 2009; 2010; Bourigault et al., 2016; Netrapalli & Sanghavi, 2012; Saito et al., 2008) , or an adaptive setting where the agent sequentially selects the seed nodes adaptive to current diffusion results (Golovin & Krause, 2011; Tong et al., 2016; Han et al., 2018; Peng & Chen, 2019; Tong & Wang, 2020) . Online influence maximization (Chen et al., 2013; 2015; Lei et al., 2015; Lugosi et al., 2019; Perrault et al., 2020; Zuo et al., 2022) is proposed to learn network model while selecting seed nodes in the non-adaptive setting. Existing works on online IM studies mostly follow IC model and edge-level feedback (Chen et al., 2013; Wang & Chen, 2017; Wen et al., 2017; Vaswani et al., 2017; Lugosi et al., 2019) . Chen et al. (2013) and Wang & Chen (2017) formulated the online IM problem as combinatorial bandits problem and proposed combinatorial upper confidence bound (CUCB) algorithm to estimate the activation probabilities of edges in a tabular manner. Wen et al. (2017) assumed a linear parameterization on each edge with known edge features and proposed a linear bandits-based solution. Our paper is the first to consider online influence maximization in the adaptive setting and formulate it as an RL problem. We also can handle the more challenging node-level feedback. Some recent works also explored settings beyond IC model and edge-level feedback. Li et al. (2020) studied online IM with linear threshold model, and proposed a linear bandits-based solution to model the linearity in LT model for node-level feedback. We also leveraged the linearity in diffusion model to handle node-level feedback similar to Li et al. (2020) but for IC model. Vaswani et al. (2017) considered diffusion model-independent setting using a heuristic objective function, but without theoretical guarantee of the heuristic. Olkhovskaya et al. (2018) studied UCB-based algorithm for node-level feedback, but their algorithm is designed only for certain random graph models such as stochastic block models and Chung-Lu models. Our analysis is related to regret analysis of model-based reinforcement learning, which have been studied in various settings such as tabular MDP (Auer et al., 2008) , linear/kernel MDP (Yang & Wang, 2020; Yang et al., 2020) , factored MDP (Rosenberg & Mansour, 2021) , general model class (Ayoub et al., 2020) , etc. We provide a first problem-specific analysis for influence maximization. Our analysis differs from existing regret analysis in a couple of ways. First, although we focus on a linear model for network diffusion, the state-to-state transition of the IM is highly nonlinear, thus the value and Q functions for IM do not admit a linear model and invalidate linear/kernel MDP approaches. Second, due to the nature of network diffusion process, the state and its value can grow unboundedly for large networks, causing unbounded variance at the same time. Our analysis is specially tailored to such growth process over large networks and derive regret bound by focusing a high probability event where states stay bounded. To our best knowledge, this is a first IM-specific regret analysis for controlling unbounded growth process over large networks.

2. PROBLEM FORMULATION

We present a tensor network diffusion process to model user feature-dependent content featuredependent network propagation. Our goal is to both select seed users and customize contents for influence maximization. Further, we formulate IM into an RL problem to enable much more delicate control of the network diffusion process based on real-time feedback.

2.1. TENSOR NETWORK DIFFUSION PROCESS

Consider a social network of N users, where the network structure may be hidden. Let there be K choices of contents. Let s i,k ∈ {0, 1} denote the status of an user-content pair, i.e., s i,k = 1 if user i is actively tweeting content k. The full state of the network is denoted by s ∈ {0, 1} N ×K , a binary matrix. We focus on the asymptotic regime of large networks, i.e., N can be arbitrarily large or even N → ∞. We assume that each content can be propagated from one user to multiple users following an independent network diffusion process. Assumption 1 (Bernoulli Independent Cascade Model). Let s be the next state. For each k ∈ [K], we assume there is an underlying connectivity matrix A k ∈ R n×n such that P(s i,k = 1|s) = j∈[N ] A k i,j s j,k , And we assume s i,k 's are independent conditioned on s. Here A k i,j measures the level of influence user j has over user i for the k-th content. Therefore, the aggregate "influence" received by user i is j A k i,j s j,k . We model the status of user i as a Bernoulli variable, which is parameterized by the aggregate "influence" received by user i. Our model is closely related to the independent cascade model (Kempe et al., 2003) . In IC model, the activation probability takes of the form P(s i,k = 1|s) = 1j (1 -A k i,j s j,k ). A limitation is that efficient estimation of IC model requires edge-level observations (Chen et al., 2013; Wang & Chen, 2017) . Assumption 1 can be viewed as an linearized approximation to IC model, i.e., 1j (1 -A k i,j s j,k ) ≈ j A k i,j s j,k when all the A values are tiny (see Assumption 3). In Appendix F, we extend the assumption to the generalized linear setting and establish the regret bound for our algorithm. Consider a parameterized network diffusion model based on user features and content features. Let the i-th user be associated with a user feature vector x i ∈ R d1 , for all i ∈ [N ]. Let the k-th content be associated with a content feature θ k ∈ R d2 , for all k ∈ [K]. We assume that the influence is linear with respect to both user and content feature. Assumption 2 (Content-Dependent Linear Tensor Model). There exists a d 1 × d 1 × d 2 tensor T * ∈ R d1d1d2 such that A k i,j = T * , x i ⊗ x j ⊗ θ k , where ⊗ denotes outer product and , denotes inner product. Note that this is different from the linear MDP model commonly studied in the theoretical RL literature (Jin et al., 2020) . We focus on large networks where N can be arbitrarily large. We also assume each individual user has bounded influence over its neighbors and the diffusion process has a natural decay property. Assumption 3 (Uniform transition probability upper bound). There exists a constant C > 0 such that A ∞ ≤ C N K . Assumption 4 (Diffusion decay). There exists ∆ > 0 such that i∈[N ] A k i,j ≤ 1 -∆ for all k, j. Assumption 4 says that influence from any seed user has a discounting nature; without this assumption, some seed user may have infinite-long influence and make the diffusion process unbounded. This assumption also implies that, the "influence" of any seed user-content pair would last O(1/∆) time steps on average.

2.2. REINFORCEMENT LEARNING MODEL

We formulate the influence maximization problem as an infinite-horizon discounted MDP. Define the state space as S = {0, 1} N ×K where 1 refers to activated user-content pair. At each timestep, the agent observes the current network state s and picks an action a ∈ A := [N ] × [K] to activate one user-content pair. Let s a be the post-action state, i.e., s a = s + 1 a . Then the state of network transitions following the network diffusion process, i.e., Assumptions 1,2. Since users are activated independently from one another, the state-transition law of the MDP admits a product structure: P(s |s, a) = i∈[N ],k∈[K] P(s i,k |s, a). At each state-action pair, the agent receives a reward r(s, a) = i,k v i,k s i,k measuring the amount of influence over the network. For examples, if we let v i,k ≡ 1, then we have r(s, a) = s 1 , which counts the number of active users. Without lost of generality, we assume v i,k ≤ 1. Let π : S → A be a decision policy. We measure the value of policy π at state s as a cumulative sum of discounted rewards V π (s) = E π ∞ t=1 γ t-1 r(s t , a t ) s 1 = s . Recall Assumption 4 that the influence of any action lasts 1/∆ time steps. Thus, a natural choice of the discount factor to be γ = 1 -o(∆). Finally, the policy optimization problem is to find π * = argmax π V π (s). Relation between Discounted MDP and Bandit IM model. The discounted MDP formulation differs from the bandit IM optimization in two ways. (1) Our policy is dynamic and makes statedependent decisions, while the bandit approach would make a batch of decisions only at the beginning of the diffusion process; (2) In both cases, the optimization objectives are sums of total influences from all seed users. The difference lies in how to measure the per-seed influence. In IM bandit, the per-seed influence is a cumulative sum calculated after the diffusion process is over. In our formulation, the per-seed influence is a cumulative γ-discounted sum of rewards from this seed's descendants. If we choose γ = 1 -o(∆), these values differ by only o(1) and we can make the difference arbitrarily small.

3. MODEL-BASED RL FOR INFLUENCE MAXIMIZATION (MORIMA)

To reduce the statistical complexity, we adopt a model-based RL approach for exploring the unknown network and learning the optimal policy. Our approach alternates between model estimation and policy update. Our algorithm calculates a bonus function based on the collected data and and add it to the reward, which dynamically trades-off between exploitation and exploration. We also adopt a slow switching technique to reduce computational burden. Tensor ridge regression for model estimate. Under the linear tensor model (Assumption 2), we can use tensor ridge regression to perform model-based RL. This reduces the statistical complexity since the dimension of the unknown parameter is smaller. Furthermore, this approach only requires node-level feedback, While previous bandits approaches for IC model require edge-level feedback (Chen et al., 2013; Wen et al., 2017; Wu et al., 2019) . Specifically, let s a be the altered state after applying action a. Observe that, conditioned on (s, a), the random variable s i,k satisfies a linear relation: E[s i,k |s, a] = j A k i,j (s a ) j,k = T * , x i ⊗ j x j • (s a ) j,k ⊗ θ k . Denote for short φ i,k (s, a) = x i ⊗ j x j • (s a ) j,k ⊗ θ k ∈ R d1d1d2 , and φ t i,k = φ i,k (s t , a t ). At time t, after observing the history (s 1 , a 1 , . . . , s t-1 , a t-1 , s t ), we estimate the tensor model by : T t = argmin T t-1 τ =1 K k=1 N i=1 ( T , φ τ i,k -(s τ +1 ) i,k ) 2 + λ T 2 2 , where T 2 2 is calculated by vectorizing T . This allows an analytical solution: T t = Σ -1 t-1 B t-1 , where Σ t-1 = λI + t-1 τ =1 K k=1 N i=1 φ τ i,k • (φ τ i,k ) . B t-1 = t-1 τ =1 K k=1 N i=1 φ τ i,k • (s τ +1 ) i,k . Notice that the sizes of the covariance matrix Σ t-1 and the right-hand-side term B t-1 are d × d and d, respectively, where d = d 2 1 d 2 N . Optimistic Planning with truncated-reward model. To avoid the worst-case O(N K) reward, we identify a high probability upper bound Λ for the rewards and truncate the reward as r(s, a) = min{r(s, a), Λ}. Then based on the ridge regression estimation T t , we add a bonus term b t (s, a) to the truncated reward r and solve for an optimistic Q-function Q * Tt, r+bt (s, a) using the model estimate. Specifically, we can choose Λ = 6 ∆ 2 log(4N KT 3 ). For T t , we define the reward bonus as b t (s, a) = 2γΛ 1 -γ N i=1 K k=1 (1 ∧ β t • φ i,k (s, a) Σ -1 t-1 ), where we use the notation 1 ∧ x = min{1, x} and β t = 24 ∆ C A /(N K) • d • log(1 + N KL 2 t/(dλ)) + 4 log(8N 2 K 2 t 2 /δ) + √ λ T * 2 (6) with L being an upper bound of φ t i,k 2 and d = d 2 1 d 2 . This choice of β t ensures with high probability, Q * Tt, r+bt (s, a) is an upper bound of Q * (s, a), which is the optimal Q-function for ground-truth transition with truncated-reward. We calculate the optimal truncated Q-function Q * Tt, r+bt (s, a) using value iteration with truncation (Algorithm 2). Slow switching. To reduce computation overhead, we adopt a slow switching technique from bandit and RL literatures (Abbasi-Yadkori et al., 2011; Zhou et al., 2021b) . The idea is that we only update model and policy when enough new data has been collected, via checking the covariance matrix. Specifically, say the most recent switching happens at time t, we choose to switch at time t only if det(Σ t -1 ) > 2 det(Σ t-1 ). After switching, we calculate the optimistic Q-function Q t = Q * T t , r+b t (s, a). Then we pick actions greedily using Q t , i.e., a = argmax a Q t (s, a), until the next switching. Algorithm 1 Model-based RL for Influence Maximization (MORIMA) 1: Initialize Σ 1 = λI, B 1 = 0. Z = det(Σ 1 ). 2: Calculate T 1 and b 1 (s, a) and compute Q 1 = Q * T1, r+b1 s, a). 3: Take the greedy action with respect to Q 1 : a 1 = argmax a Q 1 (s 1 , a). 4: for t = 2, • • • , do 5: Calculate Σ t-1 and B t-1 according to Eqn. (4).  : Compute the optimistic Q-function Q t = Q * Tt, r+bt (s, a) (Algorithm 2). 9: Set Z = det(Σ t-1 ). 10: else 11: Set Q t = Q t-1 . 12: end if 13: Take the greedy action with respect to Q t : a t = argmax a Q t (s t , a). Full algorithm. We put together the pieces and present the full Algorithm 1. The algorithm makes only O(d log(T )) model updates and policy updates until time T . Each model update can be done efficiently using least square regression. Policy updates require solving a new planning problem which can be combinatorially hard. In practice, one can solve the planning problem using approximate dynamic programming (Powell, 2007) or Monte-Carlo Tree Search (MCTS) methods (Browne et al., 2012) , and we will use a two-step lookahead scheme in our experiments. For theoretical analysis, we assume access to a planning oracle that is able to find the optimal policy with respect to a known model T . Relaxing such assumption to an approximated planning oracle can be also be done with minor algorithmic and analysis modifications.

4. REGRET ANALYSIS

In this section, we provide regret analysis for Algorithm 1. We define the regret for the infinite-horizon discounted MDP as in (Zhou et al., 2021b) . Definition 1. For any possibly non-stationary policy π, the infinite-horizon discounted regret is defined as Regret(T ) = T t=1 ∆ t , where ∆ t = V * (s t ) -V π t (s t ), where V * is the optimal value function, and V π t is defined as V π t (s) = E π ∞ i=0 γ i r(s t+i , a t+i )|s 1 , . . . , s t-1 , s t = s Now we present our main theorem. Theorem 2. Let Assumptions 1-4 hold. With probability at least 1 -δ, Algorithm 1 satisfies the following regret upper bound: Regret(T ) ≤ O 1 ∆ 2 (1 -γ) 2 • d √ C/∆ + √ dN K • √ T + polylog(T )-terms, where d = dim(T * ) = d 2 1 d 2 . We see that the dominant term of the regret is O 1 ∆ 2 (1-γ) 2 • d √ C/∆ + √ dN K • √ T . Notice that the worst-case reward would scale with N K, while we managed to reduce the scaling of the regret to 1/∆ 2 .

4.1. PROOF SKETCH OF THEOREM 2

Next, we provide a proof sketch and defer the complete proof to Appendix C. Proof. High probability upper bounds for the size of active user-content pairs. We utilize the diffusion decay assumption (Assumption 4) to provide a high probability upper bound on the number of active user-content pairs. We show that for any policy π, with probability at least 1 -p, we have for all t ≥ 1, s t 1 ≤ O( 1 ∆ 2 log 4t 2 p ). We see that although we have in total N K user-content pairs, the number of active ones is constrained by a constant intrinsic to the network diffusion dynamics. Sharper bounds for the confidence region. We derive a batched version of Bernstein-type self-normalized bounds from (Zhou et al., 2021b) and show that with high probability for all t, T t -T  [(s t+1 ) i,k |s t , a t ] ≤ E[(s t+1 ) i,k |s t , a t ] = j A k i,j (s tat ) j,k ≤ C N K ( s t 1 + 1) ≤ O( C N K∆ 2 ). Then β t = O( dC N K /∆ + 1), which improves upon β t = O( √ d) given by the sub-Gaussian type self-normalized bounds. Surrogate regret of the truncated-reward model. Since we essentially run our algorithm against the truncated-reward model, we define the surrogate regret as Regret(T ) = T t=1 ( V * (s t )-V π t (s t )) , where V * and V π t are computed using the truncated reward r(s, a) = min{r(s, a), Λ}. By Eqn. ( 7), with probability at least 1 -1/(2N KT ), under any policy, we have r(s t , a t ) ≤ s t 1 ≤ Λ = O(1/∆ 2 ) for all t ≤ T . This means with high probability we have r(s t , a t ) = r(s t , a t ) and hence the true regret and the surrogate regret is similar. Specifically, we will show Regret(T ) ≤ Regret(T ) + 1/(1 -γ).  P i,k (•|s, a) 1 ≤ 2(1 ∧ | T * -T , φ i,k (s, a) |). Therefore, the bonus term can be chosen as Eqn. ( 5) and we ensure optimism at each time. Regret decomposition. We have the following regret decomposition for the surrogate regret. Regret(T ) ≤ O 1 1 -γ T t=1 b ts (s t , a t ) + 2γΛ 1 -γ T log 1 δ + Λ 1 -γ M , where M = M (T ) is the total number of switches and we will show that M = O(d). Then the dominant term of the regret is 1 1 -γ T t=1 b ts (s t , a t ) = O 1 (1 -γ) 2 ∆ 2 • β T • T t=1 N i=1 K k=1 (1 ∧ φ i,k (s t , a t ) Σ -1 ts -1 ) ≤ O 1 (1 -γ) 2 ∆ 2 • β T • O( √ dN KT + M N K). where t s denotes the last switch up to time t, and the last inequality follows from a variant of Elliptical Potential Lemma. Plug in the choice of β T and we derive the result.

4.2. EXTENSION TO GENERALIZED LINEAR INDEPENDENT CASCADE MODELS

In this subsection, we briefly show how to extend our algorithm and regret bound to generalized linear IC models. The details can be found in Appendix F. Assumption 5 (Generalized Linear Independent Cascade (GLIC) Model). Let s be the next state. For each k ∈ [K], we assume there is an underlying connectivity matrix A k ∈ R n×n such that P(s i,k = 1|s) = µ j A k i,j s j,k , where µ : R → R satisfies µ(0) = 0 and 1/κ ≤ µ ≤ 1 for some κ ≥ 1. And we assume s i,k 's are independent conditioned on s. Given Assumption 5, there are two major differences extending MORIMA to GLIC models: (1) new objective of tensor model estimation under GLM, which does not have analytical solutions; and (2) new optimistic planning, which is analyzed following new tensor estimate. We have the following regret bound for MORIMA under GLIC. Theorem 3. Let Assumption 5, Assumptions 2-4 hold. With probability at least 1 -δ, Algorithm 1 satisfies the following regret upper bound: Regret(T ) ≤ O κ ∆ 2 (1 -γ) 2 • d √ C/∆ + √ dN K • √ T + polylog(T )-terms, where d = dim(T * ) = d 2 1 d 2 .

5. EXPERIMENTS

We experiment with MORIMA (Algorithm 1) for influence maximization on both synthetic networks and a large-scale Twitter social network. Synthetic networks. We run benchmark experiments on two synthetic networks. The first cascade network (see Appendix E Figure 2a ), consisting of N = 300 users and K = 4 content types, is constructed to bear a hierarchical structure with users with high, medium, and low influences, with 9-dimensional user feature and 3-dimensional content features (see Appendix E for details). The second network is constructed to have a star-like structure, consisting of N = 70 users of various influence levels and K = 1 contents (see AppendixE Figure2b for details). Experiment result of the first network is reported in Figure 1 (a), and result of the second star network is reported in Appendix E Figure 3 . Twitter social network. We further conduct experiments using the Twitter Social Network dataset (Leskovec & Krevl, 2014) , which represents real-world social networks. The dataset contains ∼80k nodes and ∼1 million directed edges, where a directed edge (u, v) means the node u follows the node v on Twitter. We randomly sample multiple sub-graphs from the Twitter network and construct K networks corresponding to different topics/content types. For the sampled networks, we first pick out n 1 nodes with most out-degrees and then include all their -hop neighbors, with n 1 = 8, = 5. We construct a connectivity tensor over these networks by randomly drawing edge weights from [0, 0.1] and normalizing tensor row sums to 0.9. Next we apply non-negative Tucker decomposition (Kossaifi et al., 2019) to extract the Tensor core model T * , user features (of dimension d 1 = 10) and content features (of dimension d 2 = 2). Thus, we have generated a large-scale topic-aware Twitter diffusion network with K = 3 content types, N = 1966 nodes, and 38023 edges for dynamic influence maximization. Implementation and Baselines. Exactly solving for the optimal policy, even if the network is fully known, requires solving a combinatorially hard planning problem and is intractable. In our experiment, we adopt the two-step lookahead approximate dynamic programming scheme (Powell, 2007) as the planning oracle for Algorithm 2 of MORIMA. In the implementation of Algorithm 1, we set γ = 0.9, λ = 1 for synthetic networks and λ = 0.01 for the Twitter network. For comparison with MORIMA, we also test the following baselines: (i) the naive random policy that uniformly selects a user-content pair to activate; (2) the IMLinUCB (Wen et al., 2017) which is combinatorial linear bandits baseline that was originally designed for non-adaptive online IM. To make a fair comparison with the two-step lookahead oracle, we run IMLinUCB every 2 timesteps and play the 2 selected actions spontaneously; (3) MORIMA without slow switching, where we force the Q-function to be updated at each time step; (4) MORIMA with known A k s -purely planning with a fully know model -it would be a performance upper bound of the reinforcement learning algorithm. Results and analysis. We report the averaged discounted cumulative rewards and its empirical confidence region of our experiments in Figure 1 , where each test is repeated on synthetic networks for 20 times and Twitter network for 5 times. In Figure 1 (a), we observe that the discounted sum of reward of MORIMA reaches the same level of the performance upper bound with true A in less than 100 rounds on synthetic network, showing that our algorithm can quickly explore the unknown network and learn to make optimal decisions. In Figure 1 (b), we observe that MORIMA can still match the performance of the planning oracle in large real social network with thousands of users. Across all experiments, our reinforcement learning-based MORIMA significantly outperforms IMLinUCB because it can adaptively make decisions based on current state while IMLinUCB does not take state into consideration. Further, slow switching did not hurt the performance of MORIMA while greatly reducing the computation complexity from O(T ) parameter updates to O(d log T ).

6. CONCLUSION

In this paper, we study the problem of content-aware online adaptive influence maximization and formulate the problem as an infinite-horizon discount MDP. We propose MORIMA, a model-based reinforcement learning algorithm that learns optimal policy using only node-level feedback under the IC model. We provide a O(d √ T /∆ + √ dN KT ) regret bound for our algorithm, which is the first sublinear regret bound of online adaptive influence maximization problem. We empirically validated the effectiveness of our algorithm on synthetic and real-world social networks. 

A LIMITATIONS AND BROADER IMPACT

As a common limitation for all previous work on influence maximization, it is computationally infeasible to find the exact optimal policy, especially when applied to real-world networks with millions of nodes. Therefore, the scalability of our proposed algorithm comes at the cost of using approximations. Given a certain computational budget constraint, the optimality of the policy needs to be traded-off towards affordable computational and storage complexity. In our experiments, we use Monte-Carlo methods, parallel computation and randomized tree search (dynamic programming) methods for approximating the optimal policies. While the real-world social network graph in our experiment is in the same scale as the graphs used in previous online IM studies, scaling up to larger network is an important future work of ours. In the paper, we propose a model-based RL algorithm to learn the optimal policy for online adaptive influence maximization problems, which can be applied to advertisements for promoting beneficial ideas, new knowledge, and innovative products across social networks. However, such algorithms might be exploited to propagate fake news or rumors through the social networks. Addressing the ethical concerns also needs to be considered in future work.  β t = 24 ∆ C/(N K) • d • log(1 + N KL 2 t/(dλ)) + 4 log(8N 2 K 2 t 2 /δ) + √ λ T * 2 . and L = sup φ t i,k 2 , d = d 2 1 d 2 . Before the proof of Lemma B.1, we introduce two lemmas below: Lemma B.2 (High probability bounds for the number of active user-content pairs). Let Assumptions 1-4 hold. For any possibly non-stationary policy π, with probability at least 1 -δ, we have for all t ≥ 1, s t 1 ≤ 2 ∆ ( 2 ∆ log 2t 2 δ + 1). Lemma B.3 (Bernstein-type self-normalized bound, batched version (Zhou et al., 2021a) ). Let {F t } ∞ t=1 be a filtration, {x i t , y i t } t≥1,1≤i≤m be a stochastic process such that x i t ∈ R d is F t - measurable and y i t ∈ R is F t+1 -measurable. Assume that conditioned on F t , {y 1 t , • • • , y m t } are independent, and |y i t | ≤ R, E[y i t |F t ] = T * , x i t , var[y i t |F t ] ≤ σ 2 , x i t 2 ≤ L, then with probability at least 1 -δ, the following holds simultaneously for all t ≥ 1: T t -T * Σt-1 ≤ β t , m i=1 t-1 τ =1 x i τ (y i τ -T * , x i τ ) Σ -1 t-1 ≤ β t - √ λ T * 2 . where T t = Σ -1 t-1 B t-1 , Σ t-1 = λI + m i=1 t-1 τ =1 x i τ (x i τ ) , B t-1 = m i=1 t-1 τ =1 x i τ y i τ , β t = 8σ d log(1 + mtL 2 /(dλ)) log(4m 2 t 2 /δ) + 4R log(4m 2 t 2 /δ) + √ λ T * 2 . proof of Lemma B.1. We use Lemma B.3 for batched stochastic process {φ t i,k , (s t+1 ) i,k }. Notice that we can choose σ 2 to be the upper bound of var[(s t+1 ) i,k |s t , a t ], and var[(s t+1 ) i,k |s t , a t ] ≤ E[(s t+1 ) i,k |s t , a t ] = j A k i,j (s tat ) j,k ≤ C N K j (s tat ) j,k ≤ C N K ( s t 1 +1). where we used the assumption that A k i,j ≤ C N K . By Lemma B.2, we have with probability at least 1 -δ/2, for all t, s t 1 ≤ 2 ∆ ( 2 ∆ log 4t 2 δ + 1). Therefore, when the above inequalities hold, we have var[(s t+1 ) i,k |s t , a t ] ≤ C N K ( 2 ∆ ( 2 ∆ log 4t 2 δ + 1) + 1) ≤ C N K ( 3 ∆ ) 2 log 4t 2 δ . By Lemma B.3 with m = N K, R = 1, and σ 2 = C N K (3/∆) 2 log 4t 2 δ , we have β t = 24 ∆ C/(N K) • d • log(4t 2 /δ) log(1 + N KL 2 t/(dλ)) log(8N 2 K 2 t 2 /δ)+4 log(8N 2 K 2 t 2 /δ)+ √ λ T * 2 , which is smaller than the result stated in the lemma.

B.2 DEFERRED PROOFS IN SUBSECTION B.1

proof of Lemma B.2. First, we bound the expectation of s t 1 . By the transition, we have E[(s t+1 ) i,k |s t , a t ] = j A k i,j (s tat ) j,k Therefore, E[ s t+1 1 |s t , a t ] = i,j,k A k i,j (s tat ) j,k Recall that we have assumed that for any content k and any user j, i A k i,j ≤ 1 -∆. Then we have E[ s t+1 1 |s t , a t ] ≤ (1 -∆) j,k (s tat ) j,k = (1 -∆) s tat 1 ≤ (1 -∆)( s t 1 + 1), where the last inequality holds since the action alters at most one entry of the state. Next, notice that conditioned on (s t , a t ), s t+1 is the summation of N K independent Bernoulli random variables. By Bernstein inequality, we have with probability at least 1 -δ t , s t+1 1 -E[ s t+1 1 |s t , a t ] ≤ 2( i,k var[(s t+1 ) i,k |s t , a t ] log 1 δ t + log 1 δ t ). Since the variance of a Bernoulli random variable is bounded by its expectation, we have s t+1 1 -E[ s t+1 1 |s t , a t ] ≤ 2( E[ s t+1 1 |s t , a t ] log 1 δ t + log 1 δ t ), Therefore, by Equation ( 9), we have s t+1 1 ≤ (1 -∆)( s t 1 + 1) + 2 (1 -∆)( s t 1 + 1) log 1 δ t + 2 log 1 δ t ≤ (1 -∆)( s t 1 + 1) + ∆ 2 ( s t 1 + 1) + 2 ∆ (1 -∆) log 1 δ t + 2 log 1 δ t = (1 -∆/2)( s t 1 + 1) + 2 ∆ log 1 δ t , where we used at + b/t ≥ 2 √ ab for the last inequality. Finally, we set δ t = δ 2t 2 so that t δ t ≤ δ and take union bound over all t ≥ 1. By solving the recursion, we have with probability at least 1 -δ, s t 1 ≤ 2 ∆ ( 2 ∆ log 2t 2 δ + 1) for all t ≥ 1. proof of Lemma B.3. We consider a "serialized" stochastic process. Let G t,i = σ(F t , y 1 t , . . . , y i-1 t ). When 1 ≤ i ≤ m, we have G t,i ⊆ G t,i+1 ; while when i = m + 1, we have G t,m+1 = σ(F t , y 1 t , . . . , y m t ) ⊆ G t+1,1 = F t+1 . Then we know that G 1,1 ⊆ G 1,2 ⊆ • • • ⊆ G 1,m ⊆ G 2,1 ⊆ G 2,2 ⊆ • • • ⊆ G 2,m ⊆ • • • ⊆ G t,1 ⊆ G t,2 ⊆ • • • ⊆ G t,m ⊆ . . . is a filtration. Clearly we have x i t is G t,i -measurable and y i t is G t,i+1 -measurable. By the conditional independence assumption, we also have y i t |G t,i = d y i t |F t , y 1 t , . . . , y i-1 t = d y i t |F t . Therefore, by Theorem 4.1 of Zhou et al. (2021a) , we have with probability at least 1 -δ, for all t ≥ 1 and i = 1, . . . , m, T t,i -T * Σt,i ≤ β t,i , and t-1 τ =1 m j=1 x j τ (y j τ -T * , x j τ ) + i j=1 x j t (y j t -T * , x j t ) Σ -1 t,i ≤ β t - √ λ T * 2 , where T t,i = Σ -1 t,i B t,i , Σ t,i = λI + t-1 τ =1 m j=1 x j τ (x j τ ) + i j=1 x j t (x j t ) , B t,i = t-1 τ =1 m j=1 x j τ y j τ + i j=1 x j t y j t , and β t,i = 8σ d log(1 + t i L 2 /(dλ)) log(4t 2 i /δ) + 4R log(4t 2 i /δ) + √ λ T * 2 , t i = m(t -1) + i. Then the result of Lemma B.3 follows by setting i = m.

C PROOF OF THEOREM 2

Additional Notation. Let t 1 = 1, and for s ≥ 1, the next switching time t s+1 is recursively defined as t s+1 = min{t| det(Σ t-1 ) > 2 det(Σ ts-1 )}. Denote the set of switching times by W = {t 1 , t 2 , . . . , t M } where M is the total number of switches. We have 1 = t 1 < t 2 < • • • < t M ≤ T < t M +1 . We slightly abuse the notation and use t s to denote the last switch up to time t, i.e., t s ≤ t < t s+1 . Then by slow switching we mean Q t = Q * Tt s , r+bt s . Recall the definition of the regrets Regret(T ) = T t=1 (V * (s t ) -V π t (s t )), Regret(T ) = T t=1 ( V * (s t ) -V π t (s t )), where V, Regret are defined with the original untruncated model and V , Regret are defined with the truncated-reward model. Key Lemmas. Before the proof of Theorem 2, we introduce several key lemmas. Lemma C.1 (optimism). Let Assumptions 1-4 hold. Set the bonus term to be b t (s, a) = 2Λγ 1 -γ N i=1 K k=1 (1 ∧ β t • φ i,k (s, a) Σ -1 t-1 ). Then with probability at least 1 -δ, we have the optimistic condition Q * (s, a) ≤ Q t (s, a) holds for all t ≥ 1. Furthermore, we have for any V (s) such that 0 ≤ V (s) ≤ Λ/(1 -γ), γ|E s ∼P(s |s,a) V (s ) -E s ∼P T t (s |s,a) V (s )| ≤ b t (s, a). Lemma C.2 (surrogate regret). Let Assumptions 1-4 hold. Assume that T log(1/γ) ≥ log(2N KT ). For any policy π, we have the following connection of the regrets of the two MDPs. Regret(T ) ≤ Regret(T ) + 1 1 -γ . Lemma C.3 (regret decomposition (Zhou et al., 2021b) ). Let Assumptions 1-4 hold. Assume at each time step t, the results of Lemma C.1 holds. Then with probability at least 1 -δ, we have the following regret decomposition Regret(T ) ≤ 1 1 -γ 2 T t=1 b ts (s t , a t ) + 2γΛ 1 -γ T log 1 δ + γ 2Λ/(1 -γ) + E T . where E T is the switching error E T = T t=1 V t (s t+1 ) -V t+1 (s t+1 ). Lemma C.4 (bounding the number of switches). Let Assumptions 1-4 hold. The total number of the switches M incurred by Algorithm 1 is bounded as M < 1 log 2 d log d + N KT L 2 /λ d + 1, where L = sup φ t i,k 2 . Next we state the proof of Theorem 2. proof of Theorem 2. Combing Lemma C.1, Lemma C.2, and Lemma C.3, we have with probability at least 1 -2δ, when T log(1/γ) ≥ log(2N KT ), Regret(T ) ≤ 1 1 -γ 2 T t=1 b ts (s t , a t ) + 2γΛ 1 -γ T log 1 δ + γ 2Λ/(1 -γ) + E T + 1 1 -γ . Next we provide an upper bound for T t=1 b ts (s t , a t ). By Lemma C.1 we know that b ts (s t , a t ) ≤ 2γΛ 1 -γ β T N i=1 K k=1 (1 ∧ φ i,k (s t , a t ) Σ -1 ts-1 ). For any (i, k), define Σ t,i,k = Σ t-1 + i-1 j=1 K l=1 φ t j,l (φ t j,l ) + k l=1 φ t i,l (φ t i,l ) . By the definition that t s+1 = min{t| det(Σ t-1 ) > 2 det(Σ ts-1 )}, we have det(Σ ts+1-2 ) ≤ 2 det(Σ ts-1 ). Therefore, when t s ≤ t < t s+1 -1, we have det(Σ t,i,k ) ≤ det(Σ t ) ≤ det(Σ ts+1-2 ) ≤ 2 det(Σ ts-1 ). By Lemma D.4, this implies φ i,k (s t , a t ) 2 Σ -1 ts -1 ≤ 2 φ i,k (s t , a t ) 2 Σ -1 t,i,k-1 . T t=1 N i=1 K k=1 (1 ∧ φ i,k (s t , a t ) Σ -1 ts -1 ) = t+1∈W N i=1 K k=1 (1 ∧ φ i,k (s t , a t ) Σ -1 ts-1 ) + t+1 / ∈W N i=1 K k=1 (1 ∧ φ i,k (s t , a t ) Σ -1 ts -1 ) ≤ M N K + N KT t+1 / ∈W N i=1 K k=1 (1 ∧ φ i,k (s t , a t ) 2 Σ -1 ts -1 ) ≤ M N K + 2N KT t+1 / ∈W N i=1 K k=1 (1 ∧ φ i,k (s t , a t ) 2 Σ -1 t,i,k-1 ) ≤ M N K + 2N KT T t=1 N i=1 K k=1 (1 ∧ φ i,k (s t , a t ) 2 Σ -1 t,i,k-1 ) ≤ M N K + 2N KT • 2d log dλ + N KT L 2 dλ , where the last inequality follows from Lemma D.3. This implies T t=1 b ts (s t , a t ) ≤ 2γΛ 1 -γ β T M N K + 4N KdT log dλ + N KT L 2 dλ . Next we bound the switching error E T . Since there are in total M switches, we know that there are at most M non-zero terms in the summation of E T . Then we have E T = T t=1 (V t (s t+1 ) -V t+1 (s t+1 )) ≤ Λ 1 -γ M. Plugging the result of Lemma C.4, we have E T = O( Λ 1-γ d). Therefore, we have the final regret upper bound when T log(1/γ) ≥ log(2N KT ): Regret(T ) ≤ O 1 ∆ 2 (1 -γ) 2 • d √ C/∆ + √ dN K • √ T + polylog(T )-terms, where d = dim(T * ) = d 2 1 d 2 . When T log(1/γ) ≤ log(2N KT ), the above inequality trivially holds. C.1 DEFERRED PROOFS proof of Lemma C.1. For simplicity, define P = P Tt , which is the estimated transition distribution obtained using T t . Notice that P i,k (•|s, a) is a Bernoulli distribution with success probability T * , φ i,k (s, a) , while P i,k (•|s, a) is a Bernoulli distribution with success probability T t , φ i,k (s, a) . Therefore, we have P i,k (•|s, a) -P i,k (•|s, a) 1 ≤ 2(1 ∧ | T * -T t , φ i,k (s, a) |). By Lemma B.1, with probability at least 1 -δ, we have T * -T t Σt-1 ≤ β t for all t ≥ 1. Then by Cauchy Inequality, the above term can be further bounded as P i,k (•|s, a) -P i,k (•|s, a) 1 ≤ 2(1 ∧ β t • φ i,k (s, a) Σ -1 t-1 ). By Lemma D.2, we have P(•|s, a) -P(•|s, a) 1 ≤ i,k P i,k (•|s, a) -P i,k (•|s, a) 1 ≤ 2 i,k (1 ∧ β t • φ i,k (s, a) Σ -1 t-1 ). Then by Lemma D.1, we know the desired result holds. proof of Lemma C.2. Define π * as the optimal policy under the original model. Then we have V * (s t ) -V * (s t ) = V * (s t ) -V π * (s t ) + V π * (s t ) -V * (s t ) ≤ V * (s t ) -V π * (s t ) = E π * ∞ i=1 γ i-1 r(s i , a i ) s 1 = s t -E π * ∞ i=1 γ i-1 r(s i , a i ) s 1 = s t , where the inequality holds because V * (s t ) is the optimal V-function with respect to the truncatedreward model. Notice that by Lemma B.2 with δ 0 = 1/(2N KT ), we know that for policy π * , with probability at least 1 -δ 0 , we have s i 1 ≤ Λ = 6 ∆ 2 log(4N KT 3 ) for all 1 ≤ i ≤ T . Therefore, with probability at least 1 -δ 0 , r(s i , a i ) = r(s i , a i ) for all 1 ≤ i ≤ T . Then V * (s t ) -V * (s t ) = E π * ∞ i=1 γ i-1 (r(s i , a i ) -r(s i , a i )) s 1 = s t = E π * T i=1 γ i-1 (r(s i , a i ) -r(s i , a i )) s 1 = s t + E π * ∞ i=T +1 γ i-1 (r(s i , a i ) -r(s i , a i )) s 1 = s t ≤ (1 -δ 0 ) • 0 + δ 0 T i=1 γ i-1 N K + N K γ T 1 -γ ≤ 1 1 -γ δ 0 N K + 1 1 -γ δ 0 N K = 1 T (1 -γ) , where the last inequality holds when T log(1/γ) ≥ log(2N KT ). On the other hand, we have V π t (s t ) -V π t (s t ) = E π ∞ i=0 γ t+i (r(s t+i , a t+i ) -r(s t+i , a t+i )) s 1 , • • • , s t ≥ 0. Therefore, Regret(T ) ≤ Regret(T ) + T t=1 [(V * (s t ) -V * (s t )) -(V π t (s t ) -V π t (s t ))] ≤ Regret(T ) + 1 1 -γ . proof of Lemma C.3. Define V t (s) = max a Q t (s, a). By the assumption that Q * (s, a) ≤ Q t (s, a), we have V * (s) ≤ V t (s). Then ∆ t = V * (s t ) -V π t (s t ) ≤ V t (s t ) -V π t (s t ) = Q t (s t , a t ) -V π t (s t ), where the last equality holds since a t ∈ argmax a Q t (s t , a), i.e., we take the action a t greedily according to Q t . The optimal truncated Q-function Q * Tt s , r+bt s (10) ∆ t = min r(s t , a t ) + b ts (s t , a t ) + γE s ∼P T ts (•|st,at)  V t (s ), Λ/(1 -γ) -r(s t , a t ) + γE s ∼P(•|st,at) V π t+1 (s ) ≤ r(s t , a t ) + b ts (s t , a t ) + γE s ∼P T ts (•|st,at) V t (s ) -r(s t , a t ) + γE s ∼P(•|st,at) V π t+1 (s ) =b ts (s t , a t ) + γ E s ∼P T ts (•|st,at) V t (s ) -E s ∼P(•|st,at) V π t+1 (s ) =b ts (s t , a t ) + γ E s ∼P T ts (•|st,at) V t (s ) -E s ∼P(•|st,at) V t (s ) + γ E s ∼P(•|st,at) V t (s ) -E s ∼P(•|st,at) V π t+1 (s ) . For the second term, since |V t (s )| ≤ Λ/(1 -γ), by Lemma C.1, we have γ E s ∼P T ts (•|st,at) V t (s ) -E s ∼P(•|st,at) V t (s ) ≤ b ts (s t , a t ). For the last term, we have γ E s ∼P(•|st,at) V t (s ) -E s ∼P(•|st,at) V π t+1 (s ) =γξ t + γ V t (s t+1 ) -V π t+1 (s t+1 ) , where ξ t = E s ∼P(•|st,at) (V t (s ) -V π t+1 (s )) -(V t (s t+1 ) -V π t+1 (s t+1 )) . Therefore, we have T t=1 ∆ t ≤ T t=1 V t (s t ) -V π t (s t ) ≤ 2 T t=1 b ts (s t , a t ) + γ T t=1 ξ t + γ T t=1 V t (s t+1 ) -V π t+1 (s t+1 ) . For the last term above, we have T t=1 V t (s t+1 ) -V π t+1 (s t+1 ) = T t=1 V t+1 (s t+1 ) -V π t+1 (s t+1 ) + T t=1 V t (s t+1 ) -V t+1 (s t+1 ) E T = T -1 t=0 V t+1 (s t+1 ) -V π t+1 (s t+1 ) + V T +1 (s T +1 ) -V π T +1 (s T +1 ) -V 1 (s 1 ) + V π 1 (s 1 ) + E T ≤ T t=1 V t (s t ) -V π t (s t ) + 2Λ/(1 -γ) + E T . Notice that {ξ t } is a martingale difference sequence. Therefore, by Azuma-Hoeffding inequality, we have with probability at least 1 -δ, T t=1 ξ t ≤ 2Λ 1 -γ T log 1 δ . To summarize, we have Regret(π) ≤ T t=1 V t (s t ) -V π t (s t ) ≤ 2 T t=1 b ts (s t , a t ) + 2γΛ 1 -γ T log 1 δ + γ T t=1 V t (s t ) -V π t (s t ) + 2Λ/(1 -γ) + E T , which implies Regret(π) ≤ 1 1 -γ 2 T t=1 b ts (s t , a t ) + 2γΛ 1 -γ T log 1 δ + γ 2Λ/(1 -γ) + E T . proof of Lemma C.4. On the one hand, we have det(Σ T ) det(Σ 0 ) ≥ M -1 s=1 det(Σ ts+1-1 ) det(Σ ts-1 ) > 2 M -1 . On the other hand, we also have det(Σ T ) det(Σ 0 ) = det(Σ -1 0 Σ T ) ≤ Tr(Σ -1 0 Σ T ) d d = Tr(I + λ -1 i,k,t φ t i,k (φ t i,k ) ) d d ≤ d + N KT L 2 /λ d d . Therefore, M < 1 log 2 d log d+N KT L 2 /λ d + 1.

D AUXILIARY LEMMAS

Lemma D.1. Let P(s |s, a) and P(s |s, a) be two transition probabilities. Assume that 0 ≤ r(s, a) ≤ Λ. Let Q * be optimal Q-function for MDP M P,r . Let Q * be the optimal truncated Q-function for M P,r+b , which satisfies the following equation Lemma D.3 (Lemma 11 in (Abbasi-Yadkori et al., 2011) ). For any Q * (s, a) = min{Λ/(1 -γ), r(s, a) + b(s, a) + γE s ∼P(•|s,a) max a Q * (s , a )}. Then if b(s, a) ≥ γΛ 1 -γ P(•|s, a) -P(•|s, a) 1 , we have Q * (s, a) ≥ Q * (s, a) for all (s, a). Furthermore, we have for any V (s) such that 0 ≤ V (s) ≤ Λ/(1 -γ), γ|E s ∼P(s |s,a) V (s ) -E s ∼ P(s |s,a) V (s )| ≤ b(s, a). {x t } T t=1 ⊆ R d , let Σ t = λI + T t=1 x t x t , then we have T t=1 (1 ∧ x t 2 Σ -1 t-1 ) ≤ 2d log dλ + T L 2 dλ , where L = sup x t 2 . Lemma D.4 (Lemma 12 in (Abbasi-Yadkori et al., 2011) ). Let A, B ∈ R d×d be two positive definite matrices and A B. Then for any x ∈ R d , we have x 2 A ≤ x 2 B • det(A) det(B ) .

E ADDITIONAL EXPERIMENTS E.1 SYNTHETIC CASCADE NETWORK

We conduct experiments on cascade synthetic networks as shown in Figure 2a . There are K = 4 identical networks are associated with different contents; each cascade network has N = 300 nodes with high/medium/low influential power, 5/20/275 nodes respectively. More specifically, the five high nodes can activate theirselves and each others at the next time step with probability 0.1; the high nodes can activate the medium nodes with probability the medium nodes can only activate the low nodes with probability 0.12. Therefore, these networks would appreciate the high nodes as better actions with more delayed reward. In addition, the d 2 = 3-dim content features for K = 4 contents are {(1, 0, 0), (0, 1, 0), (0, 0, 1), (0.3, 0.3, 0.4)}; the d 1 = 9-dim user features are one-hot for each type of nodes associated with each dimension of content features. Finally, the underlying linear dynamic T * can be easily computed for desired probabilities of edges. The experiment result of the cascade synthetic networks is reported in Figure1a. 

E.2 SYNTHETIC STAR-SHAPE NETWORK

We conduct additional experiments on another synthetic network as shown in Figure 2b , where a better action has more delayed reward and making decision at each time step is appreciated. This synthetic network has three influential nodes out of total N = 70 nodes, while the center one is the best choice and has the ability to activate the other two influential nodes. Also the central influential node has delayed but higher expected reward than expected reward by activating either of other two neighboring influential nodes. For simplicity, we set only one content is available here, K = 1 and d 2 = 1. The d 1 = 6 user features are one-hot vectors, indicating their neighborhood subgraph. The discount factor of reward is γ = 0.9. In Figure 3 , we compare the performance of our MORIMA to MORIMA with known A k as upper bound and IMLinUCB as baseline. The details of these algorithms are the same as in section 5. MORIMA exhibits its great power to explore the unknown graph efficiently; its learning curve at the first 40 time steps overlaps with MORIMA while knowing the true dynamics A k . In addition, the sum of cumulative rewards of MORIMA and the upper bound stay at the same high level. Furthermore, we observe that the learning procedure of IMLinUCB takes much longer and converges to a much lower level. The classic IM setting, activating k seeds at once for every k step, shows its limit while adaptive decision making leads to a better result. We ran all experiments on our internal cluster with 8 CPUs, 128G memory per task. ). Then with probability at least 1 -δ, we have the optimistic condition Q * (s, a) ≤ Q t (s, a) holds for all t ≥ 1. Then the result holds by applying the same proof of Lemma B.2. proof of Lemma F.2. We have L t (T * ) = λT * + t-1 τ =1 K k=1 N i=1 µ( T * , φ τ i,k ) -(s τ +1 ) i,k φ τ i,k Σ -1 t-1 ≤ λT * Σ -1 t-1 + t-1 τ =1 K k=1 N i=1 µ( T * , φ τ i,k ) -(s τ +1 ) i,k φ τ i,k Σ -1 t-1 ≤ λT * λI -1 + t-1 τ =1 K k=1 N i=1 µ( T * , φ τ i,k ) -(s τ +1 ) i,k φ τ i,k Σ -1 t-1 ≤ √ λ T * 2 + (β t - √ λ T * 2 ) = β t , where the last but one inequality holds by applying Lemma B.3 with the same variance upper bound as in the proof of Lemma B.1. proof of Lemma F.3. Let P = P Tt . By the assumption that µ ≤ 1, we still have P i,k (•|s, a) -P i,k (•|s, a) 1 ≤ 2(1 ∧ | T * -T t , φ i,k (s, a) |) ≤ 2(1 ∧ T * -T t Σt-1 • φ i,k (s, a) Σ -1 t-1 ). By Lemma F.2, with probability at least 1 -δ, we have L t (T * ) ≤ β t and then L t ( T t ) ≤ β t for all t ≥ 1. Therefore, we have 2β t ≥ L t ( T t ) + L t (T * ) ≥ λ T t + t-1 τ =1 K k=1 N i=1 µ( T t , φ τ i,k ) -(s τ +1 ) i,k φ τ i,k -λT * - t-1 τ =1 K k=1 N i=1 µ( T * , φ τ i,k ) -(s τ +1 ) i,k φ τ i,k Σ -1 t-1 = λI + t-1 τ =1 K k=1 N i=1 µ ( T , φ τ i,k ) φ τ i,k (φ τ i,k ) • ( T t -T * ) Σ -1 t-1 ≥ 1/κ • Σ t-1 • ( T t -T * ) Σ -1 t-1 = 1/κ • T t -T * Σt-1 ,



O(•) ignores all logarithmic terms.



6:if det(Σ t-1 ) > 2Z then 7: Calculate T t and b t (s, a) according to Eqn. (3) and Eqn. (5).

8

14: end for Algorithm 2 Truncated Value Iteration 1: Input: parameter T , reward r(s, a), bonus term b(s, a). 2: Initialize Q(s, a) = Λ 1-γ . 3: while Not Converged do 3: Q(s, a) ← min{ Λ 1-γ , r(s, a) + b(s, a) + γE s ∼P T (•|s,a) max a Q(s , a )} 4: end while 5: Return: Optimistic Q function Q(s, a).

Bonus term. Let P(s |s, a) be the true transition probability and P(s |s, a) be the empirical estimate. As a typical result in MDP theory, we require b(s, a) ≥ γ V ∞ • P(•|s, a) -P(•|s, a) 1 to ensure optimism. We exploit the fact that P(s |s, a) and P(•|s, a) is factorized, i.e., P(•|s, a) = ⊗ N i=1 ⊗ K k=1 P i,k (•|s, a) and P(•|s, a) = ⊗ N i=1 ⊗ K k=1 P i,k (•|s, a), which stem from the independence assumption (Assumption 1). This gives us P(•|s, a) -P(•|s, a) 1 ≤ N i=1 K k=1 P i,k (•|s, a) -P i,k (•|s, a) 1 . Notice that P i,k (•|s, a) is a Bernoulli distribution, then by Assumption 2 we have P i,k (•|s, a) -

Figure 1: Real-time discounted sum of rewards of IM on synthetic and Twitter networks. Averaged results with 85% CI bands are included.

(s, a) (Algorithm 2) satisfies the following truncated Bellman equation: Q * Tt s , r+bt s (s, a) = min{Λ/(1 -γ), r(s, a) + b ts (s, a) + γE s ∼P T ts (•|s,a) max a Q * Tt s , r+bt s (s , a )}.

Factorization). If P(•|s, a) = ⊗ N i=1 P i (•|s, a), P(•|s, a) = ⊗ N i=1 P i (•|s, a), then P(•|s, a) -P(•|s, a) 1 ≤ N i=1 P i (•|s, a) -P i (•|s, a) 1 .

Figure 2: Synthetic network visualization. (a) Cascade network. Four identical networks are associated with different content; their nodes are colored differently. The node activation could not happen across content. The cascade network has nodes with high/medium/low influential power, 5/20/275 nodes respectively. (b) Star-shape network. This network has three influential nodes; the center one can activate the other two influential nodes. Only edges with positive probability are visualized.

Optimism). Let Assumption 5, Assumptions 2-4 hold. Set the bonus term to beb t (s, a) β t • φ i,k (s, a) Σ -1 t-1

for any V (s) such that 0 ≤ V (s) ≤ Λ/(1 -γ), γ|E s ∼P(s |s,a) V (s ) -E s ∼P T t (s |s,a) V (s )| ≤ b t (s, a).proof of Theorem 3. We can verify that Lemma C.2, Lemma C.3, and Lemma C.4 also hold for the generalized linear model. Therefore, the exact same proof of Theorem 2 applies with Lemma C.1 replaced with Lemma F.3. The result only differs by a factor of κ.F.4 DEFERRED PROOFS OF LEMMASproof of Lemma F.1. Notice that we have assumed µ(0) = 0 and µ ≤ 1. Therefore for z > 0,µ(z) = µ(z) -µ(0) ≤ z -0 = z. Then E[(s t+1 ) i,k |s t , a t ] = µ( j A k i,j (s tat ) j,k ) ≤ j A k i,j (s tat ) j,k .

Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. In International Conference on Machine Learning, pp. 12793-12802. PMLR, 2021b. Jinhang Zuo, Xutong Liu, Carlee Joe-Wong, John CS Lui, and Wei Chen. Online competitive influence maximization. In International Conference on Artificial Intelligence and Statistics, pp. 11472-11502. PMLR, 2022.

F EXTENSION TO GENERALIZED LINEAR MODEL

In this section, we show how to extend our algorithm and regret bound to generalized linear models. We first re-state below the modified assumption on the transition model. Assumption 6 (Generalized Bernoulli Independent Cascade Model). Let s be the next state. For each k ∈ [K], we assume there is an underlying connectivity matrix A k ∈ R n×n such thatwhere µ : R → R satisfies µ(0) = 0 and 1/κ ≤ µ ≤ 1 for some κ ≥ 1. And we assume s i,k 's are independent conditioned on s.F.1 MORIMA FOR GENERALIZED LINEAR MODEL Next, we state the changes to our algorithm. Under this assumption, we cannot simply use ridge regression to get an empirical estimation of T * . Instead, we estimate the tensor model bywhere φ t i,k and Σ t are defined as in Section 3. We still perform optimistic planning with respect to the truncated-reward model, where the reward bonus term b t (s, a) is replaced withwhich is the original bonus term multiplied by 2κ. We adopt the same slow switching method as before.

F.2 REGRET ANALYSIS

We have the following regret bound, which is the original bound multiplied by κ.Theorem 4 (Re-state Theorem 3). Let Assumption 5, Assumptions 2-4 hold. With probability at least 1 -δ, Algorithm 1 satisfies the following regret upper bound:

F.3 PROOF SKETCH OF THEOREM 3

The proof of Theorem 3 only differs from the proof of Theorem 2 slightly. Next we examine through the proof of Theorem 2 and state the corresponding lemmas in the generalized linear model setting.First, we have exactly the same result for the high probability bounds for the number of active user-content pairs. Lemma F.1 (High probability bounds for the number of active user-content pairs). Let Assumption 5, Assumptions 2-4 hold. For any possibly non-stationary policy π, with probability at least 1 -δ, we have for all t ≥ 1,The next two lemmas justify the choice of the bonus term. Lemma F.2 (Confidence Region). Let Assumption 5, Assumptions 2-4 hold. With probability at least 1 -δ, we have for all t ≥ 1, L t (T * ) ≤ β t , where β t is defined as in Eqn. ( 6).where we use Lagrange Mean Value Theorem in the first equality and µ (•) ≥ 1/κ. Then the desired result holds by applying Lemma D.2 and Lemma D.1 in the same way as the proof of Lemma C.1.

