LEARNING TO TAKE A BREAK: SUSTAINABLE OPTIMIZATION OF LONG-TERM USER ENGAGEMENT

Abstract

Optimizing user engagement is a key goal for modern recommendation systems, but blindly pushing users towards increased consumption risks burn-out, churn, or even addictive habits. To promote digital well-being, most platforms now offer a service that periodically prompts users to take a break. These, however, must be set up manually, and so may be suboptimal for both users and the system. In this paper, we propose a framework for optimizing long-term engagement by learning individualized breaking policies. Using Lotka-Volterra dynamics, we model users as acting based on two balancing latent states: drive, and interest-which must be conserved. We then give an efficient learning algorithm, provide theoretical guarantees, and empirically evaluate its performance on semi-synthetic data.

1. INTRODUCTION

As consumers of content, we have come to rely extensively on algorithmic recommendations. This has made the task of recommending-in a relevant, timely, and personalized manner-key to the success of modern media platforms. Most commercial systems are built with the primary aim of optimizing user engagement, a process in which machine learning plays a central role. But alongside their many successes, recommendation systems have also been scrutinized for heedlessly driving users towards excessive and often undesired levels of consumption. This has raised awareness as to the need for redesigning recommendation systems in ways that actively promote digital well-being. How can media platforms balance business goals with user well-being? One prominent approach, which is now offered by most major platforms, is to periodically prompt users to take breaks (Constine, 2018; Perez, 2018) . The idea behind breaks is that occasional disruptions curb the inertial urge for perpetual consumption, and can therefore aid in reducing 'mindless scrolling' (Rauch, 2018) , or even addiction (Montag et al., 2018; Ding et al., 2016) . As a general means for promoting well-being, breaking is psychologically well-grounded (e.g., Danziger et al., 2011; Sievertsen et al., 2016) . But for platforms, breaks serve a utilitarian purpose: their goal is to foster long-term engagement by compensating for the myopic nature of conventional recommendation algorithms, which are typically trained to optimize immediate engagement. Since breaking schedules are applied heuristically on top of existing recommendation policies-and typically need to be set up manually by users-current solutions unlikely utilize their full potential (Monge Roffarello & De Russis, 2019) . In this paper, we propose a disciplined learning framework for responsible and sustainable optimization of long-term user engagement by controlling breaks. Our point of departure is that sustained engagement necessitates sustained user well-being, and here we advocate for breaks as a means to establish both. Focusing on feed-based recommendation, our framework optimizes long-term engagement by learning an optimal breaking policy that prescribes individualized breaking schedules. The challenge in learning to break is that the effects of recommendations on users can slowly accumulate over time, deeming as ineffectual policies that rely on clear signs of over-exposure. To be preemptive, we argue that breaks must be scheduled in a way that anticipates the future trajectory of user behavior, and early on. To achieve this, we introduce a novel class of behavioral models based on Lokta-Volterra (LV) dynamical systems (Lotka, 1910) . These depict users as acting based on two balancing forces: drive to consume and intrinsic interest, with corresponding latent states. Intuitively, high interest increases drive to consume, but prolonged consumption decreases interest; together, these describe how user behavior varies over time and in response to recommendations. Our model captures the notion that interest can exhaust long before over-consumption is observed. This arms our approach with the prescience needed to prevent burn-out by ensuring that interest is sustainably preserved; thus, whereas current solutions target the symptom-ours aims for the cause. Our proposed learning algorithm consists of two steps: First, we embed user interaction sequences in 'LV-space'-the set of all possible trajectories that our behavioral model class can express. Then, we optimize individualized breaking policies by solving an optimal control problem over this latent space, in which the control variable is a breaking schedule applied on top of the existing recommendation scheme. Here the challenge is that different breaking policies can lead to different counterfactual trajectories, of which observational data is only partially informative. Since our goal considers long-term outcomes, our solution is to optimize directly for counterfactual steady-states. From a behavioral perspective, we view this as aiming to steer towards sustainable habits; from a computational perspective, under our choice of policy class, this enables tractable learning. As we show, the optimization landscape of LV equilibria admits a compact representation, whose main benefit is that it can be fully described by predictions of individualized user engagement rates. Practically, this is advantageous, as it circumvents the need to take arbitrary and costly exploration steps, and enables learning using readily available predictive tools (e.g., Gupta et al., 2006) . We make use of a small set of learned predictive models, each trained on a small and minimally-invasive experimental dataset, which allow us to tune our policy to suit different conditions. The final learned policy has an intuitive interpretation: it takes as input a small set of predictions for a user, and via careful interpolation, applies a decision rule that anticipates the effects of breaking on future outcomes (c.f. conventional approaches, which take in predictions and apply the myopic argmax rule). Our main theoretical result is a bound on the expected long-term engagement of our learned breaking policy, relative to the optimal policy in the class. We show that the gap decomposes into three distinct additive terms: (i) predictive error, (ii) modeling error (i.e., embedding distortion), and (iii) variance around the (theoretical) steady state. These provide an intuitive interpretation of the bound, as well as means to understand the effects of different modeling choices. Our proof technique relies on carefully weaving LV equilibrium analysis within conventional concentration bounds for learning. Finally, we provide an empirical evaluation of our approach on semi-synthetic data. Using the MoiveLens 1M dataset, we generate discrete time-series data in a way that captures the essence of our behavioral model, but is different from the actual continuous-time dynamics we optimize over. Results show that despite this gap, our approach improves significantly over myopic baselines, and often closely matches an optimal oracle. Taken together, these demonstrate the potential utility of our approach. Code is available at: https://github.com/lvml-iclr-2023/lvml. Broader perspective. At a high level, our work argues for viewing recommendation as a task of sustainable resource management. As other cognitive tasks, engaging with digital content requires the availability of certain cognitive resources-attentional, executive, or emotional. These resources are inherently limited, and prolonged engagement depletes them (Kahneman, 1973; Muraven & Baumeister, 2000) ; this, in turn, can reduce the capacity of key cognitive processes (e.g., perception, attention, memory, self-control, and decision-making), and in the extreme-cause ego depletion (Baumeister et al., 1998) or cognitive fatigue (Mullette-Gillman et al., 2015) . As a means to allow resources to replenish, 'mental breaks' have been shown to be highly effective (Bergum & Lehr, 1962; Hennfng et al., 1989; Gilboa et al., 2008; Ross et al., 2014; Helton & Russell, 2017) . Nevertheless, traditional approaches to recommendation remain agnostic to the idea that recommending takes a cognitive toll: they simply recommend at each point in time the item predicted to be most engaging (Robertson, 1977) . As an alternative, our approach explicitly models recommendation as a process which draws on these resources, and therefore-must also conserve them. The subclass of 'Predator-Prey' LV dynamics which we draw on are used extensively in ecology for modeling the dynamics of interacting populations, and demonstrate how over-predation can ultimately lead to self-extinction by eliminating the prey population-but also show how enabling resources to naturally replenish ensures sustainable relations. As such, here we advocate for studying recommendation systems as human-centric ecosystems, and take one step towards their sustainable design.

1.1. RELATED WORK

User dynamics: latent states and feedback. A recent body of work aims to capture time-varying behavior by modeling users as acting based on dynamic latent states. Broadly, works in this field model the effects of recommendations as either shifting user preferences via positive-only feedback (Jiang et al., 2019; Kalimeris et al., 2021; Sanna Passino et al., 2021; Dean & Morgenstern, 2022) , or reducing willingness to consume via negative-only feedback, e.g. via boredom, satiation, or fatigue (Wang & Lin, 2003; Warlop et al., 2018; Kleinberg & Immorlica, 2018; Cao et al., 2020; Leqi et al., 2021) . While these restrict behavior to expressing a unidirectional effect, our approach integrates both types of feedback and models internal states as competing but balancing forces, which we argue is more realistic. This draws connections to recent attempts of injecting psychological modeling into recommendation system design (Kleinberg et al., 2022; Dubey et al., 2022; Curmei et al., 2022) . Works in this field often combine theoretical analysis with simulation studies (Schmit & Riquelme, 2018; Chaney et al., 2018; Mansoury et al., 2020; Krauth et al., 2020) , and here we follow suit. Lotka-Volterra dynamics: modeling, learning, and control. The study of ecosystem dynamics and their conservation has a long and rich history, in which LV analysis plays an integral role (see e.g. Hofbauer et al. (1998) ; Takeuchi (1996) ). LV systems our used primarily for modeling biological ecosystems, but are also used in economics (Weibull, 1997; Samuelson, 1998) , finance (Farmer, 2002; Scholl et al., 2021) , and behavioral modeling (e.g., drug addiction and relapse (Duncan et al., 2019) ). In terms of learning, Gorbach et al. (2017) and Ryder et al. (2018) propose variational techniques for dynamical systems, but do not consider control. Our work aims to directly learn optimal policies, for which we draw on recent advances in turnpike optimal control (Trélat & Zuazua, 2015) .

2. LEARNING SETTING

We consider a sequential recommendation setting in which users interact with a stream of recommended items over time. New users u are sampled independently from some unknown distribution D, and time for each user is measured relative to their time of joining. Interactions occur at discrete time-points in continuous time, t ∈ R + , and upon user request: when a user u queries the system for additional content at time t, the system responds by presenting a recommended item x(t) ∈ X . We assume recommendations are governed by an existing and fixed recommendation policy ψ, which determines x(t) given a request from u at time t. We allow ψ to be stochastic, and make no additional assumptions on its structure or mechanics. User interactions are therefore described by a sequence of pairs {(t i , x i )} i , where x i = x(t i ) ∼ ψ(u; t i ) ∈ X is the item recommended to u at time t i . Our key modeling assumption is that subsequent request times, t i+1 = t i + ∆t i for ∆t i > 0, are determined jointly by the user's 'state' at time t i and the recommended item x i (note this means ∆t i can depend on t i ). We consider user states as latent and in the abstract, but broadly expect 'good' recommendations to entail frequent requests by inducing small values of ∆t i . Together, user u's choice behavior, coupled with the policy ψ, induce a temporal point process (TPP) which governs the generation of interaction sequences of duration t as {(t i , x i ) | t i ∈ [0, t]} ∼ TPP ψ (u; t). For each user, the system collects data until some (relative) fixed time T 0 , and seeks to optimize engagement in the subsequent interval [T 0 , T 0 + T ], where T is a predetermined time horizon. We denote the corresponding input sequences by S 0 u = {(t i , x i ) | t i ∈ [0, T 0 )}, and target sequences by S u = {(t i , x i ) | t i ∈ [T 0 , T ]}. Defining by 1 T |S u | the engagement rate of u for the chosen time horizon T , our goal in learning will be to maximize expected long-term user engagement rate, namely E u∼D E 1 T |S u | for (S 0 u , S u ) ∼ TPP ψ (u; T 0 + T ), in expectation over new users. Breaking policies. The unique aspect of our learning problem is that our only means for increasing engagement is by suggesting breaks. Our task will be to learn a breaking policy π ∈ Π which can override ψ: when user u queries the system at time t i , the policy π(u; t i ) ∈ {0, 1} determines whether to present the intended item x(t) ∼ ψ(u; t i ) (for π = 0), or suggest a break instead (π = 1).foot_0 Denoting the overall composed policy by π • ψ, our aim is to learn a breaking policy π that increases engagement by complementing an existing ψ. Hence, our learning objective is: arg max π∈Π E u∼D E TPP 1 T |S u | , (S 0 u , S u ) ∼ TPP π•ψ (u; T 0 + T ) Broadly, we expect breaks to negatively affect short-term engagement (i.e., entail longer ensuing ∆t i ), but have the potential to improve engagement in the long run if scheduled appropriately. Data and exploration. For learning an optimal breaking policy, we assume the system has access to a dataset of previously logged interactions S = {(S 0 1 , S 1 ), . . . , (S 0 m , S m )} for m train-time users u 1 , . . . , u m ∼ D, collected under a 'clean' recommendation policy ψ. It will be convenient to assume that the system 'featurizes' user input sequences S 0 u via a vector mapping ϕ used in learning. For clarity, we overload notation and represent users as u = ϕ(S 0 u ) ∈ R d . This allows us to support additional user feedback (e.g., ratings) or information (e.g., demographics) as input to ϕ. Since Eq. ( 1) is a policy problem (note breaks affect outcomes), learning requires some form of exploration or experimentation. Here we aim for experimentation to be simple and minimal. Specifically, we assume access to a small number of N additional datasets, S (j) = {(u k , S k )} mj k=1 , collected for different users, and under composed policies π j • ψ for various predetermined π j . These datasets can either be given at the onset, or collected as part of the learning process; our only usage of them will be for learning predictors ŷ = f j (u) of engagement rate y = 1 T |S u |. Ideally, we would like to make do with only a few S (j) of small size and that can be gathered concurrently; our results show that even a single additional dataset can be highly informative, and in some cases sufficient. Learning to break. To effectively optimize Eq. ( 1), learning must anticipate the effects of different breaking policies π ∈ Π on future sequences S u of unobserved users u. It will therefore be useful to distinguish between the policy π itself, which determines when breaks are applied, and a component for estimating individualized counterfactual engagement rates 1 T | Ŝu (π)| induced by π, which will aid in choosing a good policy. Our focus will be on learning simple policies coupled with rich and task-appropriate models of responsive user behavior. In particular, we set Π to include all stationary policies, π(u) = π(p u ), which for each user u recommends a break with a time-independent personalized probability p u = p(u) ∈ [0, 1]. Stationary policies are interpretable, amenable to efficient optimization, and straightforward to implement. As we show, despite their simplicity, such policies can be quite expressive when coupled with an appropriate behavioral model.

3. ENGAGEMENT VIA 'PREDATOR-PREY' DYNAMICS

To optimize engagement with breaks, we propose to model user behavior as a dynamical system of Lotka-Volterra (LV) Predator-Prey equations. The model operates over continuous dynamics, which has analytic and optimizational benefits; we later make the connection back to discrete sequences. Behavioral model. Our model of user behavior views each user as acting based on two types of time-dependent latent variables: internal drive for consumption, denoted by λ(t) ∈ R + ; and intrinsic interest, denoted by q(t) ∈ [0, 1]. Intuitively, we expect that engaging experiences will reinforce a user's desire to further consume; conversely, excessive exposure to content should slowly 'erode' her interest-but left alone, will allow it to replenish (Thoman et al., 2011) . Thus, drive and interest act as balancing forces. Together, we model the time-dependent relations between λ(t) and q(t) as: dλ dt = -αλ + βλq, dq dt = γq(1 -q) -δλq where θ = (α, β, γ, δ) ≥ 0 parameterize the dynamics. For λ (drive, or 'predator'), α determines its natural decay rate, and β its interest-dependent self-reinforcement rate; for q (interest, or 'prey'), γ specifies its natural replenishment rate, and δ its rate of depletion from consumption. Note the two equations are coupled: q mediates the reinforcement of λ, and λ mediates the depletion of q. Dynamical properties. The LV model in Eq. ( 2) describes consumption as a cycle: when interest q(t) is high, drive to consume λ(t) increases, resulting in positive feedback; conversely, when λ(t) is high, q(t) decreases, which expresses negative feedback. In general, λ grows until interest is too low to sustain consumption, at which point consumption drops sharply, allowing interest to recover-and the cycle repeats. The cycling behavior exhibits oscillations in λ and q, with one lagging after the other. A typical trajectory is illustrated in Figure 1 . Note how the drop in λ occurs only some time after q is depleted; hence, anticipating (and preventing) the collapse of λ requires knowledge (and conservation) of q. Thus, q serves as a resource: necessary for consumption, and of limited supply. Over time, and if no interventions are applied, the magnitude of oscillations decreases, and the system naturally approaches a stable equilibrium, denoted (λ * , q * ), determined by the system pa- Interest q(t) (λ, q) -Phase-space diagram (λ(t), q(t)) (λ * , q * ) 0.3 0.4 0.5 0.6 0.7 Interest q(t) Figure 1 : Characteristic LV dynamics (Eq. ( 2)). (Left) Temporal relations between consumption drive λ(t), intrinsic interest q(t), and equilibrium λ * . Note how λ drops only some time after q has depleted. (Right) Evolution of the same system in phase space (λ, q), with its equilibrium (λ * , q * ). rameters θ = (α, β, γ, δ) and which attracts all initial conditions λ(0), q(0) (Takeuchi, 1996) . Equilibrium plays a key role in our approach, as it captures the notion of habits, which we aim to improve. Engagement maximization as optimal control. To optimize engagement (Eq. ( 1)), we propose to optimize λ * as an alternative proxy. Broadly, our approach will be to associate with each user u a dynamical system parameterized by θ u = (α u , β u , γ u , δ u ), and recommend breaks which lead to high values of the corresponding λ * u = λ * θu . Intuitively, if we think of 1 T |S u | as an 'empirical' rate (determined by the ∆t i in S u ), then λ * u is its continuous theoretical counterpart, and so ideally we would like to find a θ u for which λ * u is the limiting behavior of 1 T |S u | (i.e., when ∆t → 0 and T → ∞). In practice, we expect λ * u to be a useful target when the observed 1 T |S u | exhibits habits that are 'close enough' to the theoretical equilibrium, and in Sec. 4.3 we make this condition precise. Since our goal is to optimize a breaking policy, we must also be precise about the way breaks affect dynamics in our model. Towards this, we introduce into Eq. (2) a control variable, p(t) ∈ [0, 1], which acts as a 'gate' that controls the mediation strength between λ(t) and q(t) via: dλ dt = -αλ + βλq(1 -p), dq dt = γq(1 -q) -δλq(1 -p) When p > 0, breaking has a dual effect: it decelerates drive λ, and at the same time, lets q recover. Our goal will now be to solve the optimal control problem of finding p(t) which maximizes λ * . Note that not every dynamic controlling schedule guarantees convergence to some λ * . Hence, we focus on fixed controls, p ∈ [0, 1], which we prove converge to equilibrium, and having closed form. Lemma 1. Let θ = (α, β, γ, δ) define an LV system as in Eq. ( 3). Then for any p ∈ [0, 1], we have: λ * θ (p) = γ δ 1 1 -p 1 - α β 1 1 -p if p ∈ [0, 1 -α/β] , and zero otherwise. (4) Proof in Appendix A.1. For optimization, Lemma 1 is useful since it depicts λ * as a function of p, parameterized by θ. Thus, given some θ u for user u, our objective is to solve the control problem: p u = arg max p∈[0,1] λ * θu (p) Optimizing engagement now reduces to solving Eq. ( 5); given p u , we make the connection back to discrete time by setting the breaking policy to be π(p u ), which recommends breaks at rate p u .

4. LEARNING OPTIMAL BREAKING POLICIES

We now turn to presenting our learning algorithm. Our approach to optimizing engagement consists of two steps: (i) associating with each user u a set of LV parameters θ u , and then (ii) finding p u which maximizes λ * θu (p), and plugging into π(p u ). In practice, we add an intermediate prediction step, which allows us to 'shortcut' directly from observations to optimal policies. We conclude with analysis showing when learned policies π(p u ) lead to good expected engagement E u∼D E TPP 1 T |S u | .

4.1. EMBEDDING USERS IN LV SPACE

To find θ u , a seemingly reasonable approach would be to fit an LV trajectory to the initial sequence S 0 u , i.e., by solving min θ i |∆t iλ θ (t i )| for the observed t i ∈ S 0 u . This can be interpreted as embedding S 0 u in 'LV-space' by finding the nearest continuous trajectory, for which θ provides a compact representation. However, a key issue with this approach is that S 0 u contains past observations made under a single policy π(p) (e.g., the 'clean' policy π(0)). To optimize future engagement, the learned θ u must correctly account for the affects of general p on possbile ensuing sequences S u . As an alternative, we propose to find θ u by fitting the entire equilibrium curve of λ * θ (p) (Eq. ( 4)). Formally, for each p, define the expected empirical engagement rate λu (p) as: λu (p) = E π 1 T |S u | , S u ∼ TPP π(p)•ψ (u; T ) i.e., λ is the rate of counterfactual future trajectories for all possible choices of p. Ideally, we would like to find θ for which the learned curve λ * θ (p) closely aligns with that of λu (p) across p ∈ [0, 1]: θu = arg min θ ∥ λu -λ * θ ∥ for some function norm ∥ • ∥, and for which λ * θu and λu have similar maximizing p (since we aim for optimizing λ * θ to work well for λu ). Unfortunately, λu is a theoretical construct, and so θu cannot be obtained from observations. Hence, we propose to replace λu with a finite set of predictions. The role of prediction. Recall our input consists of a primary dataset S (0) = S = {(u i , S i )} m i=1 , as well as additional 'experimental' datasets S (j) collected under different π j = π(p j ) and of sizes m j for j = 1, . . . , N . We can use these to learn individualized, policy-specific predictors j) , then f j (u) should be a reasonable estimator of the expected λu (p j ). Hence, for a given u, a finite set of pairs {(p j , f j (u))} N j=1 gives points to which we can fit θ to λ * θ . Our final criterion for choosing θu is: f j (u) = f pj (u), trained to predict for each user u her engagement rate y = 1 T |S u | under π j . For ex- ample, if we train f j to minimize the squared error k (f j (u k ) -y k ) 2 on pairs (u k , y k ) ∈ S ( θu = arg min θ ∥f (u) -λ * θ ∥ = arg min θ N j=1 (f j (u) -λ * θ (p j )) 2 (8) given here with the ℓ 2 vector norm, and where f (u) = (f 1 (u), f 2 (u), . . . , f N (u)) ∈ R N + . As we will see next, optimizing over p requires only the ratios α/β and γ/δ, in which λ * is quadratic. Hence, Eq. ( 8) can be efficiently solved using a polynomial Non-Negative Least Squares (NNLS) regression solver (Chen & Plemmons, 2010) . The role of experimentation. Two parameters control the goodness of fit for θu : the number of experimental datasets, N , and their sizes, m j for j ∈ [N ]. In general, increasing N provides more data points for solving Eq. ( 8), and increasing each m j reduces noise for that point (i.e., f u (p) should be closer to λu ). But experimentation is costly, and so in reality N and the m j may be small. As motivation, we next show that under realizability and for accurate predictions, N = 2 suffices. Proposition 1. Fix N = 2, and let p 0 , p 1 ∈ [0, 1 -α/β]. For a user u, if (i) exists θ u s.t. λu = λ * θu , and (ii) f i (u) = λu (p i ) for i = 1, 2, then θu is optimal, i.e., solving Eq. ( 8) gives θu = θu . Proof in Appendix A.1, and relies on Lemma 1. Next, we discuss how to obtain π(u) from θ u .

4.2. FROM PREDICTIONS TO OPTIMAL POLICIES

Recall our goal is to learn an individualized breaking policy π that for each user u applies an appropriate, personalized breaking schedule π(u). To be able to generalize to unseen u, the conventional approach is to learn a parameterized policy π(u; η), where η is optimized on training data, and applied to new users at test-time. Our approach, by relying on predictions, circumvents the need to learn parameterized policies: once the predictors {f j (u)} j have been learned, the policy problem decomposes over users, and optimal policies π(u) are determined independently for each user.  λ * θu (p) λ * θu (p) (pi, fp i (u)) p p * Figure 2: (Left) Equilibrium curves λ * (p) and optimal policies p * (markers) for user types (θ u ) that: benefit from breaks (green), do not require breaks (orange), and will inevitably churn (blue). (Center) The optimal policy p * for all β/α, showing a second-order phase change at β/α = 2. (Right) An illustration of the true counterfactual engagement curve (orange), Optimal policies. Given θ u , our next result establishes a closed form solution for the optimal p u . Note that Eq. ( 4) shows λ * θ (p) is piece-wise polynomial in z = 1/(1p). Solving for z, we get: Lemma 2. Let θ = (α, β, γ, δ) define an LV system as in Eq. ( 3). Then the optimal p * is given by: p * (θ) = arg max p∈[0,1] λ * θ (p) = 1 -2 α β if α β ≤ 1 2 , and zero otherwise. ( 9) Proof in Appendix A.1. Hence, once θ u is found, our learned policy is defined as: π(u) = π(p u ), where pu = p * ( θu ) Altogether, to compute pu , the least squares approach in Eq. ( 8) is applied to obtain the polynomial coefficients γu δu , αuγu βuδu , then αu βu is estimated from their ratio, and then plugged into Eq. ( 9) to obtain pu . For N = 2, π(u) has a closed-form formulation as a function of predictions (Appendix A.2). Additionally, note that by Eq. ( 10), pu exhibits a phase transition at αu βu = 1 2 , below which pu = 0. Corollary 1. In LV space, π(u) partitions users to those who require breaks, and those who don't. Corollary 2. In the realizable case of Prop. 1, π(p u ) idempotently improves over the myopic π(0). Thus, the optimal policy can be interpreted as suggesting breaks only when it deems them necessary. Figure 2 illustrates λ * (p) curves, phase shift, and optimal and learned policies for various user types.

4.3. THEORETICAL GUARANTEES

We are now ready to state our main theoretical result, which bounds the expected long-term engagement obtained by our learned policy, π. Our bound shows that the gap between π and the optimal policy, π opt , is goverened by three additive terms, each relating to a different aspect of our approach: modeling error (ε LV ), predictive error (ε pred ), and deviation from expected behavior (ε dev ). A description and interpretation of each term follows shortly. For simplicity, we focus on N = 2. Theorem 1 (Informal). Let p 0 , p 1 ∈ [0, 1], and denote by π opt ∈ Π be the optimal stationary policy. Then for the learned breaking policy π, we have: E u,π opt 1 T |S u | -E u,π 1 T |S u | ≤ η TPP |p 1 -p 0 | (ε LV + ε pred + ε dev ) where η TPP is a TPP-specific constant scale factor. Formal statement, precise definitions, and proof are given in Appendix A.3. The proof consists of three main steps: We begin with a clean LV system at T = ∞, and quantify the downstream effect of perturbing the optimal policy. Then, we plug in the learned policy, and bound the gap due to predictive errors and finite T . The final step makes the transition from continuous dynamics to the discrete TPP. We next detail the role of each of the five terms in the bound, and how they can be controlled. • Predictive error: Since targets y = 1 T |S u | are real, ε pred is simply the expected regression error over users, measured here in RMSE. As is standard, ε pred can be reduced by increasing the number of samples m, or by learning more expressive predictive functions f (e.g., larger neural nets). • Modeling error: LV dynamics permit tractable learning; but as any hypothesis class, this trades off with model capacity. Here, ε LV quantifies the error due limited expressive power. Further reducing ε LV can be achieved by consider richer dynamic models-a challenge left for future work. • Deviation from expectation: The learned pu rely on predicted equilibrium, but are trained on finite-horizon data. In expectation, ε dev captures how finite sequences deviate from their mean. As a rule of thumb, we expect larger T to reduce this form of noise, but this cannot be guaranteed. • Sensitivity : For N = 2, the term |p 1p 0 | quantifies the added value of exploring beyond the default breaking policy of p 0 . Intuitively, when the points are farther away, fitting the equilibrium curve is easier. Thus, for the likely case of p 0 = 0, the experimental breaking rate p 1 should be chosen to balance between performance gain and overexposure of experimental subjects to breaks.

5. EXPERIMENTS

We conclude with an empirical evaluation of our approach on semi-synthetic data. We use the MovieLens 1M dataset to generate recommendations and simulate user behavior, enabling us to evaluate counterfactual outcomes under different policies. See Appendix B for additional details.

5.1. EXPERIMENTAL SETUP

Data. The MovieLens 1M dataset (Harper & Konstan, 2015) includes 1,000,209 ratings provided by 6,040 users and for 3,706 items, which we use to obtain features, determine the dynamics, and emulate ϕ. We sample and hold out 30% of all ratings r ux via user-stratified sampling, to which we apply Collaborative Filtering (CF) to get user features u and item features x that approximate u ⊤ x ≈ r ux (d = 8, RMSE = 0.917, r ∈ [1, 5]). This mimics a process where ϕ is based on items recommended by ψ and rated by users. We then take the remaining data points and randomly assign 1,000 users to the test set, on which we evaluate policies. The remaining users are assigned to the train set, which is then further partitioned into the main S and the different S (j) per experimental condition. Recommendation policy and user behavior. Following Kalimeris et al. (2021) ; Hron et al. (2022) , we set ψ to recommend based on learned softmax scores, softmax x (r u ), taken over all non held-out x having ratings for u. For user behavior, we generate discrete interaction sequences S u in a way that relates to our dynamic model, but is nonetheless quite distinct. Specifically, each user u is associated with discrete-time latent-state variables, updated via an LV discretization scheme (see Appendix B.5). These, together with recommended items x i = x(t i ), determine consecutive ∆t i ∈ S u . Importantly, we let ∆t i depend on u's rating for x i , namely r u,xi , which we interpret as u's utility from consuming x i . We parameterize this dependence via 'immediate', item-specific consumption gain parameters β ui that depend on r u,xi , and for simplicity set α u , γ u , δ u to be fixed. Note that since β ui depends on the recommended x i , it varies in time, and hence there is no single β that underlies the dynamics: even in the limit (∆t → 0, T → ∞), user behavior cannot be described by a continuous LV system, which implies ε LV > 0 (see Appendix Figure 5 ). Since the baseline RMSE is high, we set β ui ∝ r2 u,xi , where ru,xi = κr u,xi + (1κ)u ⊤ x, so that κ interpolates between predicted ratings (κ = 0) and true ratings (κ = 1). This allows us to (indirectly) control ε pred . For all experiments we use T = 100, and so expect a roughly fixed ε dev > 0. Methods. We compare our approach (LV) to several baselines: (i) a default policy which myopically optimizes for immediate engagement (and so does not break); (ii) a 'safety switch' policy (safety@τ ) that breaks once consumption surpasses a threshold τ ; (iii) a prediction-based policy (best-of) that chooses the best observed p u = arg max pj f j (u) (rather than optimizing over p u ∈ [0, 1]); and (iv) an oracle benchmark which directly optimizes the (generally unknown) true underlying dynamics. We measure mean long-term engagement rate (LTE) for each approach, and report averages and standard errors computed over 10 random splits. Performance is measured relative to the default baseline as it represents no change in policy (typical absolute values are LTE ≈ 10). 

5.2. RESULTS AND ANALYSIS

Main results. Figure 3 (left) compares the performance of our method to other policies. Here we set p 0 = 0, use N = 3 with p j ∈ {0.05, 0.1, 0.15}, and consider κ = 0.5 (note κ affects all policies). As can be seen, our approach significantly improves over default (+6.48%). For safety@τ , improvement over the optimal τ = 16 (+5.68%; chosen in hindsight) shows the importance of being preemptive; for the slightly smaller τ = 14, breaks are harmful. The gap from best-of (+2.19%) quantifies the gain from the optimization step in Eq. ( 9), and the close performance to oracle (-0.918%) suggests that optimizing the empirical curve (Eq. ( 8)) works well as as proxy. User types. Figure 3 (center) shows for our approach how gain in LTE varies across learned breaking policies pu > 0. For increasingly-accurate predictions (κ ∈ {0, 0.5, 1}), the main plot shows performance gains for each group of users, partitioned by their pu values (binned; plot shows average and unit standard deviation per bin). Gains until pu ≤ 0.15 are mild, but for pu > 0.15, the general trend is positive: users who are deemed to require more frequent breaks, benefit more from breaking. Gains until pu ≤ 0.3 increase for all κ, but for pu > 0.3, extrapolation becomes difficult: note the higher variation within each κ, as well as significant differences across κ. This highlights the importance of accurate predictions for inferring optimal pu when the experimental p i are small. The inlaied plot shows that, in line with our theory, pu exhibits an empirical phase shift in the estimated θu . Treatments. Figure 3 (right) shows the effects of experimental treatments on performance. Focusing on N = 2, we fix p 0 = 0, and consider increasingly aggressive experimentation by varying p 1 ∈ (0, 0.4]. For our approach, increasing p 1 helps, which is anticipated by our theoretical bound. For the best-of approach, larger p 1 also helps, but exhibits population-level optimum (p 1 ≈ 0.24), which is easy to 'overshoot'. Note that when prediction accuracy is low (κ = 0), experimentation is essential: if p 1 is not sufficiently large, then performance can sharply deteriorate.

6. DISCUSSION

Our paper studies the novel problem of learning optimal breaking policies for recommendation. We posit a tight connection between long-term engagement and user well-being, and argue that optimizing the former requires careful consideration of the latter. Viewing cognitive capacity as a limited but conservable resource, we propose Lotka-Volterra dynamics as a behavioral model that enables effective, responsible, and sustainable optimization of recommendation ecosystems. As any policy task that involves humans, care must be taken regarding potential risks. While our experiments show that optimizing engagement also improves well-being, this need not always be the case; in fact, merely measuring well-being in reality is challenging. Breaks, as interventions, are presumably 'safe', in the sense that at worst they may lead to suboptimal performance (for system) or satisfaction (for users). But breaks can also be used nefariously, e.g., by enabling temporally-varying rewards (Eyal, 2019) . As such, breaking policies should be deployed accountably and transparently.

A DEFERRED PROOFS

In this section, we formalize the model presented in section 3, and prove the claims presented in section 4. The section ends with a formal proof of Theorem 1. A.1 PROPERTIES OF LOTKA-VOLTERRA SYSTEMS Definition A.1 (Static policy equilibrium). Let λ(t), q(t) denote a Lokta-Volterra model characterized by parameters θ = (α, β, γ, δ) ∈ R 4 + , as defined in Equation 2. Let p ∈ [0, 1], and denote by π p the static policy corresponding to p. For λ(0), q(0) > 0, the static equilibrium of the system is defined as: λ * (p; θ) = lim t→∞ λ(t) q * (p; θ) = lim t→∞ q(t) We denote λ * (p) = λ * (p; θ) when θ is clear from the context. We denote λ * (p; u) = λ * (p; θ u ) when a user u ∈ U characterized by parameters θ u is given and clear from the context. Proposition A.1 (Global stability). λ * (p; θ) exists and uniquely defined for all θ ∈ R 4 + , p ∈ [0, 1] and for all initial conditions λ(0), q(0) > 0. Proof. See (Takeuchi, 1996 , Section 3.2). Lemma A.1 (Equilibrium of LV behavioral model. Formal proof of Lemma 1). Assume a Lokta-Volterra model characterized by θ = (α, β, γ, δ) ∈ R 4 + , and let p ∈ [0, 1] denote the proportion of interactions in which a forced break is served. The static equilibrium of the model under static policy π p is given by: λ * (p) = γ δ 1 1-p 1 -α β 1 1-p p ∈ 0, 1 -α β 0 otherwise q * (p) = α β 1 1-p p ∈ 0, 1 -α β 1 otherwise Proof. The LV dynamical system is given by Equation 3: dλ dt = -αλ + β(1 -p)λq dq dt = γq(1 -q) -δ(1 -p)λq when p ∈ 0, 1 -α β we equate dλ dt = 0, dq dt = 0 and obtain the result. The solution is guaranteed to be valid, as both λ * (p) > 0 and q * (p) ∈ [0, 1]. Conversely, when p / ∈ 0, 1 -α β , there exists ϵ > 0 such that d dt log λ < -ϵ < 0 for all λ > 0, q ∈ [0, 1]. From this we obtain that log λ(t) tends towards -∞, and therefore λ(t) tends towards 0, and λ * (p) = 0 as required. When λ(t) is close to zero, the interaction terms vanish in the dq dt equation, and q(t) grows logistically towards 1. Proposition A.2 (Equilibrium bounds). For a Lotka-Volterra model, the static equilibrium λ * (p) is bounded by: 0 ≤ λ * (p) ≤ βγ Proof. Denote x = 1 1-p . From Lemma A.1, for x ∈ 1, β α the equilibrium consumption λ * (x) is given by: λ * (x) = γ δ x 1 - α β x and is zero otherwise. The equilibrium is a quadratic function of x with roots x ∈ 0, β α , and therefore attains its maximum at x = β 2α . Plugging back the maximizing x into λ * we obtain the upper bound. Lower bound is attained as the equilibrium in Lemma A.1 is clipped by 0 from below. Lemma A.2 (Optimal static policy. Formal proof of Lemma 2). The optimal static policy for a Lotka-Volterra system is given by: p opt = 1 -2 α β α β ≤ 1 2 0 α β > 1 2 And the optimal equilibrium engagement rate is given by: λ * opt = βγ 4αδ α β ≤ 1 2 γ δ 1 -α β α β > 1 2 Proof. Denote x = 1 1-p . From Proposition A.2, the global maximum of λ * (x) is attained at x = β 2α . Consider two cases: When α β ≤ 1 2 , we obtain that x opt = β 2α ≥ 1, and therefore p opt = 1 -1 x ∈ [0, 1]. From this we obtain that in this case the global maximum is attained on the simplex, and given by the formula from Proposition A.2. Conversely, when α β > 1 2 , we obtain p = 1 -1 x < 0, and therefore x opt translates to a negative value of p. As λ * (p) is uni-modal, the optimal policy restricted to the simplex [0, 1] in this case is attained on the closest boundary point p = 0.

Figure 2 provides graphical intuition for this proof (left and center subplots).

Proposition A.3 (Inference of α/β from two-treatment equilibrium data. Formal proof of Proposition 1). Let λ(t), q(t) be a Lokta-Volterra model, let p 1 , p 2 ∈ [0, 1]. Denote by λ * (p 1 ), λ * (p 2 ) the static equilibrium rates corresponding to static policies π p1 , π p2 , and assume λ * (p 1 ), λ * (p 2 ) > 0. The parameter ratio α β is given by the following formula: α β = (1 -p 2 )λ * (p 2 ) -(1 -p 1 )λ * (p 1 ) 1 1-p1 -1 1-p2 Proof. From Lemma A.1, the equilibrium consumption λ * (p) is given by: λ * (p) = γ δ 1 1 -p 1 - α β 1 1 -p = γ δ 1 1 -p - α β γ δ 1 1 -p 2 When λ * (p i ) is observed for different policies p 1 , . . . , p m ∈ 0, 1 -α β , we obtain a polynomial regression problem for the parameters α β and α β γ δ , which can be solved e.g using Non-Negative Least Squares. When m = 2, we obtain a system of two linear equations. Apply Cramer's rule to obtain: γ δ = λ * (p2) (1-p1) 2 -λ * (p1) (1-p2) 2 1 (1-p1)(1-p2) 2 - 1 (1-p1) 2 (1-p2) = (1 -p 2 ) 2 λ * (p 2 ) -(1 -p 1 ) 2 λ * (p 1 ) p 2 -p 1 (11) α β γ δ = λ * (p2) (1-p1) -λ * (p1) (1-p2) 1 (1-p1)(1-p2) 2 - 1 (1-p1) 2 (1-p2) = (1 -p 1 )(1 -p 2 ) (1 -p 2 )λ * (p 2 ) -(1 -p 1 )λ * (p 1 ) p 2 -p 1 (12) And therefore α β is given by: α β = λ * (p2) (1-p1) -λ * (p1) (1-p2) λ * (p2) (1-p1) 2 -λ * (p1) (1-p2) 2 = (1 -p 1 )(1 -p 2 ) (1 -p 2 )λ * (p 2 ) -(1 -p 1 )λ * (p 1 ) (1 -p 2 ) 2 λ * (p 2 ) -(1 -p 1 ) 2 λ * (p 1 ) A.2 MODEL FITTING FROM ENGAGEMENT PREDICTIONS Notations. In this section only, we use the common notation q = 1p to denote complementary probabilities. Definition A.2 (Empirical value of α/β). For single-channel experiments with forced-break probabilities p 1 , p 2 , denote λ i = λ * (p i ), f i = f pi (u), q i = 1-p i . The empirical value of the α β parameter is given by the following formula: α β = q 1 q 2 (q 1 f 1 -q 2 f 2 ) q 2 1 f 1 -q 2 2 f 2 Proposition A.4 (α/β estimation error from prediction errors). Given a single-channel Lokta- Volterra system with parameter α β ≥ 1. Let p 1 , p 2 ∈ 1, α β , denote λ * i = λ * (p i ) ∈ R + , and let f i = λ * i + ε i be the predicted engagement rates corresponding to p 1 , p 2 . When |ε 1 |, |ε 2 | ≤ ε ≤ γ δ |p1-p2| , the estimation error is bounded by: α β - α β ≤ ε |p 1 -p 2 | βδ αγ Proof. denote q i = 1p i . The value of α β is given by Proposition A.3: α β = q 1 q 2 (q 1 λ * 1 -q 2 λ * 2 ) q 2 1 λ * 1 -q 2 2 λ * 2 And the estimator for α β is obtained by replacing the true value with their predictions: α β = q 1 q 2 (q 1 f 1 -q 2 f 2 ) q 2 1 f 1 -q 2 2 f 2 = q 1 q 2 (q 1 (λ * 1 + ε 1 ) -q 2 (λ * 2 + ε 2 )) q 2 1 (λ * 1 + ε 1 ) -q 2 2 (λ * 2 + ε 2 ) The estimation error is given by: α β - α β = q 2 1 q 2 2 (q 1 -q 2 )(ε 2 λ * 1 -ε 1 λ * 2 ) (q 2 1 λ * 1 -q 2 2 λ * 2 )(q 2 1 λ * 1 -q 2 2 λ * 2 -(q 2 1 ε 1 -q 2 2 ε 2 )) = (q 1 q 2 ) 2 ≡(i) q 1 -q 2 q 2 1 λ * 1 -q 2 2 λ * 2 ≡(ii) |ε 2 λ * 1 -ε 1 λ * 2 | ≡(iii) 1 q 2 1 λ * 1 -q 2 2 λ * 2 -(q 2 1 ε 1 -q 2 2 ε 2 ) ≡(iv) We now proceed to bound each factor: • For (i), the term (q 1 q 2 ) 2 is bounded by 1 since q 1 , q 2 ∈ [0, 1]. • For (ii), the term q1-q2 q 2 1 λ * 1 -q 2 2 λ * 2 is equal to γ δ -1 by Eq. ( 11). • For (iii), from Proposition A.2 we obtain the bound 0 ≤ λ * i ≤ βγ 4αδ , and therefore the term |ε 2 λ * 1 -ε 1 λ * 2 | is bounded by 2 βγ 4αδ ε = βγ 2αδ ε. • For (iv), the term 1 q 2 1 λ * 1 -q 2 2 λ * 2 -(q 2 1 ε1-q 2 2 ε2) is equal to: (iv) ≡ 1 q 2 1 λ * 1 -q 2 2 λ * 2 -(q 2 1 ε 1 -q 2 2 ε 2 ) = 1 |p 1 -p 2 | q 2 1 λ * 1 -q 2 2 λ * 2 -(q 2 1 ε 1 -q 2 2 ε 2 ) p 1 -p 2 -1 = 1 |p 1 -p 2 | q 2 1 λ * 1 -q 2 2 λ * 2 p 1 -p 2 Eq. ( 11) - q 2 1 ε 1 -q 2 2 ε 2 p 1 -p 2 -1 = 1 |p 1 -p 2 | γ δ - q 2 1 ε 1 -q 2 2 ε 2 p 1 -p 2 -1 Note that q 2 1 ε1-q 2 2 ε2 p1-p2 ≤ 2ε |p1-p2| . When ε is small enough, and specifically when the bound ε ≤ γ δ |p1-p2| holds, we obtain: γ δ - q 2 1 ε 1 -q 2 2 ε 2 p 1 -p 2 -1 ≤ δ γ 1 - 1 2 -1 ≤ 2 δ γ and therefore: (iv) ≤ 2 |p 1 -p 2 | δ γ Aggregating results (i)-(iv) above, we obtain the overall bound: α β - α β = (q 1 q 2 ) 2 ≤1 q 1 -q 2 q 2 1 λ * 1 -q 2 2 λ * 2 = δ γ |ε 2 λ * 1 -ε 1 λ * 2 | ≤ βγ 2αδ ε 1 q 2 1 λ * 1 -q 2 2 λ * 2 -(q 2 1 ε 1 -q 2 2 ε 2 ) ≤ 2 |p 1 -p 2 | δ γ ≤ ε |p 1 -p 2 | βδ αγ Proposition A.5 (Cost of α/β estimation error). Let α β be the engagement ratio parameter of a one-channel Lotka-Volterra system, and let ˆ α β be an estimate of these parameters. Let λ * opt be the engagement rate of the optimal static policy, and denote λ * (x) = λ * (p(x)). When α β - ˆ α β ≤ min α 2β , 1 The price of estimation error is bounded by: λ * opt -λ * ˆ α β ≤ γ δ min 2 α β -2 α β - ˆ α β , 4 α β -1 Proof. Denote r = α β , x = ˆ α β , and assume without loss of generality that γ δ = 1 and r ≤ 1. The optimal equilibrium engagement rate is given by: λ * opt = 1 4r r ∈ 0, 1 2 1 -r r ∈ 1 2 , 1 The chosen policy p(x) is given by: Assume without loss of generality that x ∈ 0, 1 2 , as values of x outside the interval can be clipped to its edges without affecting the result. The equilibrium engagement rate of the selected policy is given by: p(x) = 1 -2x x ∈ 0, 1 2 0 otherwise λ * (x) = λ * (p(x)) = 0 x ∈ 0, r 2 1 2x 1 -r 2x x ∈ r 2 , 1 2 Denote ∆(x) = λ * opt -λ * (x) . We obtain: ∆(x) = λ * opt -λ * (x) =          1 4r r ∈ 0, 1 2 , x ∈ 0, r 2 (x-r) 2 4x 2 r r ∈ 0, 1 2 , x ∈ r 2 , 1 2 (1 -r) r ∈ 1 2 , 1 , x ∈ 0, r 2 (1 -r) -1 2x 1 -r 2x r ∈ 1 2 , 1 , x ∈ r 2 , 1 2 Observe that 1 4r ≥ 1r for all r ∈ (0, 1], and therefore we obtain for all x, r: ∆(x) ≤ 1 4r (13) From the convexity of ∆(x) in the region around x = r we obtain: ∆(x) ≤ 1 2r 2 |x -r| (14) Finally, combining the two bounds yields the final result. A geometric interpretation of this claim is illustrated in Figure 4 .

A.3 OPTIMAL STATIONARY POLICY FROM ENGAGEMENT PREDICTIONS

Definition A.3 (Expected observable rate). Let u ∈ U, p ∈ [0, 1], and T > 0. Let p ∈ [0, 1], denote the corresponding static policy by π p . The expected observable rate λu (p; T ) is defined as: λu (p; u) = E π 1 T TPP πp (u; T ) where expectation is taken over the stochastic decisions of π p . Definition A.4 (Lokta-Volterra approximation of TPP). Let u ∈ U, and T > 0. Denote by p * the maximizer of expected observable rate: p * = arg max p∈[0,1] λu (p; u) The LV approximation of TPP(u; T ) is defined as: θ * u = arg min θ max p∈[0,1] λu (p; u) -λ * (p; θ) such that arg max p λ * (p; θ) = p * . The corresponding approximation error is defined as: ε LV,u = max p∈[0,1] λu (p; u) -λ * (p; θ * u ) Notations. When u is clear from the context, we denote θ * = θ * u , ε LV = ε LV,u . We use α * , β * , . . . to refer to the corresponding parts of the Lokta-Volterra parameters vector θ * . We are now ready to state and prove the main theorem for this section: Theorem A.1 (Regret bound for learned static policy. Formal version of Theorem 1). Let p 1 , p 2 ∈ [0, 1] denote two static forced-break policies, and denote by U the set of users, and assume they remain engaged under the stationary policies π(p 1 ) and π(p 2 ). Assume S u (p; T ) ∼ TPP πp•ψ (u; T ), and let µ = max u∈U γu δu • max u ′ ∈U δu ′ γu ′ , ν = max u∈U βu ᾱu . Let f p1 , f p2 : U → R + be functions predicting 1 T |S u (p 1 ; T )|, 1 T |S u (p 2 ; T )|, respectively. Denote the learned policy by p, and the optimal policy by p * . If (i) the expected RMSE of f p1 , f p2 is bounded by ε pred , (ii) the average absolute deviation of 1 T |TPP(u; T )| is bounded by ε dev , and (iii) the expected LV approximation error of the system is bounded by ε LV , then the learned policy p has bounded regret: E u,π 1 T |S u (p * ; T )| -1 T |S u (p; T )| ≤ η TPP |p 1 -p 2 | (ε pred + ε dev + ε LV ) where expectation is taken over stochastic choices of policies, and η TPP = g(µ, ν) ∈ poly(µ, ν).

Proof.

By assumption (i), the functions f p1 , f p2 have bounded expected RMSE: E u f pi (u) -1 T |S u (p i ; T )| 2 ≤ ε 2 pred (15) Applying Jensen's inequality with the convex function φ(x) = x 2 yields: E u f pi (u) -1 T |S u (p i ; T )| 2 ≤ E u f pi (u) -1 T |S u (p i ; T )| 2 Combining with Eq. ( 15) and taking the square root, we obtain an upper bound on the expected absolute error: E u f pi (u) -1 T |S u (p i ; T )| ≤ ε pred Let ∆ f = |f pi (u)λ * (p i )| apply the triangle inequality to obtain: ∆ f = |f pi (u) -λ * (p i )| ≤ f pi (u) -1 T |S u (u; T )| + 1 T |S u (u; T )| -λ(p i ; u) + λ(p i ; u) -λ * (p i ) Denote ε f = ε pred + ε dev + ε LV . Applying the triangle inequality and using the bounds in Eq. ( 16) together with assumptions (ii), (iii), we obtain: E u,π [∆ f ] ≤E u f pi (u) -1 T |S u (u; T )| + E u,π 1 T |S u (u; T )| -λ(p i ; u) + E u λ(p i ; u) -λ * (p i ; θ * u ) ≤ε pred + ε dev + ε LV = ε f (17) Denote θ * u = (α, β, γ, δ). The empirical value ˆ α β of α β is given by Definition A.2. Denote the estimation error by ∆ α β = ˆ α β -α β . By Proposition A.4, the following pointwise upper bound on ∆ α β applies when ∆ f ≤ γ δ |p1-p2| 4 : ∆ α β ≤ ∆ f |p 1 -p 2 | βδ αγ Plugging in the bound on the expected value of ∆ f into Eq. ( 18), we obtain in expectation: E u,π ∆ α β | ∆ f ≤ γ δ |p 1 -p 2 | 4 ≤ E u,π ∆ f |p 1 -p 2 | βδ αγ | ∆ f ≤ γ δ |p 1 -p 2 | 4 ≤ ε f |p 1 -p 2 | max u βδ αγ Next, we apply Proposition A.5. Denote ∆ λ * = λ * (p * )-λ * (p), and define the following probability event: A = ∆ f ≤ γ δ |p 1 -p 2 | 4 and ∆ α β ≤ 1 2ν Note that the bound in Proposition A.5 is represented as a minimum between two functions, one linear in ε and one constant. To leverage this property, apply the law of total expectation: E u,π [∆ λ * ] = E u,π [∆ λ * | A]P[A] + E u,π ∆ λ * | Ā P Ā Under A, the first term in Eq. ( 20) can be bounded by the linear term in Proposition A.5. Taking P[A] ≤ 1 and combining with equation Eq. ( 18): E u,π [∆ λ * | A]P[A] ≤ E u,π [∆ λ * | A] ≤ E u,π β 2 γ 2α 2 δ ∆ α β | A ≤ E u,π β 2 γ 2α 2 δ ∆ f |p 1 -p 2 | βδ αγ | A ≤ ν 3 2|p 1 -p 2 | ε f The expectation factor in the second term of Eq. ( 20) can be bounded by the constant term in Proposition A.5: E u,π ∆ λ * | Ā ≤ 1 4 max u βγ αδ ≤ ν 4 max u γ δ Decompose the probability factor P Ā using the law of total probability: P Ā = P ∆ f > γ δ |p 1 -p 2 | 4 + P ∆ α β > 1 2ν | ∆ f ≤ γ δ |p 1 -p 2 | 4 P ∆ f ≤ γ δ |p 1 -p 2 | 4 ≤ P ∆ f > γ δ |p 1 -p 2 | 4 + P ∆ α β > 1 2ν | ∆ f ≤ γ δ |p 1 -p 2 | 4 Apply Markov's inequality P[|X| ≥ a] ≤ E[|X|] a on the probabilities to obtain: P ∆ f > γ δ |p 1 -p 2 | 4 ≤ E u,π [∆ f ] γ δ |p 1 -p 2 | 4 -1 ≤ by Eq. ( 17) ε f 4 |p 1 -p 2 | max u δ γ P ∆ α β > 1 2ν | ∆ f ≤ γ δ |p 1 -p 2 | 4 ≤ E u,π ∆ α β | ∆ f ≤ γ δ |p 1 -p 2 | 4 ≤ by Eq. (19) ε f |p 1 -p 2 | max u βδ αγ ≤ ε f |p 1 -p 2 | ν max u δ γ Plugging back equations Eq. ( 21), Eq. ( 22),Eq. ( 23),Eq. ( 24) into equation Eq. ( 20), we obtain bounds for each term: E u,π [∆ λ * ] = E u,π [∆ λ * | A]P[A] by Eq. ( 21) + E u,π ∆ λ * | Ā by Eq. ( ) P Ā by Eq. ( 23),Eq. ( 24) (25) we obtain: E u,π [∆ λ * ] ≤ ε f |p 1 -p 2 | ν 3 2 + ν + ν 2 4 µ = ε λ * To obtain the regret bound on the empirical rates, we apply assumptions (ii), once again to bound the expected difference between λ * (p) and 1 T |S u (p; T )|, and apply the triangle inequality: E u,π 1 T |S u (p * ; T )| -1 T |S u (p; T )| ≤ ε λ * + 2(ε dev + ε LV ) Note that ν |p1-p2| > 1, as β α ≥ 1 since all the users are assumed to remain engaged in the long term, and |p 1 -p 2 | ≤ 1 as p 1 , p 2 ∈ [0, 1]. Therefore, the function η TPP = g(µ, ν) = ν 3 2 + ν + ν 2 4 η + 2ν satisfies: E u,π 1 T |S u (p * ; T )| -1 T |S u (p; T )| ≤ η TPP |p 1 -p 2 | (ε pred + ε dev + ε LV ) B EXPERIMENTAL DETAILS B.1 DATA

MovieLens-1M

We base our main experimental environment on the MovieLens-1M dataset, which is a standard benchmark dataset used widely in recommendation system research (Harper & Konstan, 2015) . The dataset includes 1,000,209 ratings provided by 6,040 users and for 3,706 movies. Ratings are in the range {1, . . . , 5}, and all users in the dataset have at least 20 reported ratings. The dataset is publicly available at: https://grouplens.org/datasets/movielens/1m/. Data partitioning. To learn latent user and item features, 30% of all ratings were drawn at random. Stratified sampling was applied to ensure that all users and items were covered, and so that each users have roughly the same proportion of ratings used for this step. These ratings were only used only for learning a CF model, and were discarded afterwards. The remaining 70% data points were used for training and testing. For these, we first randomly sampled 1,000 users to form the test set. Then, the remaining users were partitioned into the main train set S, which included 70% (≈3,528) of these users, and the experimental treatment sets S (j) , each including 10% (≈504) users for N = 3. This procedure was repeated 10 times with different random seeds. We report average results, together with 95% t-distribution confidence intervals representing variation between runs.

B.2 IMPLEMENTATION DETAILS

• Hardware: All experiments were run on a single laptop, with 16GB of RAM, M1 Pro processor, and with no GPU support. • Runtime: A single run consisting the entire pipeline (data loading and partitioning, collaborative filtering, training classifiers, simulating dynamics, learning policies, measuring and comparing performance) takes roughly 23 minutes. The main bottleneck is the discrete LV simulation, taking roughly 70% of runtime to compute, mostly due to bookkeeping necessary for the non-stationary baselines. Simulation code was optimized using the NUMBA jit compiler, which improves runtime. • Optimization packages: -Collaborative filtering (CF): We use the SURPRISE package (Hug, 2020) , which includes an implementation of the SVD algorithm for CF. All parameters were set to default values. -Regression: We use the SCIKIT-LEARN implementation of linear regression for predicting long-term engagement from user features (i.e the prediction models f j (u) in Eq. ( 8)). All parameters were set to default values. -Non-Negative Least Squares (NNLS): We use the SCIPY.OPTIMIZE implementation of NNLS. The algorithm was used with its default parameters. • Code: Code for reproducing all of our figures and experiments is available in the following repository: https://github.com/lvml-iclr-2023/lvml.

B.3 OTHER BASELINES

• Safety: In each step of the TPP simulation, look k step back, calculate the empirical rate λi = k ti-t i-k . If this rate exceeds the threshold λi > τ , the policy enters a 'cool-down' policy state, serving only forced breaks until the next time period. In our experiments, we used thresholds τ ∈ {14, 16}, k = 10 look-behind steps, and defined the cool-down period as 0.5 time units. • Oracle: To estimate the effect of perfect predictions, we implement an oracle predictor f oracle p (u) which has access to the latent user parameters. For a given u and for each p, the predictor outputs the infinite-horizon LV equilibrium for u, namely f oracle p (u) = λ * (p; θu ). We define θu = (α u , βu , γ u , δ u ), where α u , γ u , δ u are the unobserved parameters for the given user, and βu is the expected value of β ux induced by the distribution over recommended items x induced by the recommendation policy ψ. We view θu as a useful proxy for the otherwise unattainable θu .

B.4 HYPERPARAMETERS

• Collaborative filtering: We used d = 8 latent factors and enabled bias terms, which ensured performance is close to the benchmark of RMSE = 0.873 reported in the SURPRISE documentation. We used the vanilla SVD solver, with all hyperparameters set to their default values. • Recommendation policy: Softmax temperature was set to 0.5. • Prediction: We trained regressors f (u) on input feature vectors consisting of three components: u = (ṽ u , b u , ρu ) ∈ R d+2 The three components are: (i) SVD latent user factors ṽu ∈ R d , (ii) SVD user bias term b u ∈ R, (iii) an additional feature consisting of the average predicted ratings for unseen items ru weighted by recommendation probability, which we found to slightly improve predictive performance: ρu = x∈holdout(u) rux • softmax x (r u ) Where holdout(u) is the set of unseen items corresponding to user u, rux ∈ [1, 5] are the predicted ratings, and ru ∈ [1, 5] |holdout(u)| is the vector of all predicted ratings used for softmax recommendation as described in section 5. We chose to focus on linear models since the treatment datasets are relatively small (each |S (j) | ≈ 500), and since other model classes (including boosted trees and MLPs) did not perform significantly better. • Discrete TPP: Interaction sequences for each user were generated according to an LV discretization scheme TPP LV (u; T ), described in detail in the next section. Latent sates were initialized randomly with relative uniform noise around the theoretical LV equilibrium point (λ 0 , q 0 ) = ((1 + ξ λ )λ * , (1 + ξ q )q * ), where ξ λ , ξ q ∼ Uniform(-0.1, 0.1). Latent states were updated each B = 10 recommendations to stabilize noise (see Figure 5 ). When x is recommended to u at time t, latent states and ∆t are set according to β u (t), which depends on ratings r ux (true or mixed with predictions u ⊤ x via κ). Specifically, we use β u (t) = r 2 ux /5 ∈ {0.2, 0.8, 1.8, 3.2, 5}, which is convex, to accentuate the role of low ratings since they are underrepresented in the data. For B ≥ 1, we take the effective β u (t) to be the average over the B items recommended in that step. We set α = 1.3, and chose γ = 0.2, δ = 0.01 (which together determine scale) so that typical values for engagement rate 1 T |S u | are on the order of ≈ 10 for the chosen T = 100.

B.5 DISCRETE TPP FOR LOKTA-VOLTERRA SIMULATION

The Temporal Point Process (TPP) we use for simulating user interaction sequences S u is based on a discretization of the LV system described in Eq. ( 2), using the forward Euler method with variable step sizes. We denote this process by TPP LV (u; T ), and present the sampling procedure in Algorithm 1. Each user is associated with discrete latent states λ i , q i , and parameters α u , γ u , δ u . Initial states λ 0 , q 0 are set randomly. At each step, and in time t i , the system recommends x i = x(t i ), which triggers updates in latent states, and determines the next time of interaction t i+1 . As noted, these update depend on item-specific parameters β u,xi . Our TPP produces discrete sequences that are qualitatively different from their continuoustime analogs (blue lines), Nonetheless, it captures the general properties of our proposed behavioral model: note how cumulative averaging behavior (orange dashes) exhibits 'habit formation', which our equilibrium approach targets (blue dots). For the same initial conditions λ(0), q(0), the figure shows how varying the number of recommended items per step (B) 'smooths' the discrete behavior (left: B = 1, center: B = 10). For fixed β u (t) = β u , when B → ∞, and when ∆t → 0, TPP sequences approach a continuous LV trajectory; in general, and particularly when β u (t) varies by step and per recommended items-this is not the case. Under stationary policy π(p), the system recommends an item with probability (1p), and suggests a break with probability p. The simulator considers B recommendation opportunities at each step. For each k ∈ {1, . . . , B}, denote by I k ∈ {0, 1} the break indicator, equal to 0 when a break is recommended at the k-th slot in the batch. Denote by x ∼ ψ the item recommended by the underlying policy ψ, and by β(x) = r 2 ux /5 the corresponding LV hyperparameter as defined above. Algorithm 1: Sample from TPP LV (u; T ) Input: Break probability p ∈ [0, 1] Stationary content recommendation probability ψ Lotka-Volterra parameters θ u = (α, β, γ, δ) Time horizon T > 0 Output: Interaction sequence S u ∼ TPP LV π(p)•ψ (u; T ) Initialize i = 0, t 0 = 0, S u = {}; while t i < T do foreach k ∈ {1, . . . , B} do I k ∼ Bernoulli(1 -p); x k ∼ ψ; β k ←-β(r ux k ); end ∆t i ←-λ -1 i ; λ i+1 ←-λ i + -α + B k=1 I k β k B q i λ i ∆t i ; q i+1 ←-q i + γ(1 -q i ) - B k=1 I k δ B λ i q i ∆t i ; t i+1 ←-t i + λ -1 i+1 ; S u ←-S u ∪ {(t i , (x 1 , . . . , x B ), (I 1 , . . . , I B ))} ; i ←-i + 1 ; end C ADDITIONAL EMPIRICAL EVALUATION In this section, we provide additional empirical results as further indicators of potential applicability: • In subsection C.1, we repeat the experiments presented in section 5 using a different dataset, showing that qualitative results remain consistent across recommendation modalities. Evaluation. For evaluation, we follow the same procedure described in Appendix B, maintaining identical data processing procedures, constants, and hyperparameters. This procedure was repeated 10 times with different random seeds. We report average results, together with 95% t-distribution confidence intervals. Results. Figure 6 shows results for the Goodreads experiment, in the same format as Figure 3 . Results exhibit performance and trends that are qualitatively similar to the MovieLens experiment in section 5. The LV policy optimization method achieved better performance on this dataset (right pane): The performance of the LV is closer to oracle (-0.526% in Goodreads compared to -0.918% in ML1M), and the gap from the best-of method is slightly larger (+5.77% in Goodreads compared to +2.19% in ML1M). Varying user types (center pane) shows less variation across values of κ, indicating that predictors achieve satisfactory performance even when the κ is low and prediction is harder (see subsection 5.1). Varying treatments (right pane) also coincides with the observations: LV policy optimization on the Goodreads dataset exhibits less performance degradation as p 1 → 0, suggesting that less data may be sufficient for optimization. We attribute these results to the larger size of the dataset (approximately 3.6M interactions, compared to 1M interactions in ML1M), and to the possibility that a stronger structure may exist in this recommendation modality compared to general movie recommendation. 

C.2 LEVERAGING ADDITIONAL STRUCTURE USING ADAPTIVE POLICIES

One benefit of our approach in section 4 is that it requires only predictions of long-term engagement. In reality, however, other sources of information may also be available to the learner, which may be useful for improving decisions regarding when to recommend breaks. Here we consider additional information in the form of explicit user feedback, collected by the system over the course of interaction. We model users as providing to the system feedback regarding item quality, namely true ratings r ux , for some of the recommended items x they consume. For simplicity, we assume users report ratings with probability p r ∈ [0, 1], independently for each item, where we vary p r across experimental conditions. Intuitively, we would expect that, over time, user-reported ratings help improve our learned breaking policy; however, over-relying on few data and at and early stage my hinder performance. Here we examine when and how such feedback should be utilized. Adaptive policy optimization. To extend our approach to incorporate ratings-based feedback, we propose an adaptive explore-then-exploit approach which can leverage sparse explicit feedback. Fixing an adaptation time T 0 ∈ [0, T ] (a hyperparameter), the method applies the standard learned breaking policy π (Equation 9) in the time frame t ≤ T 0 , during which it also collects and stores user-reported ratings. Then, at time t = T 0 , the method updates the learning policy on the basis of the new ratings data, and applies this policy until the eventual time T . In particular, ratings are used to improve the estimated component ρu in the user feature vector u ′ (described in Equation 27). In this way, the method uses the existing engagement predictors f p (u) without requiring any retraining. A formal description of this method is provided in Algorithm 2. Here we denote the set of ratings collected until time t by r [0,t] ∈ {1, . . . , 5} * ; the average rating at time t by ρ(0,t) = 1 |r[0,t]| r∈r [0,t] r; and the set of long-term engagement predictions for a feature vector u by f (u) = {(p i , f pi (u))} (as in subsection 4.1). Algorithm 2: Adaptive policy optimization using sparse rating signals Input: Initial break probability p ∈ [0, 1] Time horizon T > 0 Adaptation time T 0 ∈ [0, T ] User feature vector u = (ṽ, b, ρ) 1. Collect ratings data r [0,T0] with breaking policy π(p) until time T 0 ; 2. Construct updated feature vector u ′ = (ṽ, b, ρ[0,T0] ) using the average rating ρ[0,T0] ; 3. Compute updated long-term engagement predictions f (u ′ ) ; 4. Use the LV policy optimization method to obtain updated policy p ′ = p * (f (u ′ )) ; 5. Use breaking policy π(p ′ ) for the remaining time t ∈ [T 0 , T ] ; Evaluation. We evaluate the adaptive policies using the MovieLens-1M experimental setup, as described in section 5 and appendix B. We make use of the same linear regression predictors f p (u) from the main part of the experiment, and maintain the same time horizon T = 100. We vary the ratings density parameter p r in the range [0, 1], and vary the adaptation time T 0 across three distinct values T 0 ∈ {0.5, 5, 50}. Results. Results are presented in Figure 7 . In the results, we observe that LV-adaptive with T 0 = 5 represents an optimal point within the hyperparameter space. For this choice, results show significant gains over the non-adaptive LV method even under moderate rating densities, and quick convergence towards the LV-true-ratings upper bound. In contrast, LV-adaptive with T 0 = 0.5 represents a "premature optimization" scenario, attempting to adapt the policy without allowing the estimated average ρ[0,T0] enough time to converge to its expected value. Similarly, T 0 = 50 represents the result of "late optimization", which has accurate estimation of ρ[0,T0] but not enough time to benefit from the updated policy. framework a design choice, to be made at the discretion of the learner; our LV model would be a good choice if it fits the data better than alternative model classes, given the amount of available data. This relation is made precise in our error bound (Theorem 1), which bounds the error when using the LV model for any underlying TPP (i.e., we make no assumptions about the true underlying data generation process). Our main experiments evaluate our approach on data that is not LV dynamics, but nonetheless, bear some resemblance. To complement these results, here we run an additional experiment in which we empirically evaluate our LV-based approach on data that is generated by a behavioral model that is entirely distinct. In particular, we consider a user model in which consumption decisions only depend on the quality of recommended items, without any dependence on internal states-i.e., it is stateless. This conforms to behavioral models which are implicitly assumed in conventional recommendation methods such as collaborative filtering. Stateless content consumption. The data generation process is formalized in Algorithm 3. As a means of capturing stateless behavior, we define a "close-range" temporal point process TPP CR (u; T ) which generates sequences of user interactions based solely on the average rating of recommended items in a batch. Here we relate rating with utility, and assume that items having higher utility (and therefore higher ratings) induce more frequent interactions; in the same way, we consider break prompts as items having zero utility, and hence zero rating. At time t i , and given a batch of size B with k ≤ B recommended items and Bk ≥ 0 breaks prompts, users acting according to the stateless behavioral will consume the next batch of content at time: t i+1 = t i + 1 τ   1 B k j=1 r uxj   -1 where (r ux1 , . . . , r ux k ) ∈ {1, . . . , 5} k are the ratings of the items recommended at time k, and τ > 0 is a constant latent parameter to be learned from data. The breaking policy π decides on the number of items k to recommend on each step. Since r ux ≥ 1 for all user-item pairs, the time difference ∆t i = t i+1t i in Equation 28 is minimized by taking k = B. This shows that the optimal breaking policy under this behavioral model is the default one, which does not prompt the user to break. Evaluation. We evaluate the stateless behavioral model using the MovieLens-1M experimental setup, as described in section 5. We set τ = 4 for all users, utilize linear regression for engagement prediction, and maintain all hyperparameters without change. Data processing steps are performed as described in appendix B, and the chosen breaking policies are evaluated.



This is similar in spirit to the 'learning to defer' paradigm(Madras et al., 2018), but in a different context.



Optimal policy (p * )Optimal stationary policy p * (α/β)

Figure 3: Results on the MovieLens 1M dataset. (Left) Performance gain of different approaches (relative to default policy). (Center) Performance of LV by user group, partitioned by learned policies pu . (Right) Sensitivity to an increasingly aggressive experimental p 1 (N = 2, p 0 = 0).

Figure 4: Graphical illustration of Proposition A.5. Cost of estimation error for different values of α β , and their corresponding upper bounds given by the claim.

Figure5: Example discrete sequence S u ∼ TPP LV (u; T ), compared to continuous LV dynamics. Our TPP produces discrete sequences that are qualitatively different from their continuoustime analogs (blue lines), Nonetheless, it captures the general properties of our proposed behavioral model: note how cumulative averaging behavior (orange dashes) exhibits 'habit formation', which our equilibrium approach targets (blue dots). For the same initial conditions λ(0), q(0), the figure shows how varying the number of recommended items per step (B) 'smooths' the discrete behavior (left: B = 1, center: B = 10). For fixed β u (t) = β u , when B → ∞, and when ∆t → 0, TPP sequences approach a continuous LV trajectory; in general, and particularly when β u (t) varies by step and per recommended items-this is not the case.

Figure6: Results on the Goodreads comic-books dataset (compare to Figure3). (Left) Performance gain of different approaches (relative to default policy). (Center) Performance of LV by user group, partitioned by learned policies pu . (Right) Sensitivity to an increasingly aggressive experimental p 1 (N = 2, p 0 = 0).

FinallyC.3 DISTINCT BEHAVIORAL MODELIn this work, we propose the LV model as a behavioral hypothesis class for counterfactual prediction of long-term engagement. As such, using the LV model within the learning-to-break optimization

Figure 7: Adaptive policy evaluation for varying rating density p r ∈ [0, 1]. Solid blue line represents the adaptive policy described in Algorithm 2, with adaptation time T 0 = 5. Dashed lines represent the performance of adaptive policies under different choices of T 0 . Cyan horizontal line represents LV-true-ratings, a non-adaptive method with oracle access to the true average rating. The remaining horizontal lines represent the performance of selected non-adaptive policies, as presented in Figure 3 (left pane).

• In subsection C.2, we demonstrate how to leverage additional problem structure commonly available in practice, and develop adaptive breaking policies which improve over time.• Finally, in subsection C.3 we present a stateless consumption model which is unrelated to Lotka-Volterra and does not promote breaks. Evaluation of the LV policy optimization method under this model shows that the optimization procedure does not degrade performance when breaks are not beneficial, thus extending Corollary 2 beyond the realizable case.C.1 ADDITIONAL DATASETThe Goodreads dataset. The Goodreads book recommendations data set is a common benchmark dataset used in recommendation systems research(Wan & McAuley, 2018;Wan et al., 2019). Ratings are in the range {1, . . . , 5}, and the dataset is filtered to only include users with at least 20 reported ratings. We use the official comic-books genre subset of the dataset, which includes 3,679,076 ratings provided by 41,932 users and for 87,565 books after pre-processing. The dataset is publicly available at: https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/. Pre-processing code is available in our code repository: https://github.com/lvml-iclr-2023/lvml.

annex

Algorithm 3: Sample from TPP CR (u; T ) Input: Break probability p ∈ [0, 1]Stationary content recommendation probability ψ Scalar parameter θ u = τ > 0 Time horizon T > 0 Output: (Right) User with extremely low engagement, for which the system tends to over-predict. In both cases, the LV policy optimization method selects the optimal policy p = 0.Results. Despite the difference between the true TPP and our choice of model class, the LV policy optimization method successfully learned the optimal no-breaks-which in this case, is the policy p = 0 for all users. Further detail is provided by Figure 8 , which illustrates the policy optimization steps under TPP CR (u; T ) for typical users with unbiased and biased engagement predictions. In both examples, the optimal points of both curves coincide, and the optimal policy is selected despite poor point-wise fit. Combined, these results show that our approach is "safe", in the sense that when breaking is sub-optimal, the learned breaking policy does not override the default policy.

