OFFLINE REINFORCEMENT LEARNING WITH DIFFER-ENTIAL PRIVACY

Abstract

The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instancedependent learning bounds under both tabular and linear Markov Decision Process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.

1. INTRODUCTION

Offline Reinforcement Learning (or batch RL) aims to learn a near-optimal policy in an unknown environmentfoot_0 through a static dataset gathered from some behavior policy µ. Since offline RL does not require access to the environment, it can be applied to problems where interaction with environment is infeasible, e.g., when collecting new data is costly (trade or finance (Zhang et al., 2020) ), risky (autonomous driving (Sallab et al., 2017) ) or illegal / unethical (healthcare (Raghu et al., 2017) ). In such practical applications, the data used by an RL agent usually contains sensitive information. Take medical history for instance, for each patient, at each time step, the patient reports her health condition (age, disease, etc.), then the doctor decides the treatment (which medicine to use, the dosage of medicine, etc.), finally there is treatment outcome (whether the patient feels good, etc.) and the patient transitions to another health condition. Here, (health condition, treatment, treatment outcome) corresponds to (state, action, reward) and the dataset can be considered as n (number of patients) trajectories sampled from a MDP with horizon H (number of treatment steps). However, learning agents are known to implicitly memorize details of individual training data points verbatim (Carlini et al., 2019) , even if they are irrelevant for learning (Brown et al., 2021) , which makes offline RL models vulnerable to various privacy attacks. Differential privacy (DP) (Dwork et al., 2006) is a well-established definition of privacy with many desirable properties. A differentially private offline RL algorithm will return a decision policy that is indistinguishable from a policy trained in an alternative universe any individual user is replaced, thereby preventing the aforementioned privacy risks. There is a surge of recent interest in developing RL algorithms with DP guarantees, but they focus mostly on the online setting (Vietri et al., 2020; Garcelon et al., 2021; Liao et al., 2021; Chowdhury & Zhou, 2021; Luyo et al., 2021) . Offline RL is arguably more practically relevant than online RL in the applications with sensitive data. For example, in the healthcare domain, online RL requires actively running new exploratory policies (clinical trials) with every new patient, which often involves complex ethical / legal clearances, whereas offline RL uses only historical patient records that are often accessible for research purposes. Clear communication of the adopted privacy enhancing techniques (e.g., DP) to patients was reported to further improve data access (Kim et al., 2017) . Our contributions. In this paper, we present the first provably efficient algorithms for offline RL with differential privacy. Our contributions are twofold. • We design two new pessimism-based algorithms DP-APVI (Algorithm 1) and DP-VAPVI (Algorithm 2), one for the tabular setting (finite states and actions), the other for the case with linear function approximation (under linear MDP assumption). Both algorithms enjoy DP guarantees (pure DP or zCDP) and instance-dependent learning bounds where the cost of privacy appears as lower order terms. • We perform numerical simulations to evaluate and compare the performance of our algorithm DP-VAPVI (Algorithm 2) with its non-private counterpart VAPVI (Yin et al., 2022) as well as a popular baseline PEVI (Jin et al., 2021) . The results complement the theoretical findings by demonstrating the practicality of DP-VAPVI under strong privacy parameters. Related work. To our knowledge, differential privacy in offline RL tasks has not been studied before, except for much simpler cases where the agent only evaluates a single policy (Balle et al., 2016; Xie et al., 2019) . Balle et al. (2016) privatized first-visit Monte Carlo-Ridge Regression estimator by an output perturbation mechanism and Xie et al. (2019) used DP-SGD. Neither paper considered offline learning (or policy optimization), which is our focus. There is a larger body of work on private RL in the online setting, where the goal is to minimize regret while satisfying either joint differential privacy (Vietri et al., 2020; Chowdhury & Zhou, 2021; Ngo et al., 2022; Luyo et al., 2021) or local differential privacy (Garcelon et al., 2021; Liao et al., 2021; Luyo et al., 2021; Chowdhury & Zhou, 2021) . The offline setting introduces new challenges in DP as we cannot algorithmically enforce good "exploration", but have to work with a static dataset and privately estimate the uncertainty in addition to the value functions. A private online RL algorithm can sometimes be adapted for private offline RL too, but those from existing work yield suboptimal and non-adaptive bounds. We give a more detailed technical comparison in Appendix B. Among non-private offline RL works, we build directly upon non-private offline RL methods known as Adaptive Pessimistic Value Iteration (APVI, for tabular MDPs) (Yin & Wang, 2021b) and Variance-Aware Pessimistic Value Iteration (VAPVI, for linear MDPs) (Yin et al., 2022) , as they give the strongest theoretical guarantees to date. We refer readers to Appendix B for a more extensive review of the offline RL literature. Introducing DP to APVI and VAPVI while retaining the same sample complexity (modulo lower order terms) require nontrivial modifications to the algorithms. A remark on technical novelty. Our algorithms involve substantial technical innovation over previous works on online DP-RL with joint DP guaranteefoot_1 . Different from previous works, our DP-APVI (Algorithm 1) operates on Bernstein type pessimism, which requires our algorithm to deal with conditional variance using private statistics. Besides, our DP-VAPVI (Algorithm 2) replaces the LSVI technique with variance-aware LSVI (also known as weighted ridge regression, first appears in (Zhou et al., 2021) ). Our DP-VAPVI releases conditional variance privately, and further applies weighted ridge regression privately. Both approaches ensure tighter instance-dependent bounds on the suboptimality of the learned policy. 

2. PROBLEM SETUP

7 ! ⇡ h (•|s h ), 8 h 2 [H]. A random trajectory s 1 , a 1 , r 1 , • • • , s H , a H , r H , s H+1 is generated according to s 1 ⇠ d 1 , a h ⇠ ⇡ h (•|s h ), r h ⇠ r h (s h , a h ), s h+1 ⇠ P h (•|s h , a h ), 8 h 2 [H]. For tabular MDP, we have S ⇥ A is the discrete state-action space and S := |S|, A := |A| are finite. In this work, we assume that r is knownfoot_2 . In addition, we denote the per-step marginal state-action occupancy d ⇡ h (s, a) as: d ⇡ h (s, a) := P[s h = s|s 1 ⇠ d 1 , ⇡]•⇡ h (a|s) , which is the marginal state-action probability at time h. , a) . Linear MDP (Jin et al., 2020b ). An episodic MDP (S, A, H, P, r) is called a linear MDP with known feature map :

Value function, Bellman (optimality) equations. The value function

V ⇡ h (•) and Q-value func- tion Q ⇡ h (•, •) for any policy ⇡ is defined as: V ⇡ h (s) = E ⇡ [ P H t=h r t |s h = s], Q ⇡ h (s, a) = E ⇡ [ P H t=h r t |s h , a h = s, a], 8 h, s, a 2 [H] ⇥ S ⇥ A. The performance is defined as v ⇡ := E d1 [V ⇡ 1 ] = E ⇡,d1 h P H t=1 r t i . The Bellman (optimality) equations follow 8 h 2 [H]: Q ⇡ h = r h + P h V ⇡ h+1 , V ⇡ h = E a⇠⇡ h [Q ⇡ h ], Q ? h = r h + P h V ? h+1 , V ? h = max a Q ? h (• S ⇥ A ! R d if there exist H unknown signed measures ⌫ h 2 R d over S and H unknown reward vectors ✓ h 2 R d such that P h (s 0 | s, a) = h (s, a), ⌫ h (s 0 )i , r h (s, a) = h (s, a), ✓ h i , 8 (h, s, a, s 0 ) 2 [H] ⇥ S ⇥ A ⇥ S. Without loss of generality, we assume k (s, a)k 2  1 and max(k⌫ h (S)k 2 , k✓ h k 2 )  p d for all h, s, a 2 [H] ⇥ S ⇥ A. An important property of linear MDP is that the value functions are linear in the feature map, which is summarized in Lemma E.14. Offline setting and the goal. The offline RL requires the agent to find a policy ⇡ in order to maximize the performance v ⇡ , given only the episodic data D = {(s ⌧ h , a ⌧ h , r ⌧ h , s ⌧ h+1 )} h2[H] ⌧ 2[n] foot_3 rolled out from some fixed and possibly unknown behavior policy µ, which means we cannot change µ and in particular we do not assume the functional knowledge of µ. In conclusion, based on the batch data D and a targeted accuracy ✏ > 0, the agent seeks to find a policy ⇡ alg such that v ? v ⇡alg  ✏.

2.1. ASSUMPTIONS IN OFFLINE RL

In order to show that our privacy-preserving algorithms can generate near optimal policy, certain coverage assumptions are needed. In this section, we will list the assumptions we use in this paper. Assumptions for tabular setting. Assumption 2.1 ( (Liu et al., 2019) ). There exists one optimal policy ⇡ ? , such that ⇡ ? is fully covered by µ, i.e. 8 s h , a h 2 S ⇥ A, d ⇡ ? h (s h , a h ) > 0 only if d µ h (s h , a h ) > 0. Furthermore, we denote the trackable set as C h := {(s h , a h ) : d µ h (s h , a h ) > 0}. Assumption 2.1 is the weakest assumption needed for accurately learning the optimal value v ? by requiring µ to trace the state-action space of one optimal policy (µ can be agnostic at other locations). Similar to (Yin & Wang, 2021b) , we will use Assumption 2.1 for the tabular part of this paper, which enables comparison between our sample complexity to the conclusion in (Yin & Wang, 2021b) , whose algorithm serves as a non-private baseline. Assumptions for linear setting. First, we define the expectation of covariance matrix under the behavior policy µ for all time step h 2 [H] as below: ⌃ p h := E µ ⇥ (s h , a h ) (s h , a h ) > ⇤ . (1) As have been shown in (Wang et al., 2021; Yin et al., 2022) , learning a near-optimal policy from offline data requires coverage assumptions. Here in linear setting, such coverage is characterized by the minimum eigenvalue of ⌃ p h . Similar to (Yin et al., 2022) , we apply the following assumption for the sake of comparison. Assumption 2.2 (Feature Coverage, Assumption 2 in (Wang et al., 2021) ). The data distributions µ satisfy the minimum eigenvalue condition: 8 h 2 [H],  h := min (⌃ p h ) > 0. Furthermore, we denote  = min h  h .

2.2. DIFFERENTIAL PRIVACY IN OFFLINE RL

In this work, we aim to design privacy-preserving algorithms for offline RL. We apply differential privacy as the formal notion of privacy. Below we revisit the definition of differential privacy. Definition 2.3 (Differential Privacy (Dwork et al., 2006) ). A randomized mechanism M satisfies (✏, )-differential privacy ((✏, )-DP) if for all neighboring datasets U, U 0 that differ by one data point and for all possible event E in the output range, it holds that P[M (U ) 2 E]  e ✏ • P[M (U 0 ) 2 E] + . When = 0, we say pure DP, while for > 0, we say approximate DP. In the problem of offline RL, the dataset consists of several trajectories, therefore one data point in Definition 2.3 refers to one single trajectory. Hence the definition of Differential Privacy means that the difference in the distribution of the output policy resulting from replacing one trajectory in the dataset will be small. In other words, an adversary can not infer much information about any single trajectory in the dataset from the output policy of the algorithm. During the whole paper, we will use zCDP (defined below) as a surrogate for DP, since it enables cleaner analysis for privacy composition and Gaussian mechanism. The properties of zCDP (e.g., composition, conversion formula to DP) are deferred to Appendix E.3. Definition 2.4 (zCDP (Dwork & Rothblum, 2016; Bun & Steinke, 2016) ). A randomized mechanism M satisfies ⇢-Zero-Concentrated Differential Privacy (⇢-zCDP), if for all neighboring datasets U, U 0 and all ↵ 2 (1, 1), D ↵ (M (U )kM (U 0 ))  ⇢↵, where D ↵ is the Renyi-divergence (Van Erven & Harremos, 2014). Finally, we go over the definition and privacy guarantee of Gaussian mechanism. Definition 2.5 (Gaussian Mechanism (Dwork et al., 2014) ). Define the `2 sensitivity of a function f : N X 7 ! R d as 2 (f ) = sup neighboring U,U 0 kf (U ) f (U 0 )k 2 . The Gaussian mechanism M with noise level is then given by M(U ) = f (U ) + N (0, 2 I d ). Lemma 2.6 (Privacy guarantee of Gaussian mechanism (Dwork et al., 2014; Bun & Steinke, 2016) ). Let f : N X 7 ! R d be an arbitrary d-dimensional function with `2 sensitivity 2 . Then for any ⇢ > 0, Gaussian Mechanism with parameter 2 = 2 2 2⇢ satisfies ⇢-zCDP. In addition, for all 0 < , ✏ < 1, Gaussian Mechanism with parameter = 2 ✏ q 2 log 1.25 satisfies (✏, )-DP. We emphasize that the privacy guarantee covers any input data. It does not require any distributional assumptions on the data. The RL-specific assumptions (e.g., linear MDP and coverage assumptions) are only used for establishing provable utility guarantees. 3 RESULTS UNDER TABULAR MDP: DP-APVI (ALGORITHM 1) For reinforcement learning, the tabular MDP setting is the most well-studied setting and our first result applies to this regime. We begin with the construction of private counts. Private Model-based Components. Given data D = {(s ⌧ h , a ⌧ h , r ⌧ h , s ⌧ h+1 )} h2[H] ⌧ 2[n] , we denote n s h ,a h := P n ⌧ =1 1[s ⌧ h , a ⌧ h = s h , a h ] be the total counts that visit (s h , a h ) pair at time h and n s h ,a h ,s h+1 := P n ⌧ =1 1[s ⌧ h , a ⌧ h , s ⌧ h+1 = s h , a h , s h+1 ] be the total counts that visit (s h , a h , s h+1 ) pair at time h, then given the budget ⇢ for zCDP, we add independent Gaussian noises to all the counts: n 0 s h ,a h = ns h ,a h + N (0, 2 ) + , n 0 s h ,a h ,s h+1 = ns h ,a h ,s h+1 + N (0, 2 ) + , 2 = 2H ⇢ . However, after adding noise, the noisy counts n 0 may not satisfy n 0 s h ,a h = P s h+1 2S n 0 s h ,a h ,s h+1 . To address this problem, we choose the private counts of visiting numbers as the solution to the following optimization problem (here E ⇢ = 4 r H log 4HS 2 A ⇢ ): {e n s h ,a h ,s 0 } s 0 2S = argmin {x s 0 } s 0 2S max s 0 2S x s 0 n 0 s h ,a h ,s 0 such that X s 0 2S x s 0 n 0 s h ,a h  E ⇢ 2 and x s 0 0, 8 s 0 2 S. e n s h ,a h = X s 0 2S e n s h ,a h ,s 0 . (3) Remark 3.1. The optimization problem (3) can be reformulated as: min t, s.t. |x s 0 n 0 s h ,a h ,s 0 |  t and x s 0 0 8 s 0 2 S, X s 0 2S x s 0 n 0 s h ,a h  E ⇢ 2 . ( ) Note that (4) is a Linear Programming problem with S + 1 variables and 2S + 2 (one constraint on absolute value is equivalent to two linear constraints) linear constraints, which can be solved efficiently by the simplex method (Ficken, 2015) or other provably efficient algorithms (Nemhauser & Wolsey, 1988) (Vietri et al., 2020; Chowdhury & Zhou, 2021 ) that may not be a distribution, we have to ensure that ours is a probability distribution, because our Bernstein type pessimism (line 5 in Algorithm 1) needs to take variance over this transition kernel estimate. The intuition behind the construction of our private transition kernel is that, for those state-action pairs with e n s h ,a h  E ⇢ , we can not distinguish whether the non-zero private count comes from noise or actual visitation. Therefore we only take the empirical estimate of the state-action pairs with sufficiently large e n s h ,a h . Algorithm 1 Differentially Private Adaptive Pessimistic Value Iteration (DP-APVI)  1: Input: Offline dataset D = {(s ⌧ h , a ⌧ h , r ⌧ h , s ⌧ h+1 )} n,H ⌧,h=1 : b Q p h (•, •) e Q h (•, •) h (•, •). 7: Q h (•, •) min{ b Q p h (•, •), H h + 1} + . 8: 8s h , let b ⇡ h (•|s h ) argmax ⇡ h hQ h (s h , •), ⇡ h (•|s h )i and e V h (s h ) hQ h (s h , •), b ⇡ h (•|s h )i. 9: end for 10: Output: {b ⇡ h }. Algorithmic design. Our algorithmic design originates from the idea of pessimism, which holds conservative view towards the locations with high uncertainty and prefers the locations we have more confidence about. Based on the Bernstein type pessimism in APVI (Yin & Wang, 2021b) , we design a similar pessimistic algorithm with private counts to ensure differential privacy. If we replace e n and e P with n and b Pfoot_5 , then our DP-APVI (Algorithm 1) will degenerate to APVI. Compared to the pessimism defined in APVI, our pessimistic penalty has an additional term e O ⇣ SHE⇢ e ns h ,a h ⌘ , which accounts for the additional pessimism due to our application of private statistics. We state our main theorem about DP-APVI below, the proof sketch is deferred to Appendix C.1 and detailed proof is deferred to Appendix C due to space limit. Theorem 3.3. DP-APVI (Algorithm 1) satisfies ⇢-zCDP. Furthermore, under Assumption 2.1, denote dm := min h2[H] {d µ h (s h , a h ) : d µ h (s h , a h ) > 0}. For any 0 < < 1, there exists constant c 1 > 0, such that when n > c 1 • max{H 2 , E ⇢ }/ dm • ◆ (◆ = log(HSA/ )), with probability 1 , the output policy b ⇡ of DP-APVI satisfies ( e O hides constants and Polylog terms, E ⇢ = 4 r H log 4HS 2 A ⇢ ) 0  v ? v b ⇡  4 p 2 H X h=1 X (s h ,a h )2C h d ⇡ ? h (s h , a h ) s Var P h (•|s h ,a h ) (V ? h+1 (•)) • ◆ nd µ h (s h , a h ) + e O ✓ H 3 + SH 2 E ⇢ n • dm ◆ . Comparison to non-private counterpart APVI (Yin & Wang, 2021b) . According to Theorem 4.1 in (Yin & Wang, 2021b) , the sub-optimality bound of APVI is for large enough n, with high probability, the output b ⇡ satisfies: 0  v ? v b ⇡  e O 0 @ H X h=1 X (s h ,a h )2C h d ⇡ ? h (s h , a h ) s Var P h (•|s h ,a h ) (V ? h+1 (•)) nd µ h (s h , a h ) 1 A + e O ✓ H 3 n • dm ◆ . Compared to our Theorem 3.3, the additional sub-optimality bound due to differential privacy is e O ⇣ SH 2 E⇢ n• dm ⌘ = e O ✓ SH 5 2 n• dm p ⇢ ◆ = e O ✓ SH 5 2 n• dm✏ ◆ . 7 In the most popular regime where the privacy budget ⇢ or ✏ is a constant, the additional term due to differential privacy appears as a lower order term, hence becomes negligible as the sample complexity n becomes large. Comparison to Hoeffding type pessimism. We can simply revise our algorithm by using Hoeffding type pessimism, which replaces the pessimism in line 5 with C 1 H • q ◆ e ns h ,a h E⇢ + C2SHE⇢•◆ e ns h ,a h . Then with a similar proof schedule, we can arrive at a sub-optimality bound that with high probability, 0  v ? v b ⇡  e O 0 @ H • H X h=1 X (s h ,a h )2C h d ⇡ ? h (s h , a h ) s 1 nd µ h (s h , a h ) 1 A + e O ✓ SH 2 E ⇢ n • dm ◆ . ( ) Compared to our Theorem 3.3, our bound is tighter because we express the dominate term by the system quantities instead of explicit dependence on H (and Var P h (•|s h ,a h ) (V ? h+1 (•))  H 2 ). In addition, we highlight that according to Theorem G.1 in (Yin & Wang, 2021b) , our main term nearly matches the non-private minimax lower bound. For more detailed discussions about our main term and how it subsumes other optimal learning bounds, we refer readers to (Yin & Wang, 2021b) . Apply Laplace Mechanism to achieve pure DP. To achieve Pure DP instead of ⇢-zCDP, we can simply replace Gaussian Mechanism with Laplace Mechanism (defined as Definition E.19). Given privacy budget for Pure DP ✏, since the `1 sensitivity of {n s h ,a h } [ {n s h ,a h ,s h+1 } is 1 = 4H, we can add independent Laplace noises Lap( 4H ✏ ) to each count to achieve ✏-DP due to Lemma E.20. Then by using E ✏ = e O H ✏ instead of E ⇢ and keeping everything else ((3), ( 5) and Algorithm 1) the same, we can reach a similar result to Theorem 3.3 with the same proof schedule. The only difference is that here the additional learning bound is e O ⇣ SH 3 n• dm✏ ⌘ , which still appears as a lower order term.

4. RESULTS UNDER LINEAR MDP: DP-VAPVI(ALGORITHM 2)

In large MDPs, to address the computational issues, the technique of function approximation is widely applied, and linear MDP is a concrete model to study linear function approximations. Our second result applies to the linear MDP setting. Generally speaking, function approximation reduces the dimensionality of private releases comparing to the tabular MDPs. We begin with private counts. Private Model-based Components. Given the two datasets D and D 0 (both from µ) as in Algorithm 2, we can apply variance-aware pessimistic value iteration to learn a near optimal policy as in VAPVI (Yin et al., 2022) . To ensure differential privacy, we add independent Gaussian noises to the 5H statistics as in DP-VAPVI (Algorithm 2) below. Since there are 5H statistics, by the adaptive composition of zCDP (Lemma E.17), it suffices to keep each count ⇢ 0 -zCDP, where ⇢ 0 = ⇢ 5H . In DP-VAPVI, we use 1 , 2 , 3 , K 1 , K 2 8 to denote the noises we add. For all i , we directly apply Gaussian Mechanism. For K i , in addition to the noise matrix 1 p 2 (Z + Z > ), we also add E 2 I d to ensure that all K i are positive definite with high probability (The detailed definition of E, L can be found in Appendix A). Algorithm 2 Differentially Private Variance-Aware Pessimistic Value Iteration (DP-VAPVI) 1: Input: Dataset D = {(s ⌧ h , a ⌧ h , r ⌧ h , s ⌧ h+1 )} K,H ⌧,h=1 D 0 = {(s ⌧ h , ā⌧ h , r⌧ h , s⌧ h+1 )} K,H ⌧,h=1 . Budget for zCDP ⇢. Failure probability . Universal constant C. 2: Initialization: Set ⇢0 ⇢ 5H , e VH+1(•) 0. Sample 1 ⇠ N ⇣ 0, 2H 4 ⇢ 0 I d ⌘ , 2, 3 ⇠ N ⇣ 0, 2H 2 ⇢ 0 I d ⌘ , K1, K2 E 2 I d + 1 p 2 (Z + Z > ), where Zi,j ⇠ N ⇣ 0, 1 4⇢ 0 ⌘ (i.i.d.), E = e O ⇣q Hd ⇢ ⌘ . Set D e O ⇣ H 2 L  + H 4 E p d  3/2 + H 3 p d ⌘ . 3: for h = H, H 1, . . . , 1 do 4: Set e ⌃ h P K ⌧ =1 (s ⌧ h , ā⌧ h ) (s ⌧ h , ā⌧ h ) > + I + K1 5: Set e h e ⌃ 1 h [ P K ⌧ =1 (s ⌧ h , ā⌧ h ) • e V h+1 (s ⌧ h+1 ) 2 + 1] 6: Set e ✓ h e ⌃ 1 h [ P K ⌧ =1 (s ⌧ h , ā⌧ h ) • e V h+1 (s ⌧ h+1 ) + 2] 7: Set ⇥ g Var h e V h+1 ⇤ (•, •) ⌦ (•, •), e h ↵ [0,(H h+1) 2 ] ⇥⌦ (•, •), e ✓ h ↵ [0,H h+1] ⇤ 2 8: Set e h (•, •) 2 max{1, g Var h e V h+1 (•, •)} 9: Set e ⇤ h P K ⌧ =1 (s ⌧ h , a ⌧ h ) (s ⌧ h , a ⌧ h ) > /e 2 h (s ⌧ h , a ⌧ h ) + I + K2 10: Set e w h e ⇤ 1 h ⇣ P K ⌧ =1 (s ⌧ h , a ⌧ h ) • ⇣ r ⌧ h + e V h+1 (s ⌧ h+1 ) ⌘ /e 2 h (s ⌧ h , a ⌧ h ) + 3 ⌘ 11: Set h (•, •) C p d • ⇣ (•, •) > e ⇤ 1 h (•, •) ⌘ 1/2 + D K 12: Set Qh (•, •) (•, •) > e w h h (•, •) 13: Set b Q h (•, •) min Qh (•, •), H h + 1 + 14: Set b ⇡ h (• | •) argmax ⇡ h ⌦ b Q h (•, •), ⇡ h (• | •) ↵ A , e V h (•) max⇡ h ⌦ b Q h (•, •), ⇡ h (• | •) ↵ A 15: end for 16: Output: {b ⇡ h } H h=1 . Below we will show the algorithmic design of DP-VAPVI (Algorithm 2). For the offline dataset, we divide it into two independent parts with equal length: D = {(s ⌧ h , a ⌧ h , r ⌧ h , s ⌧ h+1 )} h2[H] ⌧ 2[K] and D 0 = {(s ⌧ h , ā⌧ h , r⌧ h , s⌧ h+1 )} h2[H] ⌧ 2[K] . One for estimating variance and the other for calculating Q-values. Estimating conditional variance. The first part (line 4 to line 8) aims to estimate the conditional variance of e V h+1 via the definition of variance: [Var h e V h+1 ](s, a) = [P h ( e V h+1 ) 2 ](s, a) ([P h e V h+1 ](s, a)) 2 . For the first term, by the definition of linear MDP, it holds that h P h e V 2 h+1 i (s, a) = (s, a) > R S e V 2 h+1 (s 0 ) d⌫ h (s 0 ) = h , R S e V 2 h+1 (s 0 ) d⌫ h (s 0 )i. We can estimate h = R S e V 2 h+1 (s 0 ) d⌫ h (s 0 ) by applying ridge regression. Below is the output of ridge regression with raw statistics without noise: argmin 2R d K X k=1 hD (s k h , āk h ), E e V 2 h+1 ⇣ sk h+1 ⌘i 2 + k k 2 2 = ⌃ 1 h K X k=1 (s k h , āk h ) e V 2 h+1 ⇣ sk h+1 ⌘ , where definition of ⌃h can be found in Appendix A. Instead of using the raw statistics, we replace them with private ones with Gaussian noises as in line 5. The second term is estimated similarly in line 6. The final estimator is defined as in line 8: e h (•, •) 2 = max{1, g Var h e V h+1 (•, •)}. 9 8 We need to add noise to each of the 5H counts, therefore for 1, we actually sample H i.i.d samples 1,h , h = 1, • • • , H from the distribution of 1. Then we add 1,h to P K ⌧ =1 (s ⌧ h , ā⌧ h ) • e V h+1 (s ⌧ h+1 ) 2 , 8 h 2 [H]. For simplicity, we use 1 to represent all the 1,h . The procedure applied to the other 4H statistics are similar. 9 The max{1, •} operator here is for technical reason only: we want a lower bound for each variance estimate. Variance-weighted LSVI. Instead of directly applying LSVI (Jin et al., 2021) , we can solve the variance-weighted LSVI (line 10). The result of variance-weighted LSVI with non-private statistics is shown below: argmin w2R d kwk 2 2 + K X k=1 h h (s k h , a k h ), wi r k h e V h+1 (s k h+1 ) i 2 e 2 h (s k h , a k h ) = b ⇤ 1 h K X k=1 s k h , a k h • h r k h + e V h+1 s k h+1 i e 2 h (s k h , a k h ) , where definition of b ⇤ h can be found in Appendix A. For the sake of differential privacy, we use private statistics instead and derive the e w h as in line 10. Our private pessimism. Notice that if we remove all the Gaussian noises we add, our DP-VAPVI (Algorithm 2) will degenerate to VAPVI (Yin et al., 2022) . We design a similar pessimistic penalty using private statistics (line 11), with additional D K accounting for the extra pessimism due to DP. Main theorem. We state our main theorem about DP-VAPVI below, the proof sketch is deferred to Appendix D.1 and detailed proof is deferred to Appendix D due to space limit. Note that quantities M i , L, E can be found in Appendix A and briefly, L = e O( p H 3 d/⇢), E = e O( p Hd/⇢). For the sample complexity lower bound, within the practical regime where the privacy budget is not very small, max{M i } is dominated by max{ e O(H 12 d 3 / 5 ), e O(H 14 d/ 5 )}, which also appears in the sample complexity lower bound of VAPVI (Yin et al., 2022) . The 2 V (s, a) in Theorem 4.1 is defined as max{1, Var P h (V )(s, a)} for any V . Theorem 4.1. DP-VAPVI (Algorithm 2) satisfies ⇢-zCDP. Furthermore, let K be the number of episodes. Under the condition that K > max{M 1 , M 2 , M 3 , M 4 } and p d > ⇠, where ⇠ := sup V 2[0,H], s 0 ⇠P h (s,a), h2[H] r h +V (s 0 ) (T h V )(s,a) V (s,a) , for any 0 < < , with probability 1 , for all policy ⇡ simultaneously, the output b ⇡ of DP-VAPVI satisfies ( e O hides constants and Polylog terms) v ⇡ v b ⇡  e O p d • H X h=1 E⇡  q (•, •) > ⇤ 1 h (•, •) ! + DH K , where ⇤ h = P K k=1 (s k h ,a k h )• (s k h ,a k h ) > 2 e V h+1 (s k h ,a k h ) + I d and D = e O ⇣ H 2 L  + H 4 E p d  3/2 + H 3 p d ⌘ . In particular, define ⇤ ? h = P K k=1 (s k h ,a k h )• (s k h ,a k h ) > 2 V ? h+1 (s k h ,a k h ) + I d , we have with probability 1 , v ? v b ⇡  e O p d • H X h=1 E⇡?  q (•, •) > ⇤ ? 1 h (•, •) ! + DH K . Comparison to non-private counterpart VAPVI (Yin et al., 2022) . Plugging in the definition of L, E (Appendix A), under the meaningful case that the privacy budget is not very large, DH is dominated by e O ✓ H 11 2 d/ 3 2 p ⇢ ◆ . According to Theorem 3.2 in (Yin et al., 2022) , the sub-optimality bound of VAPVI is for sufficiently large K, with high probability, the output b ⇡ satisfies: v ? v b ⇡  e O p d • H X h=1 E ⇡ ?  q (•, •) > ⇤ ? 1 h (•, •) ! + 2H 4 p d K . Compared to our Theorem 4.1, the additional sub-optimality bound due to differential privacy is e O ✓ H 11 2 d/ 3 2 p ⇢•K ◆ = e O ✓ H 11 2 d/ 3 2 ✏•K ◆ . 10 In the most popular regime where the privacy budget ⇢ or ✏ is a constant, the additional term due to differential privacy also appears as a lower order term. Instance-dependent sub-optimality bound. Similar to DP-APVI (Algorithm 1), our DP-VAPVI (Algorithm 2) also enjoys instance-dependent sub-optimality bound. First, the main term in (10) improves PEVI (Jin et al., 2021) over O( p d) on feature dependence. Also, our main term admits no explicit dependence on H, thus improves the sub-optimality bound of PEVI on horizon dependence. For more detailed discussions about our main term, we refer readers to (Yin et al., 2022) .

5. SIMULATIONS

In this section, we carry out simulations to evaluate the performance of our DP-VAPVI (Algorithm 2), and compare it with its non-private counterpart VAPVI (Yin et al., 2022) and another pessimism-based algorithm PEVI (Jin et al., 2021) which does not have privacy guarantee. Experimental setting. We evaluate DP-VAPVI (Algorithm 2) on a synthetic linear MDP example that originates from the linear MDP in (Min et al., 2021; Yin et al., 2022) but with some modifications.foot_8 For details of the linear MDP setting, please refer to Appendix F. The two MDP instances we use both have horizon H = 20. We compare different algorithms in figure 1 (a), while in figure 1(b), we compare our DP-VAPVI with different privacy budgets. When doing empirical evaluation, we do not split the data for DP-VAPVI or VAPVI and for DP-VAPVI, we run the simulation for 5 times and take the average performance. ⇡ while x-axis denotes the number of episodes K. The horizons are fixed to be H = 20. The number of episodes takes value from 5 to 1000.

Results and discussions.

From Figure 1 , we can observe that DP-VAPVI (Algorithm 2) performs slightly worse than its non-private version VAPVI (Yin et al., 2022) . This is due to the fact that we add Gaussian noise to each count. However, as the size of dataset goes larger, the performance of DP-VAPVI will converge to that of VAPVI, which supports our theoretical conclusion that the cost of privacy only appears as lower order terms. For DP-VAPVI with larger privacy budget, the scale of noise will be smaller, thus the performance will be closer to VAPVI, as shown in figure 1(b). Furthermore, in most cases, DP-VAPVI still outperforms PEVI, which does not have privacy guarantee. This arises from our privitization of variance-aware LSVI instead of LSVI.

6. CONCLUSION AND FUTURE WORKS

In this work, we take the first steps towards the well-motivated task of designing private offline RL algorithms. We propose algorithms for both tabular MDPs and linear MDPs, and show that they enjoy instance-dependent sub-optimality bounds while guaranteeing differential privacy (either zCDP or pure DP). Our results highlight that the cost of privacy only appears as lower order terms, thus become negligible as the number of samples goes large. Future extensions are numerous. We believe the technique in our algorithms (privitization of Bernstein-type pessimism and variance-aware LSVI) and the corresponding analysis can be used in online settings too to obtain tighter regret bounds for private algorithms. For the offline RL problems, we plan to consider more general function approximations and differentially private (deep) offline RL which will bridge the gap between theory and practice in offline RL applications. Many techniques we developed could be adapted to these more general settings.



The environment is usually characterized by a Markov Decision Process (MDP) in this paper. Here we only compare our techniques (for offline RL) with the works for online RL under joint DP guarantee, as both settings allow access to the raw data. This is due to the fact that the uncertainty of reward function is dominated by that of transition kernel in RL. For clarity we use n for tabular MDP and K for linear MDP when referring to the sample complexity. This conclusion is summarized in Lemma C.3. The non-private empirical estimate, defined as (15) in Appendix C. Here we apply the second part of Lemma 2.6 to achieve (✏, )-DP, the notation e O also absorbs log 1 (only here denotes the privacy budget instead of failure probability). Here we apply the second part of Lemma 2.6 to achieve (✏, )-DP, the notation e O also absorbs log 1 (only here denotes the privacy budget instead of failure probability). We keep the state space S = {1, 2}, action space A = {1, • • • , 100} and feature map of state-action pairs while we choose stochastic transition (instead of the original deterministic transition) and more complex reward.



(a) Compare different algorithms, H = 20 (b) Different privacy budgets, H = 20

Figure 1: Comparison between performance of PEVI, VAPVI and DP-VAPVI (with different privacy budgets) under the linear MDP example described above. In each figure, y-axis represents suboptimality gap v ? v b⇡ while x-axis denotes the number of episodes K. The horizons are fixed to be H = 20. The number of episodes takes value from 5 to 1000.

: S ⇥ A ⇥ S 7 ! [0, 1] maps each state action (s h , a h ) to a probability distribution P h (•|s h , a h ) and P h can be different across time. Besides, r h : S⇥A 7 ! R is the expected immediate reward satisfying 0  r h  1, d 1 is the initial state distribution and H is the horizon. A policy ⇡ = (⇡ 1 , • • • , ⇡ H ) assigns each state s h 2 S a probability distribution over actions according to the map s h

. In addition, if we do not solve this optimization problem and directly take e n s h ,a h ,s h+1 = n 0 s h ,a h ,s h+1 and e n s h ,a h = P s h+1 2S e n s h ,a h ,s h+1 , we can only derive |e n s h ,a h n s h ,a h |  e O( p SE ⇢ ) through concentration on summation of S i.i.d. Gaussian noises. In contrast, solving (3) ensures that |e n s h ,a h n s h ,a h |  E ⇢ with high probability 5 . The private estimation of the transition kernel is defined as: e P h (s 0 |s h , a h ) = e n s h ,a h ,s 0 e ns ,a h > E ⇢ and e P h (s 0 |s h , a h ) = 1 S otherwise. Remark 3.2. Different from the transition kernel estimate in previous works

