OFFLINE REINFORCEMENT LEARNING WITH DIFFER-ENTIAL PRIVACY

Abstract

The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instancedependent learning bounds under both tabular and linear Markov Decision Process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.

1. INTRODUCTION

Offline Reinforcement Learning (or batch RL) aims to learn a near-optimal policy in an unknown environmentfoot_0 through a static dataset gathered from some behavior policy µ. Since offline RL does not require access to the environment, it can be applied to problems where interaction with environment is infeasible, e.g., when collecting new data is costly (trade or finance (Zhang et al., 2020) ), risky (autonomous driving (Sallab et al., 2017) ) or illegal / unethical (healthcare (Raghu et al., 2017) ). In such practical applications, the data used by an RL agent usually contains sensitive information. Take medical history for instance, for each patient, at each time step, the patient reports her health condition (age, disease, etc.), then the doctor decides the treatment (which medicine to use, the dosage of medicine, etc.), finally there is treatment outcome (whether the patient feels good, etc.) and the patient transitions to another health condition. Here, (health condition, treatment, treatment outcome) corresponds to (state, action, reward) and the dataset can be considered as n (number of patients) trajectories sampled from a MDP with horizon H (number of treatment steps). However, learning agents are known to implicitly memorize details of individual training data points verbatim (Carlini et al., 2019) , even if they are irrelevant for learning (Brown et al., 2021) , which makes offline RL models vulnerable to various privacy attacks. Differential privacy (DP) (Dwork et al., 2006) is a well-established definition of privacy with many desirable properties. A differentially private offline RL algorithm will return a decision policy that is indistinguishable from a policy trained in an alternative universe any individual user is replaced, thereby preventing the aforementioned privacy risks. There is a surge of recent interest in developing RL algorithms with DP guarantees, but they focus mostly on the online setting (Vietri et al., 2020; Garcelon et al., 2021; Liao et al., 2021; Chowdhury & Zhou, 2021; Luyo et al., 2021) . Offline RL is arguably more practically relevant than online RL in the applications with sensitive data. For example, in the healthcare domain, online RL requires actively running new exploratory policies (clinical trials) with every new patient, which often involves complex ethical / legal clearances, whereas offline RL uses only historical patient records that are often accessible for research purposes. Clear communication of the adopted privacy enhancing techniques (e.g., DP) to patients was reported to further improve data access (Kim et al., 2017) . Our contributions. In this paper, we present the first provably efficient algorithms for offline RL with differential privacy. Our contributions are twofold. • We design two new pessimism-based algorithms DP-APVI (Algorithm 1) and DP-VAPVI (Algorithm 2), one for the tabular setting (finite states and actions), the other for the case with linear function approximation (under linear MDP assumption). Both algorithms enjoy DP guarantees (pure DP or zCDP) and instance-dependent learning bounds where the cost of privacy appears as lower order terms. For tabular MDP, we have S ⇥ A is the discrete state-action space and S := |S|, A := |A| are finite. In this work, we assume that r is knownfoot_2 . In addition, we denote the per-step marginal state-action occupancy d ⇡ h (s, a) as: d ⇡ h (s, a) := P[s h = s|s 1 ⇠ d 1 , ⇡]•⇡ h (a|s), which is the marginal state-action probability at time h.



The environment is usually characterized by a Markov Decision Process (MDP) in this paper. Here we only compare our techniques (for offline RL) with the works for online RL under joint DP guarantee, as both settings allow access to the raw data. This is due to the fact that the uncertainty of reward function is dominated by that of transition kernel in RL.



• We perform numerical simulations to evaluate and compare the performance of our algorithm DP-VAPVI (Algorithm 2) with its non-private counterpart VAPVI(Yin et al., 2022)  as well as a popular baseline PEVI(Jin et al., 2021). The results complement the theoretical findings by demonstrating the practicality of DP-VAPVI under strong privacy parameters.Related work. To our knowledge, differential privacy in offline RL tasks has not been studied before, except for much simpler cases where the agent only evaluates a single policy(Balle et al., 2016; Xie  et al., 2019).Balle et al. (2016)  privatized first-visit Monte Carlo-Ridge Regression estimator by an output perturbation mechanism and Xie et al. (2019) used DP-SGD. Neither paper considered offline learning (or policy optimization), which is our focus.There is a larger body of work on private RL in the online setting, where the goal is to minimize regret while satisfying either joint differential privacy(Vietri et al., 2020; Chowdhury & Zhou, 2021; Ngo  et al., 2022; Luyo et al., 2021)  or local differential privacy(Garcelon et al., 2021; Liao et al., 2021;  Luyo et al., 2021; Chowdhury & Zhou, 2021). The offline setting introduces new challenges in DP as we cannot algorithmically enforce good "exploration", but have to work with a static dataset and privately estimate the uncertainty in addition to the value functions. A private online RL algorithm can sometimes be adapted for private offline RL too, but those from existing work yield suboptimal and non-adaptive bounds. We give a more detailed technical comparison in Appendix B.Among non-private offline RL works, we build directly upon non-private offline RL methods known as Adaptive Pessimistic Value Iteration (APVI, for tabular MDPs) (Yin & Wang, 2021b) and Variance-Aware Pessimistic Value Iteration (VAPVI, for linear MDPs)(Yin et al., 2022), as they give the strongest theoretical guarantees to date. We refer readers to Appendix B for a more extensive review of the offline RL literature. Introducing DP to APVI and VAPVI while retaining the same sample complexity (modulo lower order terms) require nontrivial modifications to the algorithms.Markov DecisionProcess. A finite-horizon Markov Decision Process (MDP) is denoted by a tuple M = (S, A, P, r, H, d 1 ) (Sutton & Barto, 2018), where S is state space and A is action space. A nonstationary transition kernel P h : S ⇥ A ⇥ S 7 ! [0, 1] maps each state action (s h , a h ) to a probability distribution P h (•|s h , a h ) and P h can be different across time. Besides, r h : S⇥A 7 ! R is the expected immediate reward satisfying 0  r h  1, d 1 is the initial state distribution and H is the horizon. A policy ⇡ = (⇡ 1 , • • • , ⇡ H ) assigns each state s h 2 S a probability distribution over actions according to the map s h 7 ! ⇡ h (•|s h ), 8 h 2 [H]. A random trajectory s 1 , a 1 , r 1 , • • • , s H , a H , r H , s H+1 is generated according to s 1 ⇠ d 1 , a h ⇠ ⇡ h (•|s h ), r h ⇠ r h (s h , a h ), s h+1 ⇠ P h (•|s h , a h ), 8 h 2 [H].

