VALUE-BASED MEMBERSHIP INFERENCE ATTACK ON ACTOR-CRITIC REINFORCEMENT LEARNING

Abstract

In actor-critic reinforcement learning (RL), the so-called actor and critic, respectively, compute candidate policies and a value function that evaluates the candidate policies. Such RL algorithms may be vulnerable to membership inference attacks (MIAs), a privacy attack that infers the data membership, i.e., whether a specific data record belongs to the training dataset. We investigate the vulnerability of value function in actor-critic to MIAs. We develop CriticAttack, a new MIA that targets black-box RL agents by examining the correlation between the expected reward and the value function. We empirically show that CriticAttack can correctly infer approximately 90% of the training data membership, i.e., it achieves 90% attack accuracy. Such accuracy is far beyond the 50% random guessing accuracy, indicating a severe privacy vulnerability of the value function. To defend against CriticAttack, we design a method called Crit-icDefense that inserts uniform noise to the value function. CriticDefense can reduce the attack accuracy to 60% without significantly affecting the agent's performance.

1. INTRODUCTION

Membership inference attacks (MIAs) pose privacy vulnerabilities in reinforcement learning (RL) algorithms (Gomrokchi et al., 2020) . Such attacks may make inferences about the training environments-whether a particular environment has been used in training-by observing the outcomes of an RL algorithm. For example, Pan et al. (2019) ; Wang et al. (2019) ; Chen et al. (2021) show that MIAs can infer users' vehicle routes or room layouts. Most, if not all, existing methods for MIA suffer from high computational complexity or make unrealistic assumptions. For example, the methods in Pan et al. (2019) and Yang et al. (2021) rely on observing the learned policies. Both methods are computationally inefficient because they need to learn separate policies for each environment the attacker wants to infer. The methods in Gomrokchi et al. (2021; 2020) do not require learning additional policies for different environments, but assume that the attacker has full access to the RL algorithm, including the states, transitions, actions, and rewards on which the algorithm relies. We propose a new black-box MIA called CriticAttack that alleviates the computational burden and relaxes the unrealistic assumptions made in the existing works. CriticAttack trains one set of policies for all environments, as opposed to training one set of policies per environment (e.g., Yang et al. (2021) ). It makes inferences only based on the values generated by the value function and the expected rewards, in contrast to the states, transitions, actions, and rewards required by the existing work (e.g., Gomrokchi et al. (2021) ). We empirically show that CriticAttack can achieve 90% accuracy in inferring environments from the MiniGrid library (Chevalier-Boisvert et al., 2018) . We perform the MIA on a state-of-the-art actor-critic RL algorithm (Schulman et al., 2017) . The actor-critic algorithm trains two components: an actor and a critic. The actor generates policies that determine an RL agent's actions. The critic learns a value function that evaluates the policies by predicting the expected rewards, also known as rewards-to-go. The actor and the critic typically memorize their training environments (Haarnoja et al., 2018; Raichuk et al., 2021) . Hence, we expect a high correlation between the values and the expected rewards from a training environment. On multiple RL tasks, CriticAttack achieves 90% attack accuracy, significantly higher than the 50% random guessing accuracy. Such high attack accuracy is an indication of the severe privacy vulnerability of the value function. We then turn our attention to defending against CriticAttack. We design a simple and efficient defense method called CriticDefense that concentrates on the value function. It inserts uniform noise to the value function to reduce the correlation between the values and the rewards-togo. However, inserting noise introduces a trade-off between the attack accuracy and the agent's performance, e.g., measured by the cumulative reward that the agent obtains. CriticDefense can reduce the attack accuracy from 90% to 60% while degrading no more than 10% of the agent's performance. Furthermore, we provide empirical evidence to show that the correlation between the values and the rewards-to-go is the primary source of privacy vulnerability. Due to the exploitation feature of RL, agents tend to choose the states experienced during training. The value function can accurately predict rewards-to-go on experienced states. Hence the correlation computed from a training environment is significantly higher than that from a test environment. The high correlation in the training environment leads to high attack accuracy. The optimized value function plays a key role in transfer learning and the teacher-student framework. Many well-known transfer learning algorithms for actor-critic require the source agents to release their optimized value functions (Xu et al., 2020; Zhang & Whiteson, 2019; Takano et al., 2010) . In the teacher-student framework, the student agents learn the optimal policies from the teacher's policies and value functions (Kurenkov et al., 2019) . Therefore, it is essential to consider the privacy implications of the value function. Pan et al. (2019) and Yang et al. (2021) develop MIA methods for deep RL that collect policies or actions for inference. While CriticAttack collects values from the value function and the cumulative reward for membership inference. Gomrokchi et al. (2021) and Gomrokchi et al. (2020) introduce two MIA methods to infer the roll-out trajectories in off-policy RL algorithms, which learn the optimal policy independently of the agent's actions. In contrast, CriticAttack works for on-policy RL algorithms, which optimize policies that determine what actions to take. From the defense perspective, several works (Garcelon et al., 2021; Lebensold et al., 2019b; Liao et al., 2021; Balle et al., 2016b; Chen et al., 2021) enforce differential privacy to the RL algorithm, which can protect against MIAs. Compared to the differential privacy mechanisms, we design CriticDefense for protecting the value function specifically. CriticDefense provides robust protection against attacks on the value function; however, it has limited ability to protect other components in the algorithm and does not achieve differential privacy.

3. PRELIMINARY

Reinforcement Learning (RL) is an area of machine learning where we train an agent or a set of agents by interacting with a set of environments. The agent observes a state from the environment, then takes action based on its policy π, and receives a reward from the environment that evaluates this action. We formally define the environment as a Markov decision process (MDP) E = {S, A, P, I , R}, where S and A are the sets of states and actions, P : S × A → S is the state transition function, I : S → [0, 1] is the initial distribution of the states, and R : S → R is the reward function. We consider a set of environments as the training dataset of the agent that may face privacy threats. Actor-Critic (Konda & Tsitsiklis, 1999) is one of the state-of-the-art RL algorithms that trains two components: actor and critic. The actor with parameters θ takes the current state representation and all possible actions as input and then generates a policy π θ . The critic V π (s) with parameters φ learns a value function, which takes the current state observation as input and outputs a value that evaluates the actions leading to the current state. We present the details of the actor-critic algorithm in the Appendix. In the training stage, we run the agent in the set of environments to collect a set of trajectories. Each trajectory consists of a sequence of tuples (state s t , action a t , reward r t , new state s t +1 ) with respect to timestamp t = 1, ..., T . We then estimate the advantage A π t to compute the parameters' gradient in both the actor and the critic. Most actor-critic algorithms use one of the three advantage estimation methods in Equation 1. The value function evaluates the current state and past actions leading to the current state by estimating the reward-to-go. Reward-to-go is the expected cumulative reward the agent can get if starting from the current state: rt = T k=t r k . Note that r k : k > t is an expected reward the agent will likely to get at timestamp k. In the training stage, we compute the reward-to-go at timestamp t by giving the values and the advantage estimation method E : TD advantage: rt = A π t + V π φ (s t ), A π t = r t + γV π φ (s t +1 ) -V π φ (s t ). N-step advantage: rt,N = A π t ,N + V π φ (s t ), A π t ,N = N -1 k=0 γ k r t +k+1 + γ N V π φ (s t +N +1 ) -V π φ (s t ). Generalized advantage: rt,λ = A π t ,λ + V π φ (s t ), A π t ,λ = T k=t (γλ) k-t A π k . (1) We train the critic to estimate the reward-to-go at a given state s t , and update the parameters φ accordingly to minimize the value loss: L (φ) = t ||V π φ (s t ) -rt || 2 . ( ) Membership Inference Attack (MIA) is one of the well-known privacy attacks that can be applied to machine learning models to infer whether a selected data record belongs to the training dataset of the given model. The shadow model framework (Shokri et al., 2017) is the standard approach to MIAs on machine learning models, where the shadow models mimic the behavior of the target model. Since the training datasets of the shadow models are known, the attacker can learn to infer whether a data record is used in training the shadow model. We then apply the trained attacker to infer the target model. We denote the percentage of the correctly inferred data records as the attack accuracy.

4. ATTACK METHOD

We design an environment-based MIA on actor-critic algorithms named CriticAttack. In environment-based (user-based) MIA, the attacker infers about an environment, as opposed to trajectory-based (sample-based) MIA, where the attacker infers about a single trajectory. Criti-cAttack determines whether the agent has been trained under a particular environment based on the observation of the values and rewards-to-go.

4.1. ASSUMPTION

A target agent is the RL agent trained by a set of private environments that the attacker wants to infer. In this work, we perform CriticAttack on the target agent whose policies and value functions are composed of neural networks and optimized by the actor-critic algorithm. The information of the target agent includes the well-trained parameters of the actor-network and the critic-network, the specifications of the two networks such as the number of layers and the activation function, the training algorithm with hyper-parameters, loss functions, the gradient history, and feedback from the environments. The attacker typically does not have full access to the information of the target agent. Based on the attacker's access, we can categorize MIAs into two groups: black-box attack and white-box attack (Hu et al., 2021) . The black-box attacker only has access to the inputs and outputs of the neural networks, the actor-network and the critic network in this case. In contrast, the white-box attacker has full access to the parameters of the neural networks, loss functions, and gradients. However, several black-box MIAs (Shokri et al., 2017; Sablayrolles et al., 2019) can also access the network specifications, training algorithm, and hyper-parameters. To distinguish black-box and white-box attacks, we consider the access to the networks' parameters and gradient history as the borderlines between the two types of attacks. The attacker is a black-box attacker if it has access to neither networks' parameters nor gradient history. In this work, we assume the attacker only has access to the inputs and outputs of the actor and critic networks, the training algorithm (including the advantage estimation method), hyperparameters, and rewards from the environments. Therefore, CriticAttack falls into the category of black-box attack.

4.2. CRITICATTACK

CriticAttack follows the shadow model framework (Shokri et al., 2017 ). Since we do not have access to the target agent's training environments, we train a set of shadow agents with known environments to mimic the behavior of the target agent. The attacker learns how the shadow agents behave differently in visited and new environments. We assume there is a public universal data distribution that all the environments, regardless of whether they are used during training, are drawn from this distribution. So, the attacker can obtain similar datasets to train the shadow agents. Once the attacker learns to differentiate whether an environment has been used in training the shadow agents, we can apply it to the target agent. Training the attacker takes the following three steps: First, we obtain a set of environments from the data source and evenly partition the environments into two groups: training environments and validation environments. We train each shadow agent using the training environments until its performance is less than 5% different from the target agent. We measure the performance by the average rewards in the validation environments. We then repeat this step to construct multiple shadow agents. Second, we run each shadow agent on its training and validation environments to collect trajectories. For each environment E , we collect a corresponding trajectory set S E that contains n critic trajectories, which is defined in Definition 4.1. Definition 4.1. A critic trajectory T v, r consists a sequence of (value, reward-to-go) tuples: T v, r = {(V π φ (s t ), rt ) : t = 0, ..., T }, where T is the trajectory length. The critic trajectory can break up into a value trajectory T v and a reward-to-go trajectory T r : T v = {V π φ (s t ) : t = 0, ..., T }, T r = { rt : t = 0, ..., T }. Note that we need to compute the rewards-to-go given the rewards and the value trajectory. We trace the rewards r = {r 1 , ..., r T } and the value trajectory T v from T to 0 to compute the rewardto-go trajectory T r and get the critic trajectory T v, r using Algorithm 1. After obtaining the critic trajectories, we label each critic trajectory set S E as 'in' if E belongs to the training environments and 'out' otherwise. By repeating the second step, we get a critic trajectory set and its label for every environment in the training and validation dataset. These critic trajectories and labels form a supervised learning dataset for the attacker. Third, we train a binary classifier that takes a set of critic trajectories S E as input and determines the corresponding environment E is 'in' or 'out' of the training environments. We design two architectures for the binary classifier: the logistic regression classifier and the deep neural network classifier. Logistic Regression on Correlation Score (LR) focuses on the correlation between values and rewards-to-go. Suppose we have collected N sets of critic trajectories from the shadow agents, where each set contains n trajectories. For each environment E , we extract the value trajectories T v i and the reward-to-go trajectories T r i from the trajectory set S E , where i = 1, ..., n and n is number of trajectories in S E . Then, we compute the average correlation between the value trajectories and reward-to-go trajectories following Equation 3: Algorithm 1: REWARD-TO-GO ESTIMATION Input: value trajectory T v , rewards r , discount factor γ, estimation method E , hyper-parameters N , λ Output: critic trajectory T v, r A π T +1,N ,λ ,V π φ (s T +1 ), r t >T = 0, 0, r T for t = T to 0 do if E is TD advantage then A π t ,N ,λ = r t + γV π φ (s t +1 ) -V π φ (s t ) ; / * refer to Equation 1 * / end if E is N-step advantage then A t = r t +1 -V π φ (s t ) + γV π φ (s t +1 ) ; / * refer to Equation 1 * / A t +N = r t +N +1 -V π φ (s t +N +1 ) + γV π φ (s t +N +2 ) A π t ,N ,λ = γA π t +1,N ,λ + A t -γ N A t +N ; / * Proof: see Appendix * / end if E is Generalized advantage then A π t ,N ,λ = r t + γV π φ (s t +1 ) -V π φ (s t ) + γλA π t +1,N ,λ ; / * refer to Equation 1 * / end rt = A π T +1,N ,λ + V π φ (s t ) end T v, r = {(v t , rt ) : t = 0, ..., T } ρ E = 1 n n i =1 cov(T v i , T r i ) σ T v i σ T r i = 1 n n i =1 E[(T v i -µ T v i )(T r i -µ T r i )] σ T v i σ T r i . In the training stage, we compute a correlation score ρ E for each environment E , form a (correlation, label) tuple, and mark the label 1 to represent 'in' and 0 to represent 'out.' We then fit the logistic regression classifier with all the (correlation, label) tuples. In the inference stage, we compute the average correlation score ρ E v al for the target agent on a given environment E v al and use the logistic regression classifier to predict if E v al belongs to the training dataset of the target agent. Deep Neural Network (DNN) takes the concatenation ⊕ of the value trajectory and the corresponding reward-to-go trajectory as input and performs binary classification: N N ω (T v i ⊕ T r i ) = 1, as 'in', 0, as 'out'. . In the training stage, we assign a label 'in' or 'out' to each critic trajectory depending on whether it is collected from a training environment. We then train the neural network with these labeled trajectories. In the inference stage, we run the target agent on a given environment E to obtain n trajectories. We apply the trained neural network to predict each trajectory and take the majority vote as the prediction for the given environment.

5. DEFENSE METHOD

In practice, the best protection is to conceal the value function from users. However, RL agents that allow users to fine-tune or allow to be the teacher in transfer learning must release their value functions. In such scenarios, we introduce a defense method named CriticDefense specifically against CriticAttack. We consider the correlation between the values and the rewards-to-go as one of the primary factors impacting the attacker's decision. Therefore, we develop CriticDefense to modify the correlations and examine if it can effectively reduce the attack accuracy.

5.1. CRITICDEFENSE

CriticDefense inserts uniform noise into a certain percentage of the values during training. Practically, we replace the value function V π φ in the actor-critic algorithm with the following: Ṽ π φ (s t ) = (V π φ (s t ) + 1 [0,R) (u) • u • R)%1, ( ) where R is a hyper-parameter to define the noise percentage, u is sampled from a standard uniform distribution: u ∼ U (0, 1), % is the modulo operation. CriticDefense adds negligible noise to a small proportion of the values as R approaches 0 and adds large noise to almost all the values if R approaches 1. Adding uniform noise to the values indirectly adds noise to the rewards-to-go since we compute rewards-to-go based on the values according to Algorithm 1. CriticAttack makes inference only based on the values and rewards-to-go; adding noise to both components is the most straightforward approach to protect against such attack. In RL algorithms, values are strictly between 0 and 1, so we use a modulo to guarantee this. Compared to clipping the noisy values, modulo one ensures the noisy values do not exceed the upper limit and allows a more significant change, which may further break the correlation between the values and rewards-to-go. For instance, the original value of 0.95 with a noise of 0.1 will result in an invalid value of 1.05. Value clipping clips the noisy value to 1 and makes it valid. In contrast, modulo results in a new value of 0.05. The modulo will be triggered more constantly as R approaches 1.

5.2. COMPATIBILITY

CriticDefense is designed to protect the value function against CriticAttack, so it has limited efficacy in protecting the policies and mitigating overfitting. We integrate two other methods with CriticDefense to strengthen the protection of other components of the actor-critic algorithm, such as policies. CriticDefense is compatible with the two methods below, which means they do not interfere with each other and introduce extra performance loss. Regularization is a prominent approach to prevent overfitting by lowering the complexity of the neural networks during training (Kukačka et al., 2017) . Many works have demonstrated that regularization reduces MIA accuracy by mitigating overfitting (Ying et al., 2020; Nasr et al., 2018; Kaya et al., 2020) . We consider applying L2 regularization in the actor-critic algorithm: add a regularization loss with a regularization rate a to the original value loss L (φ). Value Clipping is another approach to prevent the value function from being over-adapted to the newly added training environment and losing the information from previous environments (Schulman et al., 2017) . Existing work has shown its effectiveness in protecting against MIA in RL policies (Yang et al., 2021) . It clips the norm of the value loss to ϵ cl i p to restrict the step size of updating the parameters.

6. EMPIRICAL ANALYSIS

In this section, we perform two sets of MIA experiments. In the first set of experiments, we provide empirical evidence to show that CriticAttack can correctly infer above 90% of the training environments. In the second set of experiments, we demonstrate the vulnerability factor of Crit-icAttack and show that CriticDefense can reduce the attack accuracy to 60% while maintaining the agent's performance. In both the attack and defense section, we clip the values to ϵ cl i p = 0.2 and enforce L2regularization with a = 0.01 to mitigate overfitting while training the agents. We apply the two methods in both sections to show that: 1) CriticAttack works well even if the mitigation methods are applied, and 2) the attack accuracy drops due to CriticDefense rather than the two mitigation methods. Environment Setup In all the experiments, we use the MiniGrid toolkit (Chevalier-Boisvert et al., 2018) as the underlying testbed. We choose four tasks listed in Figure 1 , wherein the agent learns to reach a target destination without bumping into obstacles. Once the agent reaches the destination or a fixed number of steps, the environment resets to a new map while the task remains the same. We control the map's layout by fixing random seeds due to the one-to-one correspondence between seeds and maps. We use the maps to simulate room/warehouse layouts in real-world settings and reveal their privacy. Assuming that each new map represents a floor map of private property, we perform MIA to infer whether the agent has visited a given floor map in its training, which violates the privacy of the private properties.

Experiment Setup

We have presented three different advantage estimation methods in Equation 1, so we perform three subsets of experiments for the three estimation methods separately. We use the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017) for deploying the three advantage estimations: TD advantage, N-step advantage (N-Step), and Generalized advantage (GAE). We apply the PPO algorithm on 40 unique maps to train each target agent (we also show the MIA results on targets with more training maps in the Appendix). We implement the PPO algorithm using the RL-Starter-Files library (Willems, 2018) with default hyper-parameters unless specified below.

6.1. ATTACK

Following CriticAttack in the Methodology section, we train five shadow agents for each target agent. Due to the assumption that the attacker does not have access to the training data size, we use 20 distinct maps to train each shadow agent until it converges to the same reward as the target agent. Then, we apply each shadow agent to 40 distinct maps-in which 20 maps are used to train this shadow agent-to collect 25 critic trajectories from each map. So, we can collect 200 trajectory sets with 5,000 critic trajectories from 5 shadow agents and use them for training the attacker. Once we finish training the attacker, we apply it to the target agent to infer the training maps of the target agent. We have proposed two architectures for the attacker: LR and DNN. We present the MIA results of the two architectures in 

6.2. DEFENSE

We now investigate the effects of CriticDefense and compare it with the well-known differential privacy mechanism DP-SGD. We assume the attacker knows the defense methods applied to the target model, so we also deploy the defense methods with the same parameters while training the shadow models. Note that we are not trying to indicate that the CriticDefense outperforms DP-SGD. CriticDefense is more suitable for protecting the value function against CriticAttack, but DP-SGD is potentially more effective in protecting the policies. Privacy-Performance Trade-off Figure 3 shows the defense results on GAE on the Multi-Rooms task, while the MIA settings are identical to Section 6.1. We observe that both of the defense methods can reduce attack accuracies. CriticDefense can reduce the attack accuracy to approximately 60% with less than 10% performance loss, measured by rewards. We compute the performance loss using the cumulative reward of the unprotected agent r T and the cumulative reward of the protected agent r ′ T : L per f = r T -r ′ T r T . CriticDefense can reduce the attack accuracies to around 60% with approximately 10% of performance losses in the other tasks, regardless of the estimation methods. Figure 2 compares attack results before and after applying CriticDefense and shows how CriticDefense reduces the attack accuracy. We also present the numerical results in Table 4 in the Appendix. In contrast, the DP-SGD algorithm can reduce the MIA accuracy to approximately 50% with over 60% performance loss. Additionally, the actor-critic algorithm with DP-SGD requires a significantly larger number of steps to convergence than CriticDefense. It means the computation of training a protected agent is ten times more expensive than training an unprotected one. We define the differential privacy budget ϵ, δ, and plot σ vs. ϵ in the Appendix.

6.3. VULNERABILITY FACTOR

Due to the exploitation feature of RL, when we place an agent in its training environment, the agent tends to take the 'best' trajectory that it has experienced during training. Therefore, the value function can accurately predict the rewards-to-go, causing the correlations from training environments to be significantly higher than correlations from validation environments. Hence we can achieve above 85% attack accuracy simply by using a logistic regression classifier to find the threshold between the training and validation correlations. CriticDefense significantly reduces the training correlation and shrinks the gap between the training correlation and validation correlation. We can observe that the attack accuracies are decreasing as the training and validation correlations approach each other. Therefore, we conclude that the correlation between values and rewards-to-go is the primary source of privacy vulnerability to CriticAttack. Since CriticDefense can reduce the correlation by only adding a small amount of noise (e.g., R = 0.3), it is sufficient to protect the value function against CriticAttack while maintaining the agent's performance. We also present specific examples to support our claims in the Appendix.

7. CONCLUSION

In this work, we introduce an effective and efficient black-box membership inference attack named CriticAttack that concentrates on the value function of the actor-critic algorithm. We empirically demonstrate the high vulnerability of the value function of the actor-critic algorithm to MIAs by showing approximately 90% attack accuracies. Therefore, RL services should provide users with the least possible access to the value function. We then design a corresponding defense method called CriticDefense, which can significantly reduce the attack accuracies of Crit-icAttack without hurting the target agent's performance. A limitation of the current work is that CriticAttack only works for actor-critic algorithms. We can generalize this MIA to other reinforcement learning algorithms consisting of value functions as a future direction. The {t + 1} t h advantage is A π t +1,N = N -1 k=0 γ k r t +k+2 + γ N V π φ (s t +N +2 ) -V π φ (s t +1 ) = r t +2 + γr t +3 + ... + γ N -1 r t +N +1 + γ N V π φ (s t +N +2 ) -V π φ (s t +1 ). Then, we can compute γA π t +1,N = γr t +2 + γ 2 r t +3 + ... + γ N r t +N +1 + γ N +1 V π φ (s t +N +2 ) -γV π φ (s t +1 ), hence  A π t ,N -γA π t +1,N = r t +1 -γ N r t +N +1 + γ N V π φ (s t +N +1 ) -V π φ (s t ) -γ N +1 V π φ (s t +N +2 ) + γV π φ (s t +1 ), A π t ,N = γA π t +1,N + r t +1 -V π φ (s t ) + γV π φ (s t +1 ) -γ N r t +N +1 -V π φ (s t +N +1 ) + γV π φ (s t +N +2

A.5 MORE EXAMPLES OF CRITICATTACK

We present more examples of CriticAttack on all three advantage estimation methods in Figure 5 . We also show the value trajectories and reward-to-go trajectories collected from a selected environment which is 'in' agent 1's training dataset but 'out' of agent 2's training dataset. We can observe the value and reward-to-go trajectories generated by agent 1 are highly correlated compared to agent 2. We also perform CriticAttack on target agents whose training data sizes are varied. We attack the target agents trained using 10, 20,..., and 100 environments and observe how the training data size affects the attack accuracy. We train the shadow models using identical numbers of training environments as the target model. We present the results in Figure 6 . We can observe a negative correlation between the training data size and the attack accuracy. Increasing the training data size will improve the RL agent's generalization power, reducing the attack accuracy. However, the impact of training data size to attack accuracy is insignificant. After defining the adjacent dataset, we can define (ϵ, δ)-differential privacy as the following: Definition A.2 ((ϵ, δ)-DIFFERENTIAL PRIVACY). . Let ϵ be a positive real number and M be a randomized algorithm that takes a dataset as input. Let Y be the image of M . The algorithm M is ϵ-differentially private if, for all adjacent datasets D 1 and D 2 , and all R ⊆ Y : P[M (D 1 ) ∈ R] ≤ exp(ϵ)P[M (D 2 ) ∈ R] + δ, ( ) where δ captures the probability that ϵ-differential privacy fails. If δ = 0, then we say M is an ϵ-DIFFERENTIAL PRIVACY. DP-LSL (Balle et al., 2016a; Lebensold et al., 2019a ) is a pretraining protection mechanism that construct a differentially private value function (critic network). It achieves differential privacy by adding Gaussian noise to the critic's parameters before running the actor-critic algorithm. In the initialization phase, we construct a differentially private critic in the actor-critic algorithm by adding Gaussian noise to the critic's parameters. DP-LSL refers to the process of adding noise, which takes the differential privacy budget ϵ and δ as input. Then, we apply CriticAttack to the actor-critic algorithm with the differentially private critic. We present the MIA results in Table 3 and show that the DP-LSL has a negligible effect on protecting against CriticAttack. Note that we obtain the results by performing MIAs on the actor-critic algorithm with GAE. The work (Lebensold et al., 2019a) empirically shows that DP-LSL can achieve differential privacy with minimal loss in performance. However, our experiments demonstrate that DP-LSL does not affect protecting against CriticAttack. We present the results on Table 3 . Table 3 : MIA accuracy and performance loss under DP-LSL. We report the (mean ± standard deviation) tuple across five repetitions. (Abadi et al., 2016) is a standard approach that helps deep learning models satisfy differential privacy. DP-SGD modifies the stochastic gradient descent in the training algorithm to enforce differential privacy to the algorithm itself. DP-SGD modifies the stochastic gradient descent in the training algorithm of the deep learning model to enforce differential privacy to the algorithm itself.

DP-SGD

During the training procedure, DP-SGD first clips the gradients computed over the training data; then applies the Gaussian mechanism to add statistical noise drawn from a defined Gaussian distribution to the gradients; finally updates the model with the noisy gradients. Let θ be the parameters of the deep learning model; the DP-SGD works as the following: θ i +1 = θ i - α β M g auss β j =1 g (▽ θ L j (θ),C ), σ , where θ i is the parameters of the deep learning model at iteration i ; α, β are the learning rate and the batch size of the training algorithm; g (x,C ) is the clipping function defined by g (x,C ) = x • min 1, C ∥x∥ . C , σ are the clipping value and the noise standard deviation of the DP-SGD algorithm. We define the Gaussian mechanism as M g auss ( f (x), σ) = f (x) + n, where n ∼ N (0, σ 2 I). 

DP-SGD Privacy Budget

We apply the DP-SGD algorithm to enforce differential privacy to actor-critic reinforcement learning. We set the gradient clipping value ϵ cl i p = 0.2, epoch = 4, batch size β = 256, privacy offset δ = 1e -4, and sample size n = 1000. We show the privacy budget ϵ at each noise variance σ in Figure 7 . We present five sets of trajectories in Figure 8 , respectively. The correlations between the value and reward-to-go trajectories in the five rows are 0.88, 0.83, 0.71, 0.32, and 0.26. We observe that a smaller amount of noise does not affect the correlation between the rewardsto-go and values, as shown in the second and third rows in Figure 8 . Instead, a small amount of noise only changes the trajectories' smoothness. We must increase the noise variance to reduce the correlation, as in the third and fourth rows in Figure 8 . However, we already significantly degrade the agent's performance by introducing a large noise. 



Figure 1: RL tasks from left to right: Multi-Rooms, Door-Key, Lava-Crossing, Four-Rooms.

Figure 2: MIA on a PPO algorithm with GAE trained for the Multi-Rooms task. The figures in the first row show the value and rewards-to-go trajectories. The second and third rows show the MIA results using LR and DNN, respectively.

Figure3: Protections against MIA. The first row shows the results of CriticDefense, and the second row shows the results of DP-SGD. The columns from left to right present the required number of steps to convergence, final rewards upon convergence, the average correlation between values and rewards-to-go, and attack accuracies. In DP-SGD, we use the noise variance σ as the x-axis. We define the differential privacy budget ϵ, δ, and plot σ vs. ϵ in the Appendix.

Figure5: The first and second rows show the value and reward-to-go trajectories collected from a selected environment E 0 , generated by two separate agents: agent 1 and agent 2. Note that E 0 is in the training dataset of agent 1 but out of the training dataset of agent 2. The third and fourth rows show CriticAttack results where the attacker uses a logistic regression classifier and a deep neural network, respectively.

Figure 7: DP-SGD privacy budget at each noise level.

Figure 8: Each figure shows the value and reward-to-go trajectories from a selected environment under various conditions.

Table 1 and show some visualized examples in Figure 2. The LR attacker can achieve approximately 90% accuracies by only finding a correlation threshold. The DNN attacker can get close to 95% accuracies; however, it is computationally inefficient compared to the LR. In summary, both attackers demonstrate the severe vulnerability of the value function by showing such high attack accuracies.

) . MIA accuracies (mean ± standard deviation) across five repetitions. We use GAE for all the results.

MIA accuracies (corresponding performance loss in percentage) across five repetitions.

A.3 PROOF OF THE N-STEP ADVANTAGE IN ALGORITHM 2

Theorem A.1. In the N-Step Advantage, given the {t +1} t h advantage A π t +1,N , then the t t h advantage is(5)Proof. We have shown the N-Step Advantage in the Background section:= r t +1 + γr t +2 + ... + γ N -1 r t +N + γ N V π φ (s t +N +1 ) -V π φ (s t ).(6)

