VALUE-BASED MEMBERSHIP INFERENCE ATTACK ON ACTOR-CRITIC REINFORCEMENT LEARNING

Abstract

In actor-critic reinforcement learning (RL), the so-called actor and critic, respectively, compute candidate policies and a value function that evaluates the candidate policies. Such RL algorithms may be vulnerable to membership inference attacks (MIAs), a privacy attack that infers the data membership, i.e., whether a specific data record belongs to the training dataset. We investigate the vulnerability of value function in actor-critic to MIAs. We develop CriticAttack, a new MIA that targets black-box RL agents by examining the correlation between the expected reward and the value function. We empirically show that CriticAttack can correctly infer approximately 90% of the training data membership, i.e., it achieves 90% attack accuracy. Such accuracy is far beyond the 50% random guessing accuracy, indicating a severe privacy vulnerability of the value function. To defend against CriticAttack, we design a method called Crit-icDefense that inserts uniform noise to the value function. CriticDefense can reduce the attack accuracy to 60% without significantly affecting the agent's performance.

1. INTRODUCTION

Membership inference attacks (MIAs) pose privacy vulnerabilities in reinforcement learning (RL) algorithms (Gomrokchi et al., 2020) . Such attacks may make inferences about the training environments-whether a particular environment has been used in training-by observing the outcomes of an RL algorithm. For example, Pan et al. (2019) ; Wang et al. (2019) ; Chen et al. (2021) show that MIAs can infer users' vehicle routes or room layouts. Most, if not all, existing methods for MIA suffer from high computational complexity or make unrealistic assumptions. For example, the methods in Pan et al. (2019) and Yang et al. (2021) rely on observing the learned policies. Both methods are computationally inefficient because they need to learn separate policies for each environment the attacker wants to infer. The methods in Gomrokchi et al. (2021; 2020) do not require learning additional policies for different environments, but assume that the attacker has full access to the RL algorithm, including the states, transitions, actions, and rewards on which the algorithm relies. We propose a new black-box MIA called CriticAttack that alleviates the computational burden and relaxes the unrealistic assumptions made in the existing works. CriticAttack trains one set of policies for all environments, as opposed to training one set of policies per environment (e.g., Yang et al. (2021) ). It makes inferences only based on the values generated by the value function and the expected rewards, in contrast to the states, transitions, actions, and rewards required by the existing work (e.g., Gomrokchi et al. (2021) ). We empirically show that CriticAttack can achieve 90% accuracy in inferring environments from the MiniGrid library (Chevalier-Boisvert et al., 2018) . We perform the MIA on a state-of-the-art actor-critic RL algorithm (Schulman et al., 2017) . The actor-critic algorithm trains two components: an actor and a critic. The actor generates policies that determine an RL agent's actions. The critic learns a value function that evaluates the policies by predicting the expected rewards, also known as rewards-to-go. The actor and the critic typically memorize their training environments (Haarnoja et al., 2018; Raichuk et al., 2021) . Hence, we expect a high correlation between the values and the expected rewards from a training environment. On multiple RL tasks, CriticAttack achieves 90% attack accuracy, significantly higher than the 50% random guessing accuracy. Such high attack accuracy is an indication of the severe privacy vulnerability of the value function. We then turn our attention to defending against CriticAttack. We design a simple and efficient defense method called CriticDefense that concentrates on the value function. It inserts uniform noise to the value function to reduce the correlation between the values and the rewards-togo. However, inserting noise introduces a trade-off between the attack accuracy and the agent's performance, e.g., measured by the cumulative reward that the agent obtains. CriticDefense can reduce the attack accuracy from 90% to 60% while degrading no more than 10% of the agent's performance. Furthermore, we provide empirical evidence to show that the correlation between the values and the rewards-to-go is the primary source of privacy vulnerability. Due to the exploitation feature of RL, agents tend to choose the states experienced during training. The value function can accurately predict rewards-to-go on experienced states. Hence the correlation computed from a training environment is significantly higher than that from a test environment. The high correlation in the training environment leads to high attack accuracy. The optimized value function plays a key role in transfer learning and the teacher-student framework. Many well-known transfer learning algorithms for actor-critic require the source agents to release their optimized value functions (Xu et al., 2020; Zhang & Whiteson, 2019; Takano et al., 2010) . In the teacher-student framework, the student agents learn the optimal policies from the teacher's policies and value functions (Kurenkov et al., 2019) . Therefore, it is essential to consider the privacy implications of the value function. 2020) introduce two MIA methods to infer the roll-out trajectories in off-policy RL algorithms, which learn the optimal policy independently of the agent's actions. In contrast, CriticAttack works for on-policy RL algorithms, which optimize policies that determine what actions to take. From the defense perspective, several works (Garcelon et al., 2021; Lebensold et al., 2019b; Liao et al., 2021; Balle et al., 2016b; Chen et al., 2021) enforce differential privacy to the RL algorithm, which can protect against MIAs. Compared to the differential privacy mechanisms, we design CriticDefense for protecting the value function specifically. CriticDefense provides robust protection against attacks on the value function; however, it has limited ability to protect other components in the algorithm and does not achieve differential privacy.

3. PRELIMINARY

Reinforcement Learning (RL) is an area of machine learning where we train an agent or a set of agents by interacting with a set of environments. The agent observes a state from the environment, then takes action based on its policy π, and receives a reward from the environment that evaluates this action.



Pan et al. (2019) and Yang et al. (2021) develop MIA methods for deep RL that collect policies or actions for inference. While CriticAttack collects values from the value function and the cumulative reward for membership inference. Gomrokchi et al. (2021) and Gomrokchi et al. (

acknowledgement

We formally define the environment as a Markov decision process (MDP) E = {S, A, P, I , R}, where S and A are the sets of states and actions, P : S × A → S is the state transition function, I : S → [0, 1] is the initial distribution of the states, and R : S → R is the reward function. We consider a set of environments as the training dataset of the agent that may face privacy threats.

annex

Actor-Critic (Konda & Tsitsiklis, 1999) is one of the state-of-the-art RL algorithms that trains two components: actor and critic. The actor with parameters θ takes the current state representation and all possible actions as input and then generates a policy π θ . The critic V π (s) with parameters φ learns a value function, which takes the current state observation as input and outputs a value that evaluates the actions leading to the current state. We present the details of the actor-critic algorithm in the Appendix.

