VALUE-BASED MEMBERSHIP INFERENCE ATTACK ON ACTOR-CRITIC REINFORCEMENT LEARNING

Abstract

In actor-critic reinforcement learning (RL), the so-called actor and critic, respectively, compute candidate policies and a value function that evaluates the candidate policies. Such RL algorithms may be vulnerable to membership inference attacks (MIAs), a privacy attack that infers the data membership, i.e., whether a specific data record belongs to the training dataset. We investigate the vulnerability of value function in actor-critic to MIAs. We develop CriticAttack, a new MIA that targets black-box RL agents by examining the correlation between the expected reward and the value function. We empirically show that CriticAttack can correctly infer approximately 90% of the training data membership, i.e., it achieves 90% attack accuracy. Such accuracy is far beyond the 50% random guessing accuracy, indicating a severe privacy vulnerability of the value function. To defend against CriticAttack, we design a method called Crit-icDefense that inserts uniform noise to the value function. CriticDefense can reduce the attack accuracy to 60% without significantly affecting the agent's performance.

1. INTRODUCTION

Membership inference attacks (MIAs) pose privacy vulnerabilities in reinforcement learning (RL) algorithms (Gomrokchi et al., 2020) . Such attacks may make inferences about the training environments-whether a particular environment has been used in training-by observing the outcomes of an RL algorithm. Most, if not all, existing methods for MIA suffer from high computational complexity or make unrealistic assumptions. For example, the methods in Pan et al. (2019) and Yang et al. (2021) rely on observing the learned policies. Both methods are computationally inefficient because they need to learn separate policies for each environment the attacker wants to infer. The methods in Gomrokchi et al. (2021; 2020) do not require learning additional policies for different environments, but assume that the attacker has full access to the RL algorithm, including the states, transitions, actions, and rewards on which the algorithm relies. We propose a new black-box MIA called CriticAttack that alleviates the computational burden and relaxes the unrealistic assumptions made in the existing works. CriticAttack trains one set of policies for all environments, as opposed to training one set of policies per environment (e.g., Yang et al. ( 2021)). It makes inferences only based on the values generated by the value function and the expected rewards, in contrast to the states, transitions, actions, and rewards required by the existing work (e.g., Gomrokchi et al. (2021) ). We empirically show that CriticAttack can achieve 90% accuracy in inferring environments from the MiniGrid library (Chevalier-Boisvert et al., 2018) . We perform the MIA on a state-of-the-art actor-critic RL algorithm (Schulman et al., 2017) . The actor-critic algorithm trains two components: an actor and a critic. The actor generates policies that determine an RL agent's actions. The critic learns a value function that evaluates the policies by predicting the expected rewards, also known as rewards-to-go. The actor and the critic typically memorize their training environments (Haarnoja et al., 2018; Raichuk et al., 2021) . Hence, we expect a high correlation between the values and the expected rewards from a training environment. On multiple RL tasks, CriticAttack achieves 90% attack accuracy, significantly higher than the 50% random guessing accuracy.



For example, Pan et al. (2019); Wang et al. (2019); Chen et al. (2021) show that MIAs can infer users' vehicle routes or room layouts.

