MERMAIDE: LEARNING TO ALIGN LEARNERS US-ING MODEL-BASED META-LEARNING

Abstract

Designing mechanisms like auctions or taxation policies can be formulated as a general-sum game between a principal and a self-interested learning agent. The principal aims to induce desirable outcomes in such games and may do so, for example, by dynamically intervening on the agent's learning objective. The intervention policy should generalize well to agents with unseen learning behaviors; in the real world, the principal may not know the agent's learning algorithm nor its rewards. Moreover, interventions may be costly, e.g., enforcing a tax might require extra labor; hence, interventions should be few-shot adaptable (only needs to retrain on few agents at test-time) and cost-efficient (uses few interventions). Here, we introduce a model-based meta-learning framework to train a principal that can quickly adapt when facing out-of-distribution agents with different learning strategies and reward functions. First, in a simple Stackelberg game between the principal and a greedy agent, we show that meta-learning allows adapting to the theoretically known and appropriate Stackelberg equilibrium at meta-test time, with few interactions with the agent. Second, we show empirically that our approach yields strong meta-test time performance against bandit agents with various unseen explore-exploit behaviors. Finally, we outperform baselines that separately use either meta-learning or agent behavior modeling to learn a cost-effective intervention policy that is K-shot adaptable with only partial agent information.

1. INTRODUCTION

General-sum games provide a framework to study diverse applications involving a principal that aims to incentivize an adaptive agent (both are learners) to achieve the principal's goal, e.g., maximizing revenue in auctions (Milgrom & Milgrom, 2004) , optimizing social welfare with economic policy (Zheng et al., 2022) , or optimizing skill acquisition in personalized education (Maghsudi et al., 2021) . In this work, we focus on a principal that directly intervenes on the rewards of the agent. For instance, a government may want to incentivize the use of environmentally-friendly ("clean") products by levying green taxes, but needs to understand how people (strategically) change their consumption behavior as taxes change. Here, existing models of human adaptation that assume rational learning (or use simplified models of bounded rationality) often do not suffice. Hence, interacting with the agents is required to learn (how they change) their behavior, but such interactions are not "free". For example, a tax policy may require effort to apply it fairly and to measure its impact on consumers. To mitigate the need for costly real-world interactions, we can use simulations with deep reinforcement learning (RL) agents. This is an attractive solution framework: deep neural network behavioral models are expressive enough to emulate real-world entities and simulations can be run safely and as often as needed. Moreover, we can use deep RL to learn intervention policies that are effective even in the face of complex agent behaviors in sequential general-sum games. However, this approach also faces several challenges. When deploying the learned policies in the real world, interventions can typically only be applied a few times, due to implementation costs, and rarely under identical circumstances; in contrast to simulations, we cannot reset the real world. Even though principals may adapt their policies to new conditions, they cannot realistically know the true rewards or learning strategy of the agent. Hence, our goal is to learn policies in general-sum games

