HIDDEN INCENTIVES FOR AUTO-INDUCED DISTRIBU-TIONAL SHIFT

Abstract

Decisions made by machine learning systems have increasing influence on the world, yet it is common for machine learning algorithms to assume that no such influence exists. An example is the use of the i.i.d. assumption in content recommendation. In fact, the (choice of) content displayed can change users' perceptions and preferences, or even drive them away, causing a shift in the distribution of users. We introduce the term auto-induced distributional shift (ADS) to describe the phenomenon of an algorithm causing a change in the distribution of its own inputs. Our goal is to ensure that machine learning systems do not leverage ADS to increase performance when doing so could be undesirable. We demonstrate that changes to the learning algorithm, such as the introduction of meta-learning, can cause hidden incentives for auto-induced distributional shift (HI-ADS) to be revealed. To address this issue, we introduce 'unit tests' and a mitigation strategy for HI-ADS, as well as a toy environment for modelling real-world issues with HI-ADS in content recommendation, where we demonstrate that strong meta-learners achieve gains in performance via ADS. We show meta-learning and Q-learning both sometimes fail unit tests, but pass when using our mitigation strategy.

1. INTRODUCTION

Consider a content recommendation system whose performance is measured by accuracy of predicting what users will click. This system can achieve better performance by either 1) making better predictions, or 2) changing the distribution of users such that predictions are easier to make. We propose the term auto-induced distributional shift (ADS) to describe this latter kind of distributional shift, caused by the algorithm's own predictions or behaviour (Figure 1 ). ADS are not inherently bad, and are sometimes even desirable. But they can cause problems if they occur unexpectedly. It is typical in machine learning (ML) to assume (e.g. via the i.i.d. assumption) that (2) will not happen. However, given the increasing real-world use of ML algorithms, we believe it is important to model and experimentally observe what happens when assumptions like this are violated. This is the motivation of our work. In many cases, including news recommendation, we would consider (2) a form of cheating-the algorithm changed the task rather than solving it as intended. We care which means the algorithm used to solve the problem (e.g. ( 1) and/or (2)), but we only told it about the ends, so it didn't know not to 'cheat'. This is an example of a specification problem (Leike et al., 2017; Ortega et al., 2018) : a problem which arises from a discrepancy between the performance metric (maximize accuracy) and "what we really meant": in this case, to maximize accuracy via (1). Ideally, we'd like to quantify the desirability of all possible means, e.g. assign appropriate rewards to all potential strategies and "side-effects", but this is intractable for real-world settings. Using human feedback to learn reward functions which account for such impacts is a promising approach to specifying desired behavior (Leike et al., 2018; Christiano et al., 2017) . But the same issue can arise whenever human feedback is used in training: one means of improving performance could be to alter human preferences, making them easier to satisfy. Thus in this work, we pursue a complementary approach: managing learners' incentives. A learner has an incentive to behave in a certain way when doing so can increase performance (e.g. accuracy or reward). Informally, we say an incentive is hidden when the learner behaves as if it were not present. But we note that changes to the learning algorithm or training regime could cause previously hidden incentives to be revealed, resulting in unexpected and potentially undesirable behaviour. Managing incentives (e.g. controlling which incentives are hidden/ revealed) can allow algorithm designers to disincentivize broad classes of strategies (such as any that rely on manipulating human preferences) without knowing their exact instantiation. 1The goal of our work is to provide insights and practical tools for understanding and managing incentives, specifically hidden incentives for auto-induced distributional shift: HI-ADS. To study which conditions cause HI-ADS to be revealed, we present unit tests for detecting HI-ADS in supervised learning (SL) and reinforcement learning (RL). We also create an environment that models ADS in news recommendation, illustrating possible effects of revealing HI-ADS in this setting. The unit tests both have two means by which the learner can improve performance: one which creates ADS and one which does not. The intended method of improving performance is one that does not induce ADS; the other is 'hidden' and we want it to remain hidden. A learner "fails" the unit test if it nonetheless pursues the incentive to increase performance via ADS. In both the RL and SL unit tests, we find that introducing an outer-loop of meta-learning (e.g. Population-Based Training (PBT) Jaderberg et al. ( 2017)) can lead to high levels of failure. Similarly, recommender systems trained with PBT induce larger drifts in user base and user interests. These results suggest that failure of our unit tests indicates that an algorithm is prone to revealing HI-ADS in other settings. Finally, we propose and test a mitigation strategy we call context swapping. The strategy consists of rotating learners through different environments throughout learning, so that they can't see the results or correlations of their actions in one environment over longer time horizons. This effectively mitigates HI-ADS in our unit test environments, but did not work well in content recommendation experiments.

2. BACKGROUND

2.1 META-LEARNING AND POPULATION BASED TRAINING Meta-learning is the use of machine learning techniques to learn machine learning algorithms. This involves instantiating multiple learning scenarios which run in an inner loop (IL), while an outer loop (OL) uses the outcomes of the inner loop(s) as data-points from which to learn which learning algorithms are most effective (Metz et al., 2019) . The number of IL steps per OL step is called the interval. Many recent works focus on multi-task meta-learning, where the OL seeks to find learning rules that generalize to unseen tasks by training the IL on a distribution of tasks (Finn et al., 2017) . Single-task meta-learning includes learning an optimizer for a single task (Gong et al., 2018) , and adaptive methods for selecting models (Kalousis, 2000) or setting hyperparameters (Snoek et al., 2012) . For simplicity in this initial study we focus on single-task meta-learning. Population-based training (PBT; Jaderberg et al., 2017) is a meta-learning algorithm that trains multiple learners L 1 , ..., L n in parallel, after each interval (T steps of IL) applying an evolutionary OL step which consists of: (1) Evaluate the performance of each learner, (2) Replace both parameters and hyperparameters of 20% lowest-performing learners with copies of those from the 20% high-



Note removing or hiding an incentive for a behavior is different from prohibiting that behavior, which may still occur incidentally. In particular, not having a (revealed) incentive for behaviors that change a human's preferences, is not the same as having a (revealed) incentive for behaviors that preserve a human's preferences. The first is often preferable; we don't want to prevent changes in human preferences that occur "naturally", e.g. as a result of good arguments or evidence.



Figure 1: Distributions of users over time. Left: A distribution which remains constant over time, following the i.i.d assumption. Right: Auto-induced Distributional Shift (ADS) results in a change in the distribution of users in our content recommendation environment. (see Section 5.2 for details).

