IN THE ZONE: MEASURING DIFFICULTY AND PROGRESSION IN CURRICULUM GENERATION

Abstract

A common strategy in curriculum generation for reinforcement learning is to train a teacher network to generate tasks that enable student learning. But, what kind of tasks enables this? One answer is tasks belonging to a student's zone of proximal development (ZPD), a concept from developmental psychology. These are tasks that are not too easy and not too hard for the student. Albeit intuitive, ZPD is not well understood computationally. We propose ZONE, a novel computational framework that operationalizes ZPD. It formalizes ZPD through the language of Bayesian probability theory, revealing that tasks should be selected by difficulty (the student's probability of task success) and learning progression (the degree of change in the student's model parameters). ZONE instantiates two techniques that enforce the teacher to pick tasks within the student's ZPD. One is REJECT , which rejects tasks outside of a difficulty scope, and the other is GRAD, which prioritizes tasks that maximize the student's gradient norm. We apply these techniques to existing curriculum learning algorithms. We show that they improve the student's generalization performance on discrete MiniGrid environments and continuous control MuJoCo domains with up to 9× higher success. ZONE also accelerates the student's learning by training with 10× less data.

1. INTRODUCTION

Many reinforcement learning (RL) problems require designing a distribution of tasks to train effective policies for the real-world (Taylor & Stone, 2009) . However, designing the full space of tasks is challenging; the real world is complicated and specifying every edge task is impractical or even infeasible for certain domains (Wang et al., 2019; Parker-Holder et al., 2022b) . These tasks might also be onerous for the agent to solve without the provision of scaffolding. Training the RL agent on all possible tasks is intractable under a limited training budget (Schmidhuber, 2013; Narvekar; Zeng et al., 2022; Florensa et al., 2018a) . The state-of-the-art approaches to this problem use multi-agent curriculum generation algorithms for automating task generation. A teacher agent learns to generate tasks judiciously to train a student agent (Dennis et al., 2020; Du et al., 2022; Campero et al., 2020; Florensa et al., 2018a; Portelas et al., 2020; Matiisen et al., 2020) . The teacher is rewarded by the difficulty of tasks it generates. Even though prior methods share intuitions on the teacher objective, there is no framework for understanding the kind of tasks that best enables student learning and how the teacher should be rewarded to this end. These challenges suggest that we should re-assess how we formalize and operationalize the objective for the teacher. Thus, our work is interested in the following question: What kind of tasks should the teacher generate and how should the teacher be rewarded? From the lens of developmental psychology, an answer is that the teacher should be incentivized to generate tasks within the student's zone of proximal development (ZPD) (Vygotsky & Cole, 1978; Vygotsky, 2012; Cole et al., 2005; Shabani et al., 2010) . These tasks have two properties: They are within the student's difficulty and accelerate the student's learning progression. Albeit intuitive and widely known, ZPD lacks a computational framework. This makes operationalizing ZPD in the teaching setting difficult. Our work proposes ZONE as a computational framework that operationalizes ZPD. ZONE isolates the two properties of ZPD tasks into techniques for enforcing the teacher to generate within a student's ZPD. The first technique is REJECT , which omits training on tasks that fall outside the student's current ability zone-this is akin to rejection sampling. The second technique is GRAD, which rewards the teacher for generating tasks that maximize the norm of the student network's gradient. These are tasks which induce the largest changes in the student's current model. We apply these techniques to two popular curriculum generation algorithms-PAIRED (Dennis et al., 2020) and Goal GAN (Florensa et al., 2018b) -and show that ZONE accelerates student learning on a suite of discrete and continuous environments. We then investigate how these two techniques impact the student's and teacher's learning. In summary, our work's contributions are the following: 1. We propose ZONE, a novel computational framework that formalizes the zone of proximal development (ZPD) (Vygotsky & Cole, 1978) with Bayesian probability theory. 2. ZONE operationalizes ZPD with two techniques: REJECT , which rejects tasks that fall outside the student's difficulty range, and GRAD, which rewards the teacher for generating tasks that maximize the student's gradient norms. 3. We show that REJECT and GRAD improve the student's generalization performance and learning speed across a variety of discrete and continuous environments. 4. We investigate how the ZONE techniques improve the teacher's ability to generate ZPD tasks for the student.

Zone of proximal development (ZPD)

The objective of a teacher is to help a student learn effectively and to maximize the student's long-term performance. Prior work in developmental psychology and personalized education argue that a teacher should aid students in problems that fall within their zone of proximal development (ZPD) (Vygotsky & Cole, 1978) . ZPD problems are ones that are not too easy (where the student doesn't need the teacher's assistance), and ones that are not too difficult (where the student couldn't solve even with the teacher's assistance). Aiding students on ZPD problems is how students would most benefit from the teacher's scaffolding, such as through a curriculum (Warford, 2011; Wass & Golding, 2014 ). Vygotsky's idea has shaped how we think about teaching human students in different domains, like teaching the sciences (Chounta et al., 2017; Vainas et al., 2019) or foreign languages (Mu et al., 2021) . It has also shaped our understanding of how young infants acquire knowledge over time (Diaz et al., 1991; Wass & Golding, 2014) . Teacher-student curriculum generation Teacher-student curriculum generation is a longstanding paradigm for accelerating training and generalization of RL agents (Matiisen et al., 2020; Sukhbaatar et al., 2018; Florensa et al., 2017; 2018b; Zhang et al., 2020; Parker-Holder et al., 2022b; Dennis et al., 2020; Jiang et al., 2021b; Portelas et al., 2020; Fang et al., 2020; Soviany et al., 2022) . These algorithms reward the teacher based on measures of difficulty, such as the student's return. Despite this similarity, there is little work on formalizing what make up good task generations and how to align the teacher's objective to this end. This is contrary to work in developmental psychology that suggest pedagogical interventions via the teacher based on the student's needs and progress. Our work focuses on two popular curriculum generation algorithms: PAIRED (Dennis et al., 2020) and Goal GAN (Florensa et al., 2018b) . PAIRED is a successful regret-based algorithm applied to discrete domains, where an adversarial teacher generates tasks that maximize regret between a student and a competing student (the anti-student). Goal GAN is a similarly successful algorithm designed for continuous control domains where a generative adversarial network (GAN) (Goodfellow et al., 2020) teacher generates 2D navigation goals. We focus on these algorithms because several works build on them as either a competitive baseline or the basis of their own algorithms (Jiang et al., 2021a; Gur et al., 2021; Du et al., 2022; Parker-Holder et al., 2022a; Zhang et al., 2020) . Additionally, the algorithms cover different properties of prior work, such as domains (discrete vs. continuous), difficulty criteria (static vs. dynamic), and teacher objectives (regret vs. non-regret). Prior intrinsic reward methods can alternatively enable generalization where additional reward supplements the extrinsic reward to incentivize exploration of the environment (Campero et al., 2020; Zhang et al., 2021; Raileanu & Rocktäschel, 2019) . However, our work focuses on the teacher determining the student's curriculum and considers these methods out of scope.

