USING DEEP REINFORCEMENT LEARNING TO TRAIN AND EVALUATE INSTRUCTIONAL SEQUENCING POLI-CIES FOR AN INTELLIGENT TUTORING SYSTEM

Abstract

We present STEP, a novel Deep Reinforcement Learning solution to the problem of learning instructional sequencing. STEP has three components: 1. Simulate the tutor by specifying what to sequence and the student by fitting a knowledge tracing model to data logged by an intelligent tutoring system. 2. Train instructional sequencing policies by using Proximal Policy Optimization. 3. Evaluate the learned instructional policies by estimating their local and global impact on learning gains. STEP leverages the student model by representing the student's knowledge state as a probability vector of knowing each skill and using the student's estimated learning gains as its reward function to evaluate candidate policies. A learned policy represents a mapping from each state to an action that maximizes the reward, i.e. the upward distance to the next state in the multi-dimensional space. We use STEP to discover and evaluate potential improvements to a literacy and numeracy tutor used by hundreds of children in Tanzania.

1. INTRODUCTION

An Intelligent Tutoring System (ITS) aims at teaching a set of skills to users by individualizing instructions. Giving instruction to users requires many sequential decisions, such as what to teach, what activities to present, what problems to include, and what help to give. Our aim is to take decisions which maximize long-term rewards in the form of learning gains, so Reinforcement Learning (RL) is a natural approach to pursue, and was first proposed by Liu (1960) . The goal of an RL agent is to learn a policy π, defined as a mapping from state space S to action space A. Given any state, the RL agent follows a series of actions proposed by the learned policy to maximize the long-term expected reward. In the context of an ITS, we specify the RL agent as follows: • State s t : We define the state as a combination of the student state and the tutor state. The tutor state determines the set of actions available to the RL agent at a given timestep. We represent the student state as a vector of probabilities where element i is the estimated probability that the student knows skill i. • Action a t : The action taken by the RL agent corresponds to a tutor decision at a particular grain size. • Reward r t (s t , a t ): Defined as the average difference between prior and posterior knowledge states based on the simulated student's response to the tutor action a t to the student simulator. • Next state s t+1 : The knowledge vector of a student after a Bayesian update based on the simulated student's response to tutor action a t in state s t is the updated student knowledge state. The updated tutor state is given by the tutor simulator. The updated student knowledge state and tutor state, together gives the next state s t+1 . We instantiate STEP in the context of RoboTutor, a Finalist in the Global Learning XPRIZE Competition to develop an open source Android tablet tutor to teach basic literacy and numeracy to chil-dren without requiring adult intervention. XPRIZE independently field-tested the Swahili version of RoboTutor for 15 months in 28 villages in Tanzania. Figure 1 shows an diagrammatic overview of STEP and the rest of the paper is organized as follows. Section 2 discusses the simulation of tutor and student (the environment block). Section 3 elaborates on the training of decision policies (the RL agent block). Section 4 evaluates the learned policies. Section 5 relates this work to prior research. Section 6 concludes. Figure 1 : The RL setup for STEP

2. SIMULATING THE TUTOR AND THE STUDENT

To apply RL, we need to simulate the tutor's actions and the student's responses to them.

2.1. TUTOR SIMULATOR

The data for this paper comes from the version of RoboTutor used during the last 3 months of XPRIZE's 15-month field study. This version rotates through three content areas (literacy, numeracy, and stories), tracking the child's position in each area's curricular sequence of successively more advanced activities. It lets the child select among doing the activity at that position, advancing to the next activity, repeating the same activity (from the previous content area), or exiting RoboTutor. After selecting an activity, the child may complete all or part of it before selecting the next activity. RoboTutor has 1710 learning activities, each of which gives assisted practice of one or more skills on a sequence of items, such as letters or words to write, number problems to solve, or sentences to read. Each item requires one or more steps. Each step may take one or more attempts. Figure 2 : Threshold on percentage of correct attempts and their effects on tutor decisions. The simulated tutor state identifies the current content area and the child's position in it. RoboTutor (actual or simulated) updates the position in the content area based on the percentage of correct attempts to perform the steps in an activity. Specifically, it uses fixed heuristic thresholds (called

