PLAY TO GRADE: GRADING INTERACTIVE CODING GAMES AS CLASSIFYING MARKOV DECISION PRO-CESS

Abstract

Contemporary coding education often present students with the task of developing programs that have user interaction and complex dynamic systems, such as mouse based games. While pedagogically compelling, grading such student programs requires dynamic user inputs, therefore they are difficult to grade by unit tests. In this paper we formalize the challenge of grading interactive programs as a task of classifying Markov Decision Processes (MDPs). Each student's program fully specifies an MDP where the agent needs to operate and decide, under reasonable generalization, if the dynamics and reward model of the input MDP conforms to a set of latent MDPs. We demonstrate that by experiencing a handful of latent MDPs millions of times, we can use the agent to sample trajectories from the input MDP and use a classifier to determine membership. Our method drastically reduces the amount of data needed to train an automatic grading system for interactive code assignments and present a challenge to state-of-the-art reinforcement learning generalization methods. Together with Code.org, we curated a dataset of 700k student submissions, one of the largest dataset of anonymized student submissions to a single assignment. This Code.org assignment had no previous solution for automatically providing correctness feedback to students and as such this contribution could lead to meaningful improvement in educational experience.

1. INTRODUCTION

The rise of online coding education platforms accelerates the trend to democratize high quality computer science education for millions of students each year. Corbett (2001) suggests that providing feedback to students can have an enormous impact on efficiently and effectively helping students learn. Unfortunately contemporary coding education has a clear limitation. Students are able to get automatic feedback only up until they start writing interactive programs. When a student authors a program that requires user interaction, e.g. where a user interacts with the student's program using a mouse, or by clicking on button it becomes exceedingly difficult to grade automatically. Even for well defined challenges, if the user has any creative discretion, or the problem involves any randomness, the task of automatically assessing the work is daunting. Yet creating more open-ended assignments for students can be particularly motivating and engaging, and also help allow students to practice key skills that will be needed in commercial projects. Generating feedback on interactive programs from humans is more laborious than it might seem. Though the most common student solution to an assignment may be submitted many thousands of times, even for introductory computer science education, the probability distribution of homework submissions follows the very heavy tailed Zipf distribution -the statistical distribution of natural language. This makes grading exceptionally hard for contemporary AI (Wu et al., 2019) as well as massive crowd sourced human efforts (Code.org, 2014) . While code as text has proved difficult to grade, actually running student code is a promising path forward (Yan et al., 2019) . We formulate the grading via playing task as equivalent to classifying whether an ungraded student program -a new Markov Decision Process (MDP) -belongs to a latent class of correct Markov Decision Processes (representing correct programming solutions to the assignment). Given a discrete set of environments E = {e n = (S n , A, R n , P n ) : n = 1, 2, 3, ...}, we can partition them into E We are building a classifier that determines whether e, a new input decision process is behaviorally identical to the latent decision process. Hardcourt H-R-R Retro R-H-H R-R-H R-H-R H-H-R H-R-H Prior work on providing feedback for code has focused on text-based syntactic analysis and automatically constructing solution space (Rivers & Koedinger, 2013; Ihantola et al., 2015) . Such feedback orients around providing hints and unable to determine an interactive program's correctness. Other intelligent tutoring systems focused on math or other skills that don't require creating interactive programs (Ruan et al., 2019; 2020) . Note that in principle one could analyze the raw code and seek to understand if the code produces a dynamics and reward model that is isomorphic to the dynamics and reward generated by a correct program. However, there are many different ways to express the same correct program and classifying such text might require a large amount of data: as a first approach, we avoid this by instead deploying a policy and observing the resulting program behavior, thereby generating execution traces of the student's implicitly specified MDP that can be used for classification. Main contributions in this paper: • We introduce the reinforcement learning challenge of Play to Grade. • We propose a baseline algorithm where an agent learns to play a game and use features such as total reward and anticipated reward to determine correctness. • Our classifier obtains 93.1% accuracy on 8359 most frequent programs that cover 50% of the overall submissions and achieve 89.0% accuracy on programs that are submitted by less than 5 times. We gained 14-19% absolute improvement over grading programs via code text. • We will release a dataset of over 700k student submissions to support further research.

2. THE PLAY TO GRADE CHALLENGE

We formulate the challenge with constraints that are often found in the real world. Given an interactive coding assignment, teacher often has a few reference implementations of the assignment. Teachers use them to show students what a correct solution should look like. We also assume that the teacher can prepare a few incorrect implementations that represent their "best guesses" of what a wrong program should look like. To formalize this setting, we consider a set of programs, each fully specifies an environment and its dynamics: E = {e n = (S n , A, R n , P n ) : n = 1, 2, 3, ...}. A subset of these environments are



Figure 1: Bounce can have different "themes" for the background, paddle, and ball. There are two themes to choose from: "hardcourt" and "retro". We show the complete eight different combinations of themes and what the game would look like under these settings.

