AUTOMATICALLY ANSWERING AND GENERATING MACHINE LEARNING FINAL EXAMS

Abstract

We propose to answer this question using the same criteria we use to answer a similar question: can a human learn machine learning? We automatically answer final exams in MIT's, Harvard's and Cornell's large machine learning courses and generate new questions at a human level. Recently, program synthesis and few-shot learning solved university-level problem set questions in mathematics and STEM courses at a human level. In this work, we solve questions from final exams that differ from problem sets in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics. We provide a new dataset and benchmark of questions from machine learning final exams and code for automatically answering these questions and generating new questions. To make our dataset a reproducible benchmark, we use automatic checkers for multiple choice questions, questions with numeric answers, and questions with expression answers, and evaluate a large free language model, Meta's OPT, and compare the results with Open AI's GPT-3 and Codex. A student survey comparing the quality, appropriateness, and difficulty of machine-generated questions with human-written questions shows that across multiple aspects, machine-generated questions are indistinguishable from humangenerated questions and are suitable for final exams. We perform ablation studies comparing zero-shot learning with few-shot learning, chain-of-thought prompting, GPT-3 and OPT pre-trained on text and Codex fine-tuned on code on a range of machine learning topics and find that few-shot learning methods perform best. We make our data and code publicly available for the machine learning community.

1. INTRODUCTION

Can a machine learn machine learning? This work presents a new dataset of machine learning final exams with 646 question parts and a benchmark of baselines using transformers and their respective grade performance, demonstrating that the best baseline performs at a human level. In university-level STEM courses, students complete assignments (including problem sets and labs) and exams throughout the course. Recent work has opened the door for a machine to solve course problem sets (Drori et al., 2022) using language models and few-shot learning. However, final exams remain challenging, and this work is the first to present a structured dataset of machine learning finals and a benchmark of baseline methods for answering them. Final exams differ from problem sets because they serve as a benchmark of cumulative understanding of material learned over a semester and evaluate the students' depth and breadth of expertise. Further, questions on final exams are longer, have multiple parts, span a broader set of topics, and are more complicated and nuanced. around seven parts each. Final exam questions are also multifaceted and multi-stepped: different parts of a single question require applying different concepts and problem-solving skills, and parts may build upon each other. While weekly problem sets focus on a single topic, finals span topics from the entire semester. Further, final questions are often story-based problems that may require mathematical modeling. Due to the time constraint of these exams, finals are also designed to test core understanding and application of course material over rote calculations. Thus, asking a machine to answer questions from finals allows for testing whether the model is able to learn a breadth and depth of topics beyond problem sets. (i) , y (i) )}, a weighted nearest neighbor regressor has the form h(x; θ) = (x (i) ,y (i) )∈Dtrain f (x, x (i) ; θ)y (i) (x (i) ,y (i) )∈Dtrain f (x, x (i) ; θ) . A typical choice for f is f (x, x ′ ; θ) = e -θ∥x-x ′ ∥ 2 where θ is a scalar and ∥x -  x ′ ∥ 2 = d j=1 (x j -x ′ j ) 2 .



All the above holds for MIT's and Cornell's Introduction to Machine Learning classes and Harvard's Machine Learning course. These are undergraduate courses with hundreds of students each semester, making them the largest undergraduate courses offered. Introduction to Machine Learning is a core class in the computer science program. The prerequisites for the course are Python Programming and Multivariate Calculus, with Introduction to Algorithms and Linear Algebra recommended. The class typically consists of weekly exercises, labs, quizzes, homework, a midterm, and a final exam. There were no final exams in at MIT for Fall 2020 and Spring 2020 due to COVID-19.

In this work, we present a new dataset curated from final exams of MIT's and Cornell's Introduction to Machine Learning course and Harvard's Machine Learning class, totaling 646 question parts spanning the broad range of topics in machine learning: regression, classifiers, logistic regression, features, loss functions, neural networks, convolutional neural networks (CNNs), Markov decision processes (MDPs), recurrent neural networks (RNNs), reinforcement learning, clustering, decision trees, model selection, ensemble methods, Bayesian networks, hidden Markov models (HMMs), and optimization. Our dataset covers the finals given at MIT for semesters of Fall 2017, Spring 2018, Fall 2018, Spring 2019, Fall 2019, Spring 2021, Fall 2021, and Spring 2022, Harvard Spring 2015 and Spring 2021, and Cornell Spring 2017 and Fall 2018. Due to the COVID-19 pandemic, no finals were in the MIT course during 2020.

Table 1: MIT Spring 2021, Question 10.a: question part, solution, Codex Zero-Shot answer, program output, GPT-3 Zero-Shot answer, and GPT-3 Zero-Shot with CoT prompt answer. Topic Classifiers Question Part Given a set of data D train = {(x

Assume our training data D train = ((1, 1), (2, 2), (3, 6)). What is h(10; 0)? That is, letting θ = 0, what is our prediction for x = 10? Solution 3 Codex Zero-Shot Answer import numpy as np def f(x, x_i, theta): return np.exp(-theta * np.linalg.norm(x -x_i) *

