AUTOMATICALLY ANSWERING AND GENERATING MACHINE LEARNING FINAL EXAMS

Abstract

We propose to answer this question using the same criteria we use to answer a similar question: can a human learn machine learning? We automatically answer final exams in MIT's, Harvard's and Cornell's large machine learning courses and generate new questions at a human level. Recently, program synthesis and few-shot learning solved university-level problem set questions in mathematics and STEM courses at a human level. In this work, we solve questions from final exams that differ from problem sets in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics. We provide a new dataset and benchmark of questions from machine learning final exams and code for automatically answering these questions and generating new questions. To make our dataset a reproducible benchmark, we use automatic checkers for multiple choice questions, questions with numeric answers, and questions with expression answers, and evaluate a large free language model, Meta's OPT, and compare the results with Open AI's GPT-3 and Codex. A student survey comparing the quality, appropriateness, and difficulty of machine-generated questions with human-written questions shows that across multiple aspects, machine-generated questions are indistinguishable from humangenerated questions and are suitable for final exams. We perform ablation studies comparing zero-shot learning with few-shot learning, chain-of-thought prompting, GPT-3 and OPT pre-trained on text and Codex fine-tuned on code on a range of machine learning topics and find that few-shot learning methods perform best. We make our data and code publicly available for the machine learning community.

1. INTRODUCTION

Can a machine learn machine learning? This work presents a new dataset of machine learning final exams with 646 question parts and a benchmark of baselines using transformers and their respective grade performance, demonstrating that the best baseline performs at a human level. In university-level STEM courses, students complete assignments (including problem sets and labs) and exams throughout the course. Recent work has opened the door for a machine to solve course problem sets (Drori et al., 2022) using language models and few-shot learning. However, final exams remain challenging, and this work is the first to present a structured dataset of machine learning finals and a benchmark of baseline methods for answering them. Final exams differ from problem sets because they serve as a benchmark of cumulative understanding of material learned over a semester and evaluate the students' depth and breadth of expertise. Further, questions on final exams are longer, have multiple parts, span a broader set of topics, and are more complicated and nuanced. 



All the above holds for MIT's and Cornell's Introduction to Machine Learning classes and Harvard's Machine Learning course. These are undergraduate courses with hundreds of students each semester, making them the largest undergraduate courses offered. Introduction to Machine Learning is a core class in the computer science program. The prerequisites for the course are Python Programming and Multivariate Calculus, with Introduction to Algorithms and Linear Algebra recommended. The class typically consists of weekly exercises, labs, quizzes, homework, a midterm, and a final exam. There were no final exams in at MIT for Fall 2020 and Spring 2020 due to COVID-19. Introduction to Machine Learning final exams differ from problem sets in several ways, and the experience of solving each varies. First, finals are long, containing around nine questions with

