A UNIFIED FRAMEWORK FOR COMPARING LEARNING ALGORITHMS

Abstract

We propose a framework for (learning) algorithm comparisons, wherein the goal is to find similarities and differences between models trained with two different learning algorithms. We begin by formalizing the goal of algorithm comparison as finding distinguishing feature transformations, input transformations that change the predictions of models trained with one learning algorithm but not the other. We then present a two-stage method for algorithm comparisons based on comparing how models use the training data, leveraging the recently proposed datamodel representations (Ilyas et al., 2022). We demonstrate our framework through three case studies that compare models trained with/without standard data augmentation, with/without pre-training, and with different optimizer hyperparameters.

1. INTRODUCTION

Building a machine learning model involves a series of design choices. Indeed, even after choosing a dataset, for example, one must decide on a model architecture, an optimization method, and a data augmentation pipeline. These design choices together define a learning algorithm, a function mapping training datasets to machine learning models. Even if they do not affect accuracy, design choices determine the biases of the resulting models. For example, Hermann et al. (2020) find significant variation in shape bias (Geirhos et al., 2019) across a group of ImageNet models that vary in accuracy by less than 1%. In order to understand the impact of design choices, we thus need to be able to differentiate learning algorithms in a more fine-grained way than accuracy alone. Motivated by this observation, we develop a unified framework for comparing learning algorithms. Our proposed framework comprises (a) a precise, quantitative definition of learning algorithm comparison; and (b) a concrete methodology for comparing any two algorithms. For (a), we frame the algorithm comparison problem as one of finding input transformations that distinguish the two algorithms. This goal is different and more general than quantifying model similarity (Ding et al., 2021; Bansal et al., 2021; Morcos et al., 2018a) or testing specific biases (Hermann et al., 2020) . For (b), we propose a two-stage method for comparing algorithms in terms of how they use the training data. In the first stage of this method, we leverage datamodel representations (Ilyas et al., 2022) to find weighted combinations of training examples (which we call training directions) that have disparate impact on test-time behavior of models across learning algorithms. In the second stage, we filter the subpopulation of test examples that are most influenced by each identified training direction, then manually inspect them to infer a shared feature (e.g., we might notice that all of the test images contain a spider web). We then tie this intuition back to our quantitative definition by designing a distinguishing feature transformation based on the shared feature (e.g., overlaying a spider web pattern to the background of an image). We illustrate the utility of our framework through three case studies (Section 3), motivated by typical choices one needs to make within machine learning pipelines, namely: • Data augmentation: We compare classifiers trained with and without data augmentation on the LIVING17 (Santurkar et al., 2021) dataset. We show that models trained with data augmentation, while having higher overall accuracy, are more prone to picking up specific instances of co-occurrence bias and texture bias compared to models trained without data augmentation. For We demonstrate that pre-training can either suppress or amplify specific spurious correlations. As an example of the former, adding a yellow patch to random images increases confidence in the "landbird" label of models trained from scratch by 12%, but actually decreases the confidence of pre-trained models by 4%. As an example of the latter, adding a human face (Figure 1B ) to the background increases the "landbird" confidence of pre-trained models by 4%, but decreases the confidence of models trained from scratch by 1%. • Optimizer hyperparameters: Finally, we compare classifiers trained on CIFAR-10 (Krizhevsky, 2009) using stochastic gradient descent (SGD) with different choices of learning rates and batch sizes. Our analysis pinpoints subtle differences in model behavior induced by small changes to these hyperparameters. For example, adding a small pattern that resembles windows (Figure 1C ) to random images increases the "truck" confidence by 7% on average for models trained with a smaller learning rate, but increases the confidence by only 2% for models trained with a larger learning rate. Across all three case studies, our framework surfaces fine-grained differences between models trained with different learning algorithms, enabling us to better understand the role of the design choices that make up a learning algorithm.

2. COMPARING LEARNING ALGORITHMS

In this section, we describe our (learning) algorithm comparison framework. In Section 2.1, we formalize algorithm comparison as the task of identifying distinguishing transformations. These are functions that-when applied to test examples-significantly and consistently change the predictions of one model class but not the other. In Section 2.2, we describe our method for identifying distinguishing feature transformations by comparing how each model class uses the training data.

2.1. FORMALIZING ALGORITHM COMPARISONS VIA DISTINGUISHING TRANSFORMATIONS

The goal of algorithm comparison is to understand the ways in which two learning algorithms (trained on the same data distribution) differ in the models they yield. More specifically, we are interested in comparing the model classes induced by the two learning algorithms: Definition 1 (Induced model class). Given input space X , label space Y, and model space M ⊂ X → Y, a learning algorithm A : (X × Y) * → M is a (potentially random) function mapping a set of input-label pairs to a model. Fixing a data distribution D, the model class induced by algorithm A is the distribution over M that results from applying A to randomly sampled datasets from D.



Figure1: A visual summary of our case studies. We use our method to study the differences between training with and without standard data augmentation, with and without ImageNet pre-training, and with different choices of SGD optimizer hyperparameters. In all three cases, our framework allows us to pinpoint concrete ways in which the two algorithms being compared differ.

