MODEL CHANGELISTS: CHARACTERIZING CHANGES IN ML PREDICTION APIS

Abstract

Updates to Machine Learning as a Service (MLaaS) APIs may affect downstream systems that depend on their predictions. However, performance changes introduced by these updates are poorly documented by providers and seldom studied in the literature. As a result, users are left wondering: do model updates introduce subtle performance changes that could adversely affect my system? Ideally, users would have access to a detailed ChangeList specifying the slices of data where model performance has improved and degraded since the update. But, producing a ChangeList is challenging because it requires (1) discovering slices in the absence of detailed annotations or metadata, (2) accurately attributing coherent concepts to the discovered slices, and (3) communicating them to the user in a digestable manner. We introduce Mocha, an interactive framework for building, verifying and releasing ChangeLists that addresses these challenges. Using it, we perform a large-scale analysis of three real-world MLaaS API updates. We produce a ChangeList for each, identifying over 100 coherent data slices on which the model's performance changed significantly. Notably, we find 63 instances where an update improves performance globally, but hurts performance on a coherent slice -a phenomenon not previously documented at scale in the literature. These findings underscore the importance of producing a detailed ChangeList when the model behind an API is updated.

1. INTRODUCTION

Modern software systems often depend on Machine Learning as a Service (MLaaS) APIs developed by cloud providers (e.g. AWS, GCP, Azure) or research organizations (e.g. OpenAI, HuggingFace). The models behind these APIs are periodically updated and new versions are released. However, to the user, how a new update will affect the workings of their broader system is typically unclear. Consider, for example, a newspaper that uses an image tagging API to source archival photos for retrospective stories (Greenfield, 2018) . Updates to the underlying model could lead to unexpected changes in the workflow of photo editors and journalists who rely on the system. MLaaS providers rarely provide transparent evaluations of their updates, and those that do focus on global metrics and vague notions of improvement. Release notes from cloud providers like Amazon, Google and Microsoft for their APIs are terse and provide little information e.g. Microsoft's Vision API (Feb '22 update) noting "general performance and AI quality improvements" (Microsoft, b). These release notes tell an incomplete story: saying that one model improves on another obscures the fact that models may perform very differently on fine-grained slices of data (Ribeiro et al., 2020; de Vries et al., 2019) . Returning to the newspaper described above, performance after the update may improve globally, while still deteriorating on historic photos -the kind of photos commonly found in the newspaper's archives. Without more detailed evaluations, users are left wondering: Do model updates introduce subtle changes that could adversely affect my system? While many studies include detailed comparisons of MLaaS APIs (Buolamwini & Gebru, 2018; Goel et al., 2021a; b; Ribeiro et al., 2020; Qi et al., 2020) , they lack comparisons of the same API before and after an update. Recent work shows that API updates can lead to performance drops on benchmarks (Chen et al., 2021) , but the analysis is limited to simple tasks and global measurements. Answering this question would be easier if providers released detailed reports specifying the slices of data where performance has changed. We formalize this using the notion of a ChangeList. Ideally, a ChangeList is interactive, allowing a user to explore how the model's behavior has changed on the slices most important to their system. For the example above, the newspaper should be able to draw conclusions like: "the updated API detects objects in historic photos with 10% lower recall". Such conclusions would inform decisions around whether or not to integrate the update. However, producing a comprehensive ChangeList is difficult due to 3 main challenges: 1. For complex data like images, the set of slices that partition the data is extremely large and unknown a priori. How can we gather coherent slices that explain the change? 2. When interpreting slices, we typically attribute concepts (e.g. historic) to them. However, if the slice was discovered automatically, it may not align perfectly with a concept, leading to false conclusions about performance on the concept. How do we quickly perform accurate attribution? 3. The number of slices with significant changes can be very large, and not all changes will be relevant to all users. How do we help users surface slices most important to their system? To address these challenges we introduce Mocha, an interactive framework for building, verifying and releasing ChangeLists for model updates. Mocha consists of three phases: 1. Discovery: First, we adapt a recently proposed slice discovery method (Eyuboglu et al., 2022) to gather slices for the ChangeList in Mocha. We use cross-modal embeddings and a simple mixture model to identify slices of data where the models differ. Mocha also supports manual slicing over metadata, and can incorporate slices from any method of slice discovery. 2. Attribution: Next, we ascribe concepts to the discovered slices. Via an interactive process termed micro-labeling, we verify the accuracy of the attributions and dynamically correct them. Cross-modal embeddings (e.g. CLIP) are used to guide an importance sampler (Owen, 2013) that surfaces a small number of examples for labeling. Labeled examples are then used to estimate the precision and recall of the user attributions, and to update slices to reflect label feedback. 3. Release: Finally, to help users understand model updates, we release the ChangeList in the Mocha web interface. The slices in the ChangeList are indexed by cross-modal embeddings, and are therefore easily searchable by text or image. Further, if the ChangeList is missing slices important to the user, they can initiate discovery and attribution to edit the ChangeList. While Mocha can be used to prepare ChangeLists for any pair of models, we focus particularly on demonstrating its application to the challenging real-world problem of documenting MLaaS APIs. We use Mocha to study updates to three image tagging APIs with the HAPI database (Chen et al.), which gathers predictions on the same test examples before and after an update. We produce one ChangeList per API update, with findings from our study of ChangeLists below: • The ChangeLists include over 100 coherent slices on which the model's performance changed significantly. These slices were not annotated in the dataset and were discovered by Mocha. • There are 63 slices in the ChangeLists on which an API update introduced a statistically significant degradation in performance. For example, between 2020 and 2022, the accuracy of a model from Google Cloud Vision on the task of tagging "people" degraded by 17.7%-points for



Figure 1: Overview of Mocha. (left) An MLaaS API updates and changes predictions for downstream users; (right) Mocha is an interactive framework for building, verifying and releasing ChangeLists to explain model updates using slices of data.

