MODEL CHANGELISTS: CHARACTERIZING CHANGES IN ML PREDICTION APIS

Abstract

Updates to Machine Learning as a Service (MLaaS) APIs may affect downstream systems that depend on their predictions. However, performance changes introduced by these updates are poorly documented by providers and seldom studied in the literature. As a result, users are left wondering: do model updates introduce subtle performance changes that could adversely affect my system? Ideally, users would have access to a detailed ChangeList specifying the slices of data where model performance has improved and degraded since the update. But, producing a ChangeList is challenging because it requires (1) discovering slices in the absence of detailed annotations or metadata, (2) accurately attributing coherent concepts to the discovered slices, and (3) communicating them to the user in a digestable manner. We introduce Mocha, an interactive framework for building, verifying and releasing ChangeLists that addresses these challenges. Using it, we perform a large-scale analysis of three real-world MLaaS API updates. We produce a ChangeList for each, identifying over 100 coherent data slices on which the model's performance changed significantly. Notably, we find 63 instances where an update improves performance globally, but hurts performance on a coherent slice -a phenomenon not previously documented at scale in the literature. These findings underscore the importance of producing a detailed ChangeList when the model behind an API is updated.

1. INTRODUCTION

Modern software systems often depend on Machine Learning as a Service (MLaaS) APIs developed by cloud providers (e.g. AWS, GCP, Azure) or research organizations (e.g. OpenAI, HuggingFace). The models behind these APIs are periodically updated and new versions are released. However, to the user, how a new update will affect the workings of their broader system is typically unclear. Consider, for example, a newspaper that uses an image tagging API to source archival photos for retrospective stories (Greenfield, 2018) . Updates to the underlying model could lead to unexpected changes in the workflow of photo editors and journalists who rely on the system. MLaaS providers rarely provide transparent evaluations of their updates, and those that do focus on global metrics and vague notions of improvement. Release notes from cloud providers like Amazon, Google and Microsoft for their APIs are terse and provide little information e.g. Microsoft's Vision API (Feb '22 update) noting "general performance and AI quality improvements" (Microsoft, b). These release notes tell an incomplete story: saying that one model improves on another obscures the fact that models may perform very differently on fine-grained slices of data (Ribeiro et al., 2020; de Vries et al., 2019) . Returning to the newspaper described above, performance after the update may improve globally, while still deteriorating on historic photos -the kind of photos commonly found in the newspaper's archives. Without more detailed evaluations, users are left wondering: Do model updates introduce subtle changes that could adversely affect my system? While many studies include detailed comparisons of MLaaS APIs (Buolamwini & Gebru, 2018; Goel et al., 2021a; b; Ribeiro et al., 2020; Qi et al., 2020) , they lack comparisons of the same API before and after an update. Recent work shows that API updates can lead to performance drops on benchmarks (Chen et al., 2021) , but the analysis is limited to simple tasks and global measurements. 1

