STEALING AND DEFENDING TRANSFORMER-BASED ENCODERS

Abstract

Self-supervised learning (SSL) has become the predominant approach to training on large amounts of unlabeled data. New real-world APIs offer services to generate high-dimensional representations for given inputs based on SSL encoders with transformer architectures. Recent efforts highlight that it is possible to steal highquality SSL encoders trained on convolutional neural networks. In this work, we are the first to extend this line of work to stealing and defending transformer-based encoders in both language and vision domains. We show that it is possible to steal transformer-based sentence embedding encoders solely using their returned representations and with 40x fewer queries than the number of the victim's training data points. We further decrease the number of required stealing queries for language encoders by reusing extracted representations on semantically similar sentences and for the vision encoders by leveraging semi-supervised learning. Finally, to defend transformers against stealing attacks, we develop a new scheme to watermark the language encoders, where for the last training iterations, we alternate between optimizing for the standard sentence embedding and a chosen downstream task. For the vision domain, we design a defense technique that combines watermarking with dataset inference. Our method creates a unique encoder signature based on a private data subset that acts as a secret seed during training. By applying dataset inference on the seed, we can then successfully identify stolen transformers.

1. INTRODUCTION

The success of self-supervised learning (SSL) motivates the emergence of large-scale services offering API access to encoders which return high-dimensional representations for given inputs. These representations serve to train a diverse set of downstream tasks with a small amount of labeled data. Latest APIs (Clarifai, 2022; Cohere, 2022; OpenAI, 2022) use transformer-based encoders (Devlin et al., 2018; Dosovitskiy et al., 2020) to generate representations. Such encoders have a high number of parameters (e.g., the state-of-the-art RoBERTa-Large language encoder (Liu et al., 2019) has roughly 355M parameters) and are trained on datasets consisting of millions of data points-yielding a highly expensive training procedure (Sharir et al., 2020) . Therefore, these encoders are lucrative targets for stealing attacks (Tramèr et al., 2016) where an adversary extracts a victim encoder by submitting queries and using the outputs to train a local stolen copy, often at a fraction of the victim's training cost (Sha et al., 2022; Dziedzic et al., 2022a) . The stolen encoder can then be used for inferences without the owner's permission, violating their intellectual property right and causing financial loss. While stealing and defending supervised models has been heavily studied (Tramèr et al., 2016; Juuti et al., 2019; Orekondy et al., 2020) , research on the topic of stealing and defending transformer-based encoders is limited. Despite the immediate practical importance of this problem, to the best of our knowledge, all previous works on model stealing attacks and defenses against SSL encoders are conducted offline in contrived experimental settings (Cong et al., 2022; Sha et al., 2022; Dziedzic et al., 2022a; b) , focusing on the vision domain with convolutional neural network (CNN)-based architectures, and do not attack the popular transformer (Vaswani et al., 2017) architecture, which currently holds state-of-the-art results on many vision (Caron et al., 2021) and natural language processing (NLP) tasks (Gao et al., 2021) . Since transformers are trained differently than CNNs and represent larger and more complex architectures, stealing them is more challenging.

