STEALING AND DEFENDING TRANSFORMER-BASED ENCODERS

Abstract

Self-supervised learning (SSL) has become the predominant approach to training on large amounts of unlabeled data. New real-world APIs offer services to generate high-dimensional representations for given inputs based on SSL encoders with transformer architectures. Recent efforts highlight that it is possible to steal highquality SSL encoders trained on convolutional neural networks. In this work, we are the first to extend this line of work to stealing and defending transformer-based encoders in both language and vision domains. We show that it is possible to steal transformer-based sentence embedding encoders solely using their returned representations and with 40x fewer queries than the number of the victim's training data points. We further decrease the number of required stealing queries for language encoders by reusing extracted representations on semantically similar sentences and for the vision encoders by leveraging semi-supervised learning. Finally, to defend transformers against stealing attacks, we develop a new scheme to watermark the language encoders, where for the last training iterations, we alternate between optimizing for the standard sentence embedding and a chosen downstream task. For the vision domain, we design a defense technique that combines watermarking with dataset inference. Our method creates a unique encoder signature based on a private data subset that acts as a secret seed during training. By applying dataset inference on the seed, we can then successfully identify stolen transformers.

1. INTRODUCTION

The success of self-supervised learning (SSL) motivates the emergence of large-scale services offering API access to encoders which return high-dimensional representations for given inputs. These representations serve to train a diverse set of downstream tasks with a small amount of labeled data. Latest APIs (Clarifai, 2022; Cohere, 2022; OpenAI, 2022) use transformer-based encoders (Devlin et al., 2018; Dosovitskiy et al., 2020) to generate representations. Such encoders have a high number of parameters (e.g., the state-of-the-art RoBERTa-Large language encoder (Liu et al., 2019) has roughly 355M parameters) and are trained on datasets consisting of millions of data points-yielding a highly expensive training procedure (Sharir et al., 2020) . Therefore, these encoders are lucrative targets for stealing attacks (Tramèr et al., 2016) where an adversary extracts a victim encoder by submitting queries and using the outputs to train a local stolen copy, often at a fraction of the victim's training cost (Sha et al., 2022; Dziedzic et al., 2022a) . The stolen encoder can then be used for inferences without the owner's permission, violating their intellectual property right and causing financial loss. While stealing and defending supervised models has been heavily studied (Tramèr et al., 2016; Juuti et al., 2019; Orekondy et al., 2020) , research on the topic of stealing and defending transformer-based encoders is limited. Despite the immediate practical importance of this problem, to the best of our knowledge, all previous works on model stealing attacks and defenses against SSL encoders are conducted offline in contrived experimental settings (Cong et al., 2022; Sha et al., 2022; Dziedzic et al., 2022a; b) , focusing on the vision domain with convolutional neural network (CNN)-based architectures, and do not attack the popular transformer (Vaswani et al., 2017) architecture, which currently holds state-of-the-art results on many vision (Caron et al., 2021) and natural language processing (NLP) tasks (Gao et al., 2021) . Since transformers are trained differently than CNNs and represent larger and more complex architectures, stealing them is more challenging. To imbed our watermark for sentence embedding encoders, training alternates between the main task and a secretly chosen downstream task during the last iterations. For verification, the fully-connected (FC) layer is attached to a potentially stolen copy and agreement to the victim's output on the downstream task is measured. We show how stealing attacks (Sha et al., 2022; Dziedzic et al., 2022a ) can be successfully applied to extract transformer encoders through their returned representations. In NLP, we are able to successfully steal sentence encoders using only a small number of queries; up to 40x fewer stealing queries than the number of the original training data points. We also show that this number can be further reduced by re-using the representations obtained from the victim encoder for semantically similar sentences of the stealing queries. For vision encoders, we decrease the number of queries against victim encoders by adapting semi-supervision based on MixMatch (Berthelot et al., 2019) , which has, so far, only been applied to attacks in the supervised setting (Jagielski et al., 2020) . The successful applicability of encoder stealing to transformer-based architectures in public API settings motivates the urgent need for defenses. We first propose DataSeed Inference (DSI), a combination between watermarking (Uchida et al., 2017; Jia et al., 2021; Adi et al., 2018) and Dataset Inference (DI) (Dziedzic et al., 2022b) as a successful defense for vision encoders. While standard DI operates on the assumption that an encoder is uniquely characterized by its whole training data and exploits that the unique data signature of the victim encoder is transferred to stolen copies, DSI adds a private data subset as a unique seed to the original training data and uses this seed to identify stolen copies. This is a necessary adaption of DI to transformer-based encoders since these are trained on a large amount of public data-possibly scrapped from the Internet (Radford et al., 2021) , and, thereby not necessarily unique to a single encoder. We note that the signal from the private seed is transferred during stealing. Additionally, our defense does not harm the performance of the defended encoder on downstream tasks, which distinguishes it from watermarks in supervised settings. Since we find that DSI is not successful in defending language encoders, we propose a new watermarking scheme to protect them from theft. Our watermark relies on alternating between the actual sentence embedding task and a secretly chosen downstream task during the last iterations of training. This transforms the representations so that they preserve their high performance on sentence embedding tasks while increasing their accuracy on the downstream task. To embed the watermark task, we append a fully-connected layer to the encoder. The additional layer acts as our secret key. We verify whether a given encoder is a stolen copy by attaching that layer and checking the agreement to the victim encoder's output for the watermark downstream task. Victim and independent encoders have significantly different outputs whereas victim and stolen copies return similar outputs. To summarize, we make the following contributions: • We successfully steal NLP and vision transformer-based encoders in a real-world API setting, assuming access to representations only and using up to 40x fewer queries than the number of samples in the respective encoder's training dataset. Our stolen encoders achieve comparable performance to the respective victims on standard benchmarks. • We further reduce the number of stealing queries by using semantically similar sentences for language encoders and semi-supervised methods for vision encoders.



Watermark for Language Encoders.

Figure 1: Defenses against Stealing Transformer-based Encoders. (a) Our DSI relies on a private dataset, randomly partitioned into three subsets D 1 , D 2 , D Test . D 1 ∪ D 2 is injected into the encoder's public training data as a secret seed.For ownership resolution, a Gaussian Mixture Model (GMM) is trained on the representations of D 1 . DSI marks an encoder as stolen if the average likelihood from the GMM on D 2 is significantly higher than the likelihood on D Test . (b) To imbed our watermark for sentence embedding encoders, training alternates between the main task and a secretly chosen downstream task during the last iterations. For verification, the fully-connected (FC) layer is attached to a potentially stolen copy and agreement to the victim's output on the downstream task is measured.

