STEALING AND DEFENDING TRANSFORMER-BASED ENCODERS

Abstract

Self-supervised learning (SSL) has become the predominant approach to training on large amounts of unlabeled data. New real-world APIs offer services to generate high-dimensional representations for given inputs based on SSL encoders with transformer architectures. Recent efforts highlight that it is possible to steal highquality SSL encoders trained on convolutional neural networks. In this work, we are the first to extend this line of work to stealing and defending transformer-based encoders in both language and vision domains. We show that it is possible to steal transformer-based sentence embedding encoders solely using their returned representations and with 40x fewer queries than the number of the victim's training data points. We further decrease the number of required stealing queries for language encoders by reusing extracted representations on semantically similar sentences and for the vision encoders by leveraging semi-supervised learning. Finally, to defend transformers against stealing attacks, we develop a new scheme to watermark the language encoders, where for the last training iterations, we alternate between optimizing for the standard sentence embedding and a chosen downstream task. For the vision domain, we design a defense technique that combines watermarking with dataset inference. Our method creates a unique encoder signature based on a private data subset that acts as a secret seed during training. By applying dataset inference on the seed, we can then successfully identify stolen transformers.

1. INTRODUCTION

The success of self-supervised learning (SSL) motivates the emergence of large-scale services offering API access to encoders which return high-dimensional representations for given inputs. These representations serve to train a diverse set of downstream tasks with a small amount of labeled data. Latest APIs (Clarifai, 2022; Cohere, 2022; OpenAI, 2022) use transformer-based encoders (Devlin et al., 2018; Dosovitskiy et al., 2020) to generate representations. Such encoders have a high number of parameters (e.g., the state-of-the-art RoBERTa-Large language encoder (Liu et al., 2019) has roughly 355M parameters) and are trained on datasets consisting of millions of data points-yielding a highly expensive training procedure (Sharir et al., 2020) . Therefore, these encoders are lucrative targets for stealing attacks (Tramèr et al., 2016) where an adversary extracts a victim encoder by submitting queries and using the outputs to train a local stolen copy, often at a fraction of the victim's training cost (Sha et al., 2022; Dziedzic et al., 2022a) . The stolen encoder can then be used for inferences without the owner's permission, violating their intellectual property right and causing financial loss. While stealing and defending supervised models has been heavily studied (Tramèr et al., 2016; Juuti et al., 2019; Orekondy et al., 2020) , research on the topic of stealing and defending transformer-based encoders is limited. Despite the immediate practical importance of this problem, to the best of our knowledge, all previous works on model stealing attacks and defenses against SSL encoders are conducted offline in contrived experimental settings (Cong et al., 2022; Sha et al., 2022; Dziedzic et al., 2022a; b) , focusing on the vision domain with convolutional neural network (CNN)-based architectures, and do not attack the popular transformer (Vaswani et al., 2017) architecture, which currently holds state-of-the-art results on many vision (Caron et al., 2021) and natural language processing (NLP) tasks (Gao et al., 2021) . Since transformers are trained differently than CNNs and represent larger and more complex architectures, stealing them is more challenging. For ownership resolution, a Gaussian Mixture Model (GMM) is trained on the representations of D 1 . DSI marks an encoder as stolen if the average likelihood from the GMM on D 2 is significantly higher than the likelihood on D Test . (b) To imbed our watermark for sentence embedding encoders, training alternates between the main task and a secretly chosen downstream task during the last iterations. For verification, the fully-connected (FC) layer is attached to a potentially stolen copy and agreement to the victim's output on the downstream task is measured. We show how stealing attacks (Sha et al., 2022; Dziedzic et al., 2022a ) can be successfully applied to extract transformer encoders through their returned representations. In NLP, we are able to successfully steal sentence encoders using only a small number of queries; up to 40x fewer stealing queries than the number of the original training data points. We also show that this number can be further reduced by re-using the representations obtained from the victim encoder for semantically similar sentences of the stealing queries. For vision encoders, we decrease the number of queries against victim encoders by adapting semi-supervision based on MixMatch (Berthelot et al., 2019) , which has, so far, only been applied to attacks in the supervised setting (Jagielski et al., 2020) . The successful applicability of encoder stealing to transformer-based architectures in public API settings motivates the urgent need for defenses. We first propose DataSeed Inference (DSI), a combination between watermarking (Uchida et al., 2017; Jia et al., 2021; Adi et al., 2018) and Dataset Inference (DI) (Dziedzic et al., 2022b) as a successful defense for vision encoders. While standard DI operates on the assumption that an encoder is uniquely characterized by its whole training data and exploits that the unique data signature of the victim encoder is transferred to stolen copies, DSI adds a private data subset as a unique seed to the original training data and uses this seed to identify stolen copies. This is a necessary adaption of DI to transformer-based encoders since these are trained on a large amount of public data-possibly scrapped from the Internet (Radford et al., 2021) , and, thereby not necessarily unique to a single encoder. We note that the signal from the private seed is transferred during stealing. Additionally, our defense does not harm the performance of the defended encoder on downstream tasks, which distinguishes it from watermarks in supervised settings. Since we find that DSI is not successful in defending language encoders, we propose a new watermarking scheme to protect them from theft. Our watermark relies on alternating between the actual sentence embedding task and a secretly chosen downstream task during the last iterations of training. This transforms the representations so that they preserve their high performance on sentence embedding tasks while increasing their accuracy on the downstream task. To embed the watermark task, we append a fully-connected layer to the encoder. The additional layer acts as our secret key. We verify whether a given encoder is a stolen copy by attaching that layer and checking the agreement to the victim encoder's output for the watermark downstream task. Victim and independent encoders have significantly different outputs whereas victim and stolen copies return similar outputs. To summarize, we make the following contributions: • We successfully steal NLP and vision transformer-based encoders in a real-world API setting, assuming access to representations only and using up to 40x fewer queries than the number of samples in the respective encoder's training dataset. Our stolen encoders achieve comparable performance to the respective victims on standard benchmarks. • We further reduce the number of stealing queries by using semantically similar sentences for language encoders and semi-supervised methods for vision encoders. • For the vision domain, we propose DSI as a defense which adds a private dataset as a secret seed to the defender's original training data and uses this seed for ownership resolution. • For NLP sentence embedding transformers, we propose a method to watermark their representations by alternating between the actual sentence embedding and a secretly chosen downstream task during the last iterations of training.

2. BACKGROUND AND RELATED WORK

Model extraction attacks. The goal of the model extraction attacks is to replicate a functionality of a victim model f v trained on a dataset D P . An attacker has a black box access to the victim model and uses a stealing dataset D s = {q i , f v (q i )} n i=1 , consisting of queries q i and the corresponding predictions or representations returned by the victim model, to train a stolen model f s . Model extraction attacks have been shown against various types of models including classification (Tramèr et al., 2016; Jagielski et al., 2020) and representation models (Sha et al., 2022; Dziedzic et al., 2022a) . Language Encoder. We use SimCSE (Gao et al., 2021) to learn sentence representations since it outperforms other methods and is exposed via public APIs (Clarifai, 2022) . The SimCSE framework by (Gao et al., 2021) proposes unsupervised and supervised approaches to generate sentence embeddings. It starts from a pre-trained checkpoint of a BERT-based encoder, e.g., RoBERTa, and takes the representation for the classification token ([CLS]) as the sentence embedding. In this work, we rely on their supervised approach leveraging pairs of sentences from natural language inference (NLI) datasets within a contrastive learning framework. It uses the entailment pairs as positives and contradictions as hard negatives. Vision Encoder. In our work, we use DINO (Caron et al., 2021) to train vision transformers (ViT) since it returns high-quality representations that achieve SoTA performance on downstream tasks when assessed by training a linear classifier directly on representations. DINO trains student and teacher encoders, both with the same architecture but different parameters, where the teacher is updated with an (exponential moving) average of the student. Different random transformations of the same image are generated and passed through both the student and teacher. The student is provided with smaller crops of the inputs than the teacher, which forces the student to generate representations that restore parts of the initial image. The training objective is to minimize the cross-entropy loss between representations from teacher and student. Stealing Encoders. Thus far, methods for stealing encoders through representations have been shown in the computer vision domain and only for CNNs (Sha et al., 2022; Dziedzic et al., 2022a) . Previous work in the NLP domain focuses on classification tasks and performs stealing against fine-tuned models through labels based on a given pre-trained language encoder (Krishna et al., 2020; Zanella-Beguelin et al., 2021; He et al., 2021) . Model extraction against NLP APIs are shown by (Xu et al., 2021) , specifically for sentiment classification and machine translation tasks. The setup of previous work differs from ours which is concerned with stealing through representations instead of low-dimensional outputs, such as labels. This is motivated by the fact that these representations are exactly what new public APIs expose (Cohere, 2022; Clarifai, 2022) . Distillation methods used in the NLP domain (Jiao et al., 2019) which could, in principle, be applied to stealing encoders, usually require white box access to the original model, for example, to the attention layers (Jiao et al., 2019) . Therefore, distillation cannot be applied to stealing in public API-access scenarios. Defending Encoders. Dataset inference (DI) (Maini et al., 2021) is a defense against model stealing attacks. It uses the victim's training dataset as a unique signature, leveraging the following observation: for a victim encoder trained on its private data as well as for its stolen copies, the distribution of the representations generated from the victim's training data differs from the distribution of the representations generated on the test data. In contrast, for an independently trained encoder, these two distributions cannot be distinguished, allowing the detection of stolen copies (Dziedzic et al., 2022b) . For modeling the distributions, Gaussian Mixture Models (GMMs) are trained on a fraction of the private training data and applied to a disjoint fraction of the training data and the test data. An encoder is identified as a victim or stolen copy if the log-likelihood on private representations is significantly higher than on the test representations. Recently, watermarking (Uchida et al., 2017; Jia et al., 2021; Adi et al., 2018) methods have been proposed for encoders (Dziedzic et al., 2022a; Cong et al., 2022; Wu et al., 2022) . The main difference between previous work and our DSI lies in verification. Previous watermarking techniques use downstream tasks to detect a watermark while we resolve ownership based on the representations directly. For a more detailed overview on watermarking for encoders and a more thorough overview of related work, see Appendix A.

3. STEALING TRANSFORMER-BASED ENCODERS

We aim at stealing BERT-based transformers, fine-tuned to return sentence embeddings in the language domain, and general embeddings for images in the vision domain. Our stealing operates in a public API setting where the adversary can query the models through a pre-defined interface to obtain high-dimensional representations for their inputs. Stealing is then performed following previous work (Dziedzic et al., 2022a ): (i) The adversary sends N raw or augmented inputs to the victim encoder. These inputs can, in principle, be taken from any data distribution of the target domain, using open-source data. (ii) With the obtained representations, the adversary trains a stolen copy of the victim. The goal of this training is to maximize the similarity of the stolen copy's output and the representations output by the victim. Therefore, the adversary either imitates a self-supervised training using a contrastive loss function, e.g., InfoNCE Chen et al. (2020) or SoftNN Frosst et al. (2019) , or directly matches both models' representations via the Mean Squared Error (MSE) loss. Stealing Transformer-Based Encoders. For language, public APIs (Cohere, 2022) expose transformers which are first pre-trained on a large corpus of text data to return per-token representations and then fine-tuned to return high-dimensional embeddings for a given full-text input, e.g., a sentence. In the vision domain, APIs such as (Clarifai, 2022) expose encoders trained from scratch on large amounts of image data to return per-image representations. We find that public APIs (Clarifai, 2022) provide metadata about exposed encoders, which can contain information about datasets used for pre-training as well as the encoder architecture. Thus, we can instantiate our stolen encoders with the victim encoder's architecture. We also follow the API setting and initialize the stolen copies of language encoders with publicly available pre-trained transformers. 1 The stolen copies of vision encoders are initialized with random weights. Stealing with Semi-Supervision (Vision). To reduce the number of stealing queriesfoot_1 , we apply semi-supervised learning which has been used to improve stealing in the supervised setting (Jagielski et al., 2020) . The approach relies on semi-supervision based on MixMatch (Berthelot et al., 2019) and helps to leverage a large pool of unlabeled data while having access to only a small fraction of labeled data (Sohn et al., 2020; Assran et al., 2021) . The inspiration for applying MixMatch-based methods to stealing encoders comes from DINO (see Section 2), which normalizes representations with a temperature softmax and uses the cross-entropy loss to minimize distances between outputs from student and teacher encoders. Similarly, during stealing, we pass the victim and stolen representations through the softmax layer whose outputs act as our new labels for MixMatch. Note that this method is not applicable to the fine-tuned sentence embedding encoders since these do not rely on data augmentations, which is necessary for MixMatch. Re-using Representations (NLP). For stealing sentence encoders, we reduce the number of stealing queries by re-using representations over semantically similar sentences. This is possible since sentence encoders are required to return similar representations to such semantically similar sentences. Hence, when a stealing dataset holds such similar sentences (e.g., in the form of positive pairs, such as all the datasets used in this work), we only have to query one of these sentences, and assign the same representation to all semantically similar sentences to augment our fine-tuning dataset for the stolen-copy. Our experimental evaluation in Section 5.1 shows the effectiveness of this approach.

4. DEFENSES AGAINST STEALING TRANSFORMERS

Given the high training costs of transformer-based encoders, defending them from stealing is an urgent need. Yet, their representations' high dimensionality and complexity make defenses challenging.

4.1. DATASEED INFERENCE FOR VISION TRANSFORMERS

Prior work (Dziedzic et al., 2022b) has successfully applied dataset inference (DI) (Maini et al., 2021) to identify stolen copies of encoders with CNN-architectures. To verify ownership, DI assumes the training data to be private in order to act as a unique model signature. However, transformers are usually trained on millions of public data points, including noisy and uncurated data scrapes from the Internet (Radford et al., 2021) . Thus, large transformers are trained on overlapping datasets which makes DI non-applicable in this setting. To overcome this difficulty and leverage transformers' training signatures for ownership resolution, we propose DataSeed Inference (DSI). This defense combines DI with watermarking (Uchida et al., 2017) by including private data as a secret seed into the original training data and training the defended transformer on the combined datasets. The private seed then serves to uniquely identify encoders and their stolen copies. Even if an adversary is aware of our defense, they still require knowledge about the private seed to avoid extracting features contained in it. After stealing an encoder with an arbitrary dataset from the same data domain, we use the private data as well as its corresponding private test data for DSI. We visualize our approach in Figure 1a . An advantage of DSI is that the private subset does not harm the performance of the defended encoder. We observe that by adding more diverse training data, the performance can be improved for downstream tasks that share features with the private subset. The observation of increased performance is contrary to standard watermarking in supervised settings. An intuition to this behavior is that our defense does not directly influence the downstream task but only transforms the representations. In contrast, in supervised learning, watermarking directly impacts the task that the model is trained for, yielding performance loss. To provide the defended encoder with a stronger unique signal for DSI, we train differently on the private seed than on the original training data. Specifically, we apply weaker augmentations to the private data points. This enables the defended encoder to overfit more easily to the private seed. To this end, we modify the training procedure of the DINO framework (Caron et al., 2021) . Concretely, we change the standard DINO's data pre-processing and increase the size of crops for images passed through the student and teacher encoders. This is applied only to data points from the private subset and not to the original training points. Instead of using the student's crops that cover only small areas (less than 50%) of an input image, we increase the size of crops to the range between 70% and 90%. Furthermore, we also increase the size of the crops for the teacher encoder from greater than 50% to greater than 90%. DSI leverages this property and detects different behavior of encoders on their private training data versus unseen test data. Next, in our empirical evaluation we show that the unique signature from the private data seed is transferred to stolen encoder copies, making DSI applicable to perform ownership resolution. To further amplify the signal from the training signature, we add the victim's projection head on top of the verified encoders.

4.2. WATERMARKED SENTENCE-EMBEDDING ENCODERS

We find that DSI (as well as standard DI) do not yield significant results in detecting ownership for stolen language encoders. This happens because of multiple reasons. First, all victims, stolen copies, and independent encoders use the same underlying pre-trained transformers. This similarity overlies the individual signals from datasets used for fine-tuning to sentence embeddings. Second, victim encoders are usually fine-tuned over a small number of epochs (10 for TinyBERT and 3 for BERT and RoBERTa). We observe that with longer fine-tuning, the results returned by DI on victim encoders become more confident. However, the signal on the stolen encoders remains insignificant. As an alternative approach, we develop a new watermarking-based defense against stealing sentence embedding encoders. We embed the watermark starting from an already fine-tuned encoder. This is a realistic scenario where the model owner would like to add a watermark post hoc at a low cost. To embed the watermark, we perform a few iterations of training, where we always alternate between one iteration of the original sentence embedding training (with SimCSE), and then one iteration of training for a downstream task. During training of the downstream task, we add an additional fully-connected layer, which serves as our secret key during verification. Our watermarking approach is visualized in Figure 1b . In this work, we select SST2 (binary classification for sentiment analysis) as the watermark downstream task. Note, however, that a defender can select from many possible downstream tasks, reshuffle or flip the labels, or use their own private downstream task, which makes the detection of the watermark much harder. To resolve ownership, a verifier simply attaches the fully-connected layer (secret key) to the output of an encoder suspected to be a stolen copy. Then, agreement between the outputs of the victim encoder (plus fully-connected layer), and the outputs of the potentially stolen copy (plus fully-connected layer) on the secret downstream task is measured as the percentage of labels where both outputs agree. We resolve that an encoder was stolen if the agreement is above the threshold of 95%, otherwise, the encoder is marked as independent. We assume that the adversary does not obfuscate the output representations as described in Dziedzic et al. (2022b) .

5. EMPIRICAL EVALUATION

We evaluate our methods for stealing and defending transformer-based encoders trained on different vision and NLP datasets.

5.1. STEALING TRANSFORMERS

Vision. For ImageNet victims, we use the ViT's checkpoints released by the original DINO paper (Caron et al., 2021) . For CIFAR10, we train Tiny ViTs from scratch. All training procedures follow (Caron et al., 2021) unless otherwise specified, using 300 epochs, a batch size of 256, and the learning rate is set to 5e-4 with a cosine annealing scheduler. For stealing, we experiment with different numbers of queries from various datasets, including CIFAR10, CIFAR100, SVHN, STL10, and ImageNet. Details on the datasets are included in Appendix B.1, while further insights on the training can be found in Appendix B.2. We also use ViTs to instantiate stolen encoders and experiment with different loss functions, such as MSE, InfoNCE (Chen et al., 2020) , or SoftNN (Frosst et al., 2019) . Unless otherwise specified, our stealing procedure follows Dziedzic et al. (2022a) . We assess the performance of victim and stolen encoders using standard linear evaluation (Chen et al., 2020) on four downstream tasks: CIFAR10, CIFAR100, SVHN, and STL10 (see Table 1 ). We observe that stealing with the victim's training distribution yields the closest performance of the stolen copy to the victim. The MSE loss performs better for the lower query regime (<100k) while contrastive losses like InfoNCE yield higher accuracy scores for the larger number of queries. When stealing with more complex datasets (e.g., CIFAR10 vs SVHN) we can obtain a better generalization of the stolen copy, which is shown by higher accuracy on more downstream tasks. We also apply MixMatch for the case when an attacker has limited access to the API and wants to use as few queries as possible. Thus, in this case, we operate in the regime with a number of queries ranging from 4k to 10k, see Table 2 . Our results show that MixMatch outperforms all other methods, even when provided with fewer labeled queries, and retains high similarity to the victim encoder when measured with the cosine similarity score between the representations from the victim encoder and a stolen copy. Language. We steal from BERT-based sentence embedding encoders fine-tuned on nli-for-SimCSE (Gao et al., 2021) ("nli"), QQP (Iyer et al.) ("qqp"), and Flickr30k (Young et al., 2014) ("flickr"). For more details on the datasets and our pre-processing, see Appendix B.3. For TinyBERT, we fine-tune the victims using SimCSE (Gao et al., 2021) , for Bert Basefoot_2 ("BERT"), and RoBERTa Largefoot_3 ("RoBERTa") we use models from Hugging Face as victim encoders. We fine-tune independent models using SimCSE and initialize our stolen copies with pretrained transformers of the victim architecture. Following SimCSE, we replace the original BERT pooling layer of our stolen models with a custom pooler that returns the [CLS] representation and add a randomly initialized MLP layer on top before stealing. 5 For stealing, we issue 60K queries, using sentences from our three datasets against the victim encoders and fine-tune our stolen copies with the resulting representations for 20 (TinyBERT and BERT), or 5 (RoBERTa) epochs, respectively, using MSE loss. Further details on our training and stealing methods are provided in Appendix B.4. Table 3 depicts the performance of our victim and stolen encoders, evaluated on tasks from the SentEval benchmark. We observe that across all base encoders (TinyBERT, BERT, and RoBERTa), the performance of the stolen copies is comparable to their respective victim encoders over most benchmark tasks. This holds true even for the large qqp dataset (∼2.6M training samples) and stolen Table 1 : Performance of Vision Transformers. We follow SimCLR (Chen et al., 2020) and do a linear evaluation of the encoders (denoted as EN) on downstream tasks. f v and f ′ v are victim encoders trained on data D v , f s is the stolen encoder extracted using queries from a given stealing dataset D s . M is the type of the underlying encoder, where T is ViT Tiny, S is ViT Small, and B is ViT Base. Victim encoder with an asterisk (*) is a pre-trained encoder provided by Clarifai (2022) . Other encoders are trained using the DINO code-base (https://github.com/facebook). CE denotes the Cross-Entropy loss and MSE is Mean Squared Error. For a given stealing dataset, we test loss functions: CE, MSE, InfoNCE, and SoftNN and report results for the best performing one. copies obtained with only 60K queries, i.e., ∼40x fewer queries than training samples. In general, the performance of encoders stolen with nli and qqp is higher than the one of encoders stolen with flickr. We suspect this is due to the low semantic diversity in flickr which consists only of 30K images with five semantically equal captions each, leading to semantic overlap within the 60K stealing-queries. We further explore the impact of the number of stealing queries on the performance of the stolen copies. Our results in Table 9 in Appendix C.1 highlight a performance decrease when reducing the number of stealing queries. The performance drop is most significant between 10k and 20k queries. This motivates an evaluation of the effectiveness of our method to re-use representations for semantically similar sentences in this setup. We query the stolen model copy with 10k sentences from the nli dataset and assign the obtained representation also to the semantically equal positive partner of each sentence. This results in 20k fine-tuning samples for the stolen copy. Our results in Table 4 highlight that the performance of the stolen copy with only 10k queries (augmented to 20k data points) is similar to the original stealing with 20k sentences. 

5.2. DEFENDING TRANSFORMERS

DataSeed Inference. We design DataSeed Inference to incorporate a private seed into the victim's training set. Concretely, we insert randomly-selected 10k images from the SVHN and GTSRB training splits into the CIFAR10 training sets. We train ViT Tiny and ViT Small on the combined datasets as victims. We use weaker augmentations for the private set, which results in more similar inputs during training and evaluation and provides a stronger signal for DSI. Otherwise, the training procedures for the victim and stolen encoders are the same as before. Table 12 shows that DSI is more effective when the projection head in training DINO uses shallower MLPs. This result meets our intuition: since the training loss is minimized directly with the projection head, some information about the training signature is lost when the head is removed. Therefore, to make the training signature easier to detect, we attach the victim's projection head on top of the verified models. We use 20k randomly-selected images from the rest of the training splits as the validation set in DSI. We train GMMs with 10 components across all settings. For more details, see Appendix B.2. The results in Table 5 demonstrate that DSI is able to differentiate the stolen transformers of CIFAR10 by injecting a private subset of SVHN and GTSRB, even when the stealing datasets are different from training. Similar to DI for supervised learning (Maini et al., 2021) , the victim encoder typically has the largest ∆µ and the smallest p-values. We also find that increasing the size of the private set makes this defense more effective. Finally, our comparisons with the independent models ensure that we avoid false positives when detecting stolen models (see last two rows in Table 5 ). We also assess the performance of the encoders trained with CIFAR10 and a private data seed. We observe that these encoders achieve similar performance to the victim trained only on CIFAR10 and can outperform it when assessed on the downstream tasks that share features with the private subset, see Table 1 . Watermarking Language Encoders. We present the performance of the watermarked encoders as well as the success of our watermarking in Table 6 . We use the initial fine-tuned sentence embedding encoder as the independent encoder. This can be considered the worst-case evaluation because the representations returned from this encoder are expected to be most similar to its watermarked derivation. We show that even in this worst-case, the independent model is never incorrectly resolved as being a stolen copy. To compute the p-values, we leverage the confidence scores (softmax outputs) for the correct labels from the downstream task and use the t-test. The p-values indicate that there is a significant difference between the distribution of the confidence scores from independent vs victim encoders (p-value < 5%). In contrast, the difference is not significant between the victim and stolen encoders. Our results highlight that a relatively small number of fine-tuning steps (e.g., 200 alternations between the original task and downstream task) are sufficient to successfully embed the watermark into the encoder while preserving the high performance of the defended encoder on other unrelated and general downstream tasks.

6. CONCLUSIONS AND FUTURE WORK

Modern APIs offer access to high-value transformer-based encoders for generating representations of given input text or images. We demonstrate how to steal these transformers in the language and vision domains by using only representations. Our stealing requires up to 40x fewer queries than the number of training data points used to train the victim and it yields stolen copies with comparable performance on standard benchmarks. We further decrease the number of stealing queries by using semantically similar sentences in the language domain and semi-supervised learning in the vision domain. We propose a new defense for vision transformers, where a private data seed acts as the training signature of the victim encoder. We find that this type of defense is unable to protect sentence embedding encoders since the signal of the private data subset used during fine-tuning is overlaid with the data used during pre-training of the base encoder. To overcome this obstacle, we propose a method for embedding a watermark into the encoders by fine-tuning a defended encoder with a specific downstream task. Even though we place our work in a public access setting and implement our stealing based on representations such as the ones returned by public APIs, we do not actually steal the models from real-world APIs. All the models stolen in this work are either trained by ourselves or obtained from open-source platforms, such as Hugging Face. However, we note that several public APIs expose access to the same or similar models. Hence, our attacks could potentially be applied against such APIs to steal their models. Given that the models attacked in this work are publicly available anyways, their extraction from an API would not cause additional harm to the API provider. We still decided to keep the code implementing our model stealing confidential to minimize the risk of exposure through our work for such providers. Instead, we submit only the code implementing our defenses.

8. REPRODUCIBILITY STATEMENT

Due to the reasons elaborated in the ethics statement, we do not publicly submit the code of our stealing procedure. Instead, for reproducibility, we describe our approach in detail-documenting the frameworks we used, all hyperparameters for training, and the architectures. We submit the code for our defense in the supplementary material. In the README.md file we provide the main commands needed to run our code. We also describe the pointers to the crucial parts of the code that directly reflect the implementation described in the main part of the submission. We also describe the defenses used in our experiments thoroughly.

A ADDITIONAL RELATED WORK

Self-Supervised Learning. In computer vision, one of the most popular self-supervised algorithms is contrastive learning (Chen et al., 2020; van den Oord et al., 2018) , where representations that come from differently transformed views of the same image are brought closer to each other and the representations from views of different inputs are repelled. In NLP, a popular self-supervised pre-training approach is to mask selected words in the input sequence and train the model to predict that masked words Devlin et al. (2018) . Transformers. Transformer (Vaswani et al., 2017) is becoming a ubiquitous architecture in NLP and computer vision. While the original transformer consists of an encoder and decoder component, our work only studies the encoder part for representation learning. Transformers are composed of several identical layers, namely a multi-head attention sublayer followed by a feed forward sublayer. The multi-head attention sublayer utilizes the self-attention mechanism to learn the pairwise relationships between all tokens. Self-attention is the key to the success of transformers, which makes parallel training and learning long-range dependency between tokens easier. NLP transformers. In this paper, we investigate BERT-based models (Devlin et al., 2018) , pretrained bidirectional transformers (Vaswani et al., 2017) . In addition to BERT Base, we also use TinyBERT (Jiao et al., 2019) , obtained by distilling BERT to a smaller transformer architecture, and RoBERTa Large (Liu et al., 2019) , which is pre-trained on a larger dataset than BERT. Then we analyze the task of learning highly generic sentence representation, a fundamental problem in NLP (Kiros et al., 2015; Hill et al., 2016; Logeswaran & Lee, 2018) . Vision Transformers. Vision transformers (Dosovitskiy et al., 2020; Touvron et al., 2021; Park & Kim, 2022) , inspired by transformer models from the NLP domain, achieve state-of-the-art performance for many downstream tasks, such as standard image classification (He et al., 2022) or segmentation (Caron et al., 2021) . They are based on either self-distillation with no labels (Caron et al., 2021) or masked autoencoders (MAE) (Devlin et al., 2018; Dosovitskiy et al., 2020) . MAE is a masked autoencoder for the self-supervised training of vision transformers. The approach is based on training an asymmetric encoder-decoder architecture that learns to reconstruct randomly masked patches of an input image. BEiT (Bao et al., 2022) is another similar approach to MAE that also masks patches of an image and learns to recover the original inputs. Watermarking Encoders The watermark scheme proposed by Dziedzic et al. (2022a) trains the defended encoder simultaneously with an augmentation predictor. As the watermarking task, the defended encoder is trained to predict the rotation range of an input image. For watermark verification, an encoder's representations are evaluated on the augmentation predictor to obtain the accuracy on the watermark task and when the accuracy is significantly higher than 50%, the encoder is marked as stolen. SSLGuard (Cong et al., 2022) is another watermarking-based defense. First, it generates a secret key, a verification dataset, and a decoder. A watermarked encoder and its stolen copies map samples in the verification dataset to secret representations that in turn can be transformed into the secret key through the decoder. For independently trained encoders, the decoder transforms the representations generated from the verification dataset into random vectors. SSLGuard assumes a specific attack to which the trigger is transferred, which makes it vulnerable to new types of attacks. On the other hand, our DSI uses a private seed as a signature and does not rely on any assumptions about a type of stealing attack. Wu et al. (2022) manipulates the verification samples to obtain for them unique feature representations from protected and stolen encoders, which leads to unique predicted labels for any downstream task. However, it was shown that representations can be easily obfuscated (Dziedzic et al., 2022b) to remove their similarity to a stolen encoder.

Stealing from NLP Classifiers

The stealing methods proposed in Chen et al. (2021) and Rafi et al. ( 2022) are for downstream classification tasks while we steal representations from self-supervised encoders in both language as well as vision domains. For the vision encoders, we consider a more challenging task and start stealing from randomly initialized encoders instead of pre-trained ones. Apart from improving the stealing process by leveraging semantically similar sentences, we also propose a defense method against stealing sentence embedding encoders using a watermark-based approach. 

B.2 EXPERIMENTAL SETUP FOR STEALING VISION TRANSFORMERS

For vision, ViT Tiny, Small architectures are used for the CIFAR10 and SVHN victim encoders. We use DINO (Caron et al., 2021) to train ViT with patch size 16 in a self-supervised way. We resize the images to 224 × 224 so that we can use the hyperparameters provided in the DINO paper. We use 3-layer MLP and then apply an L2 normalization, and a weight normalized fully-connected layer as the DINO paper by default, and do experiments with the different number of MLP layers in ??. We train all victim models for 300 epochs. To train the stolen encoders, we used 150 epochs. We use AdamW optimizer and Cosine Annealing scheduler when training victim and stolen models. When training the victim model, the initial learning rate is 4e-5 while for training stolen models, the initial learning rate is 1e-3. A batch size of 128 or 256 was used for training the models. The temperature we use for InfoNCE is 0.07, and the temperature for soft nearest neighbors loss is 1000. For the stealing experiments, we use the last 4 block outputs of dummy representations, in the same way as in DINO (Caron et al., 2021) . Thus, the dimensions of output representations for ViT-tiny and ViT-small are 768 and 1536, respectively. Victim We train tiny-ViTs on CIFAR10. All training procedures follow (Caron et al., 2021) unless otherwise specified. The number of epochs is 300, the bs=256 and lr=5e-4 with a cosine annealing scheduler. The summary of the victim encoder's performance can be found in Table 1 . Stolen When stealing from the victim encoders, we experiment with different numbers of queries from various datasets, including CIFAR10, CIFAR100, SVHN, and STL10. We also use both ResNet and TinyViT as the architectures for stolen encoders. Stolen encoders are trained by minimizing different loss functions, such as MSE, InfoNSE (Chen et al., 2020) , or SoftNN (Frosst et al., 2019) . Unless otherwise specified, the stealing procedure follows Dziedzic et al. (2022a) . Independent We use ViTs trained on other datasets than the victim's training set as independent encoders. Private Data Subsets For DSI, we insert 5k, 10k SVHN, and 10k MNIST images into the CIFAR10 training sets respectively. The training procedures for the victim and stolen encoders are the same as before.

B.3 NLP DATASETS AND PROCESSING

nli. We use (Gao et al., 2021) 's nli-for-SimSCE dataset, consisting of 275,602 data rows from SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) . Each row holds three sentences, an original sentence, a positive entailment, and a contradiction. In training, the contradiction acts as a hard-negative. qqp. We use the merve/qqp dataset from Hugging Face. The train split consists of 2,607,949 data rows, each holding two semantically equal questions. We use this data for training as positive pairs. flickr. The flickr dataset consists of images, each annotated with five human-written captions. Following (Gao et al., 2021) , we consider any two captions of the same image as a positive pair. We split the training and test set to 90%, 10%, making sure that all caption pairs related to image end up in the same set. This yields 286,050 positive-pair training examples. When using flickr as a dataset for stealing, we drop the duplicates arising from generating all possible caption pairs before sampling the stealing-queries. Thereby, we mitigate a too small diversity over the stealing due to repeated queries.

B.4 EXPERIMENTAL SETUP FOR STEALING LANGUAGE TRANSFORMERS

Victim. As victim encoders, we use TinyBERT-based encoders and fine-tune them for the sentence embedding task on nli, qqp, and flickr by using SimSCE (Gao et al., 2021) . For more details on data pre-processing and the datasets, see Appendix B.3. We fine-tune our encoders for 10 epochs, with batch size (bs)=128, learning rate (lr)=5e-5, and temperature=0.05. Additionally, we use transformers fine-tuned with nli from BERTfoot_5 , and RoBERTafoot_6 from Hugging Face as victim encoders. For fine-tuning and stealing over all encoders, we set the maximal input sequence length to 32 and use truncation and padding. The performance of our victim encoders can be found in Table 3 . Stolen. We initialize our stolen encoders with pre-trained transformers from Huggingface (prajjwal1/bert-tiny, bert-base-uncased, and roberta-large), in accordance with the respective victim encoder. During stealing, TinyBERT and BERT use an lr=1e-5, bs=256, and linear lr-scheduling with patience 200 iterations and factor 0.5. For RoBERTa, we use the same setup, however with bs=64, lr-patience 600 iterations, and lr=5e-6 when stealing with nli or flickr. As in SimCSE, we evaluate our stolen encoders on the SentEval benchmark. Independent. We fine-tune the independent TinyBERT-based encoders on nli, qqp, and flickr in the same setup as the victim encoders. To obtain independent encoders based on BERT and RoBERTa, we fine-tune the respective base encoders on nli, qqp, and flickr for using SimCSE. We keep lr=5e-5, temperature=0.05, bs=128 and bs=32 for BERT and RoBERTa, respectively, but following (Gao et al., 2021) , we fine-tune only for 3 epochs.

DSI.

To evaluate DSI, we generate training datasets with a private dataset as secret seeds. We generate three such sets: 1) 50K flickr data points into the qqp dataset, 2) the full flickr, i.e., ∼300K into qqp, and 3) 50K qqp into flickr. We fine-tune sentence embedding encoders based on TinyBERT using the same procedure as for the victim and independent encoders. When analyzing the performance of encoders trained with the data subsets in Table 7 and note that for most tasks the flickr+qqp encoder overpasses the performance of the encoder trained only on flickr, while the performance of qqp+flickr increases only for the STS benchmarks. As an alternative to the DSI, which is ineffective in protecting our sentence embedding encoders, we evaluate the similarity between the victim, stolen, and independent encoders, we use the cosine similarity score, as in (Dziedzic et al., 2022b) . Note that the cosine similarity score should not be used for the ownership resolution. This is a very fragile metric and any deviation in the training of the stolen copy from the victim increases the distances between their representations. As shown in Dziedzic et al. (2022b) , most obfuscations of stolen representations leave a stolen copy undetectable by the cosine similarity score. We track the cosine similarity between representations of the stolen and victim encoder over the course of stealing on 50 representations from the test dataset (Figure 2 ) and observe that similarity increases from the very beginning of stealing providing evidence of the highly accurate stealing process. Table 8 depicts the final cosine similarity scores between language encoders, stolen copies, and independent models after stealing is completed. We find that the stolen copies have higher scores than independent encoders trained on different datasets than the victim's data. In Table 8 , we also evaluate the cosine similarity of our encoders trained with a private seed against their stolen copies and independent encoders. We evaluate the similarity of our encoders trained with private subsets against their stolen copies and independent encoders. We evaluate qqp and flickr in both combinations as public and private data and do not consider nli. This is because the nli dataset contains SNLI, which, in turn, contains data from flickr which might corrupt our results. Experimenting with the full flickr dataset inserted into qqp serves a fair comparison in terms of the fraction of subset data, given that qqp holds roughly 10x more data than flickr (2.6M vs. 300k). Our results highlight that inserting a large enough subset of flickr increases the cosine similarity between the qqp+flickr encoder to an independent flickr encoder while it decreases the similarity to an independent flickr encoder. The same holds for the qqp-subset. By inserting larger fractions of flickr, the qqp+flickr encoder's representations also resemble more the ones of the nli encoder, which can be explained by nli containing some sentences from flickr. For meaningful ownership resolution, the similarity between the victim and its stolen copies has to be compared to the similarity of the victim and an independent encoder trained on the same public dataset but without the private subset (color-coded in blue in Table 8 ). Our results show that if a victim is stolen with the private data subset or an independent dataset, the similarity to its stolen copies is higher than to the independent encoder. However, the information on the private data subset is not transferred to the stolen copies when the stealing is done with the victim's original training data distribution. In a public setting, this is problematic since APIs reveal information about types of data used to pre-train and fine-tune their encoders. Note that for the pre-trained encoders, BERT* and RoBERTa* from Hugging Face, the cosine similarity scores to our independently trained encoders are close to zero. This might be caused by different factors, such as modifications to the code base or differences in the execution environment. Moreover, (Dziedzic et al., 2022b) showed that even a small obfuscation of the representations decreases the cosine similarity between two encoders substantially. Note that despite some inconsistencies in representations between BERT and RoBERTa from Hugging Face and our independently trained encoders, we are able to steal the encoders from Hugging Face. Our stolen copies do not only achieve high performance on benchmarks (Table 3 ) but also produce representations with high cosine similarity scores to the pre-trained encoders (Table 8 ). This highlights that while language transformers can be stolen effectively, the signal in their representations does not easily benefit ownership resolution. Therefore, to evaluate a worst-case scenario (the exact data and training procedure of the victim are public and the independent encoder is trained in exactly the same way), we train BERT and RoBERTa-based encoders on nli twice and report the cosine similarity scores between these two encoders, instead. The encoders trained independently on the same data and with the same procedure yield a similarity score close to one (color-coded in gray)-higher than the victim and its stolen copies. This indicates that the scores cannot be used for ownership resolution since an honest party who has trained an encoder using the same data and procedure as the victim would be falsely accused of theft. These results are expected. In fact, standard DI cannot distinguish between a victim encoder and an independent encoder trained with the same procedure using the same data. Therefore, DI assumes that the victim's training data is private, which, as we motivated above might not be a realistic assumption for transformers trained on public data. This motivates our novel DSI. Our experiments with DI, DSI and cosine similarity for language transformers, yield some additional interesting insights that need to be taken into account when designing future defenses against stealing NLP encoders. We observe that if the training and test sets of an encoder consist of semantically equal or similar sentences, DI is not even enable to mark a victim encoder. This is caused since the encoders are trained to output similar/same embeddings for semantically equal sentences. For example, when splitting the flickr dataset such that captions of the same image are distributed over the train and test sets, the distributions of the resulting representations are so similar that DI cannot extract any signal. Therefore, when attempting to identify stolen encoder copies in NLP, auditors need to be very careful not to accuse honest third parties of theft, when they hold semantically equal training data for their encoders than a victim encoder. Furthermore, we observe that the data pre-processing and tokenization in NLP plays a significant role for DI. While DI is able to mark victim encoders if their input data is batched and tokenized in exactly the same way, it is unable to do so when the data is tokenized differently (e.g., with different padding, truncation, batching). This observation is intuitive since DI relies on an exact training signature. In ownership resolution, this, furthermore, does not represent a limitation since the owner of the victim knows the exact pre-processing method used during their training / fine-tuning. We also experimented with DI using only sentences above or below a certain length for ownership resolution at inference time. This implements the intuition that some very short or long sentences might represent a more unique training signature, however, neither DI nor DSI were able to successfully mark stolen copies. Finally, we observed that in nli (in contrast to qqp and flickr), DI was not even always able to mark the victim model. Results varied broadly, depending on the data sampled from the training and test set to fit the GMMs and estimate likelihood. We assume that this is due to nli containing data from many different domains (the underlying MNLI contains text from, among others, fiction, telephone speech, and letters). Subsampling might, hence, lead to having data from completely different distributions in training and test data and, thereby, prevent DI.

C.2 NLP: KNOWN ARCHITECTURE BUT UNKNOWN CHECKPOINT

We observe that when stealing is started from a different checkpoint for the underlying encoder than the one used by the victim, the performance changes depending on the difference between the pre-trained checkpoints (first 3 rows in Table 10 ). For example, while the difference between using bert-base-uncased and bert-base-cased as the starting point for stealing is relatively small when the et al., 2021) and use the SentEval banchmark. f v denotes the victim encoder trained on data D v . f s is the stolen encoder extracted using queries from a given stealing dataset D s . For stealing, we 60,000 stealing queries to the victim encoder and fine-tune our stolen copy with the resulting outputs for 20 epochs. We use the BERT-based model pre-trained and taken from Hugging Face obtained by the SimCSE code-base, marked with an asterisks * (https://github.com/princeton-nlp/ SimCSE) and steal using the nli dataset. While the original BERT was initialized with bert-baseuncased, we vary the weight initialization of the checkpoint used for stealing. We also check performance after 1 epoch of fine-tuning (1 epoch). We use two different tokenizers. The API uses their original tokenizer (bert-base-uncased, potentially unknown to attacker), while the attacker uses the tokenizer that corresponds to the checkpoint from which they load the weights. original model used bert-base-uncased, the performance difference when using bert-base-multilingualcase is larger. Next, we show that the underlying encoder leaves distinct traces on the fine-tuned sentence embedding model. We analyze the performance of different checkpoints only after a single epoch of finetuning during which we use the stolen embeddings (last 3 rows in Table 10 ). This computationally inexpensive step allows attackers to quickly identify the best-performing checkpoint that they have access to and continue the fine-tuning only for the best checkpoint. Note that evaluating different model checkpoint initializations does not require obtaining additional representations from the victim model. Instead, the obtained representations can be reused over all checkpoints.

C.3 DATASET INFERENCE VS DISTRIBUTION INFERENCE IN VISION

Table 11 show that DI is a robust detection method. We train two ResNet models on the equal partitions of the CIFAR10 train set. The result suggests that we are able to differentiate between two such encoders as being trained independently. We also run the same experiments for the original version of dataset models, where we train two classifiers on the equal splits of the CIFAR10 train set. We observe that both methods are robust in this setting. To apply GMMs to supervised models, we discard the last classification layer and use the representations from the last but one layer to distinguish between these two classifiers using GMMs. This shows that the dataset inference is not only a distribution inference but the intended dataset inference. We can use the method based on GMMs when the public API exposes access to the softmax values for the supervised models. Table 11 : Dataset Inference. We divide the CIFAR10 train set into two equal splits. Then, we train on each split. We test if such two encoders are marked as independent. Each value is an average of 3 trials. ∆µ is the effect size from the statistical t-test. DI denotes the original Dataset Inference proposed by Maini et al. (2021) . For Blind Walk (rand) the results fluctuate due to the randomness coming from the selected data points and the embedding generation method. Table 12 : Effects of Projection Heads on DI. We perform dataset inference for DINO trained on CIFAR10 with different projection heads. DINO uses a 3-layer MLP followed by ℓ 2 normalization and a linear layer as the projection head during training, which is then discarded for evaluations. The first column denotes the number of layers for MLP in the projection head. The second column denotes whether the projection head is included in DI. For each set-up, we also vary the number of components for GMM. The bolded number indicates that DI successfully detects the victim. We investigate the effect of the depth and the final-layer dimension of DINO's projection head on the model's performance on downstream tasks. Table 13 shows that having a deeper MLP is crucial to the model's quality.

C.5 MORE EXPERIMENTS ON STEALING VIT SMALL

Table 14 shows the performance of stolen models using different datasets, query numbers and loss functions. The result suggests that using the same dataset as the pretrained dataset gives the best result.

C.6 NUMBER OF PARAMETERS

We count the number of parameters for the base encoders and their corresponding heads and show the results in Table 15 . The ViT-tiny has much fewer parameters than even the single projection head. ViT-small has a smaller number of parameters than the full projection head used in DINO. Only the



We use transformers from Hugging Face (https://huggingface.co/. This is relevant in public API settings since costs usually increase linearly with the number of queries and since the number of representations that can be obtained in a given time-unit is often limited. https://huggingface.co/princeton-nlp/sup-simcse-bert-base-uncased https://huggingface.co/princeton-nlp/sup-simcse-roberta-large We also experiment with stolen copies initialized without the MLP layer and obtained similar performance in the benchmarks. https://huggingface.co/princeton-nlp/sup-simcse-bert-base-uncased https://huggingface.co/princeton-nlp/sup-simcse-roberta-large



Figure 1: Defenses against Stealing Transformer-based Encoders. (a) Our DSI relies on a private dataset, randomly partitioned into three subsets D 1 , D 2 , D Test . D 1 ∪ D 2 is injected into the encoder's public training data as a secret seed.For ownership resolution, a Gaussian Mixture Model (GMM) is trained on the representations of D 1 . DSI marks an encoder as stolen if the average likelihood from the GMM on D 2 is significantly higher than the likelihood on D Test . (b) To imbed our watermark for sentence embedding encoders, training alternates between the main task and a secretly chosen downstream task during the last iterations. For verification, the fully-connected (FC) layer is attached to a potentially stolen copy and agreement to the victim's output on the downstream task is measured.

Figure 2: Cosine Similarity Score over Stealing.

MixMatch

Performance of NLP Transformers. We follow SimCSE(Gao et al., 2021) and use the SentEval benchmark. f v denotes the victim encoder trained on data D v . f s is the stolen encoder extracted using queries from a given stealing dataset D s . For stealing, we use 60,000 queries to the victim encoder and fine-tune our stolen copy with the resulting outputs for 20 (TinyBERT (T) and BERT (B)), and 5 (RoBERTa (R)) epochs. Victim encoders with an asterisk (*) are pretrained encoders from Hugging Face obtained by the SimCSE code-base (https://github. com/princeton-nlp/SimCSE), while other encoders are trained using the SimCSE code-base.

Re-using Representations. We assign the same extracted representation to a given query and its semantically similar sentences. # Samples denotes the final number of sentences used to fine-tune the stolen encoder.

DataSeed Inference for Vision Transformers. We train ViT-Tiny (left two columns) and Vit-Small (right two columns) using DINO. We add 10k data points from SVHN and GTSRB into the CIFAR10 training set. f v is the victim transformer trained on data D, f s is a transformer stolen with queries from a dataset D, and f i is an independent encoder trained on data D (different than the victim's private training data). Each value is an average over 3 trials. ∆µ is the effect size from the statistical t-test. An encoder is marked as stolen if the p-value is smaller than a threshold of 0.05.

Watermarking Sentence Embedding Encoders. We embed the watermark into an encoder and present the performance of the downstream task, the underlying encoder, and the comparison between the victim f v , stolen f s , and independent f i encoders. Steps denotes the number of finetuning steps for watermarking. Agreement is denoted by Agr and accuracy on the watermark downstream task by Acc. Both are given in %. Test loss is denoted as L, p-value as p, and effect size as ∆µ.

Chen et al. (2021) steal fine-tuned classification models, while our extraction targets encoders, hence, it is more general.Rafi et al. (2022) extract architectures and weights, while our attack method extracts representations. Additionally, Rafi et al. (2022) require adversaries to have more knowledge about the victim than is assumed in our work. For example, architecture extractions require access to the target machine.SVHN(Netzer et al., 2011): The SVHN dataset contains 32x32 coloured images with 10 classes. There are roughly 73000 training images, 26000 test images and 530000 "extra" images.ImageNet(Deng et al., 2009): Larger sized coloured images with 1000 classes. As is commonly done, we resize all images to be of size 224x224. There are approximately 1 million training images and 50000 test images.

Performance of NLP Transformers with Mixed Dataset. We follow SimCSE(Gao et al., 2021) and use the SentEval benchmark. f v denotes the victim encoder trained on data D v . f s is the stolen encoder extracted using queries from a given stealing dataset D s .

Cosine Similarity Scores. Scores calculated over 20K data points between a victim encoder f v , copies f s stolen with data D s , and independent encoders f i trained on D i . Results reported with standard deviation. In case f v and f i use the same data, we train two independent encoders with the same architecture, data, and hyperparameters (gray). Results with a double asterisk (**) indicate that we compare two independently trained BERT or RoBERTa-based encoders, instead of comparing the public Hugging Face encoder (*) with an independently trained encoder. This serves for better comparability to results obtained with TinyBERT. The notation follows Table3. For encoders trained with the additional private subset, stolen copies should be compared with independent encoders trained on the same public data without the inserted subset (blue).

Number of Stealing Queries and Impact on Model Performance. We follow Sim-CSE(Gao et al., 2021) and use the SentEval banchmark. f v denotes the victim encoder trained on data D v . f s is the stolen encoder extracted using queries from a given stealing dataset D s . For stealing, we use a different number of queries to the victim encoder and fine-tune our stolen copy with the resulting outputs for 20 epochs. We use the BERT-based model pre-trained and taken from Hugging Face obtained by the SimCSE code-base, marked with an asterisks * (https://github.com/princeton-nlp/SimCSE) and steal using the nli dataset.

Base Checkpoints and Influence on Model Performance. We follow SimCSE(Gao

annex

Table 13 : Influence of Projection Heads. We train Vit Tiny from scratch using DINO on CI-FAR10 with different configurations of projection head with respect to number of layers and output dimension(denoted as od), and evaluate them on different downstream tasks: CIFAR10, SVHN. ViT-base has a larger number of parameters than the full projection head from DINO. The number of parameters in the projection heads is different for various encoder types due to their different dimensionality of output representations.

C.7 FINE-TUNING

In Table Table 16 , we present the results after fine-tuning with the SVHN dataset on ViT Small/16 pre-trained on ImageNet.

C.8 MIXMATCH FOR SVHN

We present the results on applying the MixMatch semi-supervised learning to the SVHN dataset in Table 17 . 

