PRIVACY-PRESERVING VISION TRANSFORMER ON PERMUTATION-ENCRYPTED IMAGES

Abstract

Massive human-related data is collected to train neural networks for computer vision tasks. Potential incidents, such as data leakages, expose significant privacy risks to applications. In this paper, we propose an efficient privacy-preserving learning paradigm, where images are first encrypted via one of the two encryption strategies: (1) random shuffling to a set of equally-sized patches and (2) mixingup sub-patches. Then, a permutation-equivariant vision transformer is designed to learn on the encrypted images for vision tasks, including image classification and object detection. Extensive experiments on ImageNet and COCO show that the proposed paradigm achieves comparable accuracy with the competitive methods. Moreover, decrypting the encrypted images is solving an NP-hard jigsaw puzzle or an ill-posed inverse problem, which is empirically shown intractable to be recovered by the powerful vision transformer-based attackers. We thus show that the proposed paradigm can destroy human-recognizable contents while preserving machine-learnable information. Code will be released publicly.

1. INTRODUCTION

Deep models trained on massive human-related data have been dominating many computer vision tasks, e.g., image classification He et al. (2016) , face recognition Li et al. (2021) , etc. However, most existing approaches are built upon images that can be recognized by human eyes, leading to the risk of privacy leaks since visually perceptible images containing faces or places may reveal privacy-sensitive information. This could raise concerns about privacy breaches, thus limiting the deployment of deep models in privacy-sensitive/security-critical application scenarios or increasing people's doubts about using deep models deployed in cloud environments. To address the increasing privacy concerns, researchers have integrated privacy protection strategies into all phases of machine learning, such as data preparation, model training and evaluation, model deployment, and model inference Xu et al. (2021) . The emerging federated learning allows multi participants to jointly train a machine learning model while preserving their private data from being exposed Liu et al. (2022) . However, attackers can still recover images with high accuracy from the leaked gradients Hatamizadeh et al. (2022) or confidence information Fredrikson et al. (2015) . To protect privacy-sensitive data in a confidential level, directly learning and inferencing on encrypted data is emerging as a promising direction. Unfortunately, two huge complications remain Karthik et al. ( 2019): (1) the encryption methods themselves, such as fully homomorphic encryption, have a very high computation complexity and (2) training deep models in the encrypted domain is an extremely challenging task due to the need for calculations in the ciphertext space. Recent studies in the natural language processing field suggest that higher-order co-occurrence statistics of words play a major role in masked language models like BERT Sinha et al. ( 2021). Moreover, it has been shown that the word order contains surprisingly little information compared to that contained in the bag of words, since the understanding of syntax and the compressed world knowledge held by large models (e.g. BERT and GPT-2) are capable to infer the word order Malkin et al. (2021) . Due to the property of attention operation, when removing positional encoding, Vision Transformer (ViT) Dosovitskiy et al. ( 2020) is permutation-equivariant w.r.t. its attentive tokens. As evaluated by our experiments, removing the positional embedding from ViT only leads to a moderate performance drop (%3.1, please see Table 1 ). Such a phenomenon inspires us to explore permutation-based encryption strategies. To maximize the usability of the resulting paradigm, the following two requirements need to be satisfied. Firstly, the encryption process should preserve the machine-learnable information of inputs. Compared with existing deep models applied on the non-encrypted images, the performance drop is expected to be insignificant and acceptable, and thus make it possible to replace existing deep models on a large scale for privacy-sensitive circumstances. Secondly, the human-recognizable contents should be largely interfered, and the decryption algorithm is expected to have a very high complexity or an unaffordable cost. In addition, it would be better if the encryption algorithm is decoupled from the models to be trained, as this would allow the encryption of image contents in scenarios where only limited computing resources are available. To this end, we propose an efficient privacy-preserving learning paradigm. The key insight of our paradigm is two-fold: (1) designing encryption strategies based on permutation-equivariance and (2) making part of or the whole network permutation-equivariant, which allows it to learn on the encrypted images. Two strategies are proposed to encrypt images: random shuffling (RS) images to a set of equally-sized patches and mixing-up (MI) sub-patches; see Figure 1 . Decrypting an image encrypted by RS is solving a jigsaw puzzle problem, which can incur a large computational overhead since the problem to be solved is an NP-hard one Demaine & Demaine (2007) . Decrypting an image encrypted by MI is solving an ill-posed inverse problem, which could be hard to solve due to the difficulty in modelling the sub-patch distribution. We indicate that both kinds of encrypted images are still machine-learnable, by further designing architectures PEViT and PEYOLOS, based on ViT and YOLOS Fang et al. (2021) for the image classification and object detection tasks, respectively; see Figure 2 . Specifically, our main contributions are summarized as follows: • We propose an efficient privacy-preserving learning paradigm that can destroy humanrecognizable contents while preserving machine-learnable information. The paradigm adopts a decoupled encryption process that utilizes the permutation-equivariance property so as to be still learnable for networks that are (partially) permutation-equivariant. • RS is tailored for the standard image classification with vision transformers. By substituting reference-based positional encoding for the original one, the network is capable of learning on images encrypted by RS. • Another hallmark of our paradigm is that by further designing MI, the paradigm is extensible to position-sensitive tasks, such as object detection, for which we adapt the way that image patches are mapped to make the network partially permutation-equivariant. • Extensive attack experiments show the security of our encryption strategies. Comparison results on large-scale benchmarks show that both PEViT and PEYOLOS achieve promising performance even with highly encrypted images as input. We thus show that the proposed paradigm can destroy human-recognizable contents while preserving machine-learnable information.



Figure 1: Illustration of images encrypted by random shuffling (RS), mixing-up (MI), and their combination. The visual contents of encrypted images are near-completely protected from recognizing by human eyes.

