ROBUST DEEPFAKE DETECTION USING DISJOINT ENSEMBLES ON REDUNDANT FEATURES Anonymous authors Paper under double-blind review

Abstract

Detecting GAN-generated deepfake images remains an open problem. Current detection methods fail against an adversary who adds imperceptible adversarial perturbations to the deepfake to evade detection. We propose Disjoint Deepfake Detection (D3), a detector designed to improve adversarial robustness beyond de facto solutions such as adversarial training. D3 uses an ensemble of models over disjoint subsets of the frequency spectrum to significantly improve robustness. Our key insight is to leverage a redundancy in the frequency domain and apply a saliency partitioning technique to disjointly distribute frequency components across multiple models. We formally prove that these disjoint ensembles lead to a reduction in the dimensionality of the input subspace where adversarial deepfakes lie. We then empirically validate the D3 method against white-box attacks and black-box attacks and find that D3 significantly outperforms existing state-of-theart defenses applied to GAN-generated deepfake detection.

1. INTRODUCTION

Significant advances in deep learning are responsible for the advent of "deepfakes", which can be misused by bad actors for malicious purposes. Deepfakes broadly refer to digital media that has been synthetically generated or modified by deep neural networks (DNNs). Modern DNNs such as generative adversarial networks (GANs) (Goodfellow et al., 2014) are now capable of synthesizing hyper-realistic deepfakes, which can then be used to craft fake social media profiles (Martineau, 2019) , generate pornography (Cole, 2017), spread political propaganda, and manipulate elections. While recent work has made respectable efforts towards solving the deepfake detection problem, many of these detectors are rendered ineffective by adversarial examples. The deepfake detection problem asks the defender to classify a given image as deepfake or real (we focus on GAN-generated deepfake images). State-of-the-art detectors often leverage DNNs on the frequency space of the input image (Frank et al., 2020) . However, Carlini & Farid (2020) (amongst others) have shown that such defenses fail -the adversary can simply use adversarial perturbation techniques to evade detection (Szegedy et al., 2014; Goodfellow et al., 2015; Biggio et al., 2013; Carlini & Wagner, 2017) . Defending against adversarial examples, in general, has been shown to be a difficult task (Athalye et al., 2018) , and is a critical problem in the deepfake detection setting. Our key intuition to mitigate this problem is to utilize redundant information in the frequency feature space of deepfakes to generate disjoint ensembles for deepfake detection. Specifically, we show in Section 3.1 that we can achieve good detection performance with only a subset of the features, particularly in the frequency domain. This enables us to build an ensemble of performant classifiers, each using a disjoint set of frequencies. In contrast to traditional ensembles (where each model shares the same set of frequencies), a key advantage of this design is that non-robust frequencies are partitioned across all the models in the ensemble. Thus, an adversary is no longer able to perturb a single non-robust frequency to evade all models -rather, they must find perturbations to evade multiple sets of disjoint frequencies, which raises the attack cost. Leveraging the above intuition, our key contributions are as follows: 1. We propose D3, a GAN-generated deepfake detection framework designed to be adversarially robust. D3 builds a robust ensemble of models that use disjoint partitions of the input

