ROBUST DEEPFAKE DETECTION USING DISJOINT ENSEMBLES ON REDUNDANT FEATURES Anonymous authors Paper under double-blind review

Abstract

Detecting GAN-generated deepfake images remains an open problem. Current detection methods fail against an adversary who adds imperceptible adversarial perturbations to the deepfake to evade detection. We propose Disjoint Deepfake Detection (D3), a detector designed to improve adversarial robustness beyond de facto solutions such as adversarial training. D3 uses an ensemble of models over disjoint subsets of the frequency spectrum to significantly improve robustness. Our key insight is to leverage a redundancy in the frequency domain and apply a saliency partitioning technique to disjointly distribute frequency components across multiple models. We formally prove that these disjoint ensembles lead to a reduction in the dimensionality of the input subspace where adversarial deepfakes lie. We then empirically validate the D3 method against white-box attacks and black-box attacks and find that D3 significantly outperforms existing state-of-theart defenses applied to GAN-generated deepfake detection.

1. INTRODUCTION

Significant advances in deep learning are responsible for the advent of "deepfakes", which can be misused by bad actors for malicious purposes. Deepfakes broadly refer to digital media that has been synthetically generated or modified by deep neural networks (DNNs) . Modern DNNs such as generative adversarial networks (GANs) (Goodfellow et al., 2014) are now capable of synthesizing hyper-realistic deepfakes, which can then be used to craft fake social media profiles (Martineau, 2019) , generate pornography (Cole, 2017) , spread political propaganda, and manipulate elections. While recent work has made respectable efforts towards solving the deepfake detection problem, many of these detectors are rendered ineffective by adversarial examples. The deepfake detection problem asks the defender to classify a given image as deepfake or real (we focus on GAN-generated deepfake images). State-of-the-art detectors often leverage DNNs on the frequency space of the input image (Frank et al., 2020) . However, Carlini & Farid (2020) (amongst others) have shown that such defenses fail -the adversary can simply use adversarial perturbation techniques to evade detection (Szegedy et al., 2014; Goodfellow et al., 2015; Biggio et al., 2013; Carlini & Wagner, 2017) . Defending against adversarial examples, in general, has been shown to be a difficult task (Athalye et al., 2018) , and is a critical problem in the deepfake detection setting. Our key intuition to mitigate this problem is to utilize redundant information in the frequency feature space of deepfakes to generate disjoint ensembles for deepfake detection. Specifically, we show in Section 3.1 that we can achieve good detection performance with only a subset of the features, particularly in the frequency domain. This enables us to build an ensemble of performant classifiers, each using a disjoint set of frequencies. In contrast to traditional ensembles (where each model shares the same set of frequencies), a key advantage of this design is that non-robust frequencies are partitioned across all the models in the ensemble. Thus, an adversary is no longer able to perturb a single non-robust frequency to evade all models -rather, they must find perturbations to evade multiple sets of disjoint frequencies, which raises the attack cost. Leveraging the above intuition, our key contributions are as follows: 1. We propose D3, a GAN-generated deepfake detection framework designed to be adversarially robust. D3 builds a robust ensemble of models that use disjoint partitions of the input features. This is achieved by leveraging redundancy in the feature space. D3 achieves robustness while still exhibiting natural deepfake detection AUC-ROC scores as high as 99% (see Section 4.4 for details). 

2. BACKGROUND AND RELATED WORK

Notation. We consider a distribution D over X ×Y, where X ⊆ R d is the input space and Y ⊆ Z c is the finite class-label space. We denote vectors in boldface (e.g., x). We denote a trained classifier as a function F : X → Y (the classifier is usually parameterized by its weights w, omitted for brevity). We denote the loss function as L(x, y). An ensemble classifier is a function M (F1,F2,...,Fn) : X → Y that combines the logit outputs l 1 , l 2 , ..., l n of multiple classifiers F 1 , F 2 , ..., F n with a voting aggregation function A : R n×c → Y. We now provide a definition for adversarial examples against a classifier. For a classifier F and input-label pair (x, y), an adversarial example is a perturbed input x ′ such that (1) x ′ is misclassified, i.e., F(x ′ ) ̸ = y and (2) ||x -x ′ || is within a small ϵ ball, where ||.|| is a given norm. The value of ϵ is chosen to be small so that the perturbation is imperceptible to humans. Deepfake Detection. The research community has made rapid progress towards detecting GANgenerated deepfake images. Examples of proposed detection schemes include DNN classifiers (Wang et al., 2020; Yu et al., 2019) , color-space anomaly detection (McCloskey & Albright, 2018) , and co-occurrence matrix analysis (Nataraj et al., 2019 ), amongst others (Tariq et al., 2019; Mi et al., 2020; Marra et al., 2018; Guarnera et al., 2020; Marra et al., 2019) . However, a recent line of work has significantly advanced the state-of-the-art for deepfake detection by leveraging frequency-space analysis. Unfortunately, the aforementioned detectors have been rendered ineffective in adversarial settings. Specifically, Carlini & Farid (2020) showed that frequency-based detectors are vulnerable to adversarial examples -an adversary can add imperceptible adversarial perturbations to a deepfake that evade such detectors, rendering them ineffective. Others have corroborated this observation (Hussain et al., 2021; 2022b; Neekhara et al., 2021; Gandhi & Jain, 2020; Vo et al., 2022; Shahriyar & Wright, 2022; Fernandes & Jha, 2020; Liao et al., 2021; Hussain et al., 2022a) . 2020)'s frequency-space deepfake detector, even when adversarially trained, cannot withstand deepfake examples crafted using a variety of attacks (see Tables 1 and 2 ). While we also adversarially train each model in our ensembles, we find that our ensembles significantly improve robustness over a standalone adversarially trained model (see Section 4.2).



For example, Frank et al. (2020) proposed the idea of detecting deepfakes with the Discrete Cosine Transform (DCT) as a pre-processing transform before a binary DNN-based deepfake classifier. Similar work has also achieved remarkable performance -Zhang et al. (2019) use GAN simulators to extract similar frequency artifacts, and Durall et al. (2019) successfully train DNNs in the frequency domain to detect deepfakes.

Tramèr et al. (2017b)  on dimensionality of adversarial subspaces, we prove new bounds on the maximum number of adversarial directions that can be found under an ensemble with disjoint inputs. Our bounds are tight for both the ℓ 2 and ℓ ∞ perturbation norms (Lemmas 3.1 and 3.2 in Section 3.3) and indicate that D3 reduces the dimension of the adversarial subspace.3. We evaluate D3 against white-box and black-box attacks on a variety of GAN-generated deepfake images and find that D3 significantly outperforms state-of-the-art defenses such as ADP (Pang et al., 2019), GAL (Kariyappa & Qureshi, 2019) and DVERGE (Yang et al., 2020), suggesting a reduction in dimension of the adversarial subspace. For example, as indicated by our evaluation in Section 4.2, D3 maintains 100% adversarial accuracy against AutoAttack (Croce & Hein, 2020b) where baselines drop below 20%. Increasing the number of attack steps to 1000 reduces D3's robustness to 61% in comparison to 0% for all baselines.

Deepfakes. A widely accepted solution to countering adversarial examples is to train the model on adversarial examples, generated during training (Madry et al., 2018). However, Carlini & Farid (2020) suggest that adversarial training alone is unlikely to achieve significant improvement in robustness in the difficult deepfake detection setting. Our experiments in Section 4.2 confirm that Frank et al. (

