ANALYZING AND IMPROVING GENERATIVE ADVER-SARIAL TRAINING FOR GENERATIVE MODELING AND OUT-OF-DISTRIBUTION DETECTION Anonymous

Abstract

Generative adversarial training (GAT) is a recently introduced adversarial defense method. Previous works have focused on empirical evaluations of its application to training robust predictive models. In this paper we focus on theoretical understanding of the GAT method and extending its application to generative modeling and out-of-distribution detection. We analyze the optimal solutions of the maximin formulation employed by the GAT objective, and make a comparative analysis of the minimax formulation employed by GANs. We use theoretical analysis and 2D simulations to understand the convergence property of the training algorithm. Based on these results, we develop an unconstrained GAT algorithm, and conduct comprehensive evaluations of the algorithm's application to image generation and adversarial out-of-distribution detection. Our results suggest that generative adversarial training is a promising new direction for the above applications.

1. INTRODUCTION

Generative adversarial training (GAT) (Yin et al., 2020) is a recently introduced defense mechanism that could be used for adversarial example detection and robust classification. The defense consists of a committee of detectors (binary discriminators), with each one trained to discriminate natural data of a particular class from adversarial examples perturbed from data of other classes. Like most other work in the area of robust machine learning, the defense is specially designed for defending against norm-constrained adversaries -adversaries that are constrained to perturb the data up to a certain amount as measured by some norm. The defense's robustness is achieved by training each detector model against adversarial examples produced by the norm-constrained PGD attack (Madry et al., 2017) . Existing work: training and evaluating robust predictive models A detector trained with GAT has strong interpretability -an unbounded attack that maximizes the detector's output results in images that resemble the target class data -this suggests the detector has learned the target class data distribution. However, all previous works (Yin et al., 2020; Tramer et al., 2020) focus on the empirical evaluations of GAT's application to training robust predictive models; a theoretical understanding of why this training method causes the detector to learn the data distribution is missing. This work: theoretical understanding, improved training algorithm, and extended applications In order to better understand the GAT method, we first analyze the optimal solutions of the training objective. We start with a maximin formulation (eq. ( 5)) of the objective, and try to connect it with the minimax formulation (eq. ( 1)) that is employed by GANs (Goodfellow et al., 2014) . We find that the differences between solutions of these two formulations become immediately clear when we take a game-theory perspective. We then use theoretical analysis and 2D simulations to understand the convergence property of the GAT training algorithm. Building upon these theoretical and experimental insights, we develop an unconstrained GAT algorithm, and apply it to the tasks of generative modeling and out-of-distribution detection. We find the maximin-based generative model to be more stable to train than its minimax counterpart (GANs), and at the same time more flexible as it does not have a fixed generator and can transform arbitrary inputs to the target distribution data, which might be particularly useful for certain applications (e.g., face manipulation). The model trained with the unconstrained GAT algorithm also outperforms several state-of-the-art methods on the task of adversarial out-of-distribution detection. In summary, our key contributions are: • We analyze the optimal solutions of the GAT objective and convergence property of the training algorithm. We discuss the implications of these results on improved training of robust predictive models, generative modeling, and out-of-distribution detection. • We develop an unconstrained generative adversarial training algorithm. We conduct a comprehensive evaluation of the algorithm's application to image generation and adversarial out-of-distribution detection. • Our comparative analysis of the maximin and minimax problem clarifies misconceptions and provides new insights into how they could be utilized to solve different problems.

2. RELATED WORK AND BACKGROUND

Generative adversarial networks (GANs) The GANs framework (Goodfellow et al., 2014) learns a generator function G and a discriminator function D by solving the following minimax problem min G max D V (D, G) = E x∼pdata [log D(x)] + E z∼pz [log(1 -D(G(z)))]. The generator G implicitly defines a distribution p g by mapping a prior distribution p z from a lowdimensional latent space Z ⊆ R z to the high-dimensional data space X ⊆ R d . D : X → [0, 1] is a function that discriminates the target data distribution p data from the generated distribution p g . The minimax problem is solved by alternating between the optimization of D and optimization of G; under certain conditions, the alternating training procedure converges to a solution where p g matches p data (Jensen-Shannon divergence is zero), and D outputs 1 2 on support of p data .

Generative adversarial training (GAT)

The GAT method (Yin et al., 2020) is designed for training adversarial examples detection and robust classification models. In a K class classification problem, the robust detection/classification system consists of K base detectors, with each one trained by minimizing the following objective L(D) = -E x∼p k [log D(x)] -E x∼p -k [log(1 -max x ∈B(x, ) D(x )))]. In the above objective, p k is k-th class's data distribution, p -k is the mixture distribution of all other classes: p -k = 1 K-1 i=1,...,K,i =k p i , and B(x, ) is a neighborhood of x: {x ∈ X : x -x 2 ≤ }. The objective is characterized by an inner maximization problem and an outer minimization problem; when the inner maximization is perfectly solved and D achieves a vanishing loss, D becomes a perfectly robust model capable of separating data p k , from any -constrained adversarial examples perturbed from data of p -k . A committee of K detectors then provides a complete solution for detecting any adversarial example perturbed from an arbitrary class. Objective 2 is solved using a alternating gradient method (Algorithm 1), with the first step crafting adversarial examples by solving the inner maximization, and the second step improving the D model on these adversarial examples. Clearly, the detector's robustness depends on how well the inner maximization is solved. Despite the fact that D is a highly non-concave function when it's parameterized by a deep neural network, Madry et al. (2017) showed that the inner problem could be reasonably solved using projected gradient descent (PGD attack) -a first-order method that employs the following iterative gradient update rule (at initialization x 0 ← x, we consider L 2 -based attack) x i+1 ← Proj(x i + γ ∇ log D(x i ) ∇ log D(x i ) 2 ), where λ is some step size, and Proj is the operation of projecting onto the feasible set B(x, ). The normalized steepest ascent rule inside the Proj function, was introduced for dealing with the issue of vanishing gradient when optimizing with the cross-entropy loss (Kolter & Madry, 2019) . The PGD attack also employs random restarting to improve its effectiveness. The idea is that for a input x, first generate a set of randomized inputs by uniformly sampling from B(x, ), perform PGD attack on each of them, and use the most effective one as the actual attack. A review of related work on out-of-distribution detection in provided in Appendix A.  m i=1 -log D(x k i ) -log(1 -D(x i )) (single step). 4: Return to step 1.

3. THEORETICAL RESULTS

In this section we first reformulate objective 2 into a maximin problem, and then analyze the optimal solutions of the maximin problem and convergence property of Algorithm 1. We then discuss the optimal solution of the corresponding minimax formulation and the differences between the solutions of these two formulations. The popular generative modeling approach of GANs learns a data distribution by solving the minimax problem, but there seems to be a misconception about the differences between solutions of these two problems, and as a result, a false impression that the GANs algorithm could solve the maximin problem (Goodfellow (2016) , section 5.1.1). Our analysis of optimal solutions is based on a game-theory interpretation of these problems, and the differences between these solutions are immediately clear under such an analysis.

3.1. THE MAXIMIN PROBLEM

In this section we provide an analysis of the optimal solutions of objective 2. Maximizing D is equivalent to minimizing log(1 -D), hence eq. ( 2) is equivalent to L(D) = -E x∼p k [log D(x)] -E x∼p -k [ min x ∈B(x, ) log(1 -D(x )))]. For the convenience of analysis, instead of using -balls imposed on individual data samples, we use the notion of a common perturbation space: The perturbation space S is a subspace of the data space X , and allows mass of p -k to be moved to any location in S. A new distribution p t can be obtained by transporting the mass of p -k to appropriate locations in S, via a transformation function T : S → S. Utilizing the technique of random variable transformation, we can write the density function of p t as a function of p -k : p t (y) = S p -k (x)δ(y -T (x))dx. Figure 1 left panel is a schematic illustration of this phenomenon. Let M 1 + (S) be the set of distributions attainable by applying such transformations to the support of p -k . With the notation of perturbation space, the inner problem in eq. ( 4) could then be interpreted as determining the distribution in M 1 + (S) that causes the highest (expected) loss of the D function. Assuming B(x, ) = S, the interplay of the D model and the adversary can be formulated as a maximin problem: (D, p D t ), player 1 chooses the combination (D, p D t ) * that gives the highest U value. By analyzing both players' best strategies for playing the game, we could derive the optimal solutions (Table 1 ) for the three scenariosfoot_0 depicted in Figure 1 . From a game-playing perspective, the claims in Table 1 can be verified by assuming a different D configuration than the claimed one, and show that there always exists a p t that results in a lower U value than the U value that could be achieved with the claimed D configuration. Mathematical derivations of these optimal solutions are included in Appendix D. max D min pt∈M 1 + (S) U (D, p t ) = E x∼p k [log D(x)] + E x∼pt [log(1 -D(x))] . A discussion about scenario 2 result and its implication for training robust models in provided in Appendix E.

3.2. THE MAXIMIN PROBLEM SOLVER

The method (Algorithm 1) for training adversarial-robust detector is in fact a solver (assuming B(x, ) = S) for the maximin problem 5: maximizing D(x) is equivalent to minimizing log(1-D(x) (step 2), and minimizing the loss is equivalent to maximizing U (step 3). Algorithm 1 has the following convergence property: Proposition 1. If step 2 always perfectly solves the inner problem (i.e., the mass of p t is always moved to the location(s) where D has the largest output(s)), and step 3's updates happen in D's function space, and each update is sufficiently small, then the algorithm converges to the optimal solution of D. Proof. We consider scenario 1 in Figure 1 . Let α := max S\Supp(p k ) D, A := {x ∈ S \ Supp(p k ) : D(x) = α}, and β := max Supp(p k ) D, B := {x ∈ Supp(p k ) : D(x) = β}. We focus on the case of 1 > α, β > 1 2 ; other cases can be proved using a similar argument. Recall that in Algorithm 1, step 2 solves the inner minimization by moving mass of p -k to locations where D has the largest outputs, and step 3 updates D by decreasing its outputs on p t and increasing its outputs on p k . We further assume when mass of p -k is moved to multiple locations with equal D outputs, the algorithm doesn't have a preference over locations (i.e., mass of p -k will be uniformly distributed to these locations). Algorithm 1 can be interpreted as a finite state machine that constantly switches between the following three states: • State 1: α > β. Step 2 moves the mass of p -k to A, and step 3 decreases α while increases β; the algorithm switches to state 2 or state 3. • State 2: α < β. Step 2 moves the mass of p -k to B, and step 3 maintains α while decreases β (Appendix F.1); the algorithm switches to state 1 or state 3. • State 3: α = β. Step 2 moves the mass of p -k to A ∪ B, step 3 decreases α. Because of non-zero densities of B's points on p k , if β is decreased, the decreased amount is always lower than that of α -the algorithm switches to state 2. In particular, step 3 in state 1 and state 2 always results in an decrease of max{α, β} (but β cannot be decreased to below 1 2 , Appendix D), and step 3 in state 3 always results in an decrease of α. The algorithm converges to the D solution of α ≤ 1 2 , β = 1 2 .

Practical considerations

The above proof relies on the assumption that step 2 always perfectly solves the inner minimization (i.e., mass of p -k is always moved to the location(s) where D has the largest output(s)). In practice, as a gradient-based search procedure (see eq. ( 3)), step 2 is unlikely able to reach the global maxima when D is a highly non-concave function. This issue with gradient-based search is alleviated by the alternating optimization procedure: if at step 2 samples of p -k are stuck at local maxima, step 3 immediately decreases D outputs on these samples. In other words, local maxima are constantly being eliminated. We can clearly observe this pattern in a 2D simulation of the algorithm (Figure 4 ). However, it appears local maxima elimination cannot solve all the issues. As illustrated in Figure 2 (b), the maximin solver could converge to a solution where D has > 1 2 outputs on places other than Supp(p k ). Inspecting the gradient vector field in Figure 2 (b), we find that by starting from p -k and following the gradient of D, p t is always "trapped" to Supp(p k ). As a result, other local maxima lost the chance of being visited by p t , and cannot be eliminated. The above observation points out a straightforward solution: use a p -k that is distributed in the entire data space, as opposed to one that is concentrated in a subspace. For instance, when we use a uniform distribution in the data space as p -k , in multiple experiments we consistently obtained D solutions with no local maxima and global maxima at Supp(p k ) (Figure 2 (c) and Figure 5 ). The mathematical proof that when p -k is a uniform distribution we can always get such a D solution is provided in Appendix M. (Note however the use of uniform distribution is not a prerequisite here; any "well distributed" data should work just as well.) The fact that these D solutions don't have local maxima also means we can translate an arbitrary data point out of Supp(p k ) to Supp(p k ) by performing gradient ascent on D.  + Ex∼p t [log(1 -D(x))] (until converge). 4: For each sample x ∈ pt, update its value x ← x -λ ∇ log(1-D(x)) ∇ log(1-D(x)) 2 . 5: until pt convergences to p k The corresponding minimax game min pt max D U (D, p t ) has a reversed rule: player 1 first presents different p t s, then for each p t , player 2 determines a D pt that maximizes U under the considered p t . Then over all the combinations of (p t , D pt ), player 1 chooses the combination that gives the least U value. The solution of the game is analyzed in Goodfellow et al. (2014) ; Goodfellow (2016) : the optimal strategy for player 2 is to choose a D such that U measures the Jensen-Shannon divergence (JSD): U (D * , p t ) = -log(4) + 2 • JSD(p t p k ) (the actual solution of D is D * = p k p k +pt ), and optimal strategy for player 1 is to choose a p t that minimizes the JSD: p * t = arg min pt∈M 1 + (S) JSD(p t p k ). When Supp(p k ) ⊂ S (corresponding to scenario 1 in Figure 1 ), p * t matches p k (JSD is zero), and D * outputs 1 2 on Supp(p k ). A solver (Algorithm 2) for the minimax problem is readily available by removing the "generator" from GANs' training algorithm. It is straightforward to apply GANs algorithm's convergence property to Algorithm 2: if at each step p t is updated with a sufficiently small step of λ, and D is trained to reach its optimum, then p t converges to p k . In Figure 2(d) and Figure 6 we provide 2D simulation results of this algorithm.

3.4. THE DIFFERENCE

There are a few differences between the solutions of the maximin problem and minimax problem. First, both D * s output 1 2 for Supp(p k ). But while D * in the maximin problem outputs ≤ 1 2 on S \ Supp(p k ), D * in the minimax game doesn't need to be defined on S \ Supp(p k ) (Goodfellow et al., 2014) . In other words, D * in the minimax problem has unpredictable values between 0 and 1 in most of the data space. We can observe this phenomenon in Figure 6 . The intuition here is that in the maximin game, p * t is decided in the second move, with the knowledge of the current D value; to prevent p * t from taking this advantage, the best strategy for player 1 is to specify D outputs for the entire perturbation space. In the minimax game, on the contrary, D * is decided in the second move, with the knowledge of p t , hence the player does not need to worry about D * outputs outside the supports of p t and p k . Another difference, which can also be observed from Figure 2 , is that in the minimax game, p * t exactly matches p k , while in the maximin game, mass of p * t can be any place where D * outputs 1 2 . Overall we find these two formulations giving rise to different applications. The minimax formulation, which is the formulation used by GANs, is perfect for learning a generator that produces a distribution that exactly matches the target data distribution. The discriminator (the D model), because of its undefined behavior in most of the data space, may not be very useful. The maximin problem, if well solved (Figure 2 (c)), gives a D function that models a characteristic function of the data distribution, and could be used to solve problems that require this feature (Section 4). Algorithm 3 Unconstrained Generative Adversarial Training 1: for K in [0, 1, . . . , N ] do 2: for number of training iterations do 3: Sample minibatch m samples {x1, . . . , xm} from p k , and m samples {x1, . . . , xm} from p -k .

4:

For each sample xi in {x1, . . . , xm}, compute the perturbed sample xK i by performing K steps normalized steepest descent xk+1 i ← xk i -γ ∇ log(1-D(x k i )) ∇ log(1-D(x k i )) 2 (at initialization x0 i ← xi). 5: Update D by maximizing 1 m m i=1 log D(xi) + log 1 -D(x K i ) (single step). 6: end for 7: end for

4. APPLICATIONS

In Figure 2 (c) we show when p -k is uniformly distributed in the data space the maximin problem solver gives us a D function that has no local maxima and global maxima at the support of p k . This D function is very useful -we can identify at least two important applications: • Application 1: out-of-distribution (OOD) detection The global maxima are at Supp(p k ) means any inputs that have lower D outputs can be correctly identified as OOD inputs. • Application 2: generative modeling New samples of p k can be generated by first random sampling from the data space and then translating them to the support of p k by performing gradient ascent on D. For practical applications, we have to deal with spaces of high dimensionality. We first find that with uniform noise as the p -k dataset, we are unable to obtain a D model that is useful for detecting real OOD data (Appendix H.1). This leads us to consider using a large, diverse, real image dataset, specially ImageNet, as the p -k dataset. Our ablation study in Appendix H.1 confirms that larger and more diverse dataset leads to better OOD detection performances. We further use data augmentation to increase the dataset's diversity. Given these strategies, data of p -k could still be very sparse in a high dimensional space. In order to cover more space, we consider imposing large perturbations on p -k data. Unconstrained training To facilitate the training of a large (potentially unlimited) perturbation, we propose Algorithm 3, an unconstrained generative adversarial training algorithm. Because of the steepest descent update rule (Line 4 in Algorithm 3, note there is no Proj operation here), for a given K, the perturbation imposed on each sample has a size that is always ≤ λK; hence the algorithm can be thought as gradually increasing the perturbation limit. We found this incremental training technique necessary for training models in high dimensional space -a phenomenon also observed by Yin et al. (2020) . According to the analysis in Section 3.2, the step size λ should be set to a sufficiently small value in order for step 4 to coverage to local maxima and step 5 to eliminate these local maxima. We observed that training is stable as long as λ is bellow a certain threshold, and this threshold is related to input size and D's architecture. In the algorithm we start K from 0, which means that at the first stage D is trained to discriminate between p k and p -k ; this is not critical, but we found this pre-training causes following optimizations to converge faster.

5. EXPERIMENTS

In this section we evaluate our method on the task of generative modeling and out-of-distribution detection. Following Kurach et al. (2018) , we evaluate our method on CIFAR-10 ( Krizhevsky et al., 2009) , CelebA-HQ-128 (Karras et al., 2017) , and LSUN Bedroom-128 (Yu et al., 2015) . Details of model training, data preprocessing, and dataset statistics are provided in Appendix G. OOD detection evaluation For each one of the above three datasets, we use multiple OOD datasets (see Table 6 ) to test a D model's OOD detection performances. We further assume OOD inputs remain OOD under small L p -norm perturbations. Under this assumption, we consider the problem of detecting adversarial OOD inputs -OOD inputs that are adversarially perturbed to cause the detection to fail, and evaluate our method under this challenging scenario. We also observe that increasing K in Algorithm 3 leads to changes in D performances on OOD and adversarial OOD detection. To study this phenomenon, we evaluate D models trained with different Ks using OOD inputs under various levels of perturbations. We use Outlier Exposure (OE, Hendrycks et al. ( 2018)) as the baseline method. The idea of OE is to use an auxiliary OOD dataset of large amount of diverse data to train the OOD detector. Because we also use a large-scale, diverse dataset (ImageNet) as the p -k dataset, the OE approach can be thought of a special case of Algorithm 3 when we fix K to K = 0. We use area under the receiver operating characteristic curve (AUROC) as the performance metric (details of how AUROC and adversarial AUROC are computed are in Appendix G). Generative modeling evaluation We generate new p k samples by starting from some seed images and performing gradient ascent on D (Appendix G provides more details on generation). Due to the similarity between the studied approach and GANs (the former solves the maximin problem while the latter solves the minimax problem), we focus on a comparison with GANs. Kurach et al. (2018) is a large-scale study on the effects of various regularization and normalization techniques on GANs, and we compare our results with the best results obtained in their work.

OOD detection results

Table 2 shows the average OOD detection performances of our method (see Appendix J for the complete results on individual OOD datasets). For each dataset, we train multiple D models with different Ks, and test models under various levels of perturbationsfoot_3 ( -test, measured by L 2 norm). We can observe a general pattern across all the datasets: training with a larger K causes model performance on lower to decrease, a phenomenon that is also observed in other adversarial training scenarios (Madry et al., 2017; Tsipras et al., 2018) . The baseline method (K = 0 models) becomes completely ineffective when models are exposed to adversarial OOD inputs. On CelebA-HQ-128 and Bedroom-128 datasets our method obtains strong performances on detecting both OOD and adversarial OOD inputs. Performance on CIFAR-10 dataset is relatively low. Considering the small size (4.6MB disk space) of the default ResNet-CIFAR architecutre, we replaced it with ResNet18 which is a much larger model in terms of disk space (43MB), but only observed marginal improvements on OOD and adversarial OOD detection (Appendix H.2). Table 2 results are based on perturbations computed using PGD attacks (Madry et al., 2017) of particular combinations of steps and step size. To verify model robustness, we use a testing strategy that is widely adopted in the ML security community: use PGD attacks of different combinations of steps and step size to test model robustness. Appendix J shows that the worst results obtained with grid search are only marginally lower than reported ones in Table 2 . In Table 3 and Table 4 we report standard and adversarial OOD detection performances of our method and several state-of-the-art methods. As is discussed earlier, there is a trade-off between standard and adversarial OOD detection performance as we increase K in Algorithm 3. For this reason, we have included the performances of our model trained with different Ks (K = 0 and K = 5.) Our method uses the ResNet18 architecture and 800 Million Tiny Images (Torralba et al., 2008) as the p -k dataset -this combination gives the best performance; data of other settings are provided in Appendix H.2. (We note methods in Tabel 4 that rely on axuliary data also use the 800 Million Tiny Images.) It is observed from Table 3 that state-of-the-art OOD detection methods achieves strong performances on the standard OOD detection task. However, in the adversarial OOD detection task, even a tiny perturbation of = 0.01 could cause non-robust models (OE and our method with K = 0) to fail. Meanwhile, our method with K = 5 outperforms several state-of-the-art methods on SVHN and CIFAR-100 in this task. Image generation results In Figure 3 shows samples generated by our method. In general we find our results to be more recognizable than GANs' results in Figure 10 . We observed that the quality of generated images could be affected by the type of seed images (Appendix K.2), and increasing K in Algorithm 3 generally leads to better generations (Appendix K.1). In Figure 12 we demonstrate the method's application to face retouching. The fact that generated images are not realistic and have various artifacts suggest that the maximin problem is not well solved. This might be due to the model not being exposed to enough p -k data (consider increase the number of iterations in the inner loop of Algorithm 3), or the limitation of model architecture or capacity. While GANs has various training stability issues, we found Algorithm 3 to be as stable as ordinary supervised training. The only failure mode (gradient ascent on D results in noisy images) that we observed is caused by λ being too large (Appendix I). Limitation As our method uses gradient ascent which is susceptible to local maxima to generate samples, it tends to produce similar samples if the seed samples are not diverse enough. In practice, Table 4 : Adversarial OOD detection performances (AUROC scores) when the in-distribution dataset is CIFAR-10. Performance data of methods other than ours is collected from Bitterwolf et al. (2020) . The results of our method are based on an PGD attack of steps 100 and step size 0.002. We also used the full datasets to run the test, as opposed to 1000 samples per dataset as used by Bitterwolf et al. (2020) . For results using 1000 samples and under different attack configurations including one with random restarts see Table 30 . OOD dataset (with an L ∞ perturbation of = 0.01)

Method

Uniform Noise Gaussian Noise SVHN CIFAR-100 OE (Hendrycks et al., 2018) 75.7 N/A 3.7 11.0 CCU (Meinke & Hein, 2019) 100 N/A 14.8 23.3 ACET (Hein et al., 2019) 98.9 N/A 88.0 74.5 GOOD (Bitterwolf et al., 2020) due to the numerical algorithm or mini-batch training, we are more likely to get a D solution that has local maxima. In that case, by performing gradient ascent on D seed samples that are not diverse enough could be trapped in the same local maximum point. This is likey the case in Figure 3 where we find several face images that are quite similar to each other. On the other hand, in the less likely situation where the D solution has no local maxima (e.g., Figure 2 (b) and Figure 2(c )), all the seed samples could be concentrated in a few maxima points on support of the p k distribution. While the problem in the latter case seems more severe, it could be mitigated by properly constraining the number of steps and step size when performing gradient ascent on D. (Dinh et al., 2016) tend to assign higher likelihood to OOD inputs than they do to in-distribution inputs. Despite this challenge, several recent works (Ren et al., 2019; Choi et al., 2018; Nalisnick et al., 2019; Kirichenko et al., 2020; Serrà et al., 2019; Song et al., 2019; Huang et al., 2019; Daxberger & Hernández-Lobato, 2019 ) investigated the issue and successfully applied deep generative models to OOD detection. There is also a plethora of OOD detection methods (Hendrycks & Gimpel, 2016; Lee et al., 2018; Liang et al., 2017; Sastry & Oore, 2019; Quintanilha et al., 2018; Abdelzad et al., 2019; Chen et al., 2018; Malinin & Gales, 2018 ) that make use of statistics computed from the predictions or intermediate activations of standard classifiers train on in-distribution data. To name a few, Lee et al. (2018) fit class conditional Gaussian distributions using multiple levels of activations of the classifier, and use Mahalanobis distance to compute confidence scores to identify OOD inputs. The ODIN method (Liang et al., 2017) improves the effectiveness of a softmax score based detection approach by using temperature scaling and adding small perturbations to the input. Sastry & Oore (2019) make use of gram matrices computed from the classifier's intermediate activations to identify OOD inputs. Another branch of work utilize various alternative training strategies (Liu et al., 2020; Lee et al., 2017; Hendrycks et al., 2018; 2019; DeVries & Taylor, 2018; Shalev et al., 2018; Vernekar et al., 2019; Yu & Aizawa, 2019; Golan & El-Yaniv, 2018) . A notable example is the Outlier Exposure (OE) method developed by (Hendrycks et al., 2018) . OE works by training the OOD detector against a large, diverse out-of-distribution dataset, and has been widely adopted as a baseline method. While methods based on generative models and standard classifiers yield high performances on naturally-occurring OOD inputs, several such methods have been shown (Meinke & Hein, 2019; Bitterwolf et al., 2020) to be vulnerable to adversarial manipulation of the OOD inputs. This should come as no surprise as both generative models and standard classifiers themselves are vulnerable to adversarial attacks (Kos et al., 2018; Szegedy et al., 2013) . Given the above limitation of current approaches, a recent trend considers the worst-case scenario for OOD detection (Hein et al., 2019), our detection method employs adversarial training on OOD inputs to induce robustness. The difference is that our method uses the GAT objective with a optimal solution which naturally solves the robust OOD detection problem, while the optimal solution of the objective used by Hein et al. (2019) , which is essentially a multiple class classification objective with an extra term on OOD inputs, is unclear.

B MAXIMIN AND MINIMAX PROBLEMS IN GAME THEORY

In game theory, two-player zero-sum game is a mathematical representation of a situation in which one player's gain is balanced by another player's loss. Such a game is described by its payoff function f : R p+q → R, which represents the amount of payment that one player (player 1) makes to the other player (player 2). The goal of player 1 is to choose a strategy u ∈ R p such that the payoff is minimized, while the goal of player 2 is to choose a strategy u ∈ R q such that the payoff is maximized. The best strategies for both players, and the resulting payoff, depending on the order of play, could be solved via min u max v f (u, v) or max v min u f (u, v). In the minimax game min u max v , player 1 makes the first move. Player 2, after learning that player 1 has made the move u, will choose a v to maximize f (u, v), which results in a payoff of max v f (u, v). Player 1, who is informed of player 2's strategy, will choose a u such that the worse case payoff max v f (u, v) is minimized, which results in a payoff of min u max v f (u, v). In the maximin game max v min u , the order of play is reversed. Player 2 makes the first move, and then player 1 minimizes the payoff by choosing u = arg min u f (u, v). Player 2 knows that player 1 will follow this strategy and will choose a v such that the worse case payoff min u f (u, v) is maximized, which results in a payoff of max v min u f (u, v). The payoff min u max v f (u, v) is always greater or equal to max v min u f (u, v). This difference can be intuitively understood as the result of player 2's extra knowledge gained by taking the second move. According to the minimax theorem (Neumann, 1928) , when f is a continuous function that is concave-convex (i.e., for each v, f (u, v) is a convex function of u, and for each u, f (u, v) is a concave function of v)), these two quantities are equal. We refer the reader to Boyd et al. ( 2004) ( §5.4.3, §10.3.4) for more details on this topic.

C A DEMONSTRATION ON HOW TO SOLVE A MAXIMIN PROBLEM

Without a game theory interpolation, in Table 5 we present a minimal example demonstrating how to solve a maximin problem max v min u f (u, v), with f : R p+q → R, u ∈ R p , and v ∈ R q . In this example, u has three values u 0 , u 1 , u 2 , and v has two values: v 0 , v 1 . To solve the maximin problem we first solve the inner minimization for each value of v. As an example, when we fix v to v 0 , we solve the inner problem by choosing the u that when combined with v 0 , yields the lowest f value. We do the same computation for v 1 and we have solved the inner problem. We then move forward to the outer problem by choosing from the above two solutions (red boxes) the one with the highest f value (the green box). Table 5 : A minimal example demonstrating how to solve a maximin problem. The solutions for the inner problem for each value of v are labeled as red, and the final solution is highlighted as green. u0 u1 u2 v0 f (u0, v0) f (u0, v1) f (u0, v2) v1 f (u1, v0) f (u1, v1) f (u1, v2) u0 u1 u2 v0 f (u0, v0) f (u0, v1) f (u0, v2) v1 f (u1, v0) f (u1, v1) f (u1, v2)

D MATHEMATICAL ANALYSIS OF OPTIMAL SOLUTIONS OF THE MAXIMIN PROBLEM

Recall that the support of p t can be any subset of the pertubation space S and that U (D, p t ) = p k (x) log D(x))dx + p t (x) log(1 -D(x))dx. For convenience, we define the contour set inside S of D at α as C D α := {x ∈ S : D(x) = α}, the region of Supp(p k ) that is outside of S as Ω ko := Supp(p k ) \ S and the region of Supp(p k ) that is in S as Ω ki := Supp(p k ) ∩ S. Note that Supp(p k ) = Ω ko ∪ Ω ki . For a fixed D and let α o = max Ω ko D and α S = max S D. It is easy to check that U is minimized when Supp(p t ) lies in the contour set C D α S . Let p * t be a distribution such that Supp(p * t ) ⊂ C D α S . By direct computation we have that U (D, p * t ) = Ω ko p k (x) log D(x)dx + Ω ki p k (x) log D(x)dx + log(1 -α S ) ≤ ( Ω ko p k )(log α k ) + ( Ω ki p k )(log α S ) + log(1 -α S ) ≤ 0 + β ki log β ki 1 + β ki + log 1 1 + β ki , where β = Ω ki p k . Note here we have used the fact the the function f (y) = a log y + b log(1 -y) achieves its maximum at y = a a+b . It is not difficult to see that the above inequality becomes an equality when D(x) =    α k x ∈ Ω ko α S x ∈ Ω ki ≤ α S x ∈ S \ Supp(p k ) , where α k = 1 and α S = β ki 1+β ki . Note that D does not need to be defined outside S ∪ Supp(p k ). Scenario 1 Here we deal with the case when is large enough such that Supp(p k ) ⊂ S, in which case Ω ko = ∅, Ω ki = Supp(p k ) and α S = 1 2 . Hence by the above analysis one can check that U achieves its optimum when D ≡ α S = 1 2 on Supp(p k ) and D ≤ 1 2 on S \ Supp(p k ). In summary, the maximin problem achieves its optimum when D outputs 1 2 on the support of p k and and values less or equal to 1 2 on samples outside the support of p k but in S. Scenario 2 Here we deal with the case when is small enough such that the S ∩ Supp(p k ) = ∅, in which case Ω ko = Supp(p k ), Ω ki = ∅ and α S = 0. Hence U achieves its optimum when D ≡ 1 on Supp(p k ) and D ≡ 0 on S. In summary, the maximin problem achieves its optimum when D outputs 1 on the support of p k and zero on the the perturbation space S. Scenario 3 Here we deal with the case when S ∩ Supp(p k ) = ∅ and Supp(p k ) ⊂ S. In summary, the maximin problem achieves its optimum when D outputs 1 on the set of samples inside the support of p k but outside of the perturbation space S and β ki 1+β ki on the set of samples that are in the intersection of the support of p k and S and values less or equal to β ki 1+β ki on S. Remark The first two cases can be seen as the special cases of the third one.

E SCENARIO 2 DISCUSSION

In robust machine learning literature, it's common to consider a very small value for . For instance, one of the most commonly used limit for training L ∞ robust models is = 8/255 (L ∞ norm). A perturbation space characterized by a small limit can be thought as a semantic-preserving space: translating a sample inside the space doesn't change the sample's underlying label/class membership. A small perturbation limit corresponds to scenario 2, which is also the focus of Yin et al. (2020) . We can define robust models as models that output consistent predictions for inputs under semanticpreserving transformations. In this sense, the optimal D for scenario 2 is a robust detector, as it always outputs 0 for the perturbation space. However, the limitation of training against a small is obvious: because optimal D's outputs outside S ∪ Supp(p k ) are unspecified, any semantic-preserving operation that has a perturbation that goes beyond S can result in a high D output, thereby fools the detection. The above analysis suggests that for predictive models based on the generative adversarial training method, their robustness can be improved by training against a larger perturbation space. F ALGORITHM 1 CONVERGENCE 

G EXPERIMENTAL SETUP

Model training We use Algorithm 3 to train D models. Depending on the studied dataset, p k is set to one of the above three datasets, but we always use ImageNet (with data augmentation) as the p -k dataset (images are resized to 32 × 32 × 3 or 128 × 128 × 3 depending on p k dataset resolution). We also use the same D architectures and batch size as Kurach et al. (2018) (see Appendix G for more details of D architectures). Following Yin et al. (2020) , we use a pretrained D model to bootstrap optimization: for CIFAR-10 the D model is pretrained on the CIFAR-10 classification task, and for other datasets it is pretrained on the ImageNet classification task (Russakovsky et al., 2015) . For all the datasets, the D update step in Algorithm 3 is performed using a SGD optimizer with momentum of 0.9. The learning rate of the optimizer for CIFAR-10 datast is 0.0005, for CelebA-HQ-128 is 0.001, and for Bedroom-128 is 0.0025. The λ value is set to 0.1 for CIFAR-10, and 0.6 for CelebA-HQ-128 and Bedroom-128. For all the trainings we use a batch size of 64, same as Kurach et al. (2018) . The training follows standard supervised training, and doesn't use any regularization or normalization. Dataset preprocessing and statistics CelebA-HQ-128 dataset is downloaded from Manna (2020) . The Bedroom-128 dataset is created from the corresponding LSUN dataset by center cropping the images using a square and then resizing to 128 × 128. CIFAR-10 has 60K training images and 10K test images. We manually split CelebA-HQ-128 into a training split of 27K images and a test split of 3K images, and Bedroom-128 into a training split of 300K images and a test split of 3K images. Detector architecture Following Kurach et al. (2018) , we use tow network architecture for the experiments. For CIFAR-10 task we use the "ResNet-CIFAR" architecture (Kurach et al. (2018) Table 7a ). The architecture has 4 customized ResBlocks and takes 4.6MB disk space. For other 128 × 128 × 3 datasets, we use the "ResNet19 discriminator" architecture (Kurach et al. (2018) Table 5a ). The architecture has 6 customized ResBlocks and takes 60MB disk space Image generation details We generate a new sample of p k by starting from some seed sample and performing gradient ascent on D using the update rule in eq. ( 3). In this case, the seed sample is supposed to be out of the distribution of p k . When using the update rule in eq. ( 3), we need to specify the step size λ, the number of steps, and for the Proj operation. We use the following configurations for generation: • For CIFAR-10 generation (Figure 3 We note different configurations could lead to different generation results. AUROC computation AUROC is a metric that measures a discriminative model's ability to separate two sets of data. To compute the AUROC score of a trained D model for a given p k and p OOD dataset, we first use the D model to get the logit outputs of samples of these two datasets, and then use scikit-learn (Pedregosa et al., 2011) 's "sklearn.metrics.auc" function to compute the score (with samples in p k labeled as 1s and samples in p OOD labeled as 0s). We always use the test splits of the p OOD and p k datasets to do the above calculation. Adversarial AUROC computation To compute the adversarial AUROC score of a D model for a given p OOD and p k dataset, we first compute an adversarial OOD dataset p OOD by taking samples from p OOD and performing L 2 -based PDG attack (Madry et al., 2017) against the D model. We then compute the Adversarial AUROC score by computing the AUROC score on the p OOD and p k datasets. Same as the AUROC computation, we always use the test splits of the p OOD and p k to do the above calculation. Table 6 : In-distribution dataset and corresponding out-of-distribution datasets (images of OOD datasets are resized to the image size of the corresponding in-distribution dataset) In-distribution dataset (p k ) Out-of-distribution datasets (p OOD ) CIFAR-10 Gaussian noise, Uniform noise, SVHN (Netzer et al., 2011) , CIFAR-100 (Krizhevsky et al., 2009) In this ablation study, we use CIFAR-10 class 0 data as the target data distribution dataset p k , and train models with different p -k datasets. It is observed in Table 11 and 12 that when p -k is uniform noise, the D models only developed capability for identifying uniform noise and Gaussian noise as OOD inputs. This result seems to contradict the mathematical analysis in Appendix M which says that with a uniform distribution as p -k a D function useful for detecting any kind of OOD inputs could be obtained. According to the manifold hypothesis, real image data lie on lower-dimensional manifolds embedded within the high-dimensional space. While in contrast, the uniform noise is highly concentrated on the surface of the unit d-cubefoot_4 in a high dimensional space [0, 1] d . Our conjecture is that due to these geometry properties, in terms of Euclidean distance, real image data is close to each other while uniform noise live far away from the real data. As a result, uniform noise is much less data efficient than real data for training OOD detection models, and a much larger number of inner iterations and K value in Algorithm 3 may be needed to reach a satisfying detection performance. For real image data experiments, we respectively use ImageNet and a CIFAR-10 subset consisting of CIFAR-10 data from class 1 to class 9 as the p -k dataset. ImageNet is a considerably much larger and more diverse dataset than the CIFAR-10 dataset, and it is seen from Table 9 and Table 10 that the model training against Imagenet performs much better on OOD and adversarial OOD detection than the model trained against the CIFAR-10 subset. 

I FAILURE MODE DIAGNOSIS

We observe that in Algorithm 3, if λ is set to a too large value, the algorithm fails to learn a D that is useful for image generation. In this section we discuss the training dynamics of the case of an appropriate λ value and the case of λ being too large. λ is small enough In Algorithm 3 as we increase K, p t gradually converges to p k . In this process it becomes increasingly difficulty for the D model to differentiate these two distributions. This phenomenon can be observed in Figure 7 : the training loss (binary cross-entropy loss) of the D model becomes larger and larger (left subfigure), and eventually these two distributions become indistinguishable (AUROC ≈ 0.5, middle subfigure). From the right subfigure we can see that D's performance on p -k vs. p k is also affected by the increase in K value. λ is too large The failure mode caused by λ being too large is easy to identify (Figure 8 ): the training loss quickly decreases to 0 as K increases (left subfigure), p t and p k become perfectly separable (middle subfigure), and D model becomes unable to separate p -k from p k (right subfigure). In general, with a small enough λ value, an increase of sample quality could be expected after model is trained with a larger K. This is the case when λ is 0.1, but not when it is 0.6 (Figure 9 ).

J.1 CIFAR-10

Table 15 : The performances of CIFAR-10 K = 25 model under PGD attacks of different combinations of steps and step size. The perturbation limit is = 2.0 (L 2 norm). Each entry is computed using 500 positive samples and 500 negative samples. Each entry in this table and the following two tables is computed using 3000 positive samples and 3000 negative samples. When test > 0, perturbations are computed using PGD attacks of steps 200 and step size 2.0. -test CIFAR-10 Gaussian noise Uniform noise ImageNet SVHN CelebA-HQ CIFAR-100 mean 0.0 1.0000 1.0000 1.0000 0.9713 1.0000 0.9999 1.0000 0.9959 5.0 0.0000 0.0000 0.0010 0.0002 0.0000 0.0000 0.0000 0.0002 10.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 

N EXTENDED ADVERSARIAL OOD DETECTION RESULTS

Table 30 : The performance (AUROC scores) of CIFAR-10 K = 5 model (the in-distribution dataset is CIFAR-10) under attacks of different configurations. Following Bitterwolf et al. (2020) we used 1000 samples for both in-distribution data and OOD data. Similarly, we used 5 random restarts to enhance the default attack, but the performance decrease is negligible. OOD dataset (with an L ∞ perturbation of = 0.01) OOD dataset (with an L ∞ perturbation of = 0.03) PGD

Method

Uniform Noise Gaussian Noise CIFAR-10 CIFAR-100 OE (Hendrycks et al., 2018) 98.2 N/A 62.5 60.2 CCU (Meinke & Hein, 2019) 100 N/A 56.8 52.5 ACET (Hein et al., 2019) 96.3 N/A 99.5 99.4 GOOD (Bitterwolf et al., 2020) 



In scenario and 3, D * doesn't need to be defined on X \ (S ∪ Supp(p k )). values are based on https://github.com/MadryLab/robustness. The volume of the unit d-cube shrunk by some small epsilon in each dimension is given by V = (1 -2 ) d . This quantity quickly approaches 0 as d increases.



Figure 1: Left panel: a distribution p t is obtained by applying a transformation T to the support of p -k . Right panel: three scenarios to consider when analyzing problem 5. Red distribution represents p -k and blue distribution represents p k . The data space X is represented by the whole space inside the square, and the perturbation space S is represented by the gray area. Optimal solutions A convenient way of analyzing the above problem is to consider it as a two-player game: player 1 first presents different D configurations, then for each D, player 2 determines a perturbed distribution p D t that minimizes U under the considered D. Then over all combinations of

Figure 2: Plots of contours and gradient vector fields of the D functions (gradient vectors are normalized to have the unit length). (a) The initial positions of p -k and p k . (b) The solution obtained by the maximin problem solver. (c) The solution obtained by the maximin problem solver when p -k is a uniform distribution in the data space. (d) The solution obtained by the minimax problem solver.

Figure 3: Uncurated samples generated by our method; GANs results is in Figure 10. Seed images used to generated these results are in Figure 11. The training time for models used to produce these generations are in Table7.

STEP 3 ALWAYS DECREASE β We assume when α < β, D has a single global maximum point (i.e., |B| = 1). Lemma 1. If α < β and |B| = 1, Algorithm 1 always decreases β. Proof. Let β := max Supp(p k ) D, C := {x ∈ Supp(p k ) : D(x) < β}, and γ := max C D. (Same as the proof for Proposition 2, we consider the case of β, γ > 1 2 .) Going back to Algorithm 1, step 2 moves mass of p -k to B. Since B has only one element, the mass is concentrated on this single point.Step 3's optimization causes γ to increase and β to decrease. The intuition here is that if step 3's update is small enough, these two values will meet in an intermediate point. Let the resulting values be γ 1 and β 1 . If step 3's update on D is sufficiently small such that γ 1 -γ < β -γ, then we have max{β 1 , γ 1 } < β -the maximum value of D on Supp(p k ) has decreased.

Figure 4: The results of p t and D in the first few iterations of a 2D simulation of Algorithm 1.Step 2 solves the inner minimization, causes support of p t (red points) to be concentrated in local maxima points.Step 3 update D by increasing its outputs on the support of p k and decreasing its outputs on the support of p t , causes local maxima to be suppressed.

(a), Figure 13, and Figure 16), we use step size 0.1, steps 200, and = 15. • For CelebA-HQ-128 generation (Figure 3(b), Figure 14, and Figure 17, we use step size 1.2, steps 100, and = 40. • For CelebA-HQ-128 generation (Figure 3(c), Figure 15, and Figure 18, we use step size 0.8, steps 400, and = 70.

-128 K = 80 model 7 days 12 hours (2 2080Ti GPUs) Bedroom-128 K = 55 model 14 days 15 hours (2 20280Ti GPUs)

Figure 16: Samples generated by the CIFAR-10 K = 40 model using seed images on the left. Seed images are from OOD datasets (Table6.

Figure 17: Samples generated by the CelebA-HQ-128 K = 80 model using seed images on the left. Seed images are random samples of OOD datasets (Table6.

Figure 20: Uncurated 256 × 256 generation results in the Bedroom256 dataset. The state-of-the-art results on this dataset can be found in Figure 10 of Karras et al. (2019).

Figure 21: Uncurated 256 × 256 generation results in the ImageNet Dog 256 dataset. The stateof-the-art results on this dataset can be found at Brock et al. (2018), although their results are of resolution 128 × 128 and at the same time class-conditional. Unconditional generation results on this dataset can be found at Zhang et al. (2018).

Algorithm 1 GAT Detector Training Method (The Maximin Problem Solver) 1: Sample minibatch of m samples {x k 1 , . . . , x k m } from p k , and m samples {x -k 1 , . . . , x -k m } from p -k . 2: Compute adversarial examples {x 1 , . . . , x m } by solving max x ∈B(x, ) D(x ) for each x -k i . 3: Train the detector by minimizing 1

Optimal solutions for the three scenarios in Figure1Scenario Optimal solution (D * , p * t ) Supp(p k ) ⊂ S D * outputs 1 2 for Supp(p k ) and ≤ 1 2 for S \ Supp(p k ); p * t has its mass distributed to locations where D * outputs 1 2 . Supp(p k ) ∩ S = ∅ D * outputs 1 for Supp(p k ) and 0 for S; p * t can be an arbitrary distribution in M 1 + (S). S pk (by definition, S p * t = 1), and for other places inside S, D * outputs ≤ α; p * t has its mass distributed to locations where D

Out-of-distribution detection performances (see Appendix J for expanded results)



Qing Yu and Kiyoharu Aizawa. Unsupervised out-of-distribution detection by maximum classifier discrepancy. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9518-9526, 2019. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence, 41(8):1947-1962, 2018.A RELATED WORK ON OUT-OF-DISTRIBUTION DETECTIONOut-of-distribution (OOD) detection, also known as novelty detection, or anomaly detection, deals with the problem of identifying novel, or unusual, data from within a dataset. OOD detection has gained much research attention due to its practical importance in safety-critical applications and changeling nature. A comprehensive review of classical OOD detection methods can be found atPimentel et al. (2014).A recent surge of research interests in this topic is due to the emergence of deep generative models. Such models (specifically explicit density models(Goodfellow, 2016)) estimate the generative probability density function of the data, and should serve as an ideal candidate for OOD detection. However, it was observed(Kirichenko et al., 2020;Nalisnick et al., 2018;Shafaei et al., 2018;Hendrycks et al., 2018) that several state-of-the-art deep generative models, including Glow (Kingma & Dhariwal, 2018), PixelCNN (Oord et al., 2016), PixelCNN++ (Salimans et al.), VAEs (Kingma, 2013; Rezende et al., 2014), and RealNVP flow model

2019;Sehwag et al., 2019;Meinke & Hein, 2019;Bitterwolf et al., 2020). The Adversarial Confidence Enhanced Training (ACET) method proposed by Hein et al. (2019) use adversarial training Madry et al. (2017) on OOD inputs to improve detection robustness. Meinke & Hein (2019) uses a density estimator to provide guarantees on the maximal confidence around L 2 ball for uniform noise. Bitterwolf et al. (2020) use interval-bound-propagation (IBP) to certificate worst case guarantees for general OOD inputs under a L ∞ threat model. In the same spirit as Hein et al. (

Model training time.



Average OOD performance (AUROC scores) on CIFAR10 class 0 data. (p k = CIFAR-10 class 0, and p -k = ImageNet).

Average OOD detection performance (AUROC scores) on CIFAR-10 class 0 data. (p k = CIFAR-10 class 0, and p -k = CIFAR-10 class 1 -class 9).

OOD detection performance (AUROC scores) of K = 0 model on CIFAR-10 class 0 data (p k = CIFAR-10 class 0, and p -k = uniform noise).

OOD detection performance (AUROC scores) of K = 15 model on CIFAR-10 class 0 data. (p k = CIFAR-10 class 0, and p -k = uniform noise).

Adversarial OOD detection performances (AUROC scores) of our method trained with different model architectures and p -k datasets. The in-distribution dataset is CIFAR-10. The definition of ResNet18 can be found at https://github.com/MadryLab/robustness.

Standard OOD detection performances (AUROC scores) of our method trained with different model architectures and p -k datasets. The in-distribution dataset is CIFAR-10.

OOD detection performances of the CIFAR-10 K = 0 model on individual datasets. Each entry in this table and the following two tables is computed using 3000 positive samples and 3000 negative samples. When test > 0, perturbations are computed using PGD attacks of steps 200 and step size 0.5.

OOD detection performances of the CIFAR-10 K = 15 model on individual datasets

OOD detection performances of the CIFAR-10 K = 25 model on individual datasets

OOD detection performances of the CIFAR-10 K = 0 model on individual datasets. Each entry in this table and the following two tables is computed using 1000 positive samples (the test set only has 1000 samples) and 1000 negative samples. PGD attack setting follows the CIFAR-10 experiment.

OOD detection performances of the CIFAR-10 K = 15 model on individual datasets

OOD detection performances of the CIFAR-10 K = 25 model on individual datasets

The performances of CelebA-HQ-128 K = 40 model under PGD attacks of different combinations of steps and step size. The perturbation limit is = 10 (L 2 norm). Each entry is computed using 500 positive samples and 500 negative samples.

OOD detection performances of the CelebA-HQ-128 K = 0 model on individual datasets. Each entry in this table and the following two tables is computed using 3000 positive samples and 3000 negative samples. When test > 0, perturbations are computed using PGD attacks of steps 200 and step size 2.0.

OOD detection performances of the CelebA-HQ-128 K = 20 model on individual datasets

OOD detection performances of the CelebA-HQ-128 K = 40 model on individual datasets

The performances of Bedroom K = 40 model under PGD attacks of different combinations of steps and step size. The perturbation limit is = 10 (L 2 norm). Each entry is computed using 500 positive samples and 500 negative samples.

OOD detection performances of the Bedroom-128 K = 0 model on individual datasets.

OOD detection performances of the Bedroom-128 K = 20 model on individual datasets CIFAR-10 Gaussian noise Uniform noise ImageNet SVHN CelebA-HQ CIFAR-100 mean

OOD detection performances of the Bedroom-128 K = 40 model on individual datasets CIFAR-10 Gaussian noise Uniform noise ImageNet SVHN CelebA-HQ CIFAR-100 mean

Adversarial OOD detection performances (AUROC scores) when in-distribution dataset is SVHN. Performance data of methods other than ours is collected fromBitterwolf et al. (2020). The results of our method are based on an PGD attack of steps 100 and step size 0.005. For results under different attack configurations including one with random restarts see Table32. Our SVHN model was trained with 800 Million Tiny Images dataset as the p -k dataset and used the ResNet18 architecture.

The performance (AUROC scores) of SVHN K = 45 model (the in-distribution dataset is SVHN) under attacks of different configurations. FollowingBitterwolf et al. (2020) we used 1000 samples for both in-distribution data and OOD data. Similarly, we used 5 random restarts to enhance the default attack, but the performance decrease is negligible.OOD dataset (with an L ∞ perturbation of = 0.03)

6. CONCLUSIONS AND FUTURE WORK

In this paper we analyzed the optimal solutions of the GAT training objective and the convergence property of the training algorithm. The analysis of optimal solutions justifies the application of the GAT method to training robust predictive models. We made a comparative analysis of the maximin formulation and minimax formulation that are respectively employed by GAT and GANs. Guided by these theoretical results, we designed an unconstrained GAT algorithm, and evaluated it on the task of image generation and adversarial out-of-distribution detection. The competitive performance and training stability of the algorithm suggest that the studied approach could serve as a new tool for content creation, although we believe its performance could be further improved by optimizing hyperparameters and model architectures. Out-of-distribution detection results indicate that an OOD detection model's robustness could be improved by training the model against an adversary equipped with large-scale, diverse OOD data. The future work includes scaling up the training for larger images and high-capacity models, and extending the method's application to sequential and tabular data.

annex

Sample quality improved when λ = 0.1, but didn't when λ = 0.6.

K EXPANDED GENERATION RESULTS

Figure 10 : Samples generated by GANs (Kurach et al., 2018) ; results of our method are in Figure 3 . 6 ).

M PROOF OF ALGORITHM 1'S CONVERGENCE PROPERTY

In this section we provide a proof that when p -k is a uniform distribution over the space X \Supp(p k ), Algorithm 1 converges to a D solution with no local maxima and global maxima at support of p k .To recap Algorithm 1, in each iteration step 1 samples points from p k and p -k , step 2 solves the inner maximization by moving samples of p -k to locations where D has the maximum outputs, and step 3 solves the outer minimization by increasing D outputs on p k samples (maximizingand decreasing outputs on p -k samples (maximizingStep 2 is implemented by performing gradient ascent on D, using initial samples of p -k as starting points. Given that D could be a non-concave function during the course of Algorithm 1 execution, samples of p -k could be stuck in local maxima points in X \ Supp(p k ). We now show that due to this gradient-based search method used by step 2, Algorithm 1 has the following convergence property: Proposition 2. When p -k is a uniform distribution over the space X \ Supp(p k ), Algorithm 1 converges to a D solution with no local maxima and global maxima at support of p k .Proof. We assume that D has enough capacity such that step 3's update of D in Supp(p k ) does not affect D's outputs in X \ Supp(p k ). We assume that the environment in which Algorithm 1 is simulated has a numeric limit of (e.g., = 10 -12 ) such that X \ Supp(p k ) is a finite set. (This assumption is valid when the algorithm runs on a computer.) Since X \ Supp(p k ) is a finite set, we consider the case where p -k is a discrete uniform distribution. This distribution has non-zero probability at any point in X \ Supp(p k ).We first prove that any local maximum point in X \Supp(p k ) can be eliminated by running Algorithm 1 for a sufficient and finite number of iterations. To proceed, we first state the condition under which a local maximum point will be eliminated: a local maximum point q in X \ Supp(p k ) will be eliminated if via one or more iterations of the algorithm a sufficient number of p -k samples reach q. When this condition is satisfied, the cumulative effects of step 3 cause the local maximum to disappear by decreasing D(q) to a sufficiently small value.We next show that the above condition is always satisfied when Algorithm 1 runs for a finite number of iterations. Let U be the set of points in X \ Supp(p k ) that reach q when performing gradient ascent on D in step 2. U is non-empty when a sufficiently small step size is used for performing the gradient ascent, as it at least contains the point q itself when a step size of 0 is used. For U is non-empty, a sufficient number of p -k samples could fall on U and subsequently reach q if enough samplings of p -k are done via step 1.For a given D, the set of local maxima points in X \ Supp(p k ) is a finite set. However, as new local maxima are constantly being created due to D's update in each iteration, it is possible that this set will never be empty. We now prove that in a finite iterations of the algorithm this set actually goes to empty. Let Q t be the set of local maxima points of D in X \ Supp(p k ) in iteration t. We have shown in the first proof that all the elements of Q t are going to be reached by p -k samples in a finite number of iterations, hence if Q t is non-empty as t → ∞, D values in X \ Supp(p k ) will be decreased by an ≥ amount for an infinite number of times, which contradicts our assumption that D is a finite function in the finite set of X \ Supp(p k ).We note that the above convergence property holds for any random initialization of D. However, uniform distribution is not a necessary condition here; any p -k distribution that has non-zero density everywhere in the data space will suffice. For a particular or particular type of initialization of D, its local maxima points could follow some pattern, and hence the assumption on p -k could be relaxed. However, in practice whether or not a p -k distribution is sufficient for a given D can be difficult to measure.

