NON-NEGATIVE BREGMAN DIVERGENCE MINIMIZA-TION FOR DEEP DIRECT DENSITY RATIO ESTIMATION Anonymous

Abstract

This paper aims to estimate the ratio of probability densities using flexible models, such as state-of-the-art deep neural networks. The density ratio estimation (DRE) has garnered attention as the density ratio is useful in various machine learning tasks, such as anomaly detection and domain adaptation. For estimating the density ratio, methods collectively known as direct DRE have been explored. These methods are based on the minimization of the Bregman (BR) divergence between a density ratio model and the true density ratio. However, when using flexible models, such as deep neural networks, existing direct DRE suffers from serious train-loss hacking, which is a kind of over-fitting caused by the form of an empirical risk function. In this paper, we introduce a non-negative correction for empirical risk using only the prior knowledge of the upper bound of the density ratio. This correction makes a DRE method robust against train-loss hacking. It enables the use of flexible models, such as state-of-the-art deep neural networks. In the theoretical analysis, we show the generalization error bound of the BR divergence minimization. In our experiments, the proposed methods show favorable performance in inlier-based outlier detection and covariate shift adaptation.

1. INTRODUCTION

The density ratio estimation (DRE) problem has attracted a great deal of attention as an essential task in data science for its various industrial applications, such as domain adaptation (Shimodaira, 2000; Plank et al., 2014; Reddi et al., 2015) , learning with noisy labels (Liu & Tao, 2014; Fang et al., 2020) , anomaly detection (Smola et al., 2009; Hido et al., 2011; Abe & Sugiyama, 2019) , twosample testing (Keziou & Leoni-Aubin, 2005; Kanamori et al., 2010; Sugiyama et al., 2011a) , causal inference (Kato et al., 2020) , change point detection in time series (Kawahara & Sugiyama, 2009) , and binary classification only from positive and unlabeled data (PU learning; Kato et al., 2019) . For example, anomaly detection is not easy to perform based on standard machine learning methods such as binary classification since anomalous data is often scarce, but it can be solved by estimating the density ratio when training data without anomaly as well as unlabeled test data are available (Hido et al., 2008) . Among various approaches for DRE, we focus on the Bregman (BR) divergence minimization framework (Bregman, 1967; Sugiyama et al., 2011b) that is a generalization of various DRE methods, e.g., the moment matching (Huang et al., 2007; Gretton et al., 2009) , the probabilistic classification (Qin, 1998; Cheng & Chu, 2004) , the density matching (Nguyen et al., 2010; Yamada et al., 2010) , and the density-ratio fitting (Kanamori et al., 2009) . Recently, Kato et al. (2019) also proposed using the risk of PU learning for DRE, which also can be generalized from the BR divergence minimization viewpoint, as we show below. However, existing DRE methods mainly adopt a linear-in-parameter model for nonparametric DRE (Kanamori et al., 2012) and rarely discussed the use of more flexible models, such as deep neural networks, while recent developments in machine learning suggest that deep neural networks can significantly improve the performances for various tasks, such as computer vision (Krizhevsky et al., 2012) and natural language processing (Bengio et al., 2001) . This motivates us to use deep neural networks for DRE. However, existing DRE studies have not fully discussed using such state-of-theart deep neural networks. For instance, although Nam & Sugiyama (2015) and Abe & Sugiyama (2019) proposed using neural networks for DRE, their neural networks are simple and shallow. When using deep neural networks in combination with empirical minimization of BR divergence, we often observe a serious over-fitting problem as demonstrated through experiments in Figure 2 of Section 5. We hypothesize that this is mainly because there is no lower bound in the empirically BR divergence approximating by finite samples, i.e., we can achieve an infinitely negative value in minimization. This hypothesis is based on Kiryo et al. (2017) , which reports a similar problem in PU learning. While Kiryo et al. (2017) call this phenomena over-fitting, we refer to it as train-loss hacking because the nuance is a bit different from the standard meaning of overfitting. Here, we briefly introduce the train-loss hacking discussed in the PU learning literature Kiryo et al. (2017) . In a standard binary classification problem, we train a classifier ψ by minimizing the following empirical risk using {(y i , X i )} n i=1 : 1 n n i=1 [y i = +1]ℓ(ψ(X i )) + 1 n n i=1 [y i = -1]ℓ(-ψ(X i )), where y i ∈ {±1} is a binary label, X i is a feature, and ℓ is a loss function. On the other hand, in PU learning formulated by du Plessis et al. (2015) , because we only have positive data {(y ′ i = +1, X ′ i )} n ′ i=1 and unlabeled data {(x ′′ j )} n ′′ j=1 , we minimize the following alternative empirical risk: π n ′ n ′ i=1 ℓ(ψ(X ′ i ) - π n ′ n ′ i=1 ℓ(-ψ(X ′ i )) Cause of train-loss hacking. + 1 n ′′ n ′′ j=1 ℓ(-ψ(X ′′ j )), where π is a hyper-parameter representing p(y = +1). Note that the empirical risk ( 2) is unbiased to the population binary classification risk (1) (du Plessis et al., 2015) . While the the empirical risk (1) of the standard binary classification is lower bounded under an appropiate choise of ℓ, the empirical risk (2) of PU learning proposed by du Plessis et al. (2015) is not lower bounded owing to the existence of the second term. Therefore, if a model is sufficiently flexible, we can significantly minimize the empirical risk only by minimizing the second term -π n ′ n ′ i=1 ℓ(-ψ(X ′ i )) without increasing the other terms. Kiryo et al. (2017) proposed non-negative risk correction for avoiding this problem when using neural networks. We discuss this problem again in Section 2 and Figure 1 . In existing DRE literature, this train-loss hacking has rarely been discussed, although we often face this problem when using neural networks, as mentioned in Section 5. One reason for this is that the existing method assumes a linear-in-parameter model for a density ratio model (Kanamori et al., 2012) , which is not so flexible as neural networks and do not cause the phenomenon. To mitigate the train-loss hacking, we propose a general procedure to modify the empirical BR divergence using the prior knowledge of the upper bound of the density ratio. Our idea of the correction is inspired by Kiryo et al. (2017) . However, their idea of non-negative correction is only immediately applicable to the binary classification; thus we require a non-trivial rewriting of the BR divergence to generalize the approach to our problem. We call the proposed empirical risk the non-negative BR (nnBR) divergence, and it is a generalization of the method of Kiryo et al. (2017) . In addition, for a special case of DRE, we can still use a lower bounded loss for DRE (See bounded uLSIF and BKL introduced in the following section). However, such a loss also suffers from the train-loss hacking (bounded uLSIF of Figure 2 and BKL-NN of Figure 4 ). In the case, the train-loss hacking is caused because the loss sticks to the lower bound. This type of train-loss hacking is also avoided by using the proposed nnBR divergence. Our main contributions are: (1) the proposal of a general procedure to modify a BR divergence to enable DRE with flexible models, (2) theoretical justification of the proposed estimator, and (3) the experimental validation of the proposed method using benchmark data.

2. PROBLEM SETTING

Let X nu ⊆ R d and X de ⊆ R d be the spaces of the d-dimensional covariates X nu i nnu i=1 and X de i n de i=1 , respectively, which are independent and identically distributed (i.i.d.) as X nu i nnu i=1 i.i.d.

∼

p nu (X) and X de i n de i=1 i.i.d. ∼ p de (X), where p nu (X) and p de (X) are probability densities over X nu Table 1: Summary of methods for DRE (Sugiyama et al., 2011b) . For PULogLoss, we use C < 1 R . and X de , respectively. Here, "nu" and "de" indicate the numerator and the denominator. Our goal is to estimate the density ratio r * (X) = pnu(X) p de (X) . To identify the density ratio, we assume the following: Assumption 1. The density p nu (X) is strictly positive over the space X nu , the density p de (X) is strictly positive over the space X de , and X nu ⊆ X de . In addition, the density ratio r * is bounded from above on X de : R = sup X∈X de r * (X) < ∞. Note that the assumption X nu ⊆ X de is typical in the context of DRE. For instance, in anomaly detection with unlabeled test data, X de corresponds to a sample space including clean and anomaly data and X de corresponds to a sample space only with clean data. Here, we introduce the notation of this paper. Let E nu and E de denote the expectations over p nu (X) and p de (X), respectively. Let Ênu and Êde denote the sample average over X nu i nnu i=1 and X de i n de i=1 , respectively. Let H ⊂ {r : R d → (b r , B r )} be the hypothesis class of the density ratio, where 0 ≤ b r < R < B r .

2.1. DENSITY RATIO MATCHING UNDER THE BREGMAN DIVERGENCE

A naive way to implement DRE would be to estimate the numerator and the denominator densities separately and take the ratio. However, according to Vapnik's principle, we should avoid solving a more difficult intermediate problem than the target problem (Vapnik, 1998) . Therefore, various methods for directly estimating the density ratio model have been proposed (Gretton et al., 2009; Sugiyama et al., 2008; Kanamori et al., 2009; Nguyen et al., 2010; Yamada et al., 2010; Kato et al., 2019) . Sugiyama et al. (2011b) showed that these methods can be generalized as the density ratio matching under the BR divergence. The BR divergence is an extension of the Euclidean distance to a class of divergences that share similar properties (Bregman, 1967) . Formally, let f : (b r , B r ) → R be a twice continuously differentiable convex function with a bounded derivative. Then, the point-wise BR divergence associated with f from t * to t is defined as BR f (t * t) := f (t * ) -f (t) -∂f (t)(t * -t), where ∂f is the derivative of f . Now, the discrepancy from the true density ratio function r * to a density ratio model r is measured by integrating the point-wise BR divergence as follows (Sugiyama et al., 2011b) : BR f (r * r) : = p de (X) f (r * (X)) -f (r(X)) -∂f (r(X)) r * (X) -r(X) dX. (3) We estimate the density ratio by finding a function r that minimizes the BR divergence defined in (3). Here, we subtract the constant BR = E de f (r * (X)) from (3) to obtain BR f (r * r) := p de (X) ∂f (r(X))r(X) -f (r(X)) dX -p nu (X)∂f (r(X))dX. Here, Sugiyama et al. (2012) used r * (X)p de = p nu for removing r * (X), which is a common technique in the DRE literature. Since BR is constant with respect to r, we have arg min r BR ′ f (r * r) = arg min r BR f (r * r). Then, let us define the sample analogue of (4) as BR f (r) := Êde ∂f r(X i ) r(X i ) -f r(X i ) -Ênu ∂f r(X j ) . (5) For a hypothesis class H, we estimate the density ratio by solving min r∈H BR f (r * r). Sugiyama et al. (2011b) showed that various DRE methods can be unified from the viewpoint of BR divergence minimization. Furthermore, Menon & Ong (2016) showed an equivalence between conditional probability estimation and DRE by BR divergence minimization. In addition, we can derive a novel method for DRE from the proposed method of du Plessis et al. (2015) and Kato et al. (2019) . We summarize the DRE methods in Table 1 . Here, the empirical risks of leastsquare importance fitting (LSIF), the Kullback-Leibler importance estimation procedure (KLIEP), logistic regression (LR), and PU learning with log Loss (PULogLoss) is given as

2.2. EXAMPLES OF DRE

BR LSIF (r) := -Ênu [r(X j )] + 1 2 Êde [(r(X i )) 2 ], BR UKL (r) := Êde [r(X i )] -Ênu log r(X j ) , BR BKL (r) := -Êde log 1 1+r(Xi) -Ênu log r(Xj ) 1+r(Xj ) , and BR PU (r) := -Êde log 1 -r(X i ) + C Ênu -log r(X j ) + log 1 -r(X j ) , where 0 < C < 1 R . Here, BR LSIF (r) and BR PU (r) correspond to LSIF and PULogLoss, respectively. We can derive the KLIEP and LR from BR UKL (r) and BR BKL (r), which are called unnormalized Kullback-Leibler (UKL) divergence and binary Kullback-Leibler (BKL) divergence, respectively (Sugiyama et al., 2011b) . In BR PU (r), we restrict the model as r ∈ (0, 1) and we obtain an estimator of Cr * as a result of the minimization of the risk. Details of the existing methods are shown in Appendix A.

3. DEEP DIRECT DRE BASED ON NON-NEGATIVE RISK ESTIMATOR

In this section, we develop methods for DRE with flexible models such as neural networks.

3.1. DIFFICULTIES OF DRE USING NEURAL NETWORKS

Training neural networks for DRE by minimizing the empirical BR divergence tends to suffer from an overfitting phenomenon. For example, we show in Section 5 that the LSIF with neural networks (Nam & Sugiyama, 2015) suffers from a serious overfitting issue. One possible cause of overfitting of flexible models has been identified to be the train-loss hacking as shown by Kiryo et al. (2017) in the context of PU learning (Figure 1 ). In DRE, the train-loss hacking is also sensible: the term -Ênu ∂f r(X j ) can easily diverge to -∞ if we try to minimize empirical BR divergence (5) over a highly flexible hypothesis class H. This is because the term is not lower bounded, unlike the loss functions for other machine learning tasks such as classification. Here, note that the other term Êde ∂f r(X i ) r(X i )f r(X i ) does not introduce a sufficient trade-off to prevent the divergence to -∞ when the hypothesis class is highly flexible (Figure 1 ).

3.2. NON-NEGATIVE BR DIVERGENCE

Although DRE using flexible models suffers from serious train-loss hacking, we still have a strong motivation to use those models for analyzing data such as computer vision and text data. For example, Nam & Sugiyama (2015) and Abe & Sugiyama (2019) approximated the density ratio with neural networks, and Uehara et al. (2016) applied direct DRE for generative adversarial nets (GANs; Goodfellow et al., 2014) , both by minimizing the empirical BR divergence (5). To alleviate the trainloss hacking problem, we propose the non-negative BR divergence estimator. The proposed method is inspired by Kiryo et al. (2017) , which suggested a non-negative correction to the empirical risk of PU learning based on the knowledge that a part of the population risk is non-negative. On the other hand, in DRE, it is not obvious how to correct the empirical risk because we do not know which part of the population risk (4) is non-negative. However, this can be alleviated by determining the upper bound R of the density ratio r * . With the prior knowledge of R, we can determine part of the risk of DRE ( 4) is non-negative and conduct a non-negative correction to the empirical risk (5) based on the non-negativity of the population risk. Let us define f to be a function such that ∂f (t) = C ∂f (t)t -f (t) + f (t), where 0 < C < 1 R and put the following assumption. Assumption 2. Assume that f (t) is bounded from above, and that there exists a constant A such that ∂f (t)t -f (t) + A ≥ 0 for t ∈ (b r , B r ). Assumption 2 is satisfied by most of the loss functions which appear in the previously proposed DRE methods (see Appendix B for examples). Under Assumption 2, because f (t) is bounded above, the train-loss hacking -Ênu ∂f r(X j ) → -∞ in the minimization of the empirical risk ( 5 ) is caused by -Ênu C {∂f (r(X j ))r(X j ) -f (r(X j ))} → -∞ because -Ênu ∂f r(X j ) → -∞ = -Ênu f (r(X j )) Bounded -Ênu C {∂f (r(X j ))r(X j ) -f (r(X j ))} → -∞ . Thus, we succeeded in identifying the part of the empirical risk causing the train-loss hacking. Therefore, by preventing the term from going to infinite negative, we can avoid the train-loss hacking. For achieving this purpose, we impose a non-negative correction to the empirical risk (5); that is, under Assumption 2, we restrict the behavior of the problematic term. To incorporate the assumption, we first rewrite the population risk (4) as BR f (r * r) = p de (X) -Cp nu (X) ∂f (r(X))r(X) -f (r(X)) + A dX ( * ) -p nu (X) f (r(X)) dX. Note that we introduce a constant A, which is irrelevant to the original optimization problem. Thus, the problematic term is incorporated into ( * ). Therefore, we next consider a constraint condition on the term ( * ). Here, let us define ℓ 1 (t) := ∂f (t)t -f (t) + A, and ℓ 2 (t) := -f (t). In the above equation, since Assumption 2 implies ℓ 1 (t) ≥ 0, and 0 < C < 1 R and pnu(X) p de (X) ≤ R imply p de (X) -Cp nu (X) > 0 for all X ∈ X , we have p de (X) -Cp nu (X) ℓ 1 (r(X)) dX > 0. Motivated by this inequality, we propose an empirical risk with the non-negative correction as nnBR f (r) := Ênu ℓ 2 (r(X j )) + Êde ℓ 1 (r(X i )) -C Ênu ℓ 1 (r(X j )) + , where (•) + := max{0, •}. Note that the constraint on ( * ) is not broken in population (infinite samples) but broken in finite samples with causing the train-loss hacking. Our deep direct DRE (D3RE) is based on minimizing nnBR f (r). Remark 1 (Choice of C). In practice, selecting the hyper-parameter C does not require accurate knowledge of R because any 0 < C < 1/R is sufficient to justify the non-negative correction. However, selecting C that is relatively much smaller than 1/R may damage the empirical performance. See Section G.1.1. This remark does not mean that 1/R should not be small; that is, R is note be small. If 1/R is small, C also can be small. Therefore, we recommend use larger C more than smaller C. nnBR Divergence with Existing Methods: The above strategy can be instantiated in various methods previously proposed for DRE. Here, we introduce the non-negative BR divergence estimators that correspond to LSIF, UKL, BKL, and PULogLoss as follows: nnBR LSIF (r) := -Ênu r(X j ) - C 2 r 2 (X j ) + 1 2 Êde r 2 (X i ) - C 2 Ênu r 2 (X j ) + . nnBR UKL (r) = -Ênu log r(X j ) -Cr(X j ) + Êde [r(X i )] -C Ênu [r(X j )] + , nnBR BKL (r) = -Ênu log r(X j ) 1 + r(X j ) + C log 1 1 + r(X j ) + -Êde log 1 1 + r(X i ) + C Ênu log 1 1 + r(X j ) + , nnBR PU (r) := -C Ênu log r(X j ) + C Ênu log 1 -r(X j ) -Êde log 1 -r(X i ) + . More detailed derivation of f is in Appendix B. In Appendix C, we provide the pseudo code of D3RE. For improving the performance heuristically, we use gradient ascending in our main experiments. Note that the gradient ascent only slightly improves the performance and the use is not essential. In the experiments in Appendix E and G.1.3, we also show the experimental results without the gradient ascent for readers concerning the effect of the gradient ascent.

4. THEORETICAL JUSTIFICATION OF D3RE

In this section, we confirm the validity of the proposed method by providing a generalization error bound. ∼ p. We omit r * from the notation of BR f when there is no ambiguity.

4.1. GENERALIZATION ERROR BOUND ON BR DIVERGENCE

Theorem 4 in Appendix I provides a generalization error bound under the following assumption. Assumption 3. Let I r := (b r , B r ). Assume that there exists an empirical risk minimizer r ∈ arg min r∈H nnBR f (r) and a population risk minimizer r ∈ arg min r∈H BR f (r). Assume B ℓ := sup t∈Ir {max{|ℓ 1 (t)|, |ℓ 2 (t)|}} < ∞. Also assume ℓ 1 (resp. ℓ 2 ) is L ℓ1 -Lipschitz (resp. L ℓ2 - Lipschitz) on I r . Assume also that inf r∈H (E de -CE nu )ℓ 1 (r(X)) > 0 holds. In order for the boundedness and Lipschitz continuity in Assumption 3 to hold for the loss functions involving a logarithm (UKL, BKL, PU), a technical assumption b r > 0 is sufficient. We obtain a theoretical guarantee for D3RE from Theorem 4 in Appendix I by additionally imposing Assumption 4 to bound the Rademacher complexities using a previously known result (Golowich et al., 2019, Theorem 1) . Assumption 4 (Neural networks with bounded complexity). Assume that p nu and p de have bounded supports: sup x∈X de x < ∞. Also assume that H consists of real-valued neural networks of depth L over the domain X , where each parameter matrix W j has the Frobenius norm at most B Wj ≥ 0 and with 1-Lipschitz activation functions φ j that are positive-homogeneous (i.e., φ j is applied element-wise and φ j (αt) = αφ j (t) for all α ≥ 0). Under Assumption 4, Lemma 3 in Appendix J reveals R pnu nnu (H) = O(1/ √ n nu ) and R p de n de (H) = O(1/ √ n de ). By combining these with Theorem 4 in Appendix I, we obtain the following theorem: Theorem 1 (Generalization error bound for D3RE). Under Assumptions 3 and 4, for any δ ∈ (0, 1), we have with probability at least 1 -δ, BR f (r) -BR f (r) ≤ κ 1 √ n de + κ 2 √ n nu + 2Φ f C (n nu , n de ) + B ℓ 8 1 n de + (1 + C) 2 n nu log 1 δ , where κ 1 , κ 2 are constants that depend on C, f, B p de , B pnu , L, and B Wj . See Remark 6 in Appendix I for the explicit form of this bound. By transforming the BR divergence, Theorem 1 tells us generalization error bounds for various problems. For instance, by defining f (t) as log (1 -t) + Ct (log (t) -log (1 -t)) for 0 < t < 1, the BR divergence BR f (r) becomes the same risk functional of PU learning (see Appendix A for the derivation). Then, the generalization bound is similar to the one shown by Kiryo et al. (2017) , i.e., the generalization bounds matches the classification error bound of classification only from positive and unlabeled data. Note that the dependency of the bound is standard for classification risk bound with Lipschitz function (See Corollary 15 of Bartlett & Mendelson (2003) ). Thus, this result implies that the BR divergence minimization with D3RE also minimize the generalization error bound of binary classification.

4.2. ESTIMATION ERROR BOUND ON L 2 NORM

Next, we derive the estimation error bound of r on the L 2 norm. We aim to derive the standard convergence rate of non-parametric regression; that is, under appropriate conditions, the order of r -r * L 2 (p de ) is nearly O P (1/(n de ∧ n nu )) (Kanamori et al., 2012) . Note that unlike the generalization error bound of the BR divergence, which is related to classification problems, we need to restrict the neural network model to achieve such a convergence rate in general. To the best of our knowledge, it is difficult for showing the convergence rate for any neural network models such as ResNet (Schmidt-Hieber, 2020) in a general way. In the following Theorem 1, for a multi-layer perception with ReLU activation function (Definition 3), we derive a converge rate of Lfoot_1 distance, which is the same rate with the nonparametric regression using radial basis function with the Gaussian kernel and LSIF loss (Kanamori et al., 2012) . This result also corresponds to a tighter generalization error bound than Theorem 1 under the model restriction. Note that without such a model restriction in Theorem 2, the convergence rate will slower based on the rate of Theorem 1. The proof is shown in Appendix K. To support this result, we empirically investigate the estimator error using an artificially generated dataset with the known true density ratio in Appendix E. Theorem 2 (L 2 Convergence rate). Assume f is µ-strongly convex. Let H be defined as in Definition 3, and assume r * = pnu p de ∈ H. Also assume the same conditions as Theorem 3. Then, for any 0 < γ < 2, we have r -r * L 2 (p de ) ≤ O P (min {n de , n nu }) -1/(2+γ) (n de , n nu → ∞).

5. EXPERIMENT WITH IMAGE DATA

In this section, we experimentally show how the existing estimators fail to estimate the density ratio when using neural networks and that our proposed estimators succeed. To investigate the performances, we consider PU learning. For a binary classification problem with labels y ∈ {-1, +1}, we consider training a classifier only from p(X | y = +1) and p(X) to find a positive data point in test data sampled from p(X). The goal is to maximize the area under the receiver operating characteristic (AUROC) curve, which is a criterion used for anomaly detection, by estimating the density ratio r * (X) = p(X | y = +1)/p(X). We construct the positive and negative dataset from CIFAR-10foot_0 (Krizhevsky, 2009) dataset with 10 classes. The positive dataset comprises 'airplane', 'automobile', 'ship', and 'truck'; the negative dataset comprises 'bird', 'cat', 'deer', 'dog', 'frog', and 'horse'. We use 1, 000 positive data sampled from p(X | y = +1) and 1, 000 unlabeled data sampled from p(X) to train the models. Then, we calculate the AUROCs using 10, 000 test data sampled from p(X). In this case, it is desirable to set C < 1 2 because p(X|y=+1) 0.5p(X|y=+1)+0.5p(X|y=-1) = 1 0.5+0.5 p(X|y=-1) p(X|y=+1) . For demonstrative purposes, we use the CNN architecture from the PyTorch tutorial (Paszke et al., 2019) . Details of the network structure are shown in Appendix D.2. The model is trained using Adam without weight decay and the parameters (β 1 , β 2 , ϵ) in Kingma & Ba (2015) are fixed at the default of PyTorch, namely (0.9, 0.999, 10 -8 ). First, we compare two of the proposed estimators, nnBR PU (nnBR-PU) and nnBR LSIF (nnBR-LSIF), with the existing estimators, BR PU (PU-NN) and BR LSIF (uLSIF-NN). We use the logistic loss for PULogLoss. Additionally, we conduct an experiment with uLSIF-NN using a naively capped model r(X) = min{r(X), 1/C} (Bounded uLSIF). We fix the hyperparameter C at 1/3. We report the results for two learning rates, 1 × 10 -4 and 1 × 10 -5 . We conducted 10 trials, and calculated the average AUROCs. We also compute Êde [r(X)], which should be close to 1 if we successfully estimate the density ratio, because (p(x|y = +1)/p(X))p(X)dx = 1. The results are in Figure 2 . In all cases, the proposed estimators outperform the other methods. In contract, the unstable behaviors of PU-NN and LISF-NN are caused by the train-loss hacking (also see (Kiryo et al., 2017) Appendix G.1). The experiment also demonstrates that naive capping (Bounded uLSIF) fails to prevent train-loss hacking and leads to suboptimal behavior. A naive capping is insufficient because an unreasonable model such that r(X de i ) = 0 and r(X nu i ) = 1/C still can be a minimizer by making one part of the empirical BR divergence largely negative, e.g., as 1 2 Êde r 2 (X) -C 2 Ênu r 2 (X) = 0 -1 2C . Additional experimental results for sensitivity analysis on the upper bound of the density ratio and comparison with various estimators using nnBR divergence are shown in Appendix G.1.

6. INLIER-BASED OUTLIER DETECTION

As an applications of D3RE, we introduce inlier-based outlier detection with experiments using benchmark datasets. Moreover, we introduce other applications such as covariate shift adaptation in Appendix H. In addition to CIFAR-10, we use MNIST 2 (LeCun et al., 1998) and fashion-MNIST (FMNIST)foot_2 (Xiao et al., 2017) . Hido et al. (2008; 2011) applied the direct DRE Figure 2 : Experimental results of Section 5. The horizontal axis is epoch, and the vertical axis is AUROC. The learning rates of the left and right graphs are 1 × 10 -4 and 1 × 10 -5 , respectively. The upper graphs show the AUROCs and the lower graphs show Êde [r(X)], which will approach 1 when we successfully estimate the density ratio. (2015) and Abe & Sugiyama (2019) proposed using neural networks with DRE for this problem. In relation to the experimental setting of Section 5, the problem setting can be seen as a transductive variant of PU learning (Kato et al., 2019) . We follow the setting proposed by Golan & El-Yaniv (2018) using MNIST, CIFAR-10, and FMNIST. There are ten classes in each dataset; we use one class as the inlier class and all other classes as the outliers. For example, in the case of CIFAR-10, there are 5, 000 train data per class. On the other hand, there are 1, 000 test data for each class, which consists of 1, 000 inlier samples and 9, 000 outlier samples. The AUROC is used as a metric to evaluate whether an outlier class can be detected in the test data. We compare the proposed methods with benchmark methods of deep semi-supervised anomaly detection (DeepSAD) (Ruff et al., 2020) and geometric transformations (GT) (Golan & El-Yaniv, 2018) . The details of each method are shown in Appendix F. To compare the methods fairly, we use the same architectures of neural networks from Golan & El-Yaniv ( 2018 4 in Appendix G.2. The proposed methods result in better AUROCs than the existing methods. The largest performance gain is seen in the CIFAR-10: the mean AUROC is improved by 0.157 on average between the uLSIF-NN and nnBR-LSIF. Although GT and DeepSAD are designed for a different problem, to the best of our knowledge, these are the state-of-the-art algorithms, which are not based on DRE.

7. CONCLUSION

We proposed a non-negative correction to the empirical BR divergence for DRE. Using the prior knowledge of the upper bound of the density ratio, we can prevent train-loss hacking when using flexible models. In our theoretical analysis, we showed the generalization bound of the algorithm.

A DETAILS OF EXISTING METHODS FOR DRE

In this section, we overview examples of DRE methods in the framework of the density ratio matching under BR divergence. Least Squares Importance Fitting (LSIF): LSIF minimizes the squared error between a density ratio model r and the true density ratio r * defined as follows (Kanamori et al., 2009) : R LSIF (r) = E de [(r(X) -r * (X)) 2 ] = E de [(r * (X)) 2 ] -2E nu [r(X)] + E de [(r(X)) 2 ]. In the unconstrained LSIF (uLSIF) (Kanamori et al., 2009) , we ignore the first term in the above equation and estimate the density ratio by the following minimization problem: r = arg min r∈H 1 2 Êde [(r(X)) 2 ] -Ênu [r(X)] + R(r) , ( ) where R is a regularization term. This empirical risk minimization is equal to minimizing the empirical BR divergence defined in (5) with f (t) = (t -1) 2 /2. Unnormalized Kullback-Leibler (UKL) Divergence and KL Importance Estimation Procedure (KLIEP): The KL importance estimation procedure (KLIEP) is derived from the unnormalized Kullback-Leibler (UKL) divergence objective (Sugiyama et al., 2008; Nguyen et al., 2010; Tsuboi et al., 2009; Yamada & Sugiyama, 2009; Yamada et al., 2010) , which uses f (t) = t log(t) -t. Ignoring the terms which are irrelevant for the optimization, we obtain the unnormalized Kullback-Leibler (UKL) divergence objective (Nguyen et al., 2010; Sugiyama et al., 2012) as BR UKL (r) = E de [r(X)] -E nu log r(X) . Directly minimizing UKL is proposed by Nguyen et al. (2010) . The KLIEP also solves the same problem with further imposing a constraint that the ratio model r(X) is non-negative for all X and is normalized as Êde [r(X)] = 1. Then, following is the optimization criterion of KLIEP (Sugiyama et al., 2008) : max r Ênu log r(X) s.t. Êde [r(X)] = 1 and r(X) ≥ 0 for all X. Logistic Regression: By using f (t) = log(t) -(1 + t) log(1 + t), we obtain the following BR divergence called the binary Kullback-Leibler (BKL) divergence: BR BKL (r) = -E de log 1 1 + r(X) -E nu log r(X) 1 + r(X) . This BR divergence is derived from a formulation based on the logistic regression (Hastie et al., 2001; Sugiyama et al., 2011b) . PU Learning with the Log Loss: Consider a binary classification problem and let X and y ∈ {±1} be the feature and the label of a sample, respectively. In PU learning, the goal is to train a classifier only using positive data sampled from p(X | y = +1), and unlabeled data sampled from p(X) in binary classification (Elkan & Noto, 2008) . More precisely, this problem setting of PU learning is called the case-control scenario (Elkan & Noto, 2008; Niu et al., 2016) . Let G be the set of measurable functions from X to [ϵ, 1 -ϵ], where ϵ ∈ (0, 1/2) is a small positive value. For a loss function ℓ : R × {±1} → R + , du Plessis et al. (2015) showed that the classification risk of g ∈ G in the PU problem setting can be expressed as R PU (g) = π ℓ(g(X), +1) -ℓ(g(X), -1) p(X | y = +1)dX + ℓ(g(X), -1)]p(X)dX. ( ) According to Kato et al. (2019) , we can derive the following risk for DRE from the risk for PU learning (7) as follows: BR PU (g) = 1 R E nu [-log (g(X)) + log (1 -g(X))] -E de [log (1 -g(X))] , and Kato et al. (2019) showed that g * = arg min g∈G BR PU (g) satisfies the following: Proposition 1. It holds almost everywhere that g * (X) =      1 -ε (X / ∈ D 2 ), C pnu(X) p de (X) (X ∈ D 1 ∩ D 2 ), ε (X / ∈ D 1 ), where C = 1 R , D 1 = {X | Cp nu (X) ≥ ϵp de (X)}, and D 2 = {X|Cp nu (X) ≤ (1 -ϵ)p de (X)}. Using this result, we define the empirical version of BR PU (g) as follows: BR PU (r * r) := C Ênu -log r(X i ) + log 1 -r(X j ) -Êde log 1 -r(X i ) . Note that for f (t) in the BR divergence, we use f (t) = C log (1 -t) + Ct (log (t) -log (1 -t)) . Then, we have ∂f (t) = - C 1 -t + C(log(t) -log(1 -t)) + Ct 1 t + 1 1 -t . Therefore, we have BR f (r) := E de ∂f r(X i ) r(X i ) -f r(X i ) -E nu ∂f r(X j ) = E de - Cr(X i ) 1 -r(X i ) + Cr(X i )(log(r(X i )) -log(1 -r(X i ))) + Cr 2 (X i ) 1 r(X i ) + 1 1 -r(X i ) -E de log (1 -r(X i )) + Cr(X i ) (log (r(X i )) -log (1 -r(X i ))) -E nu - C 1 -r(X i ) + C(log(r(X i )) -log(1 -r(X i ))) + Cr(X i ) 1 r(X i ) + 1 1 -r(X i ) = E de - Cr(X i ) 1 -r(X i ) + Cr(X i )(log(r(X i )) -log(1 -r(X i ))) + Cr(X i ) 1 -r(X i ) -E de log (1 -r(X i )) + Cr(X i ) (log (r(X i )) -log (1 -r(X i ))) -E nu - C 1 -r(X i ) + C(log(r(X i )) -log(1 -r(X i ))) + C 1 -r(X i ) = E de log (1 -r(X i )) -CE nu log(r(X i )) -log(1 -r(X i )) . Remark 2 (DRE and PU learning). Menon & Ong (2016) showed that minimizing a proper CPE loss is equivalent to minimizing a BR divergence to the true density ratio, and demonstrated the viability of using existing losses from one problem for the other for CPE and DRE. Kato et al. (2019) pointed out the relation between the PU learning and density ratio estimation and leveraged it to solve a sample selection bias problem in PU learning. In this paper, we introduced the BR divergence with f (t) = log (1 -Ct) + Ct (log (Ct) -log (1 -Ct)), inspired by the objective function of PU learning with the log loss. In the terminology of Menon & Ong (2016) , this f results in a DRE objective without a link function. In other words, it yields a direct DRE method.

B EXAMPLES OF f

Here, we show the examples of f such that ∂f (t) = C ∂f (t)t-f (t) + f (t), where f (t) is bounded from above, and ∂f (t)t -f (t) + A is non-negative. First, we consider f (t) = (t -1) 2 /2, which results in the LSIF objective. Because ∂f (t) = t -1, we have t -1 = C (t -1)t -(t -1) 2 /2 + f (t) ⇔ f (t) = -C (t -1)t -(t -1) 2 /2 + t -1 = - C 2 t 2 + C 2 + t -1. The function is a concave quadratic function, therefore it is upper bounded. Note that uLSIF satisfies the Assumption 2 with A = 1 2 since ∂f (t)t + f (t) + 1 2 = 1 2 t 2 , and Assumption 3 automatically holds. Also note that the term A is irrelevant to the optimization. Second, we consider f (t) = t log(t) -t, which results in the UKL or KLIEP objective. Because ∂f (t) = log(t), we have log(t) = C log(t)t -t log(t) + t + f (t) ⇔ f (t) = -tC + log(t). We can easily confirm that the function is upper bounded by taking the derivative and finding that t = 1/C gives the maximum. Note that UKL satisfies the Assumption 2 with A = 0 since ∂f (t)tf (t) = t log(t) -t log(t) + t = t, and Assumption 3 automatically holds. Third, we consider f (t) = t log(t) -(1 + t) log(1 + t), which is used for DRE based on LR or BKL. Because ∂f (t) = log(t) -log(1 + t), we have log(t) -log(1 + t) = C (log(t) -log(1 + t))t -t log(t) + (1 + t) log(1 + t) + f (t) ⇔ f (t) = -C log(1 + t) + log(t) -log(1 + t) = log C 1 + t + log t 1 + t . We can easily confirm that the function is upper bounded as the terms involving t always add up to be negative. Note that BKL satisfies the Assumption 2 with A = 0 since ∂f (t)t -f (t) = (log(t) -log(1 + t))t -t log(t) + (1 + t) log(1 + t) = log(1 + t), and Assumption 3 automatically holds. Fourth, we consider DRE based on PULog. By setting f (t) = log (1 -t) + Ct (log (t) -log (1 -t)), we can obtain the same risk functional introduced in Kiryo et al. (2017) . Note that PULog satisfies the Assumption 2 with A = 0 since ∂f (t)t -f (t) = -log(1 -t) with t < 1, and Assumption 3 automatically holds.

C IMPLEMENTATION

The algorithm for D3RE is described in Algorithm 1. For training with a large amount of data, we adopt the stochastic optimization by splitting the dataset into mini-batches. In stochastic optimization, we separate the samples into N mini-batches as ( X nu i nnu,j i=1 , X de i n de,j i=1 ) (j = 1, . . . , N , where n nu,j and n de,j are the sample sizes in each mini-batch. Then, we consider taking sample average in each mini-batch. Let Êj nu and Êj de be sample averages over X nu i nnu,j i=1 and X de i n de,j i=1 . In addition, we use regularization such as L1 and L2 penalties, denoted by R(r). For improving the performance, we heuristically employ gradient ascent from Kiryo et al. (2017) when Êde ℓ 1 (r(X)) -C Ênu ℓ 1 (r(X)) becomes less than 0, i.e., update the model in the direction that increases the term. Let us note that our proposed method is agnostic to the optimization procedure, and other methods such as plain gradient descent can be combined with our method. We consider that further theoretical investigation of the optimization procedure is out of scope.

D NETWORK STRUCTURE USED IN SECTIONS 5 AND 6

We explain the structures of neural networks used in the experiments.

D.1 NETWORK STRUCTURE USED IN SECTIONS 5

In Section 5, we used CIFAR-10 datasets. The model was a convolutional net (Springenberg et al., 2015) : (32 × 32 × 3)-C(3 × 6, 3)-C(3 × 16, 3)-128-84-1, where the input is a 32 × 32 RGB image, Algorithm 1 D3RE Input: Training data X nu i nnu i=1 and X de i n de i=1 , the algorithm for stochastic optimization such as Adam (Kingma & Ba, 2015) , the learning rate γ, the regularization coefficient λ. Output: A density ratio estimator r. while No stopping criterion has been met: do Create N mini-batches X nu i nnu,j i=1 , X de i n de,j i=1 N j=1 . for i = 1 to N do if Êde ℓ 1 (r(X)) -C Ênu ℓ 1 (r(X)) ≥ 0: then Gradient decent: set gradient ∇ r Êj nu ℓ 2 (r(X)) + Êj de ℓ 1 (r(X)) -C Êj nu ℓ 1 (r(X)) + λR(r) . else Gradient ascent: set gradient ∇ r -Êj de ℓ 1 (r(X)) + C Êj nu ℓ 1 (r(X)) + λR(r) . end if Update r with the gradient and the learning rate γ. end for end while C(3 × 6, 3) indicates that 3 channels of 3 × 6 convolutions followed by ReLU is used. This structure has been adopted from the tutorial of Paszke et al. (2019) .

D.2 NETWORK STRUCTURE USED IN SECTIONS 6

Inlier-based Outlier Detection: We used the same LeNet-type CNNs proposed in Ruff et al. (2020) . In the CNNs, each convolutional module consists of a convolutional layer followed by leaky ReLU activations with leakiness α = 0.1 and (2 × 2)-max-pooling. For MNIST, we employ a CNN with two modules: (32 × 32 × 3)-C(3 × 32, 5)-C(32 × 64, 5)-C(64 × 128, 5)-1. For CIFAR-10 we employ the following architecture: (32 × 32 × 1)-C(1 × 8, 5)-C(8 × 4, 5)-1 with a batch normalization (Ioffe & Szegedy, 2015) after each convolutional layer. The WRN architecture was proposed in Zagoruyko & Komodakis (2016) and it is also used in Golan & El-Yaniv (2018) . This structure improved the performance of image recognition by decreasing the depth and increasing the width of the residual networks (He et al., 2015) . We omit the detailed description of the structure here. Covariate Shift Adaptation: We used the 5-layer perceptron with ReLU activations. The structure is 10000-1000-1000-1000-1000-1.

E EXPERIMENTS FOR REGRESSION ERROR USING SYNTHETIC DATASET

In this experiment, we investigate the regression error of the proposed D3RE. We compare our method with the uLSIF (Kanamori et al., 2009) with reproducing kernel Hilbert space (Kanamori et al., 2012) . For uLSIF, we use an open code of https://github.com/hoxo-m/densratio_py. For D3RE, we use nnBR-LSIF and 3-layer perceptron with ReLU activation function, where the number of the nodes of the middle layer is 100. We conducted nnBR-LSIF for all C ∈ {0.8, 1, 2, 3, 4, 5, 10, 15, 20}. We also compare these method with naively implemented LSIF with the 3-layer perceptron. Let the dimension of the domain be d and p nu (X) = N (X; µ nu , I d ), p de (X) = N (X; µ de , I d ), where N (µ, Σ) denotes the multivariate normal distribution with mean µ and Σ, µ nu and µ de are d-dimensional vector such that µ nu = (1, 0, . . . , 0) ⊤ and µ de = (0, 0, . . . , 0) ⊤ , and I d is a ddimensional identity matrix. We fix the sample sizes at n nu = n de = 1, 000 and estimate the density ratio using uLSIF, LSIF, and D3RE (nnBR-LSIF). For a performance metric, we use the mean squared error (MSE) and the standard deviation (SD) calculated over 50 trials. Note that in this setting, we can calculate the true density ratio r * . The results are shown in Table 3 . The lowest MSE methods are highlighted in bold. As shown in Table 3 , the proposed nnBR-LSIF methods estimate the density rate better than the other methods; that is, achieve lower MSEs. Note that in this setting, the upper bound of r * is infinite because we do not restrict the support of X for simplicity. In the many cases of the results, nnBR-LSIF achieve the best performance around C = 2. This result implies that we do not need to know the exact C for better estimation; that is, we can ignore some "out-lier" samples, which cause train-loss hacking. This section introduces the existing methods for anomaly detection. DeepSAD is a method for semisupervised anomaly detection, which tries to take advantage of labeled anomalies (Ruff et al., 2020) . GT proposed by Golan & El-Yaniv (2018) trains neural networks based on a self-labeled dataset by performing 72 geometric transformations. The anomaly score based on GT is calculated based on the Dirichlet distribution obtained by maximum likelihood estimation using the softmax output from the trained network. C = 0.8 C = 1 C = 2 C = 3 C = 4 C = 5 C = 10 C = 15 C = 20 dim = In the problem setting of the DeepSAD, we have access to a small pool of labeled samples, e.g. a subset verified by some domain expert as being normal or anomalous. In the experimental results shown in Ruff et al. (2020) indicate that, when we can use such samples, the DeepSAD outperforms the other methods. However, in our experimental results, such samples are not assumed to be available, hence the method does not perform well. The problem setting of Ruff et al. (2020) and ours are both termed semi-supervised learning in anomaly detection, but the two settings are different.

G DETAILS OF EXPERIMENTS

The details of experiments are shown in this section. The description of the data is as follows: MNIST: The MNIST database is one of the most popular benchmark datasets for image classification, which consists of 28 × 28 pixel handwritten digits from 0 to 9 with 60, 000 train samples and 10, 000 test samples (LeCun et al., 1998) .

CIFAR-10:

The CIFAR-10 dataset consists of 60, 000 color images of size 32 × 32 from 10 classes, each having 6000. There are 50, 000 training images and 10, 000 test images (Krizhevsky et al., 2012) .

fashion-MNIST:

The fashion-MNIST dataset consists of 70, 000 grayscale images of size 28 × 28 from 10 classes. There are 60, 000 training images and 10, 000 test images (Xiao et al., 2017) . Amazon Review Dataset: Blitzer et al. (2007) published the text data of Amazon review. The data originally consists of a rating (0-5 stars) for four different genres of products in the electronic commerce site Amazon.com: books, DVDs, electronics, and kitchen appliances. Blitzer et al. (2007) also released the pre-processed and balanced data of the original data. The pre-processed data consists of text data with four labels 1, 2, 4, and 5. We map the text data into 10, 000 dimensional data by the TF-IDF mapping with that vocabulary size. In the experiment, for the pre-processed data, we solve the regression problem where the text data are the inputs and the ratings 1, 2, 4, and 5 are the outputs. When evaluating the performance, following Menon & Ong (2016) , we calculate PD (=1-AUROUC) by regarding 4 and 5 ratings as positive labels and 1 and 2 ratings as negative labels.

G.1 EXPERIMENTS WITH IMAGE DATA

We show the additional results of Section 5. Figure 3 , we show the training loss of LSIF-based methods to demonstrate the train-loss hacking phenomenon caused by the objective function without show Êde [r(X)], which will approach 1 when we successfully estimate the density ratio. a lower bound. In Figure 3 , even though the training loss of uLSIF-NN and that of bounded uLSIF decrease more rapidly than that of nnBR-LSIF, the test AUROC score (the higher the better) either drops or fails to increase. These graphs are the manifestations of the severe train-loss hacking in DRE without our proposed device.

G.1.1 EMPIRICAL SENSITIVITY ANALYSIS ON THE UPPER BOUNDS OF THE DENSITY RATIO

Next, we investigate the sensitivity of D3RE to the hyperparameter C. We use nnBR-LSIF and nnBR-PU as in Section 5, but vary the hyperparameter C in {1/1.2, 1/1.5, 1/2.0.1/3.0, 1/5.0}. The other settings remain unchanged from the previous section. These results are shown in Figure 4 . For R = 2.0, the estimator show a better performance when 1/C is close to 2.0.

G.1.2 COMPARISON WITH VARIOUS ESTIMATORS USING NNBR DIVERGENCE

Let UKL-NN and BKL-NN be DRE method with the UKL and BKL losses with neural networks without non-negative correction. Finally, we examine the performances of nnBR-LSIF, nnBR-PU, UKL-NN, BKL-NN, nnBR-UKL, and nnBR-BKL. The learning rate was 1 × 10 -4 , and the other settings were identical to those in the previous experiments. These results are shown in Figure 4 . UKL-NN and BKL-NN also suffer train-loss hacking although BKL loss seems to be more robust against the train-loss hacking than the other loss functions . Although nnBR-UKL and nnBR-BKL show better performance in earlier epochs, nnBR-LSIF and nnBR-PU appear more stable.

G.1.3 RESULTS WITHOUT GRADIENT ASCENT

We also show the experimental results without the gradient ascent heuristic. Figure 5 corresponds to the Figure 2 without the gradient ascent heuristic. Figure 6 corresponds to the Figure 3 without the gradient ascent heuristic. Figure 7 corresponds to the Figure 4 without the gradient ascent heuristic. As shown these experiments, although the gradient ascent/descent heuristic improve the performance, there is no significant difference between empirical performance with and without the heuristic. Therefore, we recommend practitioners to use the gradient ascent/descent heuristic, but if readers concern the theoretical guarantee, they can use the plain gradient descent algorithm; that is, naively minimize the proposed original empirical nnBR risk.

G.2 EXPERIMENTS OF INLIER-BASED OUTLIER DETECTION

In Table 4 , we show the full results of inlier-based outlier detection. In almost all the cases, D3RE for inlier-based outlier detection outperforms the other methods. As explained in Section F, we consider that DeepSAD does not work well because the method assumes the availability of the labeled anomaly data, which is not available in our problem setting. Remark 3 (Benchmark Methods). Although GT is outperformed by our proposed method, the problem setting for the comparison is not in favor of GT as it does not assume the access to the test data. Recently proposed methods for semi-supervised anomaly detection by Ruff et al. (2020) did not perform well without using other side information used in Ruff et al. (2020) . On the other hand, there is no other competitive methods in this problem setting, to the best of our knowledge. We use a document dataset of Amazonfoot_3 (Blitzer et al., 2007) for multi-domain sentiment analysis (Blitzer et al., 2007) . This data consists of text reviews from four different product domains: book, electronics (elec), dvd, and kitchen. Following Chen et al. (2012) and Menon & Ong (2016) , we transform the text data using TF-IDF to map them into the instance space X = R 10000 (Salton & McGill, 1986) . Each review is endowed with four labels indicating the positivity of the review, and our goal is to conduct regression for these labels. To achieve this goal, we perform kernel ridge regression with the polynomial kernel. We compare regression without IW (w/o IW) with regression using the density ratio estimated by PU-NN, uLSIF-NN, nnBR-LSIF, nnBR-PU, uLSIF with Gaussian kernels (Kernel uLSIF), and KLIEP with Gaussian kernels (Kernel KLIEP). We conduct experiments on 2, 000 samples from one domain, and test 2, 000 samples. Following Menon & Ong (2016) , we reduce the dimension into 100 dimensions by principal component analysis when using Kernel uLSIF, Kernel KLEIP, and regressions. Following Menon & Ong (2016) and Cortes & Mohri (2011) , the mean and standard deviation of the pairwise disagreement (PD), 1 -AUROC, is reported. A part of results is in Table 6 . The full results are in Appendix G.3. The methods with D3RE show preferable performance, but the improvement is not significant compared with the image data. We consider this is owing to the difficulty of the covariate shift problem in this dataset. f -divergence Estimation: f -divergences (Ali & Silvey, 1966; Csiszár, 1967) are the discrepancy measures of probability densities based on the density ratio, hence the proposed method can be used for their estimation. They include the KL divergence (Kullback & Leibler, 1951) , the Hellinger distance (Hellinger, 1909) , and the Pearson divergence (Pearson, 1900) , as examples.

Two-sample Homogeneity Test:

The purpose of a homogeneity test is to determine if two or more datasets come from the same distribution (Loevinger, 1948) . For two-sample testing, using a semiparametric f -divergence estimator with nonparametric density ratio models has been studied (Keziou., 2003; Keziou & Leoni-Aubin, 2005) . Kanamori et al. (2010) and Sugiyama et al. (2011a) employed direct DRE for the nonparametric DRE. Generative Adversarial Networks: Generative adversarial networks (GANs) are successful deep generative models, which learns to generate new data with the same distribution as the training data Goodfellow et al. (2014) . Various GAN methods have been proposed, amongst which Nowozin et al. (2016) proposed f-GAN, which minimizes the variational estimate of f -divergence. Uehara et al. (2016) extended the idea of Nowozin et al. (2016) to use BR divergence minimization for DRE. The estimator proposed in this paper also has a potential to improve the method of Uehara et al. (2016) . Average Treatment Effect Estimation and Off-policy Evaluation: One of the goals in causal inference is to estimate the expected treatment effect, which is a counterfactual value. Therefore, following the causality formulated by Rubin (1974) , we consider estimating the average treatment effect (ATE). Recently, from machine learning community, off-policy evaluation (OPE) is also proposed, which is a generalization of ATE (Dudík et al., 2011; Imai & Ratkovic, 2014; Wang et al., 2017; Narita et al., 2019; Bibaut et al., 2019; Kallus & Uehara, 2019; Oberst & Sontag, 2019) . OPE has garnered attention in applications such as advertisement design selection, personalized medicine, search engines, and recommendation systems (Beygelzimer & Langford, 2009; Li et al., 2010; Athey & Wager, 2017) . The problem in ATE estimation and OPE is sample selection bias. For removing the bias, the density ratio has a critical role. An idea of using the density ratio dates back to (Rosenbaum, 1987) , which proposed an inverse probability weighting (IPW) method (Horvitz & Thompson, 1952) for ATE estimation. In the IPW method, we approximate the parameter of interest with the sample average with inverse assignment probability of treatment (action), which is also called propensity score. Here, it is known that using the true assignment probability yields higher variance than the case where we use an estimated assignment probability even if we know the true value (Hirano et al., 2003; Henmi & Eguchi, 2004; Henmi et al., 2007) . This property can be explained from the viewpoint of semiparametric efficiency (Bickel et al., 1998) . While the asymptotic variance of the IPW estimator with an estimated propensity score can achieve the efficiency bound, that of the IPW estimator with the true propensity score does not. By extending the IPW estimator, more robust ATE estimators are proposed by Rosenbaum (1983) , which is known as a doubly robust (DR) estimator. The doubly robust estimator is not only robust to model misspecification but also useful in showing asymptotic normality. In particular, when using the density ratio and the other nuisance parameters estimated from the machine learning method, the conventional IPW and DR estimators do not have asymptotic normality (Chernozhukov et al., 2018) . This is because the nuisance estimators do not satisfy Donsker's condition, which is required for showing the asymptotic normality of semiparametric models. However, by using the sample splitting method proposed by Klaassen (1987), Zheng & van der Laan (2011), and Chernozhukov et al. (2018) , we can show the asymptotic normality when using the DR estimator. Note that for the IPW estimator, we cannot show the asymptotic normality even if using sample-splitting. When using the IPW and DR estimator, we often consider a two-stage approach: in the first stage, we estimate the nuisance parameters, including the density ratio; in the second stage, we construct a semiparametric ATE estimator including the first-stage nuisance estimators. This is also called two-step generalized method of moments (GMM). On the other hand, from the causal inference community, there are also weighting-based covariate balancing methods (Qin & Zhang, 2007; Tan, 2010; Hainmueller, 2012; Imai & Ratkovic, 2014) . In particular, Imai & Ratkovic (2014) proposed a covariate balancing propensity score (CBPS), which simultaneously estimates the density ratio and ATE. The idea of CBPS is to construct moment conditions, including the density ratios, and estimate the ATE and density ratio via GMM simultaneously. Although the asymptotic property of the CBPS is the same as other conventional estimators, existing empirical studies report that the CBPS outperforms them (Wyss et al., 2014) . Readers may feel that the CBPS has a close relationship with the direct DRE. However, Imai & Ratkovic (2014) is less relevant to the context of the direct DRE. From the DRE perspective, the method of Imai & Ratkovic (2014) boils down to the method of Gretton et al. (2009) , which proposed direct DRE through moment matching. The research motivation of Imai & Ratkovic (2014) is to estimate the ATE with estimating a nuisance density ratio estimator simultaneously. Therefore, the density ratio itself is nuisance parameter; that is, they are not interested in the estimation performance of the density ratio. Under their motivation, they are interested in a density ratio estimator satisfying the moment condition for estimating the ATE, not in a density ratio estimator predicting the true density ratio well. In addition, while the direct DRE method adopts linear-in-parameter models and neural networks (our work), it is not appropriate to use those methods with the CBPS (Chernozhukov et al., 2018) . This is because the density ratio estimator does not satisfy Donsker's condition. Even naive Ridge and Lasso regression estimators do not satisfy the Donsker's condition. Therefore, when using machine learning methods for estimating the density ratio, we cannot show asymptotic normality of an ATE estimator obtained by the CBPS; therefore, we need to use the sample-splitting method by (Chernozhukov et al., 2018) . This means that when using the CBPS, we can only use a naive parametric linear model without regularization or classic nonparametric kernel regression. Recently, for GMM with such non-Donsker nuisance estimators, Chernozhukov et al. ( 2016) also proposed a new GMM method based on the conventional two-step approach. For these reasons, the CBPS is less relevant to the direct DRE context. Off-policy Evaluation with External Validity: By the problem setting of combining causal inference and domain adaptation, Kato et al. (2020) recently proposed using covariate shift adaptation to solve the external validity problem in OPE, i.e., the case that the distribution of covariates is the same between the historical and evaluation data (Cole & Stuart, 2010; Pearl & Bareinboim, 2014) .

Change Point Detection:

The methods for change-point detection try to detect abrupt changes in time-series data (Basseville & Nikiforov, 1993; Brodsky & Darkhovsky, 1993; Gustafsson, 2000; Nguyen et al., 2011) . There are two types of problem settings in change-point detection, namely the real-time detection (Adams, 2007; Garnett et al., 2009; Paquet, 2007) and the retrospective detection (Basseville & Nikiforov, 1993; Yamanishi & Takeuchi, 2002) . In retrospective detection, which requires longer reaction periods, Liu et al. (2012) proposed using techniques of direct DRE. Whereas the existing methods rely on linear-in-parameter models, our proposed method enables us to employ more complex models for change point detection. Similarity-based Sentiment Analysis: Kato (2019) used the density ratio estimated from PU learning for sentiment analysis of text data based on similarity.

I GENERALIZATION ERROR BOUND

The generalization error bound can be proved by building upon the proof techniques in Kiryo et al. (2017) ; Lu et al. (2020) . Notations for the Theoretical Analysis: We denote the set of real values by R and that of positive integers by N. Let X ⊂ R d . Let p nu (x) and p de (x) be probability density functions over X , and assume that the density ratio r * (x) := pnu(x) p de (x) is existent and bounded: R : = r * ∞ < ∞. Assume 0 < C < 1 R . Since R ≥ 1 (because 1 = p de (x)r * (x)dx ≤ 1 • r * ∞ ) , we have C ∈ (0, 1] and hence p mod := p de -Cp nu > 0. Problem Setup: Let the hypothesis class of density ratio be H ⊂ {r : R D → (b r , B r ) =: I r }, where 0 ≤ b r < R < B r . Let f : I r → R be a twice continuously-differentiable convex function with a bounded derivative. Define f by ∂f (t) = C(∂f (t)t-f (t))+ f (t), where ∂f is the derivative of f continuously extended to 0 and B r . Recall the definitions ℓ 1 (t) := ∂f (t)t -f (t) + A, ℓ 2 (t) := -f (t), and BR f (r) := E de [∂f (r(X))r(X) -f (r(X)) + A] -E nu [∂f (r(X))] = E Êmod [∂f (r(X))r(X) -f (r(X)) + A] -E nu f (r(X)) = E Êmod ℓ 1 (r(X)) + E nu ℓ 2 (r(X)) (= (E de -CE nu )ℓ 1 (r(X)) + E nu ℓ 2 (r(X))) , nnBR f (r) := ρ Êmod ℓ 1 (r(X)) + Ênu ℓ 2 (r(X)) = ρ(( Êde -C Ênu )ℓ 1 (r(X)) + Ênu ℓ 2 (r(X)) , where we denoted Êmod = Êde -C Ênu and ρ is a consistent correction function with Lipschitz constant L ρ (Definition 1). Remark 4. The true density ratio r * minimizes BR f . Definition 1 (Consistent correction function Lu et al. (2020) ). A function f : R → R is called a consistent correction function if it is Lipschitz continuous, non-negative and f (x) = x for all x ≥ 0. Definition 2 (Rademacher complexity). Given n ∈ N and a distribution p, define the Rademacher complexity R p n (H) of a function class H as R p n (H) := E p E σ sup r∈H 1 n n i=1 σ i r(X i ) , where {σ i } n i=1 are Rademacher variables (i.e., independent variables following the uniform distribution over {-1, +1}) and {X i } n i=1 i.i.d. ∼ p. The theorem in the paper is a special case of Theorem 3 with ρ(•) := max{0, •} (in which case L ρ = 1) and Theorem 4. Theorem 3 (Generalization error bound). Assume that B ℓ := sup t∈Ir {max{|ℓ 1 (t)|, |ℓ 2 (t)|}} < ∞. Assume ℓ 1 is L ℓ1 -Lipschitz and ℓ 2 is L ℓ2 -Lipschitz. Assume that there exists an empirical risk minimizer r ∈ arg min r∈H nnBR f (r) and a population risk minimizer r ∈ arg min r∈H BR f (r). Also assume inf r∈H E Êmod ℓ 1 (r(X)) > 0 and that (ρ -Id) is (L ρ-Id )-Lipschitz. Then for any δ ∈ (0, 1), with probability at least 1 -δ, we have BR f (r) -BR f (r) ≤ 8L ρ L ℓ1 R p de n de (H) + 8(L ρ CL ℓ1 + L ℓ2 )R pnu nnu (H) + 2Φ (C,f,ρ) (n nu , n de ) + B ℓ 8 L 2 ρ n de + (1 + L ρ C) 2 n nu log 1 δ , where Φ (C,f,ρ) (n nu , n de ) is defined as in Lemma 2. Proof. Since r minimizes nnBR f , we have BR f (r) -BR f (r) = BR f (r) -nnBR f (r) + nnBR f (r) -BR f (r) ≤ BR f (r) -nnBR f (r) + nnBR f (r) -BR f (r) ≤ 2 sup r∈H | nnBR f (r) -BR f (r)| ≤ 2 sup r∈H | nnBR f (r) -E nnBR f (r)| Maximal deviation + 2 sup r∈H |E nnBR f (r) -BR f (r)| Bias . We  sup r∈H | nnBR f (r) -E nnBR f (r)| ≤ E sup r∈H | nnBR f (r) -E nnBR f (r)| Expected maximal deviation +B ℓ 2 L 2 ρ n de + (1 + L ρ C) 2 n nu log 1 δ . Applying Lemma 1 to the expected maximal deviation term and Lemma 2 to the bias term, we obtain the assertion. The following lemma generalizes the symmetrization lemmas proved in Kiryo et al. (2017) and Lu et al. (2020) . Lemma 1 (Symmetrization under Lipschitz-continuous modification). Let 0 ≤ a < b, J ∈ N, and {K j } J j=1 ⊂ N. Given i.i.d. samples D (j,k) := {X i } n (j,k) i=1 each from a distribution p (j,k) over X , consider a stochastic process Ŝ indexed by F ⊂ (a, b) X of the form Ŝ(f ) = J j=1 ρ j   Kj k=1 Ê(i,j) [ℓ (j,k) (f (X))]   , where each ρ j is a L ρj -Lipschitz function on R, ℓ (j,k) is a L ℓ (j,k) -Lipschitz function on (a, b), and Ê(i,j) denotes the expectation with respect to the empirical measure of D (j,k) . Denote S(f ) := E Ŝ(f ) where E is the expectation with respect to the product measure of {D (j,k) } (j,k) . Here, the index j denotes the grouping of terms due to ρ j , and k denotes each sample average term. Then we have E sup f ∈F | Ŝ(f ) -S(f )| ≤ 4 J j=1 Kj k=1 L ρj L ℓ (j,k) R n (j,k) ,p (j,k) (F). Proof. First, we consider a continuous extension of ℓ (j,k) defined on (a, b) to [0, b). Since the functions in F take values only in (a, b), this extension can be performed without affecting the values of Ŝ(f ) or S(f ). We extend the function by defining the values for x ∈ [0, a] as ℓ (j,k) (x) := lim x ′ ↓a ℓ (j,k) (x ′ ), where the right-hand side is guaranteed to exist since ℓ (j,k) is Lipschitz continuous hence uniformly continuous. Then, ℓ (j,k) remains a L ρj -Lipschitz continuous function on [0, b). Now we perform symmetrization (Vapnik, 1998) , deal with ρ j 's, and then bound the symmetrized process by Rademacher complexity. Denoting independent copies of {X (j,k) } by {X (gh) j,k } (j,k) and the corresponding expectations as well as the sample averages with (gh) , E sup f ∈F | Ŝ(f ) -S(f )| ≤ J j=1 E sup f ∈F |ρ j ( Kj k=1 Ê(i,j) ℓ (j,k) (f (X))) -E (gh) ρ j ( Kj k=1 Ê(gh) (j,k) ℓ (j,k) (f (X (gh) )))| ≤ J j=1 EE (gh) sup f ∈F |ρ j ( Kj k=1 Ê(i,j) ℓ (j,k) (f (X))) -ρ j ( Kj k=1 Ê(gh) (j,k) ℓ (j,k) (f (X (gh) )))| ≤ J j=1 L ρj Kj k=1 EE (gh) sup f ∈F | Ê(i,j) ℓ (j,k) (f (X)) - Ê(gh) (j,k) ℓ (j,k) (f (X (gh) ))| = J j=1 L ρj Kj k=1 EE (gh) sup f ∈F | Ê(i,j) (ℓ (j,k) (f (X)) -ℓ (j,k) (0)) - Ê(gh) (j,k) (ℓ (j,k) (f (X (gh) )) -ℓ (j,k) (0))| ≤ J j=1 L ρj Kj k=1 2R n (j,k) ,p (j,k) ({ℓ (j,k) • f -ℓ (j,k) (0) : f ∈ F }) ≤ J j=1 L ρj Kj k=1 2 • 2L ℓ (j,k) R n (j,k) ,p (j,k) (F), where we applied Talagrand's contraction lemma for two-sided Rademacher complexity (Ledoux & Talagrand, 1991; Bartlett & Mendelson, 2001) with respect to (t → ℓ (j,k) (t) -ℓ (j,k) (0)) in the last inequality. Lemma 2 (Bias due to risk correction). Assume inf r∈H E Êmod ℓ 1 (r(X)) > 0 and that (ρ -Id) is (L ρ-Id )-Lipschitz on R. There exists α > 0 such that sup r∈H |E nnBR f (r) -BR f (r)| ≤ (1 + C)B ℓ L ρ-Id exp - 2α 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu ) =: Φ (C,f,ρ) (n nu , n de ). Remark 5. Note that we already have p mod ≥ 0 and ℓ 1 ≥ 0 and hence inf r∈H E Êmod ℓ 1 (r(X)) ≥ 0. Therefore, the assumption of Lemma 2 is essentially referring to the strict positivity of the infimum. Here, E Êmod and P (•) denote the expectation and the probability with respect to the joint distribution of the samples included in Êmod . Proof. Fix an arbitrary r ∈ H. We have  |E nnBR f (r) -BR f (r)| = |E[ nnBR f (r) -BR f (r)]| = |E[ρ( Êmod ℓ 1 (r(X))) -Êmod ℓ 1 (r(X))]| ≤ E |ρ( Êmod ℓ 1 (r(X))) -Êmod ℓ 1 (r(X))| = E {ρ( Êmod ℓ 1 (r(X))) = Êmod ℓ 1 (r(X))} • |ρ( Êmod ℓ 1 (r(X))) -Êmod ℓ 1 (r(X))| ≤ E {ρ( Êmod ℓ 1 (r(X))) = Êmod ℓ 1 (r(X))} |(ρ -Id)(s) -(ρ -Id)(0)| + |(ρ -Id)(0)| ≤ sup s:|s|≤(1+C)B ℓ L ρ-Id |s -0| + 0 ≤ (1 + C)B ℓ L ρ-Id , where Id denotes the identity function. On the other hand, since inf r∈H E Êmod ℓ 1 (r(X)) > 0 is assumed, there exists α > 0 such that for any r ∈ H, E Êmod ℓ 1 (r(X)) > α. Therefore, denoting the support of a function by supp(•), E {ρ( Êmod ℓ 1 (r(X))) = Êmod ℓ 1 (r(X))} = P Êmod ℓ 1 (r(X)) ∈ supp(ρ -Id) ≤ P Êmod ℓ 1 (r(X)) < 0 ≤ P Êmod ℓ 1 (r(X)) < E Êmod ℓ 1 (r(X)) -α holds. Now we apply McDiarmid's inequality to the right-most quantity. The absolute difference caused by altering one data point in Êmod ℓ 1 (r(X)) is bounded by B ℓ n de if the change is in a sample from p de and CB ℓ nnu otherwise. Therefore, McDiarmid's inequality implies P Êmod ℓ 1 (r(X)) < E Êmod ℓ 1 (r(X)) -α ≤ exp - 2α 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu ) . Theorem 4 (Generalization error bound). Under Assumption 3, for any δ ∈ (0, 1), with probability at least 1 -δ, we have BR f (r) -BR f (r) ≤ L ℓ1 R p de n de (H) + 8(CL ℓ1 + L ℓ2 )R pnu nnu (H) + 2Φ f C (n nu , n de ) + B ℓ 8 1 n de + (1+C) 2 nnu log 1 δ , where Φ f C (n nu , n de ) := (1 + C)B ℓ exp - 2α 2 (B ℓ 2 /n de )+(C 2 B ℓ 2 /nnu) and α > 0 is a constant determined in the proof of Lemma 2 in Appendix I. Remark 6 (Explicit form of the bound in Theorem 1). Here, we show the explicit form of the bound in Theorem 1 as follows: BR f (r) -BR f (r) ≤ κ 1 √ n de + κ 2 √ n nu + 2Φ f C (n nu , n de ) + B ℓ 8 1 n de + (1 + C) 2 n nu log 1 δ = L ℓ1 B p de 2 log(2)L + 1 L j=1 B Wj √ n de + 8(CL ℓ1 + L ℓ2 ) B pnu 2 log(2)L + 1 L j=1 B Wj √ n nu + (1 + C)B ℓ exp - 2α 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu ) + B ℓ 8 1 n de + (1 + C) 2 n nu log 1 δ .

J RADEMACHER COMPLEXITY BOUND

The following lemma provides an upper-bound on the Rademacher complexity for multi-layer perceptron models in terms of the Frobenius norms of the parameter matrices. Alternatively, other approaches to bound the Rademacher complexity can be employed. The assertion of the lemma follows immediately from the proof of Theorem 1 of Golowich et al. (2019) after a slight modification to incorporate the absolute value function in the definition of Rademacher complexity. Lemma 3 (Rademacher complexity bound (Golowich et al., 2019, Theorem 1) ). Assume the distribution p has a bounded support: B p := sup x∈supp(p) x < ∞. Let H be the class of real-valued networks of depth L over the domain X , where each parameter matrix W j has Frobenius norm at most B Wj ≥ 0, and with 1-Lipschitz activation functions φ j which are positive-homogeneous (i.e., φ j is applied element-wise and φ j (αt) = αφ j (t) for all α ≥ 0). Then R p n (H) ≤ B p 2 log(2)L + 1 L j=1 B Wj √ n . Proof. The assertion immediately follows once we modify the beginning of the proof of Theorem 1 by introducing the absolute value function inside the supremum of the Rademacher complexity as E σ sup r∈H n i=1 σ i r(x i ) ≤ 1 λ log E σ sup r∈H exp λ i=1 σ i r(x i ) . for λ > 0. The rest of the proof is identical to that of Theorem 1 of Golowich et al. (2019) .

K PROOF OF THEOREM 2

We consider relating the L 2 error bound to the BR divergence generalization error bound in the following lemma. Lemma 4 (L 2 distance bound). Let H := {r : X → (b r , B r ) =: I r | |r(x)| 2 dx < ∞} and assume r * ∈ H. If inf t∈Ir f ′′ (t) > 0, then there exists µ > 0 such that for all r ∈ H, r -r * 2 L 2 (p de ) ≤ 2 µ (BR f (r) -BR f (r * )) holds. Proof. Since µ := inf t∈Ir f ′′ (t) > 0, the function f is µ-strongly convex. By the definition of strong convexity, BR f (r) -BR f (r * ) = (BR f (r) -E de f (r * (X))) -(BR f (r * ) + E de f (r * (X))) = 0 = E de [f (r * (X)) -f (r(X)) + ∂f (r(X))(r * (X) -r(X))] ≥ E de µ 2 (r * (X) -r(X)) 2 = µ 2 r * -r 2 L 2 (p de ) . Lemma 5 (ℓ 2 distance bound). Fix r ∈ H. Given n samples {x i } n i=1 from p de , with probability at least 1 -δ, we have 1 n n i=1 (r(x i ) -r * (x i )) 2 ≤ E (r -r * ) 2 (X) = ∥r -r * ∥ 2 L 2 (p de ) +(2R) 2 log 1 δ 2n . Proof. The assertion follows from McDiarmid's inequality after noting that altering one sample results in an absolute change bounded by 1 n (2R) 2 . Thus, a generalization error bound in terms of BR f can be converted to that of an L 2 distance when the true density ratio and the density ratio model are square-integrable and f is strongly convex. However, when using the result of Theorem 1, the convergence rate shown here is slower than O P (min {n de , n nu }) -1/(4) . On the other hand, Kanamori et al. (2012) derived O P (min {n de , n nu }) -1/(2+γ) convergence rate. To derive this bound when using neural network, we need to restrict the neural network models. In the following part, we prove Theorem 2 for the following hypothesis class H. Definition 3 (ReLU neural networks; Schmidt-Hieber, 2020). For L ∈ N and p = (p 0 , . . . , p L+1 ) ∈ N L+2 , F(L, p) :={f : x → W L σ v L W L-1 σ v L-1 • • • W 1 σ v1 W 0 x : W i ∈ R pi+1×pi , v i ∈ R pi (i = 0, . . . , L)}, where σ v (y) := σ(y -v), and σ(•) = max{•, 0} is applied in an element-wise manner. Then, for s ∈ N, F ≥ 0, L ∈ N, and p ∈ N L+2 , define H(L, p, s, F ) := {f ∈ F (L, p) : L j=0 W j 0 + v j 0 ≤ s, f ∞ ≤ F }, where • 0 denotes the number of non-zero entries of the matrix or the vector, and • ∞ denotes the supremum norm. Now, fixing L, p, s ∈ N as well as F > 0, we define IndL , p := {(L, p) : L ∈ N, L ≤ L, p ∈ [p] L+2 }, and we consider the hypothesis class H := (L,p)∈Ind L, p H(L, p, s, F ) H := {r ∈ H : Im(r) ⊂ (b r , B r )}. Moreover, we define I 1 : IndL , p → R and I : H → [0, ∞) by I 1 (L, p) := 2|IndL , p| 1 s+1 (L + 1)V 2 , I(r) := max      r ∞ , min (L,p)∈Ind L, p r∈H(L,p,s,F ) I 1 (L, p)      , where V := L+1 l=0 (p l + 1), and we define H M := {r ∈ H : I(r) ≤ M }. Note that the requirement for the hypothesis class of Theorem 1 is not as tight as that of Theorem 2. Then, we prove Theorem 2 as follows: Proof. Thanks to the strong convexity, by Lemma 4, we have µ 2 r -r * 2 L 2 (p de ) ≤ BR f (r) -BR f (r * ) = BR f (r) -BR f (r * ) -BR f (r) + BR f (r) = 0 -nnBR f (r) + nnBR f (r) = 0 -BR f (r * ) + BR f (r * ) = 0 ≤ BR f (r) -BR f (r) + ( BR f (r) -nnBR f (r)) + ( nnBR f (r * ) -BR f (r * )) + BR f (r * ) -BR f (r * ) ≤ (BR f (r) -BR f (r * ) + BR f (r * ) -BR f (r)) =: A + 2 sup r∈H | BR f (r) -nnBR f (r)| =: B , where we used nnBR f (r) ≤ nnBR f (r * ). To bound A, for ease of notation, let ℓ r 1 = ℓ 1 (r(X)) and ℓ r 2 = ℓ 2 (r(X)). Then, since BR f (r) = E de ℓ 1 (r(X)) -CE nu ℓ 1 (r(X)) + E nu ℓ 2 (r(X)), BR f (r) = Êde ℓ 1 (r(X)) -C Ênu ℓ 1 (r(X)) + Ênu ℓ 2 (r(X)), we have A = BR f (r) -BR f (r * ) + BR f (r * ) -BR f (r) = (E de -Êde )(ℓ r 1 -ℓ r * 1 ) -C(E nu -Ênu )(ℓ r 1 -ℓ r * 1 ) + (E nu -Ênu )(ℓ r 2 -ℓ r * 2 ) ≤ |(E de -Êde )(ℓ r 1 -ℓ r * 1 )| + C|(E nu -Ênu )(ℓ r 1 -ℓ r * 1 )| + |(E nu -Ênu )(ℓ r 2 -ℓ r * 2 )| By applying Lemma 10, for any 0 < γ < 2, we have A ≤ O P   max    r -r * 1-γ/2 L 2 (p de ) min {n de , n nu } , 1 (min {n de , n nu }) 2/(2+γ)      . On the other hand, by Lemma 12 and Lemma 7, and the assumption inf r∈H E Êmod ℓ 1 (r(X)) > 0, there exists α > 0 such that we have B ≤ O P exp - 2α 2 (B ℓ 2 /n de )+(C 2 B ℓ 2 /nnu) . Combining the above bounds on A and B, for any 0 < γ < 2, we get r -r * 2 L 2 (p de ) ≤ O P   max    r -r * 1-γ/2 L 2 (p de ) min {n de , n nu } , 1 (min {n de , n nu }) 2/(2+γ)      + O P exp - 2α 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu ) ≤ O P   max    r -r * 1-γ/2 L 2 (p de ) min {n de , n nu } , 1 (min {n de , n nu }) 2/(2+γ)      . As a result, we have r -r * L 2 (p de ) ≤ O P (min {n de , n nu }) -1 2+γ . Each lemma used in the proof is provided as follows.

K.1 COMPLEXITY OF THE HYPOTHESIS CLASS

For the function classes in Definition 3, we have the following evaluations of their complexities. Lemma 6 (Lemma 5 in Schmidt-Hieber ( 2020)). For L ∈ N and p ∈ N L+2 , let V := L+1 l=0 (p l +1). Then, for any δ > 0, log N (δ, H(L, p, s, ∞), • ∞ ) ≤ (s + 1) log(2δ -1 (L + 1)V 2 ). Lemma 7. There exists c > 0 such that R pnu nnu (H) ≤ cn -1/2 nu , R p de n de (H) ≤ cn -1/2 de . Proof. By Dudley's entropy integral bound (Wainwright, 2019, Theorem 5.22) and Lemma 6, we have R pnu nnu (H(L, p, s, F )) ≤ 32 2F 0 log N (δ, H(L, p, s, F ), • ∞ ) n nu dδ = 32 2F 0 (s + 1) log(2δ -1 (L + 1)V 2 ) 1/2 dδ n -1/2 nu . Therefore, there exists c > 0 such that R pnu nnu (H) ≤ (L,p)∈Ind L, p R pnu nnu (H(L, p, s, F )) ≤ cn -1/2 nu . The same argument applies to R p de n de (H), and we obtain the assertion. Lemma 8. There exists c 0 > 0 such that for any γ > 0, any δ > 0, and any M ≥ 1, we have log N (δ, H M , • ∞ ) ≤ s + 1 γ M δ γ . and sup r∈H M r -r * ∞ ≤ c 0 M. Proof. The first assertion is a result of the following calculation: log N (δ, H M , • ∞ ) ≤ log (L,p)∈Ind L, p I1(L,p)≤M N (δ, H(L, p, s, M ), • ∞ ) ≤ log (L,p)∈Ind L, p I1(L,p)≤M 2 δ (L + 1)V 2 s+1 ≤ log |IndL , p| 1 δ M |IndL , p| -1 s+1 s+1 = (s + 1) log M δ < (s + 1) 1 γ M δ γ , where the first inequality follows from H M ⊂ (L,p)∈Ind L, p :I1(L,p)≤M H(L, p, s, F ), and the last inequality from γ log x 1 γ = log x < x that holds for all x, γ > 0. The second assertion can be confirmed by noting that for any r ∈ H M with M ≥ 1, r -r * ∞ ≤ r ∞ + r * ∞ ≤ M + r * ∞ ≤ 1 + r * ∞ M M ≤ (1 + r * ∞ )M holds. Definition 4 (Derived function class and bracketing entropy). Given a real-valued function class F, define ℓ • F := {ℓ • f : f ∈ F }. By extension, we define I : ℓ • H → [1, ∞) by I(ℓ • r) = I(r) and ℓ • H M := {ℓ • r : r ∈ H M }. Note that, as a result, ℓ • H M coincides with {ℓ • r ∈ ℓ • H : I(ℓ • r) ≤ M }. Lemma 9. Let ℓ : (b r , B r ) → R be a ν-Lipschitz continuous function. Let H B δ, F, • L 2 (P ) denote the bracketing entropy of F with respect to a distribution P . Then, for any distribution P , any γ > 0, any M ≥ 1, and any δ > 0, we have H B δ, ℓ • H M , • L 2 (P ) ≤ (s + 1)(2ν) γ γ M δ γ . Moreover, there exists c 0 > 0 such that for any M ≥ 1 and any distribution P , sup ℓ•r∈ℓ•H M ℓ • r -ℓ • r * L 2 (P ) ≤ c 0 νM, sup ℓ•r∈ℓ•H M ∥ℓ•r-ℓ•r * ∥ L 2 (P ) ≤δ ℓ • r -ℓ • r * ∞ ≤ c 0 νM, for all δ > 0. Proof. By combining Lemma 2.1 in van de Geer (2000) with Lemma 6, we have H B δ, ℓ • H M , • L 2 (P ) ≤ log N δ 2 , ℓ • H M , • ∞ , ≤ log N δ 2ν , H M , • ∞ ≤ s + 1 γ 2νM δ γ . For M ≥ 1, we have sup ℓ•r∈ℓ•H M ℓ • r -ℓ • r * L 2 (P ) ≤ sup ℓ•r∈ℓ•H M ℓ • r -ℓ • r * ∞ sup ℓ•r∈ℓ•H M ∥ℓ•r-ℓ•r * ∥ L 2 (P ) ≤δ ℓ • r -ℓ • r * ∞ ≤ sup ℓ•r∈ℓ•H M ℓ • r -ℓ • r * ∞ , and Lemma 6 implies sup Following is a proposition originally presented in van de Geer (2000) , which was rephrased in Kanamori et al. (2012) in a form that is convenient for our purpose. as n nu , n de → ∞. ℓ•r∈ℓ•H M ℓ • r -ℓ • r * ∞ ≤ sup Proof. First, by combining Lemma 13, the assumption on the Rademacher complexities, and Markov's inequality, there exist α > 0 and n 0 de , n 0 nu ∈ N such that for any n de ≥ n 0 de and n nu ≥ n 0 nu and any δ ∈ (0, 1), we have with probability at least 1 -δ, sup r∈H | nnBR f (r) -BR f (r)| ≤ (1 + C)B ℓ L ρ-Id δ exp - 2α 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu ) . Therefore, we have the assertion. Lemma 13. Assume R p de n de (H) = O(1)(n de → ∞) and R pnu nnu (H) = O(1)(n nu → ∞). Also assume the same conditions as Theorem 3. Then, there exist α > 0 and n 0 de , n 0 nu ∈ N such that for any n de ≥ n 0 de and n nu ≥ n 0 nu , E sup Take an arbitrary α ∈ (0, β). Since R p de n de (H) → 0(n de → ∞) and R pnu nnu (H) → 0(n nu → ∞), we can apply Lemma 14 and obtain the assertion. Lemma 14. Let β > α > 0. Assume that there exist n 0 de , n 0 nu ∈ N such that for any n de ≥ n 0 de and n nu ≥ n 0 nu , 4L ℓ1 R p de n de (H) + 4CL ℓ1 R pnu nnu (H) < β -α. Then, for any n de ≥ n 0 de and n nu ≥ n 0 nu , we have P β < sup r∈H (E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) ≤ exp -2α 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu ) . Proof. First, we will apply McDiarmid's inequality. The absolute difference caused by altering one data point in sup r∈H (E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) is bounded by B ℓ n de if the change is in a sample from p de and CB ℓ nnu otherwise. This can be confirmed by letting Ê′ mod denote the sample averaging operator obtained by altering one data point in Êmod and observing sup r∈H {E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))} -sup r∈H {E Êmod ℓ 1 (r(X)) -Ê′ mod ℓ 1 (r(X))} ≤ sup r∈H {E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X)) -(E Êmod ℓ 1 (r(X)) -Ê′ mod ℓ 1 (r(X)))} ≤ sup r∈H { Ê′ mod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))}. The right-most expression can be bounded by B ℓ n de if the change is in a sample from p de and CB ℓ nnu otherwise. Likewise, sup r∈H (E Êmod ℓ 1 (r(X)) -Ê′ mod ℓ 1 (r(X))) -sup r∈H (E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) can be bounded by one of these quantities. Therefore, we have By the assumption, if n de ≥ n 0 de and n nu ≥ n 0 nu , we have R < β -α. Therefore, E sup r∈H (E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) < β -α < β, hence β -E sup r∈H (E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) > 0. Therefore, we can take ϵ = β -E sup r∈H (E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) in Equation ( 10) to obtain P β < sup r∈H (E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) ≤ exp   - 2(β -E sup r∈H (E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) ) 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu )   ≤ exp - 2(β -R) 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu ) ≤ exp - 2α 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu ) , where we used 0 < α < β -R.



See https://www.cs.toronto.edu/~kriz/cifar.html. MNIST has 10 classes from 0 to 9. See http://yann.lecun.com/exdb/mnist/. FMNIST has 10 classes. See https://github.com/zalandoresearch/fashion-mnist. http://john.blitzer.com/software.html



Figure 1: Illustration of the train-loss hacking phenomenon. Given finite data points, a sufficiently flexible model r can easily make the training loss, e.g., BR LSIF (r), diverge to -∞.

) and Ruff et al. (2020), LeNet and Wide Resnet, for D3RE. The details of the structures are shown in Appendix D. A part of the experimental results with CIFAR-10 and FMNIST are shown in Table 2 due to the limitation of the space. The full results are shown in Table

Figure 3: The learning curves of the experiments in Section 5. The horizontal axis is epoch. The vertical axes of the top figures indicate the training losses. The vertical axes of the bottom figures show the AURPC for the test data. The bottom figures are identical to the ones displayed in Section 5.

Figure 4: Top figures: the detailed experimental results for Section G.1.1. Bottom figure: the detailed experimental results for Section G.1.2. The horizontal axis is epoch, and the vertical axis is AUROC.

Figure5: Experimental results of Section 5 without gradient ascent/descent heuristic. The horizontal axis is epoch, and the vertical axis is AUROC. The learning rates of the left and right graphs are 1 × 10 -4 and 1 × 10 -5 , respectively. The upper graphs show the AUROCs and the lower graphs show Êde [r(X)], which will approach 1 when we successfully estimate the density ratio.

Figure 6: The learning curves of the experiments in Section 5 without gradient ascent/descent heuristic. The horizontal axis is epoch. The vertical axes of the top figures indicate the training losses. The vertical axes of the bottom figures show the AURPC for the test data. The bottom figures are identical to the ones displayed in Section 5.

Figure 7: Top figures: the detailed experimental results for Section G.1.1 without gradient ascent/descent heuristic. Bottom figure: the detailed experimental results for Section G.1.2. The horizontal axis is epoch, and the vertical axis is AUROC.

sup s:|s|≤(1+C)B ℓ |ρ(s) -s| where {•} denotes the indicator function, and we used | Êmod ℓ 1 (r(X))| ≤ (1 + C)B ℓ . Further, we have sup s:|s|≤(1+C)B ℓ |ρ(s) -s| ≤ sup s:|s|≤(1+C)B ℓ

r∈H M ν r -r * ∞ ≤ νc 0 M.K.2 BOUNDING THE EMPIRICAL DEVIATIONSLemma 10. Under the conditions of Theorem 2, for any 0 < γ < 2, we have|(E de -Êde )(ℓ r 1 -ℓ r * 1 )| = O P , n de → ∞.Proof. Since 0 < γ < 2, we can apply Lemma 11 in combination with Lemma 9 to obtainsup r∈H |(E de -Êde )(ℓ r 1 -ℓ r * 1 )| D 1 (r) = O P (1) , sup r∈H |(E nu -Ênu )(ℓ r 1 -ℓ r * Noting that sup r∈H I(r) < ∞, that ℓ 2 , ℓ 1 are Lipschitz continuous, and that r -r * L 2 (pnu) ≤ sup x∈X pnu(x) p de (x)r -r * L 2 (p de ) holds, we have the assertion.

Lemma 11 (Lemma 5.14 in van de Geer (2000), Proposition 1 inKanamori et al. (2012)). Let F ⊂ L 2 (P ) be a function class and the map I(f ) be a complexity measure of f ∈ F, where I is a nonnegative function on F andI(f 0 ) < ∞ for a fixed f 0 ∈ F. We now define F M = {f ∈ F : I(f ) ≤ M } satisfying F = M ≥1 F M . Suppose that there exist c 0 > 0 and 0 < γ < 2 such that sup f ∈F M f -f 0 ≤ c 0 M, sup f ∈F M ∥f -f0∥ L 2 (P ) ≤δ f -f 0 ∞ ≤ c 0 M, for all δ > 0,and that H B (δ, F M , P ) = O (M/δ) γ . Then, we havesup f ∈F (f -f 0 )d(P -P n ) D(f ) = O P (1) , (n → ∞),where D(f ) is defined by THE DIFFERENCE OF THE BR DIVERGENCE ESTIMATORS Lemma 12. Assume R p de n de (H) = O(1)(n de → ∞) and R pnu nnu (H) = O(1)(n nu → ∞). Also assume the same conditions as Theorem 3. Then, sup r∈H | nnBR f (r) -BR f (r)| = O P exp -2α 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu )

nnBR f (r) -BR f (r)| ≤ (1 + C)B ℓ L ρ-Id exp -2α 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu )holds.Proof. First, we haveE sup r∈H | nnBR f (r) -BR f (r)| = E sup r∈H ρ( Êmod ℓ 1 (r(X))) -Êmod ℓ 1 (r(X)) = E sup r∈H {ρ( Êmod ℓ 1 (r(X))) = Êmod ℓ 1 (r(X))} • |ρ( Êmod ℓ 1 (r(X))) -Êmod ℓ 1 (r(X))| ≤ E sup r∈H {ρ( Êmod ℓ 1 (r(X))) = Êmod ℓ 1 (r(X))} sup s:|s|≤(1+C)B ℓ |ρ(s) -s| ,where {•} denotes the indicator function, and we used| Êmod ℓ 1 (r(X))| ≤ (1 + C)B ℓ . Further, we have sup s:|s|≤(1+C)B ℓ |ρ(s) -s| ≤ sup s:|s|≤(1+C)B ℓ |(ρ -Id)(s) -(ρ -Id)(0)| + |(ρ -Id)(0)| ≤ sup s:|s|≤(1+C)B ℓ L ρ-Id |s -0| + 0 ≤ (1 + C)B ℓ L ρ-Id ,where Id denotes the identity function. On the other hand, since inf r∈H E Êmod ℓ 1 (r(X)) > 0 is assumed, there exists β > 0 such that for any r ∈ H, E Êmod ℓ 1 (r(X)) > β. Therefore, denoting the support of a function by supp(•),E sup r∈H {ρ( Êmod ℓ 1 (r(X))) = Êmod ℓ 1 (r(X))} = E sup r∈H { Êmod ℓ 1 (r(X)) ∈ supp(ρ -Id)} = E sup r∈H { Êmod ℓ 1 (r(X)) < 0} = E {∃r ∈ H : Êmod ℓ 1 (r(X)) < 0} = P ∃r ∈ H : Êmod ℓ 1 (r(X)) < 0 ≤ P ∃r ∈ H : Êmod ℓ 1 (r(X)) < E Êmod ℓ 1 (r(X)) -β ≤ P β < supr∈H (E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) .

Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))} -sup r∈H {E Êmod ℓ 1 (r(X)) -Ê′mod ℓ 1 (rs inequality implies, for any ϵ > 0,P ϵ < sup r∈H (E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) -E sup r∈H (E Êmod ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) ≤ exp -2ϵ 2 (B ℓ 2 /n de ) + (C 2 B ℓ 2 /n nu ) . ℓ 1 (r(X)) -Êmod ℓ 1 (r(X))) ≤ E sup r∈H |E de ℓ 1 (r(X)) -Êde ℓ 1 (r(X))| + CE sup r∈H |E nu ℓ 1 (r(X)) -Ênu ℓ 1 (r(X))|≤ 4L ℓ1 R p de n de (H) + 4CL ℓ1 R pnu nnu (H) =: R.

Given n ∈ N and a distribution p, we define the Rademacher complexity R p

Average AUROC curve (Mean) with the standard deviation (SD) over 5 trials of anomaly detection methods. For all datasets, each model was trained on a single class and tested against all other classes. The best figure is in bold.

Experimental results (MSEs and SDs) of DRE using synthetic datasets.

Average PD (Mean) with standard deviation (SD) over 10 trials with different seeds per method. The best performing method in terms of the mean PD is specified by bold face.

Average PD (Mean) with standard deviation (SD) over 10 trials with different seeds per method. The best performing method in terms of the mean PD is specified by bold face.

nu . Therefore, McDiarmid's inequality implies, with probability at least 1 -δ, that we have

annex

0.999 0.000 0.997 0.000 0.999 0.000 1.000 0.000 1.000 0.000 0.592 0.051 0.963 0.002 1 1.000 0.000 0.999 0.000 1.000 0.000 1.000 0.000 1.000 0.000 0.942 0.016 0.517 0.039 2 0.997 0.001 0.994 0.000 0.997 0.001 1.000 0.000 1.000 0.001 0.447 0.027 0.992 0.001 3 0.997 0.000 0.995 0.001 0.998 0.000 1.000 0.000 1.000 0.000 0.562 0.035 0.974 0.001 4 0.998 0.000 0.997 0.001 0.999 0.000 1.000 0.000 1.000 0.000 0.646 0.015 0.989 0.001 5 0.997 0.000 0.996 0.001 0.998 0.000 1.000 0.000 1.000 0.000 0.502 0.046 0.990 0.001 6 0.997 0.001 0.997 0.001 0.999 0.000 1.000 0.000 1.000 0.000 0.671 0.027 0.998 0.000 7 0.996 0.001 0.993 0.001 0.998 0.001 1.000 0.000 1.000 0.001 0.685 0.032 0.927 0.004 8 0.997 0.000 0.994 0.001 0.997 0.000 0.999 0.000 0.999 0.000 0. 

G.3 EXPERIMENTS OF COVARIATE SHIFT ADAPTATION

In Table 5 , we show the detailed results of experiments of covariate shift adaptation. Even when the training data and the test data follow the same distribution, the covariate shift adaptation based on D3RE improves the mean PD. We consider that this is because the importance weighting emphasizes the loss in the empirical higher-density regions of the test examples.

H OTHER APPLICATIONS

In this section, we explain other potential applications of the proposed method.

H.1 COVARIATE SHIFT ADAPTATION BY IMPORTANCE WEIGHTING

We consider training a model using input distribution different from the test input distribution, which is called covariate shift, (Bickel et al., 2009) . To solve this problem, the density ratio has been used via importance weighting (IW) (Shimodaira, 2000; Yamada et al., 2010; Reddi et al., 2015) .

