NON-NEGATIVE BREGMAN DIVERGENCE MINIMIZA-TION FOR DEEP DIRECT DENSITY RATIO ESTIMATION Anonymous

Abstract

This paper aims to estimate the ratio of probability densities using flexible models, such as state-of-the-art deep neural networks. The density ratio estimation (DRE) has garnered attention as the density ratio is useful in various machine learning tasks, such as anomaly detection and domain adaptation. For estimating the density ratio, methods collectively known as direct DRE have been explored. These methods are based on the minimization of the Bregman (BR) divergence between a density ratio model and the true density ratio. However, when using flexible models, such as deep neural networks, existing direct DRE suffers from serious train-loss hacking, which is a kind of over-fitting caused by the form of an empirical risk function. In this paper, we introduce a non-negative correction for empirical risk using only the prior knowledge of the upper bound of the density ratio. This correction makes a DRE method robust against train-loss hacking. It enables the use of flexible models, such as state-of-the-art deep neural networks. In the theoretical analysis, we show the generalization error bound of the BR divergence minimization. In our experiments, the proposed methods show favorable performance in inlier-based outlier detection and covariate shift adaptation.

1. INTRODUCTION

The density ratio estimation (DRE) problem has attracted a great deal of attention as an essential task in data science for its various industrial applications, such as domain adaptation (Shimodaira, 2000; Plank et al., 2014; Reddi et al., 2015) , learning with noisy labels (Liu & Tao, 2014; Fang et al., 2020) , anomaly detection (Smola et al., 2009; Hido et al., 2011; Abe & Sugiyama, 2019) , twosample testing (Keziou & Leoni-Aubin, 2005; Kanamori et al., 2010; Sugiyama et al., 2011a) , causal inference (Kato et al., 2020) , change point detection in time series (Kawahara & Sugiyama, 2009) , and binary classification only from positive and unlabeled data (PU learning; Kato et al., 2019) . For example, anomaly detection is not easy to perform based on standard machine learning methods such as binary classification since anomalous data is often scarce, but it can be solved by estimating the density ratio when training data without anomaly as well as unlabeled test data are available (Hido et al., 2008) . Among various approaches for DRE, we focus on the Bregman (BR) divergence minimization framework (Bregman, 1967; Sugiyama et al., 2011b) that is a generalization of various DRE methods, e.g., the moment matching (Huang et al., 2007; Gretton et al., 2009) , the probabilistic classification (Qin, 1998; Cheng & Chu, 2004) , the density matching (Nguyen et al., 2010; Yamada et al., 2010) , and the density-ratio fitting (Kanamori et al., 2009 ). Recently, Kato et al. (2019) also proposed using the risk of PU learning for DRE, which also can be generalized from the BR divergence minimization viewpoint, as we show below. However, existing DRE methods mainly adopt a linear-in-parameter model for nonparametric DRE (Kanamori et al., 2012) and rarely discussed the use of more flexible models, such as deep neural networks, while recent developments in machine learning suggest that deep neural networks can significantly improve the performances for various tasks, such as computer vision (Krizhevsky et al., 2012) and natural language processing (Bengio et al., 2001) . This motivates us to use deep neural networks for DRE. However, existing DRE studies have not fully discussed using such state-of-theart deep neural networks. For instance, although Nam & Sugiyama (2015) and Abe & Sugiyama (2019) proposed using neural networks for DRE, their neural networks are simple and shallow. When using deep neural networks in combination with empirical minimization of BR divergence, we often observe a serious over-fitting problem as demonstrated through experiments in Figure 2 of Section 5. We hypothesize that this is mainly because there is no lower bound in the empirically BR divergence approximating by finite samples, i.e., we can achieve an infinitely negative value in minimization. This hypothesis is based on Kiryo et al. (2017) , which reports a similar problem in PU learning. While Kiryo et al. (2017) call this phenomena over-fitting, we refer to it as train-loss hacking because the nuance is a bit different from the standard meaning of overfitting. Here, we briefly introduce the train-loss hacking discussed in the PU learning literature Kiryo et al. (2017) . In a standard binary classification problem, we train a classifier ψ by minimizing the following empirical risk using {(y i , X i )} n i=1 : 1 n n i=1 [y i = +1]ℓ(ψ(X i )) + 1 n n i=1 [y i = -1]ℓ(-ψ(X i )), where y i ∈ {±1} is a binary label, X i is a feature, and ℓ is a loss function. On the other hand, in PU learning formulated by du Plessis et al. ( 2015), because we only have positive data {(y ′ i = +1, X ′ i )} n ′ i=1 and unlabeled data {(x ′′ j )} n ′′ j=1 , we minimize the following alternative empirical risk: π n ′ n ′ i=1 ℓ(ψ(X ′ i ) - π n ′ n ′ i=1 ℓ(-ψ(X ′ i )) Cause of train-loss hacking. + 1 n ′′ n ′′ j=1 ℓ(-ψ(X ′′ j )), where π is a hyper-parameter representing p(y = +1). Note that the empirical risk ( 2) is unbiased to the population binary classification risk (1) (du Plessis et al., 2015) . While the the empirical risk (1) of the standard binary classification is lower bounded under an appropiate choise of ℓ, the empirical risk (2) of PU learning proposed by du Plessis et al. ( 2015) is not lower bounded owing to the existence of the second term. Therefore, if a model is sufficiently flexible, we can significantly minimize the empirical risk only by minimizing the second term -π n ′ n ′ i=1 ℓ(-ψ(X ′ i ) ) without increasing the other terms. Kiryo et al. (2017) proposed non-negative risk correction for avoiding this problem when using neural networks. We discuss this problem again in Section 2 and Figure 1 . In existing DRE literature, this train-loss hacking has rarely been discussed, although we often face this problem when using neural networks, as mentioned in Section 5. One reason for this is that the existing method assumes a linear-in-parameter model for a density ratio model (Kanamori et al., 2012) , which is not so flexible as neural networks and do not cause the phenomenon. To mitigate the train-loss hacking, we propose a general procedure to modify the empirical BR divergence using the prior knowledge of the upper bound of the density ratio. Our idea of the correction is inspired by Kiryo et al. (2017) . However, their idea of non-negative correction is only immediately applicable to the binary classification; thus we require a non-trivial rewriting of the BR divergence to generalize the approach to our problem. We call the proposed empirical risk the non-negative BR (nnBR) divergence, and it is a generalization of the method of Kiryo et al. (2017) . In addition, for a special case of DRE, we can still use a lower bounded loss for DRE (See bounded uLSIF and BKL introduced in the following section). However, such a loss also suffers from the train-loss hacking (bounded uLSIF of Figure 2 and BKL-NN of Figure 4 ). In the case, the train-loss hacking is caused because the loss sticks to the lower bound. This type of train-loss hacking is also avoided by using the proposed nnBR divergence. Our main contributions are: (1) the proposal of a general procedure to modify a BR divergence to enable DRE with flexible models, (2) theoretical justification of the proposed estimator, and (3) the experimental validation of the proposed method using benchmark data.

2. PROBLEM SETTING

Let X nu ⊆ R d and X de ⊆ R d be the spaces of the d-dimensional covariates X nu ∼ p de (X), where p nu (X) and p de (X) are probability densities over X nu



i=1 , respectively, which are independent and identically distributed (i.i.d.) as X nu i nnu i=1 i.i.d.∼p nu (X) and X de i n de i=1 i.i.d.

