NON-NEGATIVE BREGMAN DIVERGENCE MINIMIZA-TION FOR DEEP DIRECT DENSITY RATIO ESTIMATION Anonymous

Abstract

This paper aims to estimate the ratio of probability densities using flexible models, such as state-of-the-art deep neural networks. The density ratio estimation (DRE) has garnered attention as the density ratio is useful in various machine learning tasks, such as anomaly detection and domain adaptation. For estimating the density ratio, methods collectively known as direct DRE have been explored. These methods are based on the minimization of the Bregman (BR) divergence between a density ratio model and the true density ratio. However, when using flexible models, such as deep neural networks, existing direct DRE suffers from serious train-loss hacking, which is a kind of over-fitting caused by the form of an empirical risk function. In this paper, we introduce a non-negative correction for empirical risk using only the prior knowledge of the upper bound of the density ratio. This correction makes a DRE method robust against train-loss hacking. It enables the use of flexible models, such as state-of-the-art deep neural networks. In the theoretical analysis, we show the generalization error bound of the BR divergence minimization. In our experiments, the proposed methods show favorable performance in inlier-based outlier detection and covariate shift adaptation.

1. INTRODUCTION

The density ratio estimation (DRE) problem has attracted a great deal of attention as an essential task in data science for its various industrial applications, such as domain adaptation (Shimodaira, 2000; Plank et al., 2014; Reddi et al., 2015) , learning with noisy labels (Liu & Tao, 2014; Fang et al., 2020) , anomaly detection (Smola et al., 2009; Hido et al., 2011; Abe & Sugiyama, 2019) , twosample testing (Keziou & Leoni-Aubin, 2005; Kanamori et al., 2010; Sugiyama et al., 2011a ), causal inference (Kato et al., 2020) , change point detection in time series (Kawahara & Sugiyama, 2009) , and binary classification only from positive and unlabeled data (PU learning; Kato et al., 2019) . For example, anomaly detection is not easy to perform based on standard machine learning methods such as binary classification since anomalous data is often scarce, but it can be solved by estimating the density ratio when training data without anomaly as well as unlabeled test data are available (Hido et al., 2008) . Among various approaches for DRE, we focus on the Bregman (BR) divergence minimization framework (Bregman, 1967; Sugiyama et al., 2011b) that is a generalization of various DRE methods, e.g., the moment matching (Huang et al., 2007; Gretton et al., 2009) , the probabilistic classification (Qin, 1998; Cheng & Chu, 2004) , the density matching (Nguyen et al., 2010; Yamada et al., 2010) , and the density-ratio fitting (Kanamori et al., 2009) . Recently, Kato et al. ( 2019) also proposed using the risk of PU learning for DRE, which also can be generalized from the BR divergence minimization viewpoint, as we show below. However, existing DRE methods mainly adopt a linear-in-parameter model for nonparametric DRE (Kanamori et al., 2012) and rarely discussed the use of more flexible models, such as deep neural networks, while recent developments in machine learning suggest that deep neural networks can significantly improve the performances for various tasks, such as computer vision (Krizhevsky et al., 2012) and natural language processing (Bengio et al., 2001) . This motivates us to use deep neural networks for DRE. However, existing DRE studies have not fully discussed using such state-of-theart deep neural networks. For instance, although Nam & Sugiyama (2015) and Abe & Sugiyama (2019) proposed using neural networks for DRE, their neural networks are simple and shallow.

