ROBUST LEARNING OF FIXED-STRUCTURE BAYESIAN NETWORKS IN NEARLY-LINEAR TIME

Abstract

We study the problem of learning Bayesian networks where an -fraction of the samples are adversarially corrupted. We focus on the fully-observable case where the underlying graph structure is known. In this work, we present the first nearlylinear time algorithm for this problem with a dimension-independent error guarantee. Previous robust algorithms with comparable error guarantees are slower by at least a factor of (d/ ), where d is the number of variables in the Bayesian network and is the fraction of corrupted samples. Our algorithm and analysis are considerably simpler than those in previous work. We achieve this by establishing a direct connection between robust learning of Bayesian networks and robust mean estimation. As a subroutine in our algorithm, we develop a robust mean estimation algorithm whose runtime is nearly-linear in the number of nonzeros in the input samples, which may be of independent interest.

1. INTRODUCTION

Probabilistic graphical models (Koller & Friedman, 2009) offer an elegant and succinct way to represent structured high-dimensional distributions. The problem of inference and learning in probabilistic graphical models is an important problem that arises in many disciplines (see Wainwright & Jordan (2008) and the references therein), which has been studied extensively during the past decades (see, e.g., Chow & Liu (1968) ; Dasgupta (1997) ; Abbeel et al. (2006) ; Wainwright et al. (2006) ; Anandkumar et al. (2012) ; Santhanam & Wainwright (2012); Loh & Wainwright (2012); Bresler et al. (2013; 2014); Bresler (2015) ). Bayesian networks (Jensen & Nielsen, 2007) are an important family of probabilistic graphical models that represent conditional dependence by a directed graph (see Section 2 for a formal definition). In this paper, we study the problem of learning Bayesian networks where an -fraction of the samples are adversarially corrupted. We focus on the simplest setting: all variables are binary and observable, and the structure of the Bayesian network is given to the algorithm. Formally, we work with the following corruption model: Definition 1.1 ( -Corrupted Set of Samples). Given 0 < < 1/2 and a distribution family P on R d , the algorithm first specifies the number of samples N , and N samples X 1 , X 2 , . . . , X N are drawn from some unknown P ∈ P. The adversary inspects the samples, the ground-truth distribution P , and the algorithm, and then replaces N samples with arbitrary points. The set of N points is given to the algorithm as input. We say that a set of samples is -corrupted if it is generated by this process. This is a strong corruption model which generalizes many existing models. In particular, it is stronger than Huber's contamination model (Huber, 1964) , because we allow the adversary to add bad samples and remove good samples, and he can do so adaptively. Our goal is to design robust algorithms for learning Bayesian networks with dimension-independent error. More specifically, given as input an -corrupted set of samples drawn from some groundtruth Bayesian network P and the graph structure of P , we want the algorithm to output a Bayesian network Q, such that the total variation distance between P and Q is upper bounded by a function that depends only on (the fraction of corruption) but not d (the number of variables in P ). In the fully-observable fixed-structure setting, the problem is straightforward when there is no corruption. We know that the empirical estimator (which computes the empirical conditional probabilities) is sample efficient and runs in linear time (Dasgupta, 1997) . It turns out that the problem becomes much more challenging when there is corruption. Even for robust learning of binary product distributions (i.e., a Bayesian network with an empty dependency graph), the first computational efficient algorithms with dimension-independent error was only discovered in (Diakonikolas et al., 2019a) . Subsequently, (Cheng et al., 2018) gave the first polynomialtime algorithms for robust learning of fixed-structured Bayesian networks. The main drawback of the algorithm in (Cheng et al., 2018 ) is that it runs in time Ω(N d 2 / ), which is slower by at least a factor of (d/ ) compared to the fastest non-robust estimator. Motivated by this gap in the running time, in this work we want to resolve the following question: Can we design a robust algorithm for learning Bayesian networks in the fixedstructure fully-observable setting that runs in nearly-linear time?

1.1. OUR RESULTS AND CONTRIBUTIONS

We resolve this question affirmatively by proving Theorem 1.2. We say a Bayesian network is cbalanced if all its conditional probabilities are between c and 1 -c. For the ground-truth Bayesian network P , let m be the size of its conditional probability table and α be its minimum parental configuration probability (see Section 2 for formal definitions). Theorem 1.2 (informal statement). Consider an -corrupted set of N = Ω(m/ 2 ) samples drawn from a d-dimensional Bayesian network P . Suppose P is c-balanced and has minimum parental configuration probability α, where both c and α are universal constants. We can compute a Bayesian network Q in time O(N d) such that d TV (P, Q) ≤ ln(1/ ). 1 For simplicity, we stated our result in the very special case where both c and α are Ω(1). Our approach works for general values of α and c, where our error guarantee degrades gracefully as α and c gets smaller. A formal version of Theorem 1.2 is given as Theorem 4.1 in Section 4. Our algorithm has optimal error guarantee, sample complexity, and running time (up to logarithmic factors). There is an information-theoretic lower bound of Ω( ) on the error guarantee, which holds even for Bayesian networks with only one variable. A sample complexity lower bound of Ω(m/ 2 ) holds even without corruption (see, e.g., (Canonne et al., 2017) ). Our Contributions. We establish a novel connection between robust learning of Bayesian networks and robust mean estimation. At a high level, we show that one can essentially reduce the former to the latter. This allows us to take advantage of the recent (and future) advances in robust mean estimation and apply the algorithms almost directly to obtain new algorithms for learning Bayesian networks. Our algorithm and analysis are considerably simpler than those in previous work. For simplicity, consider learning binary product distributions as an example. Cheng et al. ( 2018) tried to remove samples to make the empirical covariance matrix closer to a diagonal matrix (since the true covariance matrix is diagonal because each coordinate is independent). They used a "filtering" approach which requires proving specific tail bounds on the samples. In contrast, we show that it suffices to use any robust mean estimation algorithms which minimize the spectral norm of the empirical covariance matrix (regardless of whether it is close to being diagonal or not). As a subroutine in our approach, we develop the first robust mean estimation algorithm that runs in nearly input-sparsity time (i.e., in time nearly linear in the total number of nonzero entries in the input), which may be of independent interest. The main computation bottleneck of current nearlylinear time robust mean estimation algorithms (Cheng et al., 2019a; Depersin & Lecué, 2019; Dong et al., 2019) is running matrix multiplication weight update with the Johnson-Lindenstrauss lemma, which we show can be done in nearly input-sparsity time.



Throughout the paper, we use O(f ) to denote O(f polylog(f )).

