ROBUST LEARNING OF FIXED-STRUCTURE BAYESIAN NETWORKS IN NEARLY-LINEAR TIME

Abstract

We study the problem of learning Bayesian networks where an -fraction of the samples are adversarially corrupted. We focus on the fully-observable case where the underlying graph structure is known. In this work, we present the first nearlylinear time algorithm for this problem with a dimension-independent error guarantee. Previous robust algorithms with comparable error guarantees are slower by at least a factor of (d/ ), where d is the number of variables in the Bayesian network and is the fraction of corrupted samples. Our algorithm and analysis are considerably simpler than those in previous work. We achieve this by establishing a direct connection between robust learning of Bayesian networks and robust mean estimation. As a subroutine in our algorithm, we develop a robust mean estimation algorithm whose runtime is nearly-linear in the number of nonzeros in the input samples, which may be of independent interest.

1. INTRODUCTION

Probabilistic graphical models (Koller & Friedman, 2009) offer an elegant and succinct way to represent structured high-dimensional distributions. The problem of inference and learning in probabilistic graphical models is an important problem that arises in many disciplines (see Wainwright & Jordan (2008) and the references therein), which has been studied extensively during the past decades (see, e.g., Chow & Liu (1968) ; Dasgupta (1997) ; Abbeel et al. (2006) ; Wainwright et al. (2006) ; Anandkumar et al. (2012) ; Santhanam & Wainwright (2012) ; Loh & Wainwright (2012) ; Bresler et al. (2013; 2014) ; Bresler (2015) ). Bayesian networks (Jensen & Nielsen, 2007) are an important family of probabilistic graphical models that represent conditional dependence by a directed graph (see Section 2 for a formal definition). In this paper, we study the problem of learning Bayesian networks where an -fraction of the samples are adversarially corrupted. We focus on the simplest setting: all variables are binary and observable, and the structure of the Bayesian network is given to the algorithm. Formally, we work with the following corruption model: Definition 1.1 ( -Corrupted Set of Samples). Given 0 < < 1/2 and a distribution family P on R d , the algorithm first specifies the number of samples N , and N samples X 1 , X 2 , . . . , X N are drawn from some unknown P ∈ P. The adversary inspects the samples, the ground-truth distribution P , and the algorithm, and then replaces N samples with arbitrary points. The set of N points is given to the algorithm as input. We say that a set of samples is -corrupted if it is generated by this process. This is a strong corruption model which generalizes many existing models. In particular, it is stronger than Huber's contamination model (Huber, 1964) , because we allow the adversary to add bad samples and remove good samples, and he can do so adaptively. Our goal is to design robust algorithms for learning Bayesian networks with dimension-independent error. More specifically, given as input an -corrupted set of samples drawn from some groundtruth Bayesian network P and the graph structure of P , we want the algorithm to output a Bayesian network Q, such that the total variation distance between P and Q is upper bounded by a function that depends only on (the fraction of corruption) but not d (the number of variables in P ). In the fully-observable fixed-structure setting, the problem is straightforward when there is no corruption. We know that the empirical estimator (which computes the empirical conditional probabilities) is sample efficient and runs in linear time (Dasgupta, 1997) . It turns out that the problem becomes much more challenging when there is corruption. Even for robust learning of binary product distributions (i.e., a Bayesian network with an empty dependency graph), the first computational efficient algorithms with dimension-independent error was only discovered in (Diakonikolas et al., 2019a) . Subsequently, (Cheng et al., 2018) gave the first polynomialtime algorithms for robust learning of fixed-structured Bayesian networks. The main drawback of the algorithm in (Cheng et al., 2018 ) is that it runs in time Ω(N d 2 / ), which is slower by at least a factor of (d/ ) compared to the fastest non-robust estimator. Motivated by this gap in the running time, in this work we want to resolve the following question: Can we design a robust algorithm for learning Bayesian networks in the fixedstructure fully-observable setting that runs in nearly-linear time?

1.1. OUR RESULTS AND CONTRIBUTIONS

We resolve this question affirmatively by proving Theorem 1.2. We say a Bayesian network is cbalanced if all its conditional probabilities are between c and 1 -c. For the ground-truth Bayesian network P , let m be the size of its conditional probability table and α be its minimum parental configuration probability (see Section 2 for formal definitions). Theorem 1.2 (informal statement). Consider an -corrupted set of N = Ω(m/ 2 ) samples drawn from a d-dimensional Bayesian network P . Suppose P is c-balanced and has minimum parental configuration probability α, where both c and α are universal constants. We can compute a Bayesian network Q in time O(N d) such that d TV (P, Q) ≤ ln(1/ ).foot_0  For simplicity, we stated our result in the very special case where both c and α are Ω(1). Our approach works for general values of α and c, where our error guarantee degrades gracefully as α and c gets smaller. A formal version of Theorem 1.2 is given as Theorem 4.1 in Section 4. Our algorithm has optimal error guarantee, sample complexity, and running time (up to logarithmic factors). There is an information-theoretic lower bound of Ω( ) on the error guarantee, which holds even for Bayesian networks with only one variable. A sample complexity lower bound of Ω(m/ 2 ) holds even without corruption (see, e.g., (Canonne et al., 2017) ). Our Contributions. We establish a novel connection between robust learning of Bayesian networks and robust mean estimation. At a high level, we show that one can essentially reduce the former to the latter. This allows us to take advantage of the recent (and future) advances in robust mean estimation and apply the algorithms almost directly to obtain new algorithms for learning Bayesian networks. Our algorithm and analysis are considerably simpler than those in previous work. For simplicity, consider learning binary product distributions as an example. Cheng et al. (2018) tried to remove samples to make the empirical covariance matrix closer to a diagonal matrix (since the true covariance matrix is diagonal because each coordinate is independent). They used a "filtering" approach which requires proving specific tail bounds on the samples. In contrast, we show that it suffices to use any robust mean estimation algorithms which minimize the spectral norm of the empirical covariance matrix (regardless of whether it is close to being diagonal or not). As a subroutine in our approach, we develop the first robust mean estimation algorithm that runs in nearly input-sparsity time (i.e., in time nearly linear in the total number of nonzero entries in the input), which may be of independent interest. The main computation bottleneck of current nearlylinear time robust mean estimation algorithms (Cheng et al., 2019a; Depersin & Lecué, 2019; Dong et al., 2019) is running matrix multiplication weight update with the Johnson-Lindenstrauss lemma, which we show can be done in nearly input-sparsity time.

1.2. RELATED WORK

Bayesian Networks. Probabilistic graphical models (Koller & Friedman, 2009) provide an appealing and unifying formalism to succinctly represent structured high-dimensional distributions. The general problem of inference in graphical models is of fundamental importance and arises in many applications across several scientific disciplines (see Wainwright & Jordan (2008) and references therein). The problem of learning graphical models from data (Neapolitan, 2003; Daly et al., 2011 ) has many variants: (i) the family of graphical models (e.g., directed, undirected), (ii) whether the data is fully or partially observable, and (iii) whether the graph structure is known or not. This learning problem has been studied extensively (see, e.g., Chow & Liu (1968) ; Dasgupta (1997) ; Abbeel et al. (2006) ; Wainwright et al. (2006) ; Anandkumar et al. (2012) ; Santhanam & Wainwright (2012) ; Loh & Wainwright (2012) ; Bresler et al. (2013; 2014) ; Bresler (2015) ), resulting in a beautiful theory and a collection of algorithms in various settings. Robust Statistics. Learning in the presence of outliers has been studied since the 1960s (Huber, 1964) . For the most basic problem of robust mean estimation, it is well-known that the empirical median works in one dimension. However, most natural generalizations of the median to high dimensions (e.g., coordinate-wise median, geometric median) would incur an error of Ω( √ d), even in the infinite sample regime (see, e.g., Diakonikolas et al. (2019a) ; Lai et al. (2016) ). After decades of work, sample-efficient robust estimators have been discovered (e.g., the Tukey median (Tukey, 1975; Devroye & Györfi, 1985; Chen et al., 2018) ). However, the Tukey median is NP-Hard to compute in the worse case (Johnson & Preparata, 1978; Amaldi & Kann, 1995) and many heuristics for approximating it perform poorly as the dimension scales (Clarkson et al., 1993; Chan, 2004; Miller & Sheehy, 2010) . Computational Efficient Robust Estimators. Recent work (Diakonikolas et al., 2019a; Lai et al., 2016) gave the first polynomial-time algorithms several high-dimensional unsupervised learning tasks (e.g., mean and covariance estimation) with dimension-independent error guarantees. After the dissemination of (Diakonikolas et al., 2019a; Lai et al., 2016) , algorithmic high-dimensional robust statistics has attracted a lot of recent attention and there has been a flurry of research that obtained polynomial-time robust algorithms for a wide range of machine learning and statistical tasks (see, e.g., Balakrishnan et al. (2017) 2020)). In particular, the most relevant prior work is (Cheng et al., 2018) , which gave the first polynomial-time algorithms for robust learning of fixed-structure Bayesian networks. Faster Robust Estimators. While recent work gave polynomial-time robust algorithms for many tasks, these algorithms are often significantly slower than the fastest non-robust ones (e.g., sample average for mean estimation). Cheng et al. (2019a) gave the first nearly-linear time algorithm for robust mean estimation and initiated the research direction of designing robust estimators that are as efficient as their non-robust counterparts. Since then, there have been several works that develop faster robust algorithms for various learning and statistical tasks, including robust mean estimation for heavy-tailed distributions Dong et al. (2019) ; Depersin & Lecué (2019) , robust covariance estimation Cheng et al. (2019b) ; Li & Ye (2020) , robust linear regression Cherapanamjeri et al. (2020a) , and list-decodable mean estimation Cherapanamjeri et al. (2020b) ; Diakonikolas et al. (2020) . Organization. In Section 2, we define our notations and provide some background on robust learning of Bayesian networks and robust mean estimation. In Section 3, we give an overview of our approach and highlight some of our key technical results. In Section 4, we present our algorithm for robust learning of Bayesian networks and prove our main result.

2. PRELIMINARIES

Bayesian Networks. Fix a d-node directed acyclic graph H whose nodes are labelled [d] = {1, 2, . . . , d} in topological order (every edge goes from a node with smaller index to one with larger index). Let Parents(i) be the parents of node i in H. A probability distribution P on {0, 1} d is a Bayesian network (or Bayes net) with graph H if, for each i ∈ [d], we have that Pr X∼P [X i = 1 | X 1 , . . . , X i-1 ] depends only on the values X j where j ∈ Parents(i). . Let P be a Bayesian network with graph H. Let Γ = {(i, a) : i ∈ [d], a ∈ {0, 1} |Parents(i)| } be the set of all possible parental configurations. Let m = |Γ|. For (i, a) ∈ Γ, the parental configuration Π i,a is defined to be the event that X(Parents(i)) = a. The conditional probability table p ∈ [0, 1] m of P is given by p

Conditional Probability Table

i,a = Pr X∼P [X(i) = 1 | Π i,a ] . In this paper, we often index p as an m-dimensional vector. We use the notation p k and the associated events Π k , where each k ∈ [m] stands for an (i, a) ∈ Γ lexicographically ordered. Notations. For a vector v, let v 2 and v ∞ be the 2 and ∞ norm of v respectively. We write √ v and 1/v for the entrywise square root and entrywise inverse of a vector v respectively. For two vectors x and y, we write x y for their inner product, and x • y for their entrywise product. We use I to denote the identity matrix. For a matrix M , let M i be the i-th column of M , and let M 2 be the spectral norm of M . For a vector v ∈ R n , let diag(v) ∈ R n×n denote a diagonal matrix with v on the diagonal. Throughout this paper, we use P to denote the ground-truth Bayesian network. We use d for the dimension (i.e., the number of nodes) of P , N for the number of samples, for the fraction of corrupted samples, and m = d i=1 2 |Parents(i)| for the size of the conditional probability table of P . We use p ∈ R m to denote the (unknown) ground-truth conditional probabilities of P , and q ∈ R m for our current guess of p. Let G be the original set of N uncorrupted samples drawn from P . After the adversary corrupts an -fraction of G , let G ⊆ G be the remaining set of good samples, and B be the set of bad samples added by the adversary. The set of samples S = G ∪ B is given to the algorithm as input. Let X ∈ R d×N denote the sample matrix whose i-th column X i ∈ R d is the i-th input sample. Abusing notation, we sometimes also use X as a random variable (e.g., a sample drawn from P ). We use π P ∈ R m to denote the parental configuration probabilities of P . That is, π P k = Pr X∼P [X ∈ Π k ]. For a set S of samples, we use π S ∈ R m to denote the empirical parental configuration probabilities over S: π S k = Pr X [X ∈ Π k ] where X is uniformly drawn from S. Balance and Minimum Configuration Probability. We say a Bayesian network P is c-balanced if all conditional probabilities of P are between c and 1 -c. We use α for the minimum probability of parental configuration of P : α = min k π P k . In this paper, we assume that the ground-truth Bayesian network is c-balanced, and its minimum parental configuration probability α satisfies that α = Ω ( ln(1/ )) 2/3 c -1/3 ) . Without loss of generality, we further assume that both c and α are given to the algorithm.

2.1. TOTAL VARIATION DISTANCE BETWEEN BAYESIAN NETWORKS

Let P and Q be two distributions supported on a finite domain D. For a set of outcomes A, let P (A) = Pr X∼P [X ∈ A]. The total variation distance between P and Q is defined as d TV (P, Q) = max A⊆D |P (A) -Q(A)| . For two balanced Bayesian networks that share the same structure, it is well-known that the closeness in their conditional probabilities implies their closeness in total variation distance. Formally, we use the following lemma from Cheng et al. (2018) , which upper bounds the total variation distance between two Bayesian networks in terms of their conditional probabilities. Lemma 2.1 (Cheng et al. (2018) ). Let P and Q be two Bayesian networks that share the same structure. Let p and q denote the conditional probability tables of P and Q respectively. We have (d TV (P, Q)) 2 ≤ 2 k π P k π Q k (p k -q k ) 2 (p k + q k )(2 -p k -q k ) .

2.2. EXPANDING THE DISTRIBUTION TO MATCH CONDITIONAL PROBABILITY TABLE

Lemma 2.1 states that to learn a known-structure Bayesian network P , it is sufficient to learn its conditional probabilities p. However, a given coordinate of X ∼ P may contain information about multiple conditional probabilities (depending on which parental configuration happens). To address this issue, we use a similar approach as in Cheng et al. (2018) . We expand each sample X into an m-dimensional vector f (X, q), such that each coordinate of f (X, q) corresponds to an entry in the conditional probability table. Intuitively, q ∈ R m is our current guess for p, and initially we set q to be the empirical conditional probabilities. We use q to fill in the missing entries in f (X, q) for which the parental configurations fail to happen. Definition 2.2. Let f (X, q) for {0, 1} d × R m → R m be defined as follows: f (X, q) i,a = X i -q i,a X ∈ Π i,a 0 otherwise When X ∼ P and q = p, the distribution of f (X, p) has many good properties. Using the conditional independence of Bayesian networks, we can compute the first and second moment of f (X, p) and show that f (X, p) has subgaussian tails. Lemma 2.3. For X ∼ P and f (X, p) as defined in Definition 2.2, we have (i) E(f (X, p)) = 0. (ii) Cov[f (X, p)] = diag(π P • p • (1 -p)). (iii) For any unit vector v ∈ R m , we have Pr X∼P |v f (X, p)| ≥ T ≤ 2 exp(-T 2 /2). We defer the proof of Lemma 2.3 to Appendix A. A slightly stronger version of Lemma 2.3 was proved in Cheng et al. (2018) , which discusses tail bounds for f (X, q). For our analysis, Lemma 2.3 is sufficient. For general values of q, we can similarly compute the mean of f (X, q): Lemma 2.4. Let π P denote the parental configuration of P . For X ∼ P and f (X, q) as defined in Definition 2.2, we have E[f (X, q)] = π P • (p -q).

2.3. DETERMINISTIC CONDITIONS ON GOOD SAMPLES

To avoid dealing with the randomness of the good samples, we require the following deterministic conditions to hold for the original set G of N good samples (before the adversary's corruption). We prove in Appendix A that these three conditions hold simultaneously with probability at least 1 -τ if we draw N = Ω(m log(m/τ )/ 2 ) samples from P . The first condition states that we can obtain a good estimation of p from G . Let p G denote the empirical conditional probabilities over G . We have √ π P • (p -p G ) 2 ≤ O( ) . The second condition says that we can estimate the parental configuration probabilities π P from any (1 -2 )-fraction of G . Formally, for any subset T ⊂ G with |T | ≥ (1 -2 )N , we have π T -π P ∞ ≤ O( ) . The third condition is that the empirical mean and covariance of any (1 -2 )-fraction of G are very close to the true mean and covariance of f (X, p). Formally, for any subset T ⊂ G with |T | ≥ (1 -2 )N , we require the following to hold for δ 1 = ln 1/ and δ 2 = ln(1/ ): 1 |T | i∈T f (X i , p) 2 ≤ O(δ 1 ) , 1 |T | i∈T f (X i , p)f (X i , p) -Σ 2 ≤ O(δ 2 ) , where Σ = Cov[f (X, p)] = diag(π P • p • (1 -p)).

2.4. ROBUST MEAN ESTIMATION AND STABILITY CONDITIONS

Robust mean estimation is the problem of learning the mean of a d-dimensional distribution from an -corrupted set of samples. As we will see in later sections, to robustly learn Bayesian networks, we repeatedly use robust mean estimation algorithms as a subroutine. Recent work (Diakonikolas et al., 2019a; Lai et al., 2016) gave the first polynomial-time algorithms for robust mean estimation with dimension-independent error guarantees. The key observation in Diakonikolas et al. (2019a) is the following: if the empirical mean is inaccurate, then many samples must be far from the true mean in roughly the same direction. Consequently, these samples must alter the variance in this direction more than they distort the mean. Therefore, if the empirical covariance behaves as we expect it to be, then the empirical mean provides a good estimate to the true mean. Many robust mean estimation algorithms follow the above intuition, and they require the following stability condition to work (Definition 2.5). Roughly speaking, the stability condition states that the mean and covariance of the good samples are close to that of the true distribution, and more importantly, this continues to hold if we remove any 2 -fraction of the samples. Definition 2.5 (Stability Condition (see, e.g., Diakonikolas & Kane (2019))). Fix 0 < < 1 2 . Fix a d-dimensional distribution X with mean µ X . We say a set S of samples is ( , β, γ)-stable with respect to X, if for every subset T ⊂ S with |T | ≥ (1 -2 )|S|, the following conditions hold: (i) 1 |T | X∈T (X -µ X ) 2 ≤ β , (ii) 1 |T | X∈T (X -µ X ) (X -µ X ) -I 2 ≤ γ . Subsequent work (Cheng et al., 2019a; Dong et al., 2019; Depersin & Lecué, 2019) improved the runtime of robust mean estimation to nearly-linear time. Formally, we use the following result from Dong et al. (2019)  . A set S is an -corrupted version of a set T if |S| = |T | and |S \ T | ≤ |S|. Lemma 2.6 (Robust Mean Estimation in Nearly-Linear Time (Dong et al., 2019)  ). Fix a set of N samples G in R d . Suppose G is ( , β, γ)-stable with respect to a d-dimensional distribution X with mean µ X ∈ R d . Let S be an -corrupted version of G . Given as input S, , β, γ, there exists an algorithm that can output an estimator μ ∈ R d in time O(N d), such that with high probability, μ -µ X 2 ≤ O( √ γ + β + log 1/ ) . As we will see later, a black-box use of Lemma 2.6 does not give the desired runtime in our setting. Instead, we extend Lemma 2.6 to handle sparse input such that it runs in time nearly-linear in the number of non-zeros in the input (see Lemma 3.3).

3. OVERVIEW OF OUR APPROACH

In this section, we give an overview of our approach and highlight some of our key technical results. To robustly learn the ground-truth Bayesian network P , it is sufficient to learn its conditional probabilities p ∈ R m . At a high level, we start with a guess q ∈ R m for p and then iteratively improve our guess to get closer to p. For any q ∈ R m , we can expand the input samples into m-dimensional vectors f (X, q) as in Definition 2.2. We first show that the expectation of f (X, q) gives us useful information about (p -q). Recall that π P is the parental configuration probabilities of P . By Lemma 2.4, we have E X∼P [f (X, q)] = π P • (p -q) . Note that if we had access to this expectation and the vector π P , we could recover p immediately: we can set q = E[f (X, q)] • (1/π P ) + q which simplifies to q = p. Note that since S is an -corrupted set of samples of P , we know that {f (X i , q)} i∈S is an -corrupted set of samples of the distribution f (X, q) (with X ∼ P ). Therefore, we can run robust mean estimation algorithms on {f (X i , q)} i∈S to learn E[f (X, q)]. It turns out a good approximation of E[f (X, q)] can help us improve our current guess q. There are two main difficulties in getting this approach to work. The first difficulty is that, to use robust mean estimation algorithms, we need to show that f (X, q) satisfies the stability condition in Definition 2.5. This requires us to analyze the first two moments and tail bounds of f (X, q). Consider the second moment for example. Ideally, we would like to have a statement of the form Cov[f (X, q)] ≈ Cov[f (X, p)] + (p -q)(p -q) , but this is false because we only have f (X, p) k -f (X, q) k = (p -q) k if the k-th parental configuration happens for X. Intuitively, the "error" (p -q) is shattered into all samples where each sample only gives d out of m coordinates of (p -q), and there is no succinct representation for Cov[f (X, q)]. The second difficulty is that f (X, q) is m-dimensional. We cannot explicitly write down all the samples {f (X i , q)} N i=1 , because this takes time Ω(N m), which could be much slower than our desired running time of O(N d). Similarly, a black-box use of nearly-linear time robust mean estimation algorithms (e.g., Lemma 2.6) runs in time Ω(N m), which is too slow. In the rest of this section, we explain how we handle these two issues. Stability Condition of f (X, q). Because the second-order stability condition in Lemma 2.3 is defined with respect to I, we first scale the samples so that the covariance of f (X, p) becomes I. Lemma 2.3 shows that Cov[f (X, p)] = diag(π P •p•(1-p)). To make it close to I, we can multiply the k-th coordinate of f (X, p) by (π P k p k (1 -p k )) -1/2 . However, we do not know the exact value of π P or p, instead we use the corresponding empirical estimates π S and q S (see Algorithm 1). Definition 3.1. Let π S and q S denote the parental configuration probabilities and conditional means estimated over S. Let s = 1/ (π S • q S • (1 -q S )). Throughout this paper, for a vector v ∈ R m , we use v ∈ R m to denote v • s. In particular, we have Xi = X i • s (and similarly p, q, f (x, q)). Now we analyze the concentration bounds for f (X, q). Formally, we prove the following lemma. Lemma 3.2. Assume the conditions in Section 2.3 hold for the original set of good samples G . Then, for δ 1 = log 1/ and δ 2 = log(1/ ), the set of samples f (X i , q) i∈G is , O δ 1 √ αc + p -q 2 , O δ 2 αc + B + √ B -stable, where B = √ π P • (p -q) 2 2 . We provide some intuition for Lemma 3.2 and defer its proof to Appendix B. For the first moment, the difference between E[ f (X, q)] and the empirical mean of f (X, q) comes from several places. Even if q = p, we would incur an error of δ 1 from the concentration bound in Equation equation 3, which is at most δ 1 (αc) -1/2 after the scaling by s. Moreover, on average π P k fraction of the samples gives us information about (p -q) k . Since an -fraction of the samples are removed when proving stability, we may only have (π P k -)-fraction instead, which introduces an error of p -q 2 . This is why the first-moment parameter is δ 1 (αc) -1/2 + p -q 2 . For the second moment, after the scaling, we have Cov[ f (X, p)] ≈ I. Ideally, we would like to prove Cov[ f (X, q)] ≈ I + (π P • (p -q))(π P • (p -q)) , but this is too good to be true. For two coordinates k = , whether a sample gives information about (p-q) k or (p-q) is not independent. We can upper bound the probability that both parental configurations happen by min(π P k , π P ). If they were independent we would have a bound of π P k π P . The difference in these two upper bounds is intuitively why √ π P appears in the second-moment parameter. See Appendix B for more details. Robust Mean Estimation with Sparse Input. To overcome the second difficulty, we exploit the sparsity of the expanded vectors. Observe that each vector f (X, q) is guaranteed to be d-sparse because exactly d parental configuration can happen (see Definition 2.2). The same is true for f (X, q) because scaling does not change the number of nonzeros. Therefore, there are in total O(N d) nonzero entries in the set of samples { f (X, q)} i∈S . We develop a robust mean estimation algorithm that runs in time nearly-linear in the number of nonzeros in the input. Combined with the above argument, if we only invoke this mean estimation algorithm polylogarithmic times, we can get the desired running time of O(N d). Lemma 3.3. Consider the same setting as in Lemma 2.6. Suppose X 2 ≤ R for all X ∈ S. There is an algorithm A mean with the same error guarantee that runs in time O(log R • (nnz(S) + N + d)) where nnz(S) is the total number of nonzeros in S. That is, given an -corrupted version of an ( , β, γ)-stable set of N samples w.r.t. a d-dimensional distribution with mean µ X , the algorithm A mean outputs an estimator μ ∈ R d in time O(log R • (nnz(S) + N + d)) such that with high probability, μ -µ X 2 ≤ O( √ γ + β + log 1/ ) . We prove Lemma 3.3 by extending the algorithm in Dong et al. (2019) to handle sparse input. The main computation bottleneck of recent nearly-linear time robust mean estimation algorithms (Cheng et al., 2018; Dong et al., 2019) is in using the matrix multiplicative weight update (MMWU) method. In each iteration of MMWU, a score is computed for each sample. Roughly speaking, this score indicates whether one should continue to increase the weight on the corresponding sample. Previous algorithms use the Johnson-Lindenstrauss lemma to approximate the scores for all N samples simultaneously. We show that the sparsity of the input vectors allows for faster application of the Johnson-Lindenstrauss lemma, and all N scores can be computed in time nearly-linear in nnz(S). We defer the proof of Lemma 3.3 to Appendix C.

4. ROBUST LEARNING OF BAYESIAN NETWORKS IN NEARLY-LINEAR TIME

In this section, we prove our main result. We present our algorithm (Algorithm 1) and prove its correctness and analyze its running time (Theorem 4.1). Theorem 4.1. Fix 0 < < 0 where 0 is a sufficiently small universal constant. Let P be a c-balanced Bayesian network on {0, 1} d with known structure H. Let α be the minimum parental configuration probability of P . Assume α = Ω( 2/3 c -1/3 ). Let S be an -corrupted set of N = Ω(m/ 2 ) samples drawn from P . Given H, S, , c, and α, Algorithm 1 outputs a Bayesian network Q in time O(N d) such that, with probability at least 9/10, d TV (P, Q) ≤ O( log(1/ )/ √ αc). The c and α terms in the error guarantee also appear in prior work (Cheng et al., 2018) . Removing this dependence is an important technical question that is beyond the scope of this paper. Theorem 4.1 follows from three key technical lemmas. At the beginning of Algorithm 1, we first scale all the input vectors as in Definition 3.1. We maintain a guess q for p and gradually move it closer to p. In our analysis, we track our progress by the 2 -norm of π P • (p -q). Initially, we set q to be the empirical conditional mean over S. Lemma 4.2 proves that π P • (p -q ) 2 is not too large for our first guess. Lemma 4.3 shows that, as long as q is still relatively far from p, we can compute a new guess such that π P • (p -q) 2 decreases by a constant factor. Lemma 4.4 states that, when the algorithm terminates and π P • (p -q) 2 is small, we can conclude that the output Q is close to the ground-truth P in total variation distance. In the following three lemmas, we consider the same setting as in Theorem 4.1 and assume the conditions in Section 2.3 hold. Lemma 4.2 (Initialization). In Algorithm 1, we have π P • (p -q0 ) 2 ≤ O( √ d/ √ αc) . Lemma 4.3 (Iterative Refinement). Fix an iteration t in Algorithm 1. Assume the robust mean estimation algorithm A mean succeeds. If π P • (p -qt ) 2 ≤ ρ t and ρ t = Ω( log(1/ )/ √ αc), then we have π P • (p -qt+1 ) 2 ≤ c 1 ρ t for some universal constant c 1 < 1. Lemma 4.4. Let Q be a Bayesian network that has the same structure as P . Suppose that (1) P is c-balanced, (2) α = Ω(r + /c), and (3) π P • (p -q) 2 ≤ r/2. Then we have d TV (P, Q) ≤ r. We defer the proofs of Lemmas 4.2, 4.3, and 4.4 to Appendix D and we first prove Theorem 4.1. Proof of Theorem 4.1. We first prove the correctness of Algorithm 1. Algorithm 1: Robustly Learning Bayesian Networks Input : The dependency graph H of a c-balanced Bayesian network P with minimum parental configuration α, an -corrupted set S of N = Ω(m/ 2 ) samples {X i } N i=1 drawn from P , and the values of , c and α. Output: A Bayesian network Q such that, with probability at least 9/10, d TV (P, Q) ≤ O( log(1/ )/ √ αc). Compute the empirical probabilities π S where π S (i, a) = Pr X∈S [Π i,a ]; Compute the empirical conditional probabilities q S where q S (i, a) = Pr X∈S [X(i) = 1 | Π i,a ]; Compute the scaling vector s = 1/ (π S • q S • (1 -q S )); Let T = O(log d) and q 0 = q S ; Let ρ 0 = O( √ d/ √ αc). (We maintain upper bounds ρ t s.t. π P • (p -qt ) 2 ≤ ρ t for all t); for t = 0 to T -1 do β t = O( ρ t /α), γ t = O((ρ t ) 2 /α + ρ t / √ α) ; Solve a robust mean estimation problem. Let ν = A mean { f (X i , q t ) i∈S }, , β t , γ t ); q t+1 = ν • (1/s) • (1/π S ) + q t ; ρ t+1 = c 1 ρ t ; return the Bayesian network Q with graph H and conditional probabilities q T ; The original set of N = Ω(m log(m/ )/ 2 ) good samples drawn from P satisfies the conditions in Section 2.3 with probability at least 1 -1 20 . With high probability, the robust mean estimation oracle A mean succeeds in all iterations. For the rest of this proof, we assume the above conditions hold, which by a union bound happens with probability at least 9/10. From Lemma 4.2, we have the following condition on the initial estimate q 0 . π P • (p -q0 ) 2 = O( √ d/ √ αc) . We start with an upper bound ρ 0 of π P • (p -q0 ) 2 = O( √ d/ √ αc) . By Lemma 4.3, in each iteration, if ρ t = Ω( log(1/ )/ √ αc), we can obtain a new estimate q t+1 and an upper bound ρ t+1 on π P • (p -qt ) 2 such that ρ t+1 is smaller than ρ t by a constant factor. Hence after O(log(d)) iterations, we can get a vector q t such that π P • (p -qt ) 2 = O( log(1/ )/ √ αc) . Let Q be the Bayesian network with conditional probability table q t . The assumption that α = Ω( 2/3 c -1/3 ) allows us to apply Lemma 4.4 with r = O( log(1/ )/ √ αc), which gives the claimed upper bound on d TV (P, Q). Now we analyze the runtime of Algorithm 1. First, q S and π S can be computed in time O(N d) because each sample only affects d entries of q. We do not explicitly write down f (X, q). In each iteration, we solve a robust mean estimation problem with input f (X i , q t ) i∈S , which takes time O(N d). This is because there are N input vectors, each vector is d-sparse, and the robust mean estimation algorithm runs in time nearly-linear in the number of nonzeros in the input (Lemma 3.3). We can compute q t+1 = ν • (1/s) • (1/π S ) + q t in time in time O(m). Since there are O(log d) iterations, the overall running time is O(N d) + O(log d) O(N d) + O(m) = O(N d) .

A DETERMINISTIC CONDITIONS ON GOOD SAMPLES

In this section, we will first prove Lemma 2.3, then we prove that the deterministic conditions in Section 2.3 hold with high probability if we take enough samples. Lemma A.1. For X ∼ P and f (X, p) as defined in Definition 2.2, we have (i) E(f (X, p)) = 0. (ii) Cov[f (X, p)] = diag(π P • p • (1 -p)). (iii) For any unit vector v ∈ R m , we have Pr X∼P |v f (X, p)| ≥ T ≤ 2 exp(-T 2 /2). Proof. We first claim that E X∼P [f (X, p) k |f (X, p) 1 , ..., f (X, p) k-1 ] = 0 for all k ∈ [m]. Let k = (i, a), conditioned on f (X, p) 1 , ..., f (X, p) k-1 , the event π i,a may or may not happen. A simple calculation shows that in both cases, we have E X∼P [f (X, p) k |f (X, p) 1 , ..., f (X, p) k-1 ] = 0. For (i), we have E[f (X, p)] = π P • p + (1 -π P ) • p -p = 0. For (ii), we first show that for any (i, a) = (j, b), we have E[f (X, p) i,a f (X, p) j,b ] = 0. For the case i = j, we can see at least one of Π i,a and Π j,b does not occur, so f (X, p) i,a f (X, p) j,b is always 0. For the case i = j, we assume without loss of generality that i > j, then we have E[f (X, p) i,a |f (X, p) j,b ] = 0. For all (i, a) ∈ [m], we have E[f (X, p) 2 i,a ] = π P i,a E[(X -p i,a ) 2 |Π i,a ] = π P i,a p i,a (1 -p i,a ). Com- bining these two, we get Cov[f (X, p)] = diag(π P • p • (1 -p)). For (iii), we recall that E X∼P [f (X, p) k |f (X, p) 1 , ..., f (X, p) k-1 ] = 0, thus the sequence k=1 v k f (X, q) k for 1 ≤ ≤ m is a martingale, and we can apply Azuma's inequality. Note that |v k | ≥ |v k f (X, p) k |, hence we have Pr X∼P |v f (X, p)| ≥ T ≤ 2 exp(-T 2 /2 v 2 2 ) = 2 exp(-T 2 /2). The conditions in Equations equation 1 and equation 2 are proved in Lemma A.2, and the conditions in Equation equation 3 are proved in Corollary A.4. Lemma A.2. Let P be a Bayesian network. Let G be a set of Ω((m log(m/τ ))/ 2 ) samples drawn from P . Let π G and p G be the empirical parental configuration probabilities and conditional probabilities of P given by G . Then with probability 1 -τ , the following conditions hold: (i) For any subset T ⊂ G with |T | ≥ (1 -2 )N , we have π T -π P ∞ ≤ O( ) . (ii) √ π P • (p -p G ) 2 ≤ O( ) . Proof. For (i), first consider the case of T = G and fix an entry 1 ≤ k ≤ m in the conditional probability table. Because each sample is drawn independently from P , by the Chernoff bound, we have that when N = Ω(log(m/τ )/ 2 ), |π P k -π G k | ≤ holds with probability at least 1 -τ /m. Hence, after taking an union bound over k, we have that π T -π P ∞ ≤ holds with probability at least 1 -τ . Now for a general subset T ⊂ G , notice that removing O( )-fraction of samples can change π T by at most O( ). Thus, condition (i) holds with probability at least 1 -τ . For (ii), for any k = (i, a), note that p G k is estimated from π G k N samples. In these samples, the parental configuration Π k happens and the value of X i is decided independently. By the Chernoff bound and the union bound, we get that when N = Ω((m log(m/τ ))/ 2 ), |p G k -p k | ≤ / mπ G k holds for every k with probability at least 1 -τ , which implies √ π G • (p -p G ) 2 ≤ O( ). Combining this with π P -π G ∞ ≤ , we get that condition (ii) holds. To prove Equation equation 3, we use the following concentration bounds for subgaussian distributions. Recall that a distribution D on R d with mean µ is subgaussian if for any unit vector v ∈ R d we have Pr x∼D [| v, x -µ | ≥ t] ≤ exp(-ct 2 ) , where c is a universal constant. Lemma A.3. Let G be a set of N = Ω(( log 1/ ) -2 (d + log(1/τ ))) samples drawn from a d-dimensional subgaussian distribution with mean µ and covariance matrix Σ I. Here A B means that B -A is a positive semi-definite matrix. Then, with probability 1 -τ , the following conditions hold: For δ 1 = c 1 ( log 1/ ) and δ 2 = c 1 ( log 1/ ) where c 1 is an universal constant, we have that for any subset T ⊂ G with |T | ≥ (1 -2 )N , 1 |T | i∈T (X i -µ) 2 ≤ δ 1 , 1 |T | i∈T (X i -µ)(X i -µ) -Σ 2 ≤ δ 2 A special case of Lemma A.3 where Σ = I is proved in Diakonikolas et al. (2019a) . The proof for the general case where Σ I is almost identical. In particular, the concentration inequalities used in Diakonikolas et al. (2019a) for subgaussian distributions still hold when Σ I (see, e.g., Vershynin (2010) ). From Lemma A.3 and 2.3, we have the following corollary: Corollary A.4. Let G be a set of N = Ω(( log 1/ ) -2 (m + log(1/τ ))) samples drawn P . Then, with probability 1 -τ , the following conditions to hold: For δ 1 = c 1 ( log 1/ ) and δ 2 = c 1 ( log 1/ ), where c 1 is an universal constant, we have that for any subset T ⊂ G with |T | ≥ (1 -2 )N , 1 |T | i∈T f (X i , p) 2 ≤ O(δ 1 ) , 1 |T | i∈T f (X i , p)f (X i , p) -Σ 2 ≤ O(δ 2 ) , where Σ = Cov[f (X, p)] = diag(π P • p • (1 -p)). B STABILITY CONDITION OF f (X, q) In this section, we prove the stability condition for the samples f (X, q) (Lemma 3.2). Recall the definition of f (X, q) from Definitions 2.2 and 3.1. We first restate Lemma 3.2. Lemma 3.2. Assume the conditions in Section 2.3 hold for the original set of good samples G . Then, for δ 1 = log 1/ and δ 2 = log(1/ ), the set of samples f (X i , q) i∈G is , O δ 1 √ αc + p -q 2 , O δ 2 αc + B + √ B -stable, where B = √ π P • (p -q) 2 2 . We will prove the stability of f (X, q). The stability of f (X, q) follows directly. We introduce a matrix C D,q which is crucial in proving the stability of f (X, q). Intuitively, the matrix C D,q is related to the difference in the covariance of f (X, p) and that of f (X, q) on the sample set D. Definition B.1. For any set D of samples {X i } i∈D , we define the following m × m matrix C D,q = 1 |D| i∈D (f (X i , p) -f (X i , q))(f (X i , p) -f (X i , q)) . Observe that for x ∈ {0, 1} d with x / ∈ Π k , we have f (x, p) k = f (x, q) k = 0. On the other hand, if x ∈ Π k for some k = (i, a), we have f (x, p) k -f (x, q) k = (x i -p k ) -(x i -q k ) = q k -p k . In the very special case where all parental configurations happen (i.e., a binary product distribution), we would have C D,q = (p -q)(p -q) . However, in general the information related to (p -q) is spread among the samples. We show that even though C D,q does not have a succinct representation, we can prove the following upper bound on the spectral norm of C D,q . Lemma B.2. C D,q 2 ≤ k π D k (p k -q k ) 2 . Proof. For notational convenience, let C = C D,q . For every 1 ≤ k, ≤ m, we have |C k, | = (Pr D [Π k ∧ Π ])(p k -q k )(p -q ) ≤ min{π D k , π D } • |(p k -q k )(p -q )| ≤ π D k |(p k -q k )| • π D |(p -q )| We can upper bound the spectral norm of C in term of its Frobenius norm: C 2 2 ≤ C 2 F = k, C 2 k, ≤ k, π D k (p k -q k ) 2 π D (p -q ) 2 ≤ k π D k (p k -q k ) 2 2 . The following lemma essentially proves the stability of f (X, q), except that in the second-order condition, we should have Cov(f (X, q)) instead of Σ. We will bridge this gap in Lemma B.4. Lemma B.3. Assume the conditions in Section 2.3 hold. For δ 1 = log 1/ and δ 2 = log(1/ ), we have that for any subset T ⊂ G with |T | ≥ (1 -)|G |, 1 |T | i∈T (f (X i , q) -π P • (p -q)) 2 ≤ O(δ 1 + p -q 2 ) , and 1 |T | i∈T (f (X i , q) -π P • (p -q))(f (X i , q) -π P • (p -q)) -Σ 2 ≤ O(δ 2 + B + √ B) where B = √ π P • (p -q) 2 2 ≤ 1 α π P • (p -q) 2 2 , and Σ = diag(π P • p • (1 -p)) is the true covariance of f (X, p) . Proof. For the first moment, we have 1 |T | i∈T (f (X i , q) -π P • (p -q)) 2 = 1 |T | i∈T (f (X i , p) + f (X i , q) -f (X i , p) -π P • (p -q)) 2 ≤ 1 |T | i∈T f (X i , p) 2 + 1 |T | i∈T (f (X i , q) -f (X i , p) -π P (p -q)) 2 = 1 |T | i∈T f (X i , p) 2 + (π T -π P ) • (p -q) 2 = O(δ 1 + p -q 2 ) . For the second moment, consider any unit vector v ∈ R m . We have v 1 |T | i∈T f (X i , q)f (X i , q) v = 1 |T | i∈T f (X i , q), v 2 = 1 |T | i∈T f (X i , p), v 2 + f (X i , p) -f (X i , q), v 2 + 2 f (X i , p), v f (X i , p) -f (X i , q), v ≤ 1 |T | i∈T f (X i , p), v 2 + f (X i , p) -f (X i , q), v 2 + 2 1 |T | i∈T f (X i , p), v 2 1 |T | i∈T f (X i , p) -f (X i , q), v 2 where the last inequality follows from the Cauchy-Schwarz inequality. Therefore, we have 1 |T | i∈T f (X i , q)f (X i , q) -Σ 2 ≤ 1 |T | i∈T f (X i , p)f (X i , p) -Σ 2 + 1 |T | i∈T (f (X i , p) -f (X i , q))(f (X i , p) -f (X i , q)) 2 + 2 1 |T | i∈T f (X i , p)f (X i , p) 2 1 |T | i∈T (f (X i , p) -f (X i , q))(f (X i , p) -f (X i , q)) 2 ≤ δ 2 + C T,q 2 + 2 1 + δ 2 C T,q 2 = O δ 2 + C T,q 2 + C T,q 2 Finally, we show that the second moment matrix 1

|T |

i∈T f (X i , q)f (X i , q) is not too far from the empirical covariance matrix of f (X, q). 1 |T | i∈T f (X i , q)f (X i , q) - 1 |T | i∈T (f (X i , q) -π P • (p -q))(f (X i , q) -π P • (p -q)) 2 ≤ 2 1 |T | i∈T (f (X i , q) -π P • (p -q)) 2 π P • (p -q) 2 + π P • (p -q) 2 2 ≤ O ((δ 1 + p -q 2 ) π P • (p -q) 2 ) ≤ O π P • (p -q) 2 2 . Putting everything together and using Lemma B.2, we conclude this proof. The stability of f (X, q) follows from the stability of f (X, q) (Lemma B.3), scaling all samples by s, and replacing Σ with I in the second-order condition using Lemma B.4. Lemma B.4. Assume the conditions in Section 2.3 hold. Then after scaling, we have Σ -I 2 ≤ O αc . where Σ is the covariance matrix of f (X, p). Proof. We recall that π P k p k (1 -p k ) ≥ 1 2 π P k min(p k , 1 -p k ) = Ω(αc). Because s ∞ = O(1/ √ αc), it suffices to show that Cov(s • f (X, p)) -I 2 ≤ O( ) . In other words, we need to show π P • p • (1 -p) -π S • q S • (1 -q S ) ∞ = O( ) . Let π G and p G be the empirical parental configuration probabilities and conditional probabilities of P given by G . We first prove that π G • p G • (1 -p G ) -π S • q S • (1 -q S ) ∞ = O( ) . Note that π G -π S ∞ = O( ) and q S k (1 -q S k ) < 1, so it is sufficient to show that π G • p G • (1 -p G ) -π G • q S • (1 -q S ) ∞ = O( ) . We have π G • p G • (1 -p G ) -π G • q S • (1 -q S ) ∞ = π G • p G -π G • q S -π G • (p G + q S ) • (p G -q S ) ∞ ≤ 3 π G • (p G -q S ) ∞ . Let n k denote the number of times that the event Π k happens over G , and let t k be the number of times that X(i) = 1 when Π k = Π i,a happens. Because S is obtained by changing at most N samples in G , we can get |π G k • (p G k -q S k )| ≤ n k N • ( t k + N n k -N - t k n k ) = n k (n k + t k ) N N n k (n k -N ) ≤ 2n k N 0.5N n k = 4 . The last inequality follows from t k ≤ n k and n k -N ≥ 0.5n k (because we assume the minimum parental configuration probability is Ω( )).

This concludes π

G • p G • (1 -p G ) -π S • q S • (1 -q S ) ∞ = O( ). Similarly, in order to prove that π P • p • (1 -p) -π G • p G • (1 -p G ) ∞ = O( ) . We just need to show π P • (p -p G ) 2 = O( ), which is follows from (ii) in Lemma A.2. An application of triangle inequality finishes this proof. We are now ready to prove Lemma 3.2. Proof of Lemma 3.2. By Lemma B.3 and the fact that s ∞ = O(1/ √ αc), we know that for any subset T ⊂ G with |T | ≥ (1 -)|G |, we have 1 |T | i∈T ( f (X i , q) -π P • (p -q)) 2 ≤ O( δ 1 √ αc + p -q 2 ) , 1 |T | i∈T ( f (X i , q) -π P • (p -q))( f (X i , q) -π P • (p -q)) -Σ 2 ≤ O( δ 2 αc + B + √ B) where B = √ π P • (p -q) 2 2 , and Σ is the true covariance of f (X, p). This is because the scaling is applied to all vectors on both sides of the inequalities, so we only need to scale the scalars δ 1 and δ 2 appropriately. We conclude the proof by replacing Σ in the second-order condition with I using Lemma B.4.

C ROBUST MEAN ESTIMATION WITH SPARSE INPUT

In this section, we give a robust mean estimation algorithm that runs in nearly input-sparsity time. We build on the following lemma, which is essentially the main result of Dong et al. (2019) . Lemma C.1 (essentially Dong et al. (2019) ). Given an -corrupted version of an ( , β, γ)-stable set S of N samples w.r.t. a d-dimensional distribution with mean µ X . Suppose further that X 2 ≤ R for all X ∈ S, there is an algorithm outputs an estimator μ ∈ R d such that with high probability, Lemma C.1 is different from the way it is stated in Dong et al. (2019) . This is because we use a more concise stability condition than the one in Dong et al. (2019) . We will show that Lemma C.1 is equivalent to the version in Dong et al. (2019) in Appendix C.1. μ -µ X 2 ≤ O( √ γ + β + log 1/ ) . The computational bottleneck of the algorithm in Dong et al. (2019) is logarithmic uses of matrix multiplicative weight update (MMWU). In each iteration of every MMWU, they need to compute a score for each sample. Intuitively, these scores help the algorithm decide whether it should continue to increase the weight on each sample or not. We define some notations before we formally define the approximate score oracle. Let ∆ N = {w ∈ R N : 0 ≤ w i ≤ 1, w i = 1} be the N -dimensional simplex. Given a set of N samples X 1 , ..., X N and a weight vector w ∈ ∆ N , let µ(w) = 1 |w| w i X i and Σ(w) = 1 |w| w i (X i -µ(w))(X iµ(w)) denote the empirical mean and covariance weighted by w. Definition C.2 (Approximate Score Oracle). Given as input a set of N samples X 1 , . . . , X N ∈ R d , a sequence of t + 1 = O(log(d)) weight vectors w 0 , . . . , w t ∈ ∆ N , and a parameter α > 0, an approximate score oracle O apx outputs (1 ± 0.1)-approximations ( τ i ) N i=1 to each of the N scores τ i = (X i -µ(w t )) U (X i -µ(w t )) for U = exp(α t-1 i=0 Σ(w i )) tr exp(α t-1 i=0 Σ(w i )) . In addition, O apx outputs a scalar q such that | q -q| ≤ 0.1q + 0.05 Σ(w t ) -I 2 , where q = Σ(w t ) -I, U . These scores are computed using the Johnson-Lindenstrauss lemma. Our algorithm for computing these scores are given in Algorithm 2. Let r = O(log N log(1/δ)), = O(log d), and Q ∈ R r×d be a matrix with i.i.d entries drawn from N (0, 1/r). Algorithm 2 computes an r × d matrix A = Q • P α 2 t-1 i=0 Σ(w i ) . where P (Y ) = j=0 1 j! Y j is a degree-Taylor approximation to exp(Y ). The estimates for the individual scores are then given by τ i = 1 tr(AA ) A(X i -µ(w t )) 2 and the estimate for q is given by q = N i=1 ( τ i -1) . Algorithm 2: Nearly-linear time approximate score computation Input: A set S of N samples X 1 , . . . , X N ∈ R d , a sequence of weight vectors w 0 , ..., w t , a parameter α, and a failure probability δ > 0. Let r = O(log N log(1/δ)) and = O(log d); Let Q ∈ R r×d have entries drawn i.i.d. from N (0, 1/r); Compute the matrix A ∈ R r×d as in Equation equation 6; return ( τ i ) N i=1 given by Equation equation 7 and q given by Equation equation 8; The correctness of Algorithm 2 was proved in Dong et al. (2019) . Lemma C.3 (Dong et al. (2019)) . With probability at least 1 -δ, the output of Algorithm 2 satisfies | q -q| ≤ 0.1q + 0.05 Σ(w t ) -I 2 and | τ i -τ i | ≤ 0.1τ i for all 1 ≤ i ≤ N . By the triangle inequality, we have π G • (p G -q S ) 2 ≤ π G • p G -π S • q S 2 + π G -π S 2 ≤ 3 √ d . Using the condition in Equation equation 1 from Section 2.3, i.e., π G • (p G -p) 2 ≤ O( ), we get π G • (p -q S ) 2 ≤ O( √ d) . Now by Equation equation 2 from Section 2.3 and the assumption that the minimum parental configuration probability min k π P k = α = Ω( ), we have π P k ≤ π G k + O( ) ≤ O(π G k ), and hence π P • (p -q S ) 2 ≤ O( √ d) . After scaling by s, we have π G • (p -qS ) 2 ≤ O( √ d/ √ αc) . Lemma 4.3 shows that, when q is relatively far from p, the algorithm can find a new q such that π P • (p -q) 2 decreases by a constant factor. Lemma 4.3. Consider the same setting as in Theorem 4.1. Assume the conditions in Section 2.3 hold. Fix an iteration t in Algorithm 1. Assume the robust mean estimation algorithm A mean succeeds. If π P • (p -qt ) 2 ≤ ρ t and ρ t = Ω( log(1/ )/ √ αc), then we have π P • (p -qt+1 ) 2 ≤ c 1 ρ t for some universal constant c 1 < 1. Proof. We assume ρ t > c 4 ( log(1/ )/ √ αc) and α > c 5 for some sufficiently large universal constants c 4 and c 5 . Because π P • (p -qt ) 2 ≤ ρ t , Lemma 3.2 shows that f (X i , q By Lemma 3.3, the robust mean estimation oracle A mean , which we assume to succeed, outputs a ν ∈ R m such that, for some universal constant c 3 , ν -π P • (p -qt ) 2 ≤ c 3 α ρ t + √ α ρ t + α ρ t + log(1/ ) √ αc < c 3 √ c 5 + c 3 √ c 4 + c 3 c 5 + c 3 c 4 ρ t . From Section 2.3, we have π S -π P ∞ = O( ), which implies (π S -π P ) • (p -qt ) 2 ≤ α π P • (p -qt ) 2 ≤ α ρ t . By the triangle inequality, we have ν -π S • (p -qt ) 2 ≤ c 3 √ c 5 + c 3 √ c 4 + c 3 + 1 c 5 + c 3 c 4 ρ t . Algorithm 1 sets qt+1 = ν • (1/π S ) + qt , which is equivalent to π S • (p -qt+1 ) = π S • (p -qt ) -ν . Since π S -π P ∞ = O( ) and α = Ω( ), we have π P i ≤ 1.1π S i ∀1 ≤ i ≤ m . Putting everything together, letting c 1 = 1.1 c3 √ c5 + c3 √ c4 + c3+1 c5 + c3 c4 , we have π P • (p -qt+1 ) 2 ≤ 1.1 π S • (p -qt+1 ) 2 < 1.1 c 3 √ c 5 + c 3 √ c 4 + c 3 + 1 c 5 + c 3 c 4 ρ t = c 1 ρ t . Because c 4 and c 5 can be sufficiently large, we have c 1 < 1 as needed. Lemma 4.4 shows that when the algorithm terminates, we can conclude that the output Q is close to the ground-truth P in total variation distance. Lemma 4.4. Consider the same setting as in Theorem 4.1. Assume the conditions in Section 2.3 hold. Let Q be a Bayesian network that shares the same structure with P . Suppose that (1) P is c-balanced, (2) α = Ω(r + /c), and (3) π P • (p -q) 2 ≤ r/2. Then we have d TV (P, Q) ≤ r . Proof of Lemma 4.4. We have (p k + q k )(2 -p k -q k ) ≥ p k (1 -p k ). Hence, k π P k π Q k (p k -q k ) 2 (p k + q k )(2 -p k -q k ) ≤ k π P k π Q k π P k (p k -q k ) 2 π P k p k (1 -p k ) , = Ω(r), so we have From k π P k π Q k π P k (p k -q k ) 2 π P k p k (1 -p k ) ≤ 1.1 k π P k π Q k π P k (p k -q k ) 2 s 2 k = 1.1 k π P k π Q k π P k (p k -qk ) 2 . It suffices to show that |π P k -π Q k | ≤ r, which implies π Q k ≤ 1.1π P k and further implies d TV (P, Q) ≤ 2 π P • (p -q) 2 . Let P ≤i and Q ≤i be the distributions of the first i coordinates of P and Q respectively. We prove |π P k -π Q k | ≤ r by induction on i. Suppose that for 1 ≤ j < i and all a ∈ {0, 1} |parents(j)| , |π P j,a -π Q j,a | ≤ r, then we have d TV (P ≤(i-1) , Q ≤(i-1) ) ≤ r. Because that events Π i,a only depends on j < i, |π P i,a -π Q i,a | ≤ d TV (P ≤(i-1) , Q ≤(i-1) ) ≤ r for all a. Consequently, we have d TV (P, Q) = d TV (P ≤d , Q ≤d ) ≤ r.



Throughout the paper, we use O(f ) to denote O(f polylog(f )).



; Charikar et al. (2017); Diakonikolas et al. (2017a;b); Steinhardt et al. (2018); Diakonikolas et al. (2018); Hopkins & Li (2018); Kothari et al. (2018); Prasad et al. (2020); Diakonikolas et al. (2019b); Klivans et al. (2018); Diakonikolas et al. (2019c); Liu et al. (2020); Cheng et al. (2020); Zhu et al. (

Moreover, this algorithm runs in time O((nnz(S)+ N + d + T (O apx )) • log R), where nnz(S) is the total number of nonzeros in the samples in S and T (O apx )) is the runtime of an approximate score oracle defined in Definition C.2.

the proof of Lemma B.4, we know |π P k p k (1-p k )-1 s 2 k | = O( ) and π P k p k (1-p k ) ≥ π P k p k 2

ACKNOWLEDGMENTS

Part of this work was done while Yu Cheng was visiting the Institute of Advanced Study. Part of this work was done while Honghao Lin was an undergraduate student at Shanghai Jiao Tong University.

annex

Proof. We first show that matrix A ∈ R r×d can be computed in time O(t • (N + d + nnz(S)) • log 1/δ) .We will multiply each row of Q (from the left) through the matrix polynomial to obtain A. Let v ∈ R 1×d be one of the rows of Q and let w ∈ R N be any weight vector. Observe that we can compute all v (X i -µ(w))This is because we can compute µ(w) and v µ(w) just once, and then compute v X i for every i and subtract v µ(w) from it.Then, we can computeas the sum of N sparse vectors subtracting a dense vector in time O(nnz(S) + N + d).

Therefore, for any

Because P is a degree-matrix polynomial of t-1 i=0 Σ(w i ), we can use Horner's method for polynomial evaluation to compute v P -α 2 t-1 i=0 Σ(w i ) in time O( •t•(nnz(S)+d+N )). We need to multiply each of r rows of A through, we can compute A in time O(rIt remains to show that ( τ i ) N i=1 and q as defined in Equations 7 and 8 can be computed quickly. Note that tr(AA ) is the entrywise inner product of A with itself, so it can be computed in time O(rd). The vectors (A(X i -µ(w t )))and Aµ(w t ) can be computed only once in time O(rd). Because r = O(log N log(1/δ)), we can compute all τ i in time O(r • (nnz(S) + d)). Given the τ i 's, q can be computed in O(N ) time.Recall that r = O(log N log(1/δ)) and = O(log d). Putting everything together, the overall runtime of the oracle is In this section, we will show the equivalence between Lemma C.1 and Lemma C.6. Lemma C.6 is a restatement of the result of Dong et al. (2019) using their stability notations.We first state the stability condition used throughout Dong et al. (2019) . Definition C.5 (Dong et al. (2019) ). We say a set of points S is ( ,γ 1 ,γ 2 ,β 1 ,β 2 )-good with respect to a distribution D with true mean µ if the following two properties hold:• For any subset T ⊂ S so that |T | = 2 |S|, we haveThen, Dong et al. (2019) showed the following result.Lemma C.6. Let D be a distribution on R d with unknown mean µ. Let 0 < < 0 for some universal constant 0 . Let S be a set of N samples withand S g is ( , γ 1 , γ 2 , β 1 , β 2 )-good with respect to D. Let O apx be an approximate score oracle for S. Suppose X 2 ≤ R for all X ∈ S. Then, there is an algorithm QUEScoreFilter(S, O apx , δ) that outputs a μ such that with high probability,, We first show the connection between our stability notion (Definition 2.5) and theirs (Definition C.5).Lemma C.7. Fix a d-dimensional distribution D with mean µ, if a set S of N samples is ( , β, γ)stable with respect to D, then S is ( , β, γ, β/ , γ/ + 3β 2 / 2 )-good with respect to D.Proof. For any subset T ⊂ S with |T | = 2 |S|, we haveThe last line follows from the assumption that S is ( , β, γ)-stable with respect to D.Similarly, we haveCombining the above two inequalities, by the triangle inequality, we getWhen S is ( , β, γ)-stable, by Lemma C.6, we know that S is ( , 

D OMITTED PROOFS FROM SECTION 4

In this section, we prove the technical lemmas in Section 4. We restate each lemma before proving it.Lemma 4.2 states that the (scaled) initial estimation is not too far from the true conditional probabilities p.Lemma 4.2. Consider the same setting as in Theorem 4.1. Assume the conditions in Section 2.3 hold. In Algorithm 1, we haveProof. Recall that q 0 = q S is the empirical conditional probabilities over S, and v = v • s where s is the scaling vector with s ∞ ≤ O(1/ √ αc).Let π G and p G be the empirical parental configuration probabilities and conditional probabilities given by G .We first show that πLet n G k and n S k denote the number of times that Π k happens in G and S. Note that changing one sample in G can increase or decrease n G k by at most 1. Moreover, in a single sample, exactly d parental configuration events happen, so changing a sample can affect at most 2d n G k 's. Since S is obtained from G by changing N samples, we have |n By a similar argument, we can show thatk is the probability that Π k happens and X(k) = 1 over G .

