UNDERSTANDING THE ROLE OF IMPORTANCE WEIGHT-ING FOR DEEP LEARNING

Abstract

The recent paper by Byrd & Lipton (2019) , based on empirical observations, raises a major concern on the impact of importance weighting for the over-parameterized deep learning models. They observe that as long as the model can separate the training data, the impact of importance weighting diminishes as the training proceeds. Nevertheless, there lacks a rigorous characterization of this phenomenon. In this paper, we provide formal characterizations and theoretical justifications on the role of importance weighting with respect to the implicit bias of gradient descent and margin-based learning theory. We reveal both the optimization dynamics and generalization performance under deep learning models. Our work not only explains the various novel phenomenons observed for importance weighting in deep learning, but also extends to the studies where the weights are being optimized as part of the model, which applies to a number of topics under active research.

1. INTRODUCTION

Importance weighting is a standard tool for estimating a quantity under a target distribution while only the samples from some source distribution is accessible. It has been drawing extensive attention in the communities of statistics and machine learning. Causal inference for deep learning investigates heavily on the propensity score weighting method that applies the off-policy optimization with counterfactual estimator (Gilotte et al., 2018; Jiang & Li, 2016) , modelling with observational feedback (Schnabel et al., 2016; Xu et al., 2020) and learning from controlled intervention (Swaminathan & Joachims, 2015) . The importance weighting methods are also applied to characterize distribution shifts for deep learning models (Fang et al., 2020) , with modern applications in such as the domain adaptation (Azizzadenesheli et al., 2019; Lipton et al., 2018) and learning from noisy labels (Song et al., 2020) . Other usages include curriculum learning (Bengio et al., 2009) and knowledge distillation (Hinton et al., 2015) , where the weights characterize the model confidence on each sample. To reduce the discrepancy between the source and target distribution for model training, a standard routine is to minimize a weighted risk (Rubinstein & Kroese, 2016) . Many techniques have been developed to this end, and the common strategy is re-weighting the classes proportionally to the inverse of their frequencies (Huang et al., 2016; 2019; Wang et al., 2017) . For example, Cui et al. (2019) proposes re-weighting by the inverse of effective number of samples. The focal loss (Lin et al., 2017) down-weights the well-classified examples, and the work by Li et al. (2019) suggests an improved technique which down-weights examples based on the magnitude of the gradients. Despite the empirical successes of various re-weighting methods, it is ultimately not clear how importance weighting lays influence from the theoretical standpoint. The recent study of Byrd & Lipton (2019) observes from experiments that there is little impact of importance weights on the converged deep neural network, if the data can be separated by the model using gradient descent. They connect this phenomenon to the implicit bias of gradient descent (Soudry et al., 2018 ) -a novel topic that studies why over-parameterized models trained on separable data is biased toward solutions that generalize well. Implicit bias of gradient descent has been observed and studied for linear model (Soudry et al., 2018; Ji & Telgarsky, 2018b) , linear neural network (Ji & Telgarsky, 2018a; Gunasekar et al., 2018) , two-layer neural network with homogeneous activation (Chizat & Bach, 2020) and smooth neural networks (Nacson et al., 2019; Lyu & Li, 2019) . To summarize, those work reveals that the direction of the parameters (for linear predictor) and the normalized margin (for nonlinear predictor), regardless of the initialization, respectively converge to those of a max-margin solution. The pivotal role of margin for deep learning models has been explored actively after the long journey of understanding the generalization of over-parameterized neural networks (Bartlett et al., 2017; Golowich et al., 2018; Neyshabur et al., 2018) . For instance, Wei et al. (2019) studies the margin of the neural networks for separable data under weak regularization. They show that the normalized margin also converges to the max-margin solution, and provide a generalization bound for a neural network that hinges on its margin. Although there are rich understandings for the implicit bias of gradient descent and the margin-based generalization, very few efforts are dedicated to studying how they adjust to the weighted empiricalrisk minimization (ERM) setting. The established results do not directly transfer since importance weighting can change both the optimization geometry and how the generalization is measured. In this paper, we fill in the gap by showing the impact of importance weighting on the implicit bias of gradient descent as well as the generalization performance. By studying the optimization dynamics of linear models, we first reveal the effect of importance weighting on the convergence speed under linearly separable data. When the data is not linearly separable, we characterize the unique role of importance weighting on defining the intercept term upon the implicit bias. We then investigate the non-linear neural network under a weak regularization as Wei et al. (2019) . We provide a novel generalization bound that reflects how importance weighting leads to the interplay between the empirical risk and a compounding term that consists of the model complexity as well as the deviation between the source target distribution. Based on our theoretical results, we discuss several exploratory developments on importance weighting that are worthy of further investigations. • A good set of weights for learning can be inversely proportional to the hard-to-classify extent. For example, a sample that is close to (far from) the oracle decision boundary should have a large (small) weight. • If the importance weights are jointly trained according to a weighting model, the impact of the weighting model eventually diminishes after showing strong correlation with the hard-to-classify extent such as margin. • The usefulness of explicit regularization on weighted ERM can be studied, via their impact on the margin, on balancing the empirical loss and the distribution divergence. In summary, our contribution are three folds. • We characterize the impact of importance weighting on the implicit bias of gradient descent. • We find a generalization bound that hinges on the importance weights. For finite-step training, the role of importance weighting on the generalization bound is reflected in how the margin is affected, and how it balances the source and target distribution. • We propose several exploratory topics for importance weighting that worth further investigating from both the application and theoretical perspective. The rest of the paper is organized as follows. In Section 2, we introduce the background, preliminary results and the experimental setup. In Section 3 and 4, we demonstrate the influence of the importance weighting for linear and non-linear models in terms of the implicit bias of gradient descent and the generalization performance. We then discuss the extended investigations in Section 5.

2. PRELIMINARIES

We use bold-font letters for vectors and matrices, uppercase letters for random variables and distributions, and • to denote 2 norm when no confusion arises. We denote the training data by D = {w i , x i , y i } n i=1 where x i ∈ X denotes the features, y i is binary or categorical, and the importance weight is bounded such that: w i ∈ [1/M, M ] for some M > 1. We mention that the importance weights are often defined with respect to the source distribution P s from which the training data is drawn, and the target distribution P t . We do not make this assumption here because importance weighting is often applied for more general purposes. Therefore, w i can be defined arbitrarily. We use f (θ, x) to denote the predictor and define F = {f (θ, •) | θ ∈ Θ ⊂ R d }. For the sake of notation, we focus on the binary setting: y i ∈ {-1, +1} with f (θ, x) ∈ R. However, it will become clear later that our results can be easily extended to the multi-class setting. Consider the weighted empirical risk minimization (ERM) task with the risk given by L(θ; w) = 1/n n i=1 w i y i f (θ, x i ) for some non-negative loss function (•). The weight-agnostic counterpart is denoted by: L (θ) = 1/n n i=1 (y i f (θ, x i )) . We focus particularly on the exponential loss (u) = exp(-u) and log loss (u) = log(1 + exp(-u)). For the multi-class problem where y i ∈ [k], we extend our setup using the softmax function where the logits are now given by {f j (θ, x)} k j=1 . For optimization, we consider using gradient descent to minimize the total loss: θ (t+1) (w) = θ (t) (w) -η t ∇L(θ; w) θ=θ (t) (w) , where the learning rate η t can be constant or step-dependent. From parameter norm divergence to support vectors. Suppose D is separated by f (θ (t) , x) after some point during training. The key factor that contributes to the implicit bias for both linear and non-linear predictor under a weak regularizationfoot_0 is that the norm of the parameters diverges after separation, i.e. lim t→∞ θ (t) 2 = ∞, as a consequence of using gradient descent. Now we examine θ (t) (w) 2 . The heuristic is that if (•) is exponential-like, multiplying by w i only changes its tail property up to a constant while the asymptotic behavior is not affected. In particular, the necessary conditions for norm divergence under gradient descent can be summarized by: • C1. The loss function (•) has a exponential tail behavior (that we formalize in Appendix A.1) such that lim u→∞ (-u) = lim u→∞ ∇ (-u) = 0; • C2. The predictor f (θ, x) is α-homogeneous such that f (c • θ, x) = c α f (θ, x), ∀c > 0. In addition, we need certain regularities from f (θ, x) to ensure the existence of critical points and the convergence of gradient descent: • C3. for any x ∈ X , f (•, x) is β-smooth and l-Lipschitz on R d . C1 can be satisfied by the exponential loss, log loss and cross entropy loss under the multi-class setting. For standard deep learning models such as multilayer perceptron (MLP), C2 implies that the activation functions are homogeneous such as ReLU and LeakyReLU, and bias terms are disallowed. C3 is a common technical assumptions whose practical implications are discussed in Appendix A.1. Among the three necessary conditions, importance weighting only affects C1 up to a constant, so its impact on the norm divergence diminishes in the asymptotic regime. The formal statement is provided as below. Claim 1. There exists a constant learning rate for gradient descent, such that for any w ∈ [1/M, M ] n , with a weak regularization, lim t→∞ θ (t) (w) = ∞ under C1-C3. Compared with the previous work, we extend the norm divergence result not only to weighted ERM but a more general setting where a weak regularization is considered. We defer the proof to Appendix A.1. A direct consequence of parameter norm divergence is that both the risk and the gradient are dominated by the terms with the smallest margin, i.e. arg min i y i f (θ, x i ), which are also referred to as the "support vectors". To make sense of this point, notice that both the risk and the gradient have the form of: i C i exp -y i f (θ, x i ) , where C i are low-order terms. Since f (θ, x i ) = θ α 2 f θ/ θ 2 , x i due to the homogeneous assumption in C2, it holds that:  lim t→∞ exp -y i f (θ (t) (w), x i ) → 0. Therefore, the decision boundaries may share certain characteristics with the support vector machine (SVM) since they rely on the same support vectors. As a matter of fact, the current understandings on the implicit bias of gradient descent are mostly established on the connection with hard-margin SVM: min θ∈R d θ 2 s.t. y i f (θ, x i ) ≥ 1 ∀i = 1, 2, . . . , n, whose optimization path coincides with the max-margin problem: max θ 2≤1 min i=1,...,n y i f (θ, x i ), as shown by Nacson et al. (2019) . Define γ(θ) := min i y i f (θ, x i ). We use θ * to denote the optimal solution and γ * = γ(θ * ) := min i y i f (θ * , x i ) to denote the corresponding margin. Implicit bias of gradient descent. We start by considering the weight-agnostic setting. When D is linear separable, it is reasonable to conjecture that the separating hyperplane under a linear f (θ, •) overlaps with the solution of hard-margin SVM. Soudry et al. (2018) and Ji & Telgarsky (2018b) first show that θ (t) converges in direction to θ * , i.e. lim t→∞ θ (t) / θ (t) 2 = θ * . For nonlinear predictors, however, the parameter direction is less meaningful. Instead, it has been pointed out that neural networks often achieve perfect separation of the training data (Zhang et al., 2016) . Therefore, we are more interested in the margin whose pivoting role for the generalization of neural networks is studied extensively (Neyshabur et al., 2017; Bartlett et al., 2017; Golowich et al., 2018) . Specifically, it has been show in Nacson et al. (2019) and Lyu & Li (2019) that the normalized margin, defined by γ(θ (t) ) := γ θ (t) / θ (t) 2 , converges to the maximum margin γ * without regularization. It becomes clear at this point that to understand the role of importance weighting for deep learning, we must characterize the impact of weights on the implicit bias since they reveal the optimization geometry and generalization performance. Formally, we address the following critical questions. • Q1. Does importance weighting modify the convergence results (convergence in direction for linear predictor and in normalized margin for nonlinear predictor)? • If the convergence results remain unchanged, then: -Q2. in what way is importance weighting affecting the optimization process; -Q3. how does importance weighting influence the generalization from the source distribution to the target distribution? Experiment setup. Throughout this paper, we use the regular regression model as linear predictor. The nonlinear predictor is a two-layer MLP with five hidden units and ReLU as the activation function. All the models are trained with gradient descent using 0.1 as learning rate. We use the exponential loss and the standard normal initialization. The generated datasets for our illustrative experiments are shown in Figure 1 , which correspond to the different settings of our major topics.

3. IMPORTANCE WEIGHTING FOR LINEAR PREDICTOR

We begin with the linear predictors which allows more refined analysis on the gradient dynamics. Without loss of generality, we assume using the exponential loss. Also, we do not consider the weak regularization here since its practical impact on linear model is trivial when λ → 0 (Rosset et al., 2004a; b) , but it is not the case for nonlinear predictors. One sophistication with linear predictor is that the data may not be perfectly separated, as opposed to the nonlinear case where neural networks can in theory separate any non-degenerate data. With this kept in mind, we first assume D is linear separable and characterize the new convergence result in the following proposition. Proposition 1. With a constant learning rate η t β -1 , we consider normalizing the weights w ∈ [ 1 M , M ] n such that i w i = 1 without loss of generality, it holds that: θ (t) (w) θ (t) (w) 2 -θ * log n + D KL (p * w) + M log t • γ * , ( ) where p * = [p * 1 , . . . , p * n ] characterizes the dual optimal for the hard-margin SVM such that θ * = n i=1 y i x i •p * i and satisfies: p * i ≥ 0 and n i=1 p * i = 1. Here, D KL is the Kullback-Leibler divergence. We leave the proof to Appendix A.2. We find that importance weighting does not change the convergence result as well as the 1/ log t convergence rate. However, it does affect the convergence speed under the finite-step optimization. In particular, we show that the extra constant term induced by importance weighting is given by the KL-divergence between the (normalized) weights and the dual optimal of the hard-margin SVM, where samples with smaller margins usually have larger values. Therefore, importance weighting may accelerate gradient descent in finite-step optimization by matching weights with the inverse margin. As we show in Figure 2a and 2b, this type of "inversemargin weighted" design is able to accelerate the convergence and bring better performance under finite-step optimization. When D is not linearly separable, the key insight is that we can always partition D into D sep ∪ D non-sep , where D sep is the maximal linear separable subset defined in Ji & Telgarsky (2018b) . Let Π non-sep be the (orthogonal) projection onto the subspace S spanned by the x i 's in D non-sep , and let Π sep be the projection onto the orthogonal complement S ⊥ . The partition allows us to study the two projected parts independently since by the construction, we have θ (t) (w) = Π non-sep θ (t) (w) + Π sep θ (t) (w). It is intuitive that the optimization path of Π sep θ (t) (w) behaves similarly to the linear separable case as in Proposition 1, so we can focus on the properties of Π non-sep θ (t) (w), which we summarize in the follow proposition. Proposition 2 (Informal). Let L non-sep (θ, w) be the weighted risk defined on the non-separable subset, then with the constant learning rate: • θ(w) = arg min θ L non-sep (θ, w) is uniquely defined and θ(w) 2 = O(1); • Π non-sep θ (t) (w)-θ(w) C θ(w) 2 + log 2 t/γ sep t , where γ sep is the maximum margin on D sep and C θ(w) 2 = O(1). The formal statement, which involves how D sep is defined, is deferred to Appendix A.2 together with the proof. Proposition 2 informs that importance weighting uniquely defines the solution θ(w) on the non-separable subset of the data, to which Π non-sep θ (t) (w) converges. Hence, we expect lim t→∞ θ (t) (w) = θ(w) + θ * sep , where θ * sep is the solution on the separable subset D sep and thus its direction does not depend on w as implied by Proposition 1. We can therefore think of θ(w) as the intercept term where the weight controls how the intercept shifts on the subspace of the non-separable data. We also illustrate this finding in Figure 3 . By far, we provide an in-depth understanding and our theoretical results fully explain the observations made in Byrd & Lipton (2019) on how importance weighting affects the implicit bias of gradient descent using linear predictors. Figure 3 : The role of importance weighting on defining the intercept term in addition to the implicit bias for the linearly separable case, where the hyperplane shifts in the non-separable subspace depending on the class weights.

4. IMPORTANCE WEIGHTING FOR NONLINEAR PREDICTOR

Now we investigate the influence of importance weighting on non-linear predictors, e.g, the neural network. Here we are more interested in the regularized setting: min θ L λ (θ; w) := L(θ, w) + λ θ r , where r > 0 is fixed, λ is the regularization coefficient. We use the notation: θ λ (w) ∈ arg min L λ (θ, w). Recall that γ * := max θ ≤1 min i y i f (θ, x i ). Unlike the linear case, characterizing the gradient dynamics for nonlinear predictor is often insurmountable. Therefore, we mainly consider the asymptotic regime or the regime with sufficiently large t. We omit the superscript in θ (t) when there is no confusion. The only assumption we need to make is that: A1. the data is separated by f at some point during gradient descent, i.e. ∃t > 0 s.t. y i f (θ (t) , x i ) > 0, ∀i = 1, . . . , n. In addition, y i f (θ * , x i ) ≥ γ * > 0 for each i. In Section 4.1, we show that by solving the equation 3 with an infinitesimal (weak) regularizer, gradient descent leads to the optimal margin γ * , regardless of the choice of the importance weights. In Section 4.2, we show that the the importance weighting affects the generalization bound via a multiplication factor as well as the margin in the finite-sample scenario.

4.1. MARGIN IS INVARIANT TO IMPORTANCE WEIGHTING UNDER WEAK REGULARIZATION

We show that for any bounded w, γ(θ λ (w)) := γ(θ λ (w)/ θ λ (w) ) converges to γ * as λ decreases to zero. In practice, however, we might not obtain θ λ (w) in limited time. It is shown that as long as equation 3 is close enough to its optimum, the normalized margin of the associated θ (w) (under finite-step optimization) is lower bounded by γ * multiplied by a non-trivial factor. Formally, Proposition 3. Suppose C1-C3, A1 hold. For any w ∈ [1/M, M ] n , it follows that • (Asymptotic) lim λ→0 γ(θ λ (w)) → γ * . • (Finite steps) There exists a λ := λ(r, α, γ * , w, c) such that for θ (w) with L λ (θ (w); w) ≤ τ L λ (θ λ (w); w) and τ ≤ 2, the associated normalized margin γ(θ (w)) satisfies γ(θ (w)) ≥ c • γ * τ α/r , where 1 10 ≤ c < 1. This result is adapted from Wei et al. (2019) , which relies on Claim 1. The proof is relegated to Appendix A.4.1. We see that importance weighting does not affect the asymptotic margin when λ is sufficiently small. To get the intuition, note that when θ λ (w) is large enough and λ is small enough to be ignored, L λ (θ λ (w), w) ≈ exp -θ λ (w) α γ λ , which favors a large margin. In addition, even if L λ (θ (w), w) has not yet converged but close enough to its optimum, the corresponding normalized margin has a reasonable lower bound. We point out that this result does not rely on the choice of λ. The assumption L λ (θ (w); w) ≤ τ L λ θ λ (w); w has already accounted for the major influence of importance weighting in terms of the optimization. That is, with a "good" set of importance weights, we can achieve this criteria (by approaching global optimum) faster. We leave detailed discussions to Section 5. Figure 2d also demonstrates that the choice of the importance weights has a significant influence on the convergence speed for the non-linear predictor.

4.2. IMPORTANCE WEIGHTING AFFECTS THE GENERALIZATION BOUND

Proposition 3 conjectures on the behavior of the margin corresponding to the optimum of L λ (θ; w), which does not rely on the sample size. To bridge the connection between importance weighting and the behavior of f (θ, •) in the finite-sample setting, we investigate the generalization bound of f when the training sample distribution deviates from the testing sample distribution. Let P s be the source distribution and P t be the target distribution with the corresponding densities p s (•) and p t (•). Assume that P s and P t have the same support. We consider the Pearson χ 2 -divergence to measure the difference between P s and P t , i.e., D χ 2 (P t P t ) = (dP s /dP t ) 2 -1 dP s . The training covariates x 1 , . . . , x n are generated from P s , and the testing covariates are generated from P t . Denote by p train and p test the joint distribution of (x, y) for the training data and the testing data, respectively. We minimize equation 3 over the H-layer feedforward neural network given by f NN (θ, x) := W H σ(W H-1 σ(• • • σ(W 1 x) • • • )), where θ = [W 1 , • • • , W H ] are the parameter matrices and σ(•) is the element-wise activation function such as ReLU. Denote by η(x) = p t (x)/p s (x). We show that the generalization performance is affected by importance weighting via the interplay between the empirical risk that hinges on η, as well as a term that depends on the model complexity and the deviation of the target distribution from the source distribution. Theorem 1 (1). Assume σ is 1-Lipschitz and 1-positive homogeneous. Then with probability at least 1 -δ, we have P (x,y)∼ptest yf NN (θ(w), x) ≤ 0 ≤ 1 n n i=1 η(x i )I y i f NN (θ(w)/ θ(w) , x i ) < γ (I) + C • D χ 2 (P t ||P s ) + 1 γ • H (H-1)/2 √ n (II) + (γ, n, δ), where (I) is the empirical risk, (II) reflects the compounding effect of the model complexity of the class of H-layer neural networks and the deviation between target distribution and source distribution , (γ, n, δ) = log log 2 4C γ n + log(1/δ) n is a small quantity compared to (I) and (II). Here, C := sup x x and γ can take any positive value. The proof is deferred to Appendix A.4.2. Compared to Wei et al. (2019) , the empirical risk (I) hinges on η and there is an additional multiplier factor D χ 2 (P t ||P s ) + 1 on (II). In the two discussions below, we argue that the role of importance weighting on the generalization bound in Theorem 1 is not only reflected in how the margin is affected, but also how it balances source and target distribution: 1. Suppose θ(w) enables f NN to separate the data. Let γ θ(w) := min i y i f NN θ(w)/ θ(w) , x i . In the generalization bound of Theorem 1, if we let γ = γ θ(w) , then (I) vanishes and only (II) remains. In this case, the importance weights affects the generalization bound via γ θ(w) in finite steps as discussed in Section 4.1. That is, within finite training steps, a good set of weights w can approach closer to γ θ(w) than a bad set, and thus giving a better generalization performance. Also note that Theorem 1 holds for the non-separable cases as well. 2. We point out that (II) is a strictly decreasing function, while (I) is a non-decreasing step function with respect to γ. Therefore, there must exists a trade-off γ that minimizes the sum of (I) and (II), which is usually attained at some γ > γ θ(w) . When γ grows, certain samples will activate I(y i f NN (θ(w)/ θ(w) , x i ) < γ) and inflate (I). The hope is that an initially activated sample (indicator term) in (I) corresponds to a small η(x i ), while one with a large η(x i ) has a large value of y i f NN (θ(w)/ θ(w) , x i ) and thus will be activated later. This can be achieved by aligning w with η because a large weight on sample i forces the decision boundary to drift away from this data point and gives a larger value of y i f NN (θ(w)/ θ(w) , x i ). Therefore, the generalization bound with w aligning with η can be smaller than that with w deviating from η. The empirical results in Figure 2c provides the numerical evidence that reflects the strong effects of importance weighting on the generalization behavior.

5. EXTENSION

What makes a good set of weights for learning? We show in both Section 3 and 4 that importance weighting can affect how fast the classifier separates the data and converges to the max-margin solution. We also justify how the small-margin support vectors, who can think of as the hard-to-classify data points, are of significant importance. Imagine that we have access to an oracle that outputs the distance of each sample to the max-margin decision boundary. It is intuitive that by putting more weights on the small-margin samples, we "inform" gradient descent of their importance from the beginning and therefore accelerates the optimization. We also provide a rigorous result for linear predictor in Proposition 1. Our high-level intuition justifies a number of methodologies where people use various methods to measure the hardness of classifying a sample and use that as the weight, explicitly or implicitly. Examples include the curriculum learning (Bengio et al., 2009) , mentor net (Jiang et al., 2018) , co-teaching (Han et al., 2018) and knowledge distillation (Li et al., 2017; Hinton et al., 2015) , where auxiliary models are employed (replacing the oracle) to represent the hardness of each data point.

The effect of jointly optimizing a weighting model

It is not unusual that the importance weights, when depending on another model, is jointly trained with the classifier to achieve an better overall performance, such as the counterfactual modelling (Schnabel et al., 2016; Xu et al., 2020) and learning from noisy labels (Song et al., 2020) . For the illustration purpose, we consider the following setup: minimize ψ,θ 1 n n i=1 g(ψ, x i ) • y i f (θ, x i ) , s.t. 1 M < g(ψ, x i ) < M, where g(ψ, x i ) is the weighting model. By our main results, it is not difficult to conjecture that if the data is separable by f , the convergence of f to the max-margin solution will still hold and the weighting model g(ψ, x i ) will concentrate to a constant for all i = 1, . . . , n. This is because the general convergence results are agnostic to the weights, so the weighting model will eventually be nullified. Also, during the beginning phase of training, the learned weights may correlate negatively to the margin (as it helps to speed up the convergence), and the correlation will diminish eventually as the weights converge to the same constant. The above conjectures are supported by the empirical evidence that we discuss in Figure 4 . Therefore, jointly optimizing the weighting model may not change the convergence result but the speed of convergence is affected.

Interaction with explicit regularizations

Deep learning models are often trained with explicit regularization. To see how they interact with importance weighting, we first check weather they alter the norm divergence in Claim 1. It is obvious that both the early stopping and strong regularization on θ prohibits the norm divergence, so f (θ, •) will not achieve the max-margin solution or even separate the training data. In such cases, as it has been observed by Byrd & Lipton (2019) , the impact of importance weighting on θ λ (w) and γ(θ λ (w)) will be significant. However, this may not help generalization according to our arguments in Section 4.2, since the margins will be altered as well. Indeed, Zhang et al. (2016) shows that explicit regularizations may not lead to better generalization for neural networks. For the weighted ERM, Theorem 1 provides a powerful tool to characterize the trade-off induced by explicit regularizations via the margin size. Dropout, as an counter example, does not prohibit norm divergence and may not interfere with our main conclusions.

6. DISCUSSION

In this paper, we study the impact of importance weighting on the implicit bias of gradient descent as well as the generalization performance. Based on our theoretical findings, we propose the following future directions that are worth investigating from both the application and theoretical perspective: 1) Is there an optimal way to construct importance weights using such as the oracle margin? 2) How to correctly understand and utilize the role of a jointly-trained weighting model? 3) What is the combined effect of importance weighting and explicit regularizations for deep learning models? A APPENDIX We provide the omitted discussions, proofs, and extra numerical results in the appendix. A.1 SUPPLEMENTARY MATERIAL FOR SECTION 2 We discuss the exponential-tail behavior for loss functions, the piratical implication of condition C3 and the proof of Claim 1.

A.1.1 LOSS FUNCTION WITH EXPONENTIAL-TAIL BEHAVIOR

Having a exponential decay on the tail of the loss function is essential for realizing the implicit bias of gradient descent, since we need (u) behave like exp(-u) as u → ∞. Soudry et al. (2018) first propose the notion of tight exponential tail, where the negative loss derivative -(u) behave like: - 1 + exp(-c 1 u) e -u and -(u) 1 -exp(-c 2 u) e -u , for sufficiently large u, where c 1 and c 2 are positive constants. There is also a smoothness assumption on (•). It is obvious that under this definition, the tail behavior of the loss function is constraint from both sides by exponential-type functions. There is a more general (and perhaps more direct) definition of exponential-tail loss function Lyu & Li (2019) , where (u) = exp(-f (u)), such that: • f is smooth and f (u) ≥ 0, ∀u; • there exists c > 0 such that f (u)u is non-decreasing for u > c and f (u)u → ∞ as u → ∞. It is easy to verify that the exponential loss, log loss and cross-entropy loss satisfy both definitions. Since our focus is not to study the implicit bias of gradient descent, it suffice to work with the above loss functions.

A.1.2 PRACTICAL IMPLICATIONS OF CONDITION C3

C3 asserts the Lipschitz and smoothness properties. The Lipschitz condition is rather mild assumption for neural networks, and several recent paper are dedicated to obtaining the Lipschitz constant of certain deep learning models (Fazlyab et al., 2019; Virmaux & Scaman, 2018) . The β-smooth condition, on the other hand, is more technical-driven such that we can analyze the gradient descent. In practice, neural networks with ReLU activation do not satisfy the smoothness condition. However, there are smooth homogeneous activation functions, such as the quadratic activation σ(x) = x 2 and higher-order ReLU activation σ(x) = ReLU(x) c for c > 2. Still, in our experiments, we use ReLU as the activation function for its convenience. Proof. We first state a technical lemma that characterizes the dynamics of gradient descent. Lemma A.1 (Theorem E.10 of Lyu & Li (2019) ). Under the conditions that: • (•) is given by the exponential loss, and • f (•, x) is a smooth function on R d for all x ∈ X ; • f (θ, x) is α-homogeneous as in C2; • the data is separated by f during gradient descent at some point t 0 ; • the learning rate satisfy η t := η 0 L(θ (t) ; w) log 1/L(θ (t) ; w) 3-2/α -1 for all t, then under exponential loss we have: 1 L(θ (t) ; w) 2 log 1 L(θ (t) ;w) 2-2/α ≥ 1 2 α 2 γ θ (t0) (w) 2/α (t) i=t0 η i . To use the results of Lemma A.1, we simply need to show two things for weak regularization: • the total risk is still smooth and we still can achieve zero risk; • there exists a critical (stationary) point such that lim λ→0 L λ (θ * ; w) = 0. Notice that the risk without regularization is a smooth function in terms of θ for all x, since the composition of smooth functions is still smooth. It is easy to see that adding a weak regularization, e.g. λ θ r 2 for r > 1, does not alter the smoothness condition as λ → 0. However, the weak 1 regularization will make the total risk non-smooth, and therefore we have excluded it from our discussion. For the second point, it is obvious that θ 2 → ∞ is a critical point under exponential loss when λ → 0. Recall that: L λ (θ; w) = 1 n i w i exp -y i f θ/ θ 2 , x i • θ 2 ) + λ θ r 2 , and ∇L λ (θ; w) = 1 n i -w i exp -y i f θ/ θ 2 , x i • θ 2 • y i ∇f θ, x i ) + λ∇ θ r 2 . Therefore, for both the loss function and gradient, the main term decreases exponentially fast as θ 2 increases, while the remainder terms are only polynomial in θ 2 , so we can always find a small enough λ that satisfy: lim λ→0 lim θ →∞ L λ (θ; w) = 0 and lim λ→0 lim θ →∞ ∇L λ (θ; w) = 0, in the same fashion as we show in the (A.1) below. From a standard result of gradient descent on smooth function, which we summarize in Lemma A.2, gradient descent will always converge to a critical (stationary) point for the weighted ERM problem. Lemma A.2 (Lemma 10 of Soudry et al. (2018) ). Let L λ (θ; w) be a B(w)-smooth non-negative objective. With a constant learning rate η 0 B(w) -1 , the gradient descent sequence satisfies: • lim t→∞ t i=1 ∇L λ (θ (t) ; w) < ∞; • lim t→∞ ∇L λ (θ (t) ; w) = 0. Now we need to show that under appropriate learning rate, which is specified in Lemma A.1, gradient descent converges to the stationary point that corresponds to the zero risk under weak regularization. Using the result from Lemma A.1, notice that if L λ (θ (t) ; w) does not decrease to 0, then the denominator L λ (θ (t) ; w) 2 log 1 L λ (θ (t) ;w) 2-2/α is bounded from below. However, there exists a constant learning rate such that t i=t0 η i → ∞ as t → ∞, which leads to contradiction. Therefore, for weighted ERM with weak regularization, gradient descent converges to the stationary point where L λ (θ (t) ; w) = 0. Finally, we show to make L λ (θ (t) ; w) → 0, we must have θ (t) (w) → ∞. We show by contradiction. Suppose θ (t) ; w) is bounded from above by some constant C > 0, for all λ < λ that we choose later. So the loss function for each sample i is bounded below by a positive value that depends on C: w i exp(-y i f (θ (t) , x)) ≥ l(C) > 0. Hence, let K := λ-1/(r+1) , then l(C) ≤ L λ (θ λ (w); w) ≤ L λ (Kθ * ; w) ≤ M exp -λ-α/(r+1) • γ * + λ1/(1+r) ; (A.1) and it easy obvious that RHS→ 0 for a sufficiently small λ, which contradicts l(C) > 0. Hence, we have θ (t) (w) → ∞ for all all λ < λ, which completes the proof. A.2 SUPPLEMENTARY MATERIAL FOR SECTION 3 We provide the proofs for Proposition 1 and 2 in this part of the appendix.

A.2.1 PROOF FOR PROPOSITION 1

Proof. We first characterize the 1/ log t rate using asymptotic arguments similar to that of Soudry et al. (2018) . The key purpose here is to rigorously show that importance weighting plays a negligible role in the asymptotic regime. Let δ(t) be the residual term at step t: δ(t, w) := θ (t) (w) -θ * log t. (A.2) To show the 1/ log t rate, we simply need to prove that δ(t, w) is bounded for any w ∈ [1/M, M ] n . Notice that δ(t + 1, w) 2 = δ(t + 1, w) -δ(t, w) 2 + 2 δ(t + 1, w) -δ(t, w) δ(t, w) + δ(t, w) 2 . For the first term, we have: δ(t + 1, w) -δ(t, w) 2 = -η∇L θ (t) (w); w -θ * log(t + 1) -log(t) 2 = η 2 -η∇L θ (t) (w); w + θ * 2 log 2 (1 + 1/t) + 2η(θ * ) ∇L θ (t) (w); w log(1 + 1/t) ≤ η 2 ∇L θ (t) (w); w + θ * 2 t -2 ; where in the last line we use: • ∀u > 0, log(1 + u) ≤ u; • (θ * ) ∇L θ (t) (w); w = i -w i exp(-y i θ * x i )y i θ * x i ≤ 0 because θ * separates the data. Also, from the first conclusion of Lemma A.2, we see that ∇L θ (t) (w); w = o(1/t), so δ(t + 1, w) -δ(t, w) 2 = o(1/t) and the running sum converges to some finite number: ∞ t=1 δ(t + 1, w) -δ(t, w) 2 = C 0 < ∞. We see that the role of the weights is totally negligible because θ * separates the data (the second bullet point above). The same argument applies to the second term 2 δ(t + 1, w) -δ(t, w) δ(t, w), where w plays no part as long as θ * separates the data. The detailed proof is technical, and we refer to Lemma 6 of Soudry et al. (2018) , which states that: δ(t + 1, w) -δ(t, w) δ(t, w) = o(1/t). Therefore, by applying tensorization, it holds that: δ(t, w) 2 -δ(t = 0, w) 2 ≤ C 0 + t i=1 δ(t + 1, w) -δ(t, w) δ(t, w) < ∞, hence δ(t, w) is bounded and δ(t, w) / log t = O(1/ log t), θ (t) (w) θ (t) (w) 2 -θ * = O( 1 log t ). (A.3) It is now obvious that under the asymptotic characterization of (A.2), the weights only play a negligible role since θ * separate the data. However, the definition of δ under (A.2) also prohibits us from studying the finite-step behavior since it absorbs all the constant factors. Now we use the Fenchel-Young inequality to give a more precise characterization of the convergence speed. First of all, recall the max-margin problem for linear predictor has a dual representation for separable data according to the KKT condition for separable problem: θ * = y i X i • p * i /γ * , (A.4) where p * i is the dual optimal such that γ * = -min max i -y i x i θ s.t. θ = 1 ≡ min y i X i • p i s.t. p i ≥ 0, i p i = 1 . Now, we directly work with θ (t) (w) θ (t) (w) 2 -θ * : θ (t) (w) θ (t) (w) 2 -θ * 2 = 2 - 2 θ * , θ (t) (w) θ (t) (w) 2 , and from (A.4) and Fenchel-Young inequality we have: - θ * , θ (t) (w) θ (t) (w) 2 = p * , -y i x ( ) i θ (t) (w) γ * θ (t) (w) 2 ≤ g * p * + g -y i x ( ) i θ (t) (w) γ * θ (t) (w) 2 , (A.5) where g is a convex function with it conjugate function given by g * . To build the connections with the loss function and risk, we choose g such that g(u) = log 1 n i w i exp(u i ). As a consequence, by letting u i = -y i x ( ) i θ (t) and u = [u 1 , . . . , u n ], we have g(u) = L(θ (t) ; w). With simple algebraic computations, the conjugate function g * (p) is given by: g * (p) = log n + i p i log p i w i = D KL (p w) + log n. Plugging the above results to (A.5): 1 2 θ (t) (w) θ (t) (w) 2 -θ * 2 ≤ 1 + log L(θ (t) (w); w) θ (t) (w) 2 γ * + log n + D KL (p w) θ (t) (w) 2 γ * (A.6) According the convergence analysis of Adaboost, we have the following technical lemma. Lemma A.3 (Schapire & Freund ( 2013)). Suppose is convex, ≤ , and ≤ , with a linear predictor and a sufficiently small learning rate such that η t L(θ (t) ) ≤ 1, then: L(θ (t+1) ) ≤ L(θ (t) ) 1 -η t L(θ (t) ) 1 -η t L(θ (t) )/2 ∇L(θ (t) ) 2 L(θ (t) ) 2 , (A.7) and thus j) ) . L(θ (t+1) ) ≤ L(θ (0) ) exp - j<t η t L(θ (j) ) 1 -η j L(θ (j) )/2 ∇L(θ (j) ) 2 L(θ (j) ) 2 . (A.8) Also, θ (t+1) ≤ j<t η t L(θ (j) ) ∇L(θ (j) ) 2 L(θ To use the results in Lemma A.3, we define the following shorthand notations. Let a t (w) := η t L(θ (t) ; w) and b t (w) := ∇L(θ (t) (w); w) 2 L(θ (t) (w); w) . Now, (A.6) can be further given by: 1 2 θ (t) (w) θ (t) (w) 2 -θ * 2 ≤ 1 + log L(θ (0) ; w) θ (t) γ * - t-1 i=0 a i (w)(1 -a i (w)/2)b i (w) 2 θ (i) γ * + log n + D KL (p w) θ (t) (w) 2 γ * ≤ 1 - t-1 i=1 a i (w)b 2 i (w) θ (i) γ * + 2 t-1 i=1 a 2 i (w)b 2 i (w) θ (i) γ * + log n + D KL (p w) θ (t) (w) 2 γ * . (A.9) Notice that Lemma A.3 also imply: t-1 i=1 a 2 i (w)b 2 i (w) = t-1 i=1 η i ∇L(θ (i) (w); w) ≤ 2 t-1 i=1 L(θ (i) (w); w) -L(θ (i+1) (w); w) , which is bounded from above by 2M . Finally, it is easy to verify that b t (w) ≥ γ * , and Lemma A.3 also implies that θ (t) (w) ≤ i<t a i (w)b i (w). Finally, we simplify (A.9) to: θ (t) (w) θ (t) (w) 2 -θ * 2 ≤ 2 • log n + D KL (p w) + M θ (t) (w) 2 γ * , and obtain the desired result.

A.3 PROOF FOR PROPOSITION 2

We first present a greedy approach for the construction of the maximal separable subset D sep , which is proposed by Ji & Telgarsky (2018b) . For each sample (x i , y i ), if there exists a θ i such that y i θ i x i > 0 and min j=1,...,n y j θ i x j ≥ 0, we add it to D sep . Otherwise, we add it to D non-sep . To see why this approach work, first notice that by choosing θ * sep = i∈D θ i , θ * sep separates the data in D sep . Then we check it is indeed maximal: for any θ that is correct on any (x i , y i ) in D non-sep , there must also exist another (x j , y j ) in D non-sep so y i θ i x i < 0, or otherwise (x i , y i ) would have been in D sep . It has been shown in Ji & Telgarsky (2018b) that the risk is strongly convex on D non-sep under conditions that are satisfied by our setting. Lemma A.4 (Theorem 2.1 of Ji & Telgarsky (2018b) ). If is twice differentiable, > 0, l ≥ 0 and lim u→∞ (u) = 0, then L(θ) = i 1 n (y i θ x i ) is strongly convex on D non-sep . Now we provide the proof for Proposition 2. Proof. The first part is a direct consequence of Lemma A.4, that L(θ; w) = 1 n i w i exp(-y i θ x i ) is strongly convex on D non-sep . Therefore, the optimum θ(w) is uniquely defined and θ(w) = O(1). To show the second part, we leverage a standard argument for gradient descent with smoothness condition. Lemma A.5 (Bubeck (2014) ). Suppose L(θ) is convex and β-smooth. Then with learning rate η t ≤ β/2, the sequence of gradient descent satisfies: L(θ (t+1) ) ≤ L(θ (t) ) -η t 1 -η t β/2 θ (t) ) 2 . Then for any z ∈ R d : 2 t-1 i=0 η i L(θ (i) ) -L(z) ≤ θ (0) -z 2 -θ (t) -z 2 + t-1 i=0 η i 1 -βη i /2 L(θ (i) ) -L(z) . • (Asymptotic) lim λ→0 γ λ (w) → γ * . • (Finite steps) There exists a λ := λ(r, α, γ * , w, c) such that for θ (w) with L λ (θ (w); w) ≤ τ L λ (θ λ (w); w) and τ ≤ 2, the associated margin γ(θ (w)) satisfies γ(θ (w)) ≥ c • γ * τ α/r , where 1 10 ≤ c < 1 Proof of the Asymptotic part: Proof. We first take consider the exponential loss (u) = exp(-u). The log loss (u) = log(1 + exp(-u)) can be shown in a similar fashion. Suppose the weights w = (w 1 , . . . w n ) are normalized so that 



The regularized loss is given by L λ (θ; w) = L(θ; w) + λ θ r for a fixed r > 0. The weak regularization refers to the case where λ → 0.



Figure 1: (a). Linearly separable data; (b). Non-separable data; (c): Balanced moon-shaped nonlinear separable data; (d). Unbalance moon-shaped data after down-sampling both classes (20% for the blue class, and 80% for the orange class). We use solid line to denote the separating hyperplane of the trained linear model and shades to represent the decision boundary of trained nonlinear model.

Figure 2: (a): Epoch-wise training performances measured by the angle between the decision boundary (at that epoch) and the max-margin solution, using linear predictor on the linear separable data of Figure 1a; (b): Epoch-wise training performances measured by the average margin in the same setting as (a); (c). The generalization error on testing data (the remaining 80% of the orange class and 20% of the blue class that are not part of the down-sampling in Figure 1d) when the nonlinear model is trained under different class weights, as the training progresses; (d). The average margin for the nonlinear model on the non-linearly separable training data shown in Figure 1c, under different class weights, as the training progresses.

Figure 4: The left-five figures show that the distribution of the learned weights concentrates to a constant as the training progresses. The rightmost figure indicates the correlation pattern between margin and the learned weights: the correlation increases rapidly in the beginning, and then slowly decreases to zero (the process is much slower for nonlinear predictor so we only show the first part). Here, g(x i ) = σ(ψ x i + b) + 1, where σ(•) is the sigmoid function, the constant one is added to avoid numerical issues.

.3 PROOF FOR CLAIM 1Soudry et al. (2018) andJi & Telgarsky (2018b)  show norm divergence for linear predictors, and the follow-up work byJi & Telgarsky (2018a);Gunasekar et al. (2018) extend the result to linear neural networks. For nonlinear predictors such as multi-layer neural network with homogeneous activation, Nacson et al. (2019) andLyu & Li (2019) prove the norm divergence for gradient descent in the absence of explicit regularization.Rosset et al. (2004a)  andWei et al. (2019) considers the weak regularization for linear and nonlinear predictors, however, they only study the property of the critical points instead of the gradient descent sequence.

i=1 w i = 1 and w i ≥ 0. ConsiderL λ (Aθ; w) = n i=1 w i exp(-A α • y i f (θ; x i )) + λA r θ r ≤ exp(-A α • max i (y i f (θ; x i ))) + λA r θ r , (A.13)where A > 0, and we disregard the 1/n term in L λ for the sake of notation. In addition, we have the lower boundL λ (Aθ; w) ≥ w i • exp(-A α • max i (y i f (θ; x i ))) + λA r θ r ≥ w [n] • exp(-A α • max i (y i f (θ; x i ))) + λA r θ r , (A.14) where i = arg min i y i f (θ; x i )), w [n] = min i w i . By taking A = θ λ (w) , θ = θ * in the upper bound and A = 1, θ = θ λ (w) in the lower bound , it follows that w [n] • exp(-θ λ (w) α γ λ (w)) + λ θ λ (w) r ≤ L λ (w)(θ λ (w)) ≤ L λ (w)( θ λ (w) θ * ) ≤ exp(-θ λ (w) α • γ * ) + λ θ λ (w) r .It implies that w[n] • exp(-θ λ (w) α γ λ (w)) ≤ exp(-θ λ (w) α • γ * ), or w [n] • exp(-θ λ (w) α (γ * -γ λ (w))) ≤ 1.By Claim 1 that θ λ (w) → ∞ as λ → 0 (or Lemma C.4 in Wei et al. (2019)), the above inequality implies that γ λ (w) → γ * as λ → 0.Proof of the Finite steps partProof.Consider A = [ 1 γ * log((γ * ) r/α /λ)] 1/α , it follows that L λ (θ (w), w) ≤ τ L λ (Aθ * ) ≤ τ exp(-A α • γ * ) + τ λA r [Upper Bound A.4.1] = λτ (γ * ) r/α 1 + (log((γ * ) r/α /λ)) r/α (A.15)Then by the lower bound A.4.1, it follows thatw [n] • exp(-θ (w) α γ (w)) ≤ L λ (θ (w), w) ≤ A.15, where γ (w) = max i y i f (w / w , x i ). Note λ θ (w) r ≤ A.15. It implies that γ (w) ≥ -log(A.15/w [n] ) θ (w) α ≥ -log( λτ w [n] (γ * ) r/α (1 + (log((γ * ) r/α /λ)) r/α )) τ α/r γ * (1 + (log((γ * ) r/α /λ)) r/α ) α/r

annex

It is immediately clear that we may choose the z in Lemma A.5 such that it combines the optimal from D sep and D non-sep . In particular, we have shown that the optimal on D non-sep is uniquely given by θ(w). For D sep we assume the max-margin linear predictor is given by θ * sep (so θ * sep = 1). Therefore, according to Proposition 1, the optimum is given by log t • θ * sep . Now define z := θ(w) + θ * sep • log t/γ sep , where we add the extra constant γ sep , which is the maximum margin on the separable subset of the data, to simplify the following bound. Without loss of generality, we assume the features are bounded in • 2 norm such that x i 2 ≤ 1. As a consequence:where we use L non-sep and L sep to denote the risk associated with D non-sep and D sep . To invoke Lemma A.5, first note that the required smoothness condition is guaranteed by Lemma A.3, i.e. in each step, the risk is η t L(θ (t) )-smooth. Without loss of generality, we assume η t L(θ (t) ) ≤ η t . Therefore, according to Lemma A.5, we have:Therefore, by our choice of z as well as the result in (A.10), we obtain the bound in terms of the risk:Since we assume a constant learning rate, when i<t η i = O(t) we can simplify the above result to:Finally, from Lemma A.4 we known L(θ; w) is strongly convex (which we assume to be ω-stronglyconvex). So the convergence in terms of the risk can be transformed to parameters:which leads to our desired results.

A.4 SUPPLEMENTARY MATERIAL FOR SECTION 4

In this section, we establish the detailed proofs of Proposition 3 and Theorem 1. Recall that the loss function we are interested in is:A.4.1 PROOF OF PROPOSITION 3.We first restate the proposition. Proposition A.1. Suppose C1, C2, A1 hold. For any w ∈ [1/M, M ] n , it follows that Published as a conference paper at ICLR 2021 Note that the numerator is at the scale log( 1 λ / log 1 λ ) and the denominator is at the scale log 1 λ . So for sufficiently small λ = λ(r, α, γ * , w, c), we have γ (w) ≥ c • γ * τ α/r , where 1 10 ≤ c < 1. We leave the details of finding out the dependency of λ(r, α, γ * , w, c) on c to the readers, which is simply the basic analysis.

A.4.2 PROOF OF THEOREM 1

When the training distribution p train deviates from the testing distribution p test , we develop the generalization bound that characterizes this deviation. Denote by p s and p t the respective densities of x from the training data and the testing data. Let D(P t P s ) = ( pt(x) ps(x) ) 2 -1 p s (x)dx and η(x i ) = pt (xi) ps(xi) . We first restate Theorem 1: Theorem A.1. Assume σ is 1-Lipschitz and 1-positive homogeneous. Then with probability at least 1 -δ, we havewhere (I) is the empirical risk, (II) reflects the compounding effect of the model complexity of the class of H-layer neural networks and the deviation of the target distribution from the source distribution , (γ, n, δ)n is a small quantity compared to (I) and (II). Here C := sup x x ; γ is any positive value.To prove Theorem A.1, we first establish a few lemmas. Lemma A.6. Consider an arbitrary function class F such that ∀f ∈ F we have x∈X |f (x)| ≤ C. Then, with probability at least 1 -δ over the sample, for all margins γ > 0 and all f ∈ F we have,Proof. This lemma is adapted from Theorem 1 of Koltchinskii et al. (2002) by considering the deviation of the testing distribution from the training distribution. Then it is obtained following Theorem 5 of Kakade et al. (2009) .Lemma A.7. Let F H be the class of real-valued networks of depth H over the domain X , where each parameter matrix W h has Frobenius norm at most M F (h), and with an activation that is 1-Lipschitz, positive-homogeneous. Then,where C := sup x∈X x .Proof. From Theorem 1 of Golowich et al. (2018) , we arrive atBy Jensen's inequality, we haveIn addition, we note thatBy the bounded-difference condition (Boucheron et al., 2013) , Z is a sub-Gaussian with variance factor(A.17 Proof. This lemma are obtained by reorganizing the proof of Lemma D3 and the proof of Proposition D.1 of Wei et al. (2019) .Proof of Theorem A.1Proof. Theorem A.1 follows by Lemma A.6, A.7 and A.8.

