ON THE INDUCTIVE BIAS OF A CNN FOR DISTRIBU-TIONS WITH ORTHOGONAL PATTERNS

Abstract

Training overparameterized convolutional neural networks with gradient based optimization is the most successful learning method for image classification. However, their generalization properties are far from understood. In this work, we consider a simplified image classification task where images contain orthogonal patches and are learned with a 3-layer overparameterized convolutional network and stochastic gradient descent (SGD). We empirically identify a novel phenomenon of SGD in our setting, where the dot-product between the learned pattern detectors and their detected patterns are governed by the pattern statistics in the training set. We call this phenomenon Pattern Statistics Inductive Bias (PSI) and empirically verify it in a large number of instances. We prove that in our setting, if a learning algorithm satisfies PSI then its sample complexity is O(d 2 log(d)) where d is the filter dimension. In contrast, we show a VC dimension lower bound which is exponential in d. We perform experiments with overparameterized CNNs on a variant of MNIST with non-orthogonal patches, and show that the empirical observations are in line with our analysis.

1. INTRODUCTION

Convolutional neural networks (CNNs) have achieved remarkable performance in various computer vision tasks (Krizhevsky et al., 2012; Xu et al., 2015; Taigman et al., 2014) . In practice, these networks typically have more parameters than needed to achieve zero train error (i.e., are overparameterized). Despite non-convexity and the potential problem of overfitting, training these models with gradient based methods leads to solutions with low test error. It is still largely unknown why such simple optimization algorithms have outstanding test performance for learning overparameterized convolutional networks. Recently, there have been major efforts to provide generalization guarantees for overparameterized CNNs. However, current generalization guarantees either depend on the number of channels of the network (Long & Sedghi, 2020) or hold under specific constraints on the weights (Li et al., 2018) . Clearly, the generalization of overparameterized CNNs depends on both the learning algorithm (gradient-based methods) and unique properties of the data. Providing generalization guarantees while incorporating these factors is a major challenge. Indeed, this requires analyzing non-convex optimization methods and mathematically defining properties of the data, which is extremely difficult for real-world problems. Therefore, it is necessary to first understand simple settings which are amenable to theoretical and empirical analysis and share salient features with real-world problems. Towards this goal, we analyze a simplified pattern recognition task where all patterns in the images are orthogonal and the classification is binary. The architecture is a 3-layer overparameterized convolutional neural network and it is learned using stochastic gradient descent (SGD). We take a unique approach that combines novel empirical observations with theoretical guarantees to provide a novel generalization bound which is independent of the number of channels and is a low-degree polynomial of the filter dimension, which is usually low in practice. Empirically, we identify a novel property of the solutions found by SGD. We observe that the statistics of patterns in the training data govern the magnitude of the dot-product between learned pattern detectors and their detected patterns. Specifically, patterns that appear almost exclusively in one of the classes will have a large dot-product with the channels that detect them. On the other hand, patterns that appear roughly equally in both classes, will have a low dot-product with their detecting channels. We formally define this as the "Pattern Statistics Inductive Bias" condition (PSI) and provide empirical evidence that PSI holds across a large number of instances. We also prove that SGD indeed satisfies PSI in a simple setup of two points in the training set. Under the assumption that PSI holds, we analyze the sample complexity and prove that it is at most O(d 2 log d), where d is the filter dimension. In contrast, we show that the VC dimension of the class of functions we consider is exponential in d, and thus there exist other learning algorithms (not SGD) that will have exponential sample complexity. Together, these results provide firm evidence that even though SGD can in principle overfit, it is nonetheless biased towards solutions which are determined by the statistics of the patterns in the training set and consequently it has good generalization performance. We perform experiments with overparamterized CNNs on a variant of MNIST that has nonorthogonal patterns. We use our analysis to better understand why SGD has low sample complexity in this setting. We empirically show that the inductive bias of SGD is similar to PSI. This suggests that the idea of PSI is not unique to the orthogonal case and can be useful for understanding overparameterized CNNs in other challenging settings.

2. RELATED WORK

Several recent works have studied the generalization properties of overparameterized CNNs. Some of these propose generalization bounds that depend on the number of channels (Long & Sedghi, 2020; Jiang et al., 2019) . Others provide guarantees for CNNs with constraints on the weights (Zhou & Feng, 2018; Li et al., 2018) . Convergence of gradient descent to KKT points of the max-margin problem is shown in Lyu & Li (2020) and Nacson et al. (2019) for homogeneous models. However, their results do not provide generalization guarantees in our setting. Gunasekar et al. (2018) study the inductive bias of linear CNNs. Yu et al. (2019) study a pattern classification problem similar to ours. However, their analysis holds for an unbounded hinge loss which is not used in practice. Furthermore, their sample complexity depends on the network size, and thus does not explain why large networks do not overfit. Other works have studied learning under certain ground truth distributions. For example, Brutzkus & Globerson (2019) study a simple extension of the XOR problem, showing that overparameterized CNNs generalize better than smaller CNNs. Single-channel CNNs are analyzed in (Du et al., 2018b; a; Brutzkus & Globerson, 2017; Du et al., 2018c) . Other works study the inductive bias of gradient descent on fully connected linear or non-linear networks (Ji & Telgarsky, 2019; Arora et al., 2019a; Wei et al., 2019; Brutzkus et al., 2018; Dziugaite & Roy, 2017; Allen-Zhu et al., 2019; Chizat & Bach, 2020) . Fully connected networks were also analyzed via the NTK approximation (Du et al., 2019; 2018d; Arora et al., 2019b; Fiat et al., 2019) . Kushilevitz & Roth (1996) ; Shvaytser (1990) study the learnability of visual patterns distribution. However, our focus is on learnability using a specific algorithm and architecture: SGD trained on overparameterized CNNs.

3. THE ORTHOGONAL PATTERNS PROBLEM

Data Generating Distribution: We consider a learning problem that captures a key property of visual classification. Many visual classes are characterized by the existence of certain patterns. For example an 8 will typically contain an x like pattern somewhere in the image. Here we consider an abstraction of this behavior where images consist of a set of patterns. Furthermore, each class is characterized by a pattern that appears exclusively in it. We define this formally below. Let P be a set of orthogonal vectors in R d , where P ≤ d. For simplicity, we assume that p 2 = 1 for all p ∈ P. We consider input vectors x with n patterns of dimension d. Formally, x = (x[1], ..., x[n]) ∈ R nd where x[i] ∈ P is the ith pattern of x and n < d. We denote p ∈ x if x contains the pattern p ∈ P.foot_0 Let P(x) = {p ∈ x p ∈ P} denote the set of all patterns in x. Next, we define how labeled points are generated. Consider three non-overlapping sets of patterns: P -, P + , P s ⊂ P whose disjoint union is P. P + is the set of positive patterns, P -the set of negative patterns and P s is the set of spurious patterns. For simplicity, in this work we consider the case where P + = P -= 1. We denote, P = p 1 , p 2 , ..., p P , P + = {p 1 } and P -= {p 2 }. For convenience, we also refer to a set of patterns A as a set of the indices of the patterns, .e.g., we denote i ∈ A if p i ∈ A. We consider distributions D over (x, y) ∈ R nd × {±1} with the following properties: (1) P (y = 1) = P (y = -1) = 1 2 . (2) Given y = 1, a vector x is sampled as follows. Choose the positive pattern p 1 and randomly choose a set of n -1 patterns from P s . Denote this set of n chosen patterns by A. Let x be some x ′ such that P(x ′ ) = A, i.e., the location of each pattern in x is chosen arbitrarily.foot_1 For example, if n = 3 and the n -1 patterns are p 3 , p 7 this can result in samples such as ([p 1 , p 7 , p 3 ], 1) or ([p 3 , p 1 , p 7 ], 1). (3) Similarly for y = -1, only choose p 2 ∈ P -instead of p 1 . We will consider several distributions that satisfy the above and have different sampling schemes for the spurious patterns (see Sec. 7). Fig. 4 in the supplementary shows an example of samples generated using the above procedure. Note that any distribution which satisfies the above is linearly separable and each vector x can be classified solely based on whether p 1 ∈ x or p 2 ∈ x. Neural Architecture: For learning the above pattern detection problems, a natural model in this context is a 3-layer network with a convolutional layer, followed by ReLU, max pooling and a fullyconnected layer. Each channel in the first layer can be thought of as a detector for a given pattern. We say that a detector detects pattern p ∈ P, if p has the largest dot product with the detector among all patterns in P + ∪ P s or P -∪ P s and this dot product is positive. 3 For simplicity we fix the weights on the last linear layer to values ±1. 4Let 2k denote the number of channels. We partition the channels into two sets: w (1) , . . . , w (k) and u (1) , . . . , u (k) . These will have weights of +1 and -1 in the output respectively. Finally, let W ∈ R 2k×2 be the weight matrix whose rows are w (i) followed by u (i) . For an input x = (x[1], ..., x[n]) ∈ R nd where x[i] ∈ R d , the output of the network is: N W (x) = k i=1 max j σ w (i) ⋅ x[j] -max j σ u (i) ⋅ x[j] (1) where σ(x) = max{0, x} is the ReLU activation. Let H denote the class of all networks N W in Eq. 1, with k > 0. Finally, we note that H can perfectly fit the distribution D above, by setting k = 1, w (1) = p 1 and u (1) = p 2 . Therefore, for k > 1 the network is overparameterized. Training Algorithm: Let S be a training set with m IID samples from D. We consider minimizing the hinge loss: (W ) = 1 m ∑ (xi,yi)∈S max{1 -y i N W (x i ), 0}. For optimization, we use SGD with constant learning rate η. The parameters W are initialized as IID Gaussians with zero mean and standard deviation σ g . Let W t be the weight matrix at iteration t of SGD. Similarly let w (i) t , u (i) t be the corresponding vectors at iteration t. Detection Ratios: We now define the notion of detection ratios. The detection ratios are a property of the model and will be key to our analysis. We first define the set of neurons that are maximally activated by pattern p i among all patterns in P s ∪ P + (i.e., all detectors of pattern p i ):foot_4  W + (i) = j arg max l∈Ps∪P+ w (j) ⋅ p l = i, w (j) ⋅ p i > 0 U + (i) = j arg max l∈Ps∪P+ u (j) ⋅ p l = i, u (j) t ⋅ p i > 0 (2) Next we define M + w (i) = ∑ j∈W + (i) w (j) ⋅ p i and M + u (i) = ∑ j∈U + (i) u (j) ⋅ p i . The quantity M + w (i) is the sum of the dot products between a pattern p i and its detectors w (j) . M + w (i) can be interpreted as the overall response of w (j) detectors of pattern p i . Similarly, we define M + u (i), which is like M + w (i), only with detectors u (j) . For all p i ∈ P + ∪ P s we refer to M + u (i) M + w (1) as positive detection ratios. The detection ratio can be interpreted as the ratio between the undesired response of pattern detectors of p i and the desired response of discriminative pattern detectors (detectors of p 1 ). Therefore, we would like this ratio to be small. Indeed, for any positive point x + we have that: N W (x + ) ≥ M + w (1) - p i ∈Ps∪P+ M + u (i) = M + w (1) ⎛ ⎝ 1 - p i ∈Ps∪P+ M + u (i) M + w (1) ⎞ ⎠ where the inequality follows since positive points have only patterns in P + ∪ P s , by Eq. 1 and the definitions of M + w (i) and M + u (i). Notice that if all positive detection ratios are small, then the positive point is classified correctly. We will empirically show that for SGD, the magnitude of the detection ratios are governed by the statistics of the patterns in the training set, which will imply our generalization result. Similarly, we define W -(i), U -(i) and M - w (i), M - u (i), where the only difference is using P -instead of P + . Furthermore, for all p i ∈ P -∪P s we say that M - w (i) M - u 2) are negative detection ratios. Then, as in Eq. 3 we have for all negative points x -that: -N W (x -) ≥ M - u (2) 1 -∑ p i ∈Ps∪P- M - w (i) M - u (2) . This shows that if negative detection ratios are small, then all negative points are classified correctly. In the rest of the paper, we refer to both positive and negative detection ratios as detection ratios. Empirical Pattern Bias: In a given training set, patterns will appear in both positive and negative examples. The following measure captures how well-balanced are the patterns between the labels. For any pattern p i ∈ P, define the following statistic of the training set: s i = 1 m m j=1 y j 1{p i ∈ x j } The detection ratios define quantities of the learned model. On the other hand, Eq. 4 is a quantity of the sampled training set. In the next section, we define the PSI property, which specifies how these two measures should be related to guarantee good generalization for the learned model.

4. PATTERN STATISTICS INDUCTIVE BIAS

The inductive bias of a learning algorithm refers to how the algorithm chooses among all models that fit the data equally well. For example, an SVM algorithm has an inductive bias towards low norm. Understanding the success of deep learning requires understanding the inductive bias of learning algorithms used to learn networks, and in particular SGD (Zhang et al., 2017) . In what follows, we define a certain inductive bias of an algorithm in our setting, which we refer to as the Patterns Statistics Inductive Bias (PSI) property. The PSI property states a simple relation between the relative frequency of patterns s i (see Eq. 4) and the detection ratios. We begin by providing the formal definition of PSI, and then provide further intuition. For the definition, we let A be any learning algorithm which given a training set S returns a network A(S) as in Eq. 1. Definition 4.1. We say that a learning algorithm A satisfies the Patterns Statistics Inductive Bias condition with constants b, c, δ > 0 ((b,c,δ)-PSI) if the following holds. For any m ≥ 1,foot_5 with probability at least 1δ over the randomization of A and training set S of size m, A(S) satisfies the following conditions: ∀i ∈ P s ∪ P + ∶ M + u (i) M + w (1) ≤ b max - s i s 1 , 0 + c √ m (5) ∀i ∈ P s ∪ P -∶ M - w (i) M - u (2) ≤ b max - s i s 2 , 0 + c √ m We next provide some informal intuition as to why SGD updates may lead to PSI (in Sec. 7 we provide a proof of this for a restricted setting). We will consider updates made by gradient descent (full batch SGD). Define W + t (i) to be the set W + (i) with weight vectors w (j) t instead of w (j) . Similarly, define U + t (i), W - t (i) and U - t (i). Throughout the discussion below, we assume that these sets have roughly the same size. 7 We will show that in certain cases, a high value of -si s1 implies that the detection ratio M + u (i) M + w (1) has a high value. Furthermore, a low value of -si s1 implies a low value of M + u (i) M + w (1) . This motivates the bound in the PSI definition. As we will show, this follows since the statistics of the patterns in the training set s i , govern the magnitude of the dot-product between a detector and its detected pattern. By our distribution assumption we should have s 1 ≈ 1 2 . First assume that s i ≈ -1 2 for p i ∈ P s , i.e., -s1 si ≈ 1. Now lets see what the detection ratio M + u (i) M + w (1) should be by the gradient update. Note that the gradient is a sum of updates, one for each point in the training set. Assume that j ∈ W + t (1), i.e., w (j) detects p 1 . Then by the gradient update, the value η m p 1 is added to w (j) t for all positive points that have non-zero hinge loss. The value -η m p i is also added for a few p i ∈ P -∪ P s and a subset of the negative points (i depends on the specific negative point). In the next iteration, it holds that j ∈ W + t+1 (1) and the updates continue similarly. Overall, we see that w (j) t ⋅ p 1 , which is the dot-product between the detector and its detected pattern, increases in each iteration and should be large after a few iterations. Therefore, M + w (1) should be large. By exactly the same argument, we should expect that for j ∈ U + t (i), u t ⋅ p i increases in each iteration and now M + u (i) should be large. Under the assumption that U + t (i) ≈ W + t (1) , we should have M + u (i) M + w (1) ≈ 1 as well. Therefore, if -s1 si ≈ 1 then we should expect that M + u (i) M + w (1) ≈ 1. On the other hand, if p i appears in roughly an equal number of positive and negative points, i.e., s i ≈ 0, then we should expect M + u (i) to be low. To see this, consider a filter j ∈ U + t (i). In this case, positive points that contain p i and with non-zero loss add -η m p i to u (j) t , while negative points that contain p i and have non-zero loss add η m p i . Thus, u t ⋅ p i should not increase signficantly. Therefore, both M + u (i) M + w (1) and -si s1 should be small in this case. Given the intuition above, one possible conjecture is that the detection ratio M + u (i) M + w (1) is bounded by an affine function of max -si s1 , 0 , which leads to the PSI condition in Definition 4.1. 8 The bias term in the affine function takes into account that our intuition above is not exact. Finally, we can make a similar argument for -si s2 for motivating Eq. 6.

5. VC DIMENSION BOUND AND RELATION TO PSI

Here we show that the architecture in Section 3 is highly expressive, and can thus potentially overfit and generalize poorly. Moreover, we show examples of networks that overfit and do not satisfy PSI. First, a simple argument shows that V C(H) ≤ d n in our setting. The proof is given in Section A. The lower bound below is more challenging, and reveals interesting connections to the PSI property. Theorem 5.1. Assume that d = 2n and n ≥ 2, then V C(H) ≥ 2 d 2 -1 . The full proof is given in Section B. Here we give a sketch. We construct a set B of size 2 n-1 = 2 d 2 -1 that can be shattered. For a given I ∈ {0, 1} n-1 let I[j] be its jth entry. For any such I, define a point x I such that for any  1 ≤ j ≤ n -1, x I [j] = I[j]p 2j+1 + (1 -I[j])p 2j+2 . u (I) = max {-α I , 0} ∑ 1≤j≤n-1 x I [j] for each I ∈ {0, 1} n-1 and constants α I . Then, we prove that there exists constants α I such that N (x I ) = y I for all I by solving a linear system. Relation to PSI: Theorem 5.1 shows that there are exponentially large training sets that can be exactly fit with H. This fact can be used to show a lower bound on sample complexity that is exponential in d for general ERM algorithms (Anthony & Bartlett, 2009) . The networks that fit these datasets are those defined by w (I) , u (I) . It is easy to see that these networks do not satisfy the PSI property. To see this, note that M + w (1) = M - u (2) = 0, which implies that the left-hand sides of parts 1 and 2 in the Definition 4.1 are infinite. Therefore, PSI is not satisfied for these networks. These networks classify points based on the spurious patterns P s , and not on the patterns which determine the class. Networks that satisfy PSI are essentially the opposite: they classify a point mostly based on detectors for the patterns p 1 and p 2 and thus generalize well, as we show next.

6. PSI IMPLIES GOOD GENERALIZATION

In the previous section we showed that a general ERM algorithm for the class H may need exponentially many training samples to get low test error. Here we show that any algorithm satisfying the PSI condition (see Definition 4.1) will have low-degree polynomial sample complexity, when patterns in P s are unbiased (i.e., E [y1{p i ∈ x}] = 0 for p i ∈ P s ). Specifically, in the following theorem we show that such an algorithm will have zero test error w.h.p., given only O( P 2 log( P )) training samples. Note that this also implies a sample complexity of O(d 2 log(d)) since P ≤ d. Theorem 6.1. Assume that D satisfies the conditions in Section 3 and E [y1{p i ∈ x}] = 0 for all p i ∈ P s . Let A be a learning algorithm which satisfies (b,c,δ)-PSI with b, c ≥ 1. Then, if m > 300b 2 c 2 P 2 log( P ), with probability at least 1δ -4 P 3 ,foot_8 A(S) has 0 test error. We defer the proof to the supplementary but here we sketch the main argument. By the assumption E [y1{p i ∈ x}] = 0 and standard concentration of measure, s i should be small and therefore the detection ratios should be small by PSI. Then, by the key observation that small detection ratios imply perfect classification (e.g., Eq. 3), the algorithm achieves zero test error with respect to D.

7. EMPIRICAL AND THEORETICAL EVIDENCE THAT SGD SATISFIES PSI

Empirical Analysis: Thus far we have established that the PSI property implies good generalization. Here we provide empirical evidence that SGD indeed learns such models with overparameterized CNNs. We also provide a qualitative analysis that further confirms that the statistics of the patterns in the training set correlate with the detection ratios. Full details of the experiments are provided in the supplementary. We perform experiments with two distributions denoted by D u and D vc that satisfy the properties defined in Section 3 and such that E [y1{p i ∈ x}] = 0 for all p i ∈ P s . Thus, given Theorem 6.1, if PSI holds, good generalization will be implied. See Section E.2 for details on the distributions. Next, we show that PSI holds with small constants b and c which do not change the order of magnitude of the bound in Theorem 6.1, i.e., b 2 c 2 < 10. 10 We trained a neural network in our setting with SGD as described in Section 3. We performed more than 1000 experiments with different parameter values for n, d, k and m (see Section E.3 for details) and performed 10 experiments for each set of values for n, d, k and m. For each experiment, we set b = 2 and empirically calculated the lowest constant c which satisfies the PSI definition, which we denote by c * . The formal definition of c * is given in Eq. 10 in the supplementary. Figure 1a shows that across all experiments, the value of c * is less than 1, i.e., b 2 (c * ) 2 < 10. We further checked how c * varies with k for D = D u , d = 50 and n = 20. Figure 1b shows that c * is at most slightly correlated with k and has low value for large k. The intuition we described in Section 4 suggests that there is a positive correlation between M + u (i) M + w (1) and max -si s1 , 0 . To test this, we experimented with a distribution which can vary the probability of a spurious pattern to be selected and thus can control max -si s1 , 0 . Figure 1c clearly shows a positive correlation between these quantities, strongly suggesting that the statistics of the patterns in the training set govern the magnitude of the detection ratios. See Section E.5 for further details. Theoretical Analysis in a Simplified Setup: Here we show that PSI holds for a setup of two training points, S = {(x + , 1), (x -, -1)}. We further assume that x + and x -have exactly the same patterns in P s . We analyze gradient descent with a constant learning rate η = The theorem holds for overparameterized networks, which coincides with our empirical findings in Section 7. The theorem holds for sufficiently small initialization, and thus it is not in the same regime of NTK analysis where initialization is large (Woodworth et al., 2019; Chizat et al., 2019) .

8. EXPERIMENT ON MNIST

In this section we report experiments on a variant of MNIST (LeCun, 1998) and show that we can use our analysis to better understand the performance of overparameterized CNNs in this setting. Full details of the experiments are given in Section G. Our PSI results thus far can be summarized informally as follows. In a pattern detection problem, an algorithm has PSI bias if the dot product between a discriminative pattern and its detector is large, and the dot product between a spurious pattern and its detector is low. Furthermore, the gap between these dot products increases with the train size and a sufficiently large gap implies perfect accuracy. While our analysis required the patterns to be orthogonal, the above idea can work beyond the orthogonal case, as the experiment below shows. We consider data generated as follows. Each data point consists of 9 randomly sampled MNIST images, where if y = 1 one of the 9 digits is randomly chosen to be of color blue and the rest 8 10 To empirically validate PSI and show that it implies good generalization, we could in principle show that the conditions of Theorem 6.1 hold empirically, i.e., there exist b, c and m such that m > 300b 2 c 2 d 2 log(d) and PSI holds with constants b, c and high probability 1 -δ. However, as with most generalization results, the numerical value (including constants) results in large m which cannot be empirically tested. Instead, we show that b and c do not change the order of magnitude of the bound. digits are colored green. For y = -1 a similar sampling procedure is performed but with the color red instead of blue. Figure 2a shows examples of data points (note that in our notation n = 9 and d = 28 * 28 * 3 = 2352). Thus in this setting, red and blue digits are discriminative whereas green are spurious. We use the network in Eq. 1 and SGD to learn a classifier for this data. The data can be perfectly classified with a network with k = 1 (see Figure 2b ). We train an overparameterized network with k = 20 for different training set sizes. Figure 2c shows subsets of the learned filters (see Section G for all filters), for different training set sizes. The figures show in color the positive weight filter entries. The pattern that appears is the pattern that maximally activates them, i.e., the pattern they detect. First, we can see that the filters come in three colors (blue, red, green) corresponding to the three pattern types they detect (positive, negative, spurious respectively). This fact is not trivial and is similar to what we obtained in the proof of Theorem 7.1 and explained in Section 4. As noted above, the PSI prediction is that the dot product between detectors and detected patterns would be large for discriminative patterns (i.e., red and blue) and low for spurious (i.e., green). Furthermore, this difference should increase with the data size m. Indeed, Figure 2c shows precisely this behavior. Namely, as we increase the training set size, the green pattern detectors become darker and thus have low dot product with detected pattern. In contrast, the red and blue maintain a bright color and thus have large dot products with detected patterns. Finally, the test accuracy for m = 6, 20, 1000 is 88%, 100%, 100%, respectively. This is in line with Theorem 6.1 that shows PSI can lead to perfect test accuracy when the gap between dot products is sufficiently large.

9. CONCLUSIONS

Understanding the inductive bias of gradient methods for deep learning is an important challenge. In this paper, we study the inductive bias of overparameterized CNNs in a novel setup and provide theoretical and empirical support that SGD exhibits good generalization performance. Our results on MNIST suggest that the PSI phenomenon goes beyond orthogonal patterns. We use a unique approach of combining novel empirical observations with theoretical guarantees to make headway in a challenging setting of overparameterized CNNs. We believe that our work can pave the way for studying inductive bias of neural networks in other challenging settings.

A VC DIMENSION UPPER BOUND

Without considering the order of the patterns in the images, there are at most d n input points in D. Since the network in Eq. 1 is invariant to the order of the patterns in an image, this implies: V C(H) ≤ d n . Note that for the definition of VC dimension, we assume that the domain of possible inputs is the domain of images of the distribution. This gives a tighter upper bound for our problem. B PROOF OF THEOREM 5.1 We will construct a set B of size 2 n-1 = 2 d 2 -1 that can be shattered. For a given I ∈ {0, 1} n-1 let I[j] be its jth entry. For any such I, define a point x I such that for any 1 ≤ j ≤ n -1, x I [j] = I[j]p 2j+1 + (1 -I[j])p 2j+2 . Furthermore, arbitrarily choose x I [n] = p 1 or x I [n] = p 2 and define B = x I I ∈ {0, 1} n-1 . Now, assume that each point x I ∈ B has label y I . We will show that there is a network N ∈ H such that N (x I ) = y I for all I. For each I ∈ {0, 1} n-1 , define w (I) = max {α I , 0} ∑ 1≤j≤n-1 x I [j] and u (I) = max {-α I , 0} ∑ 1≤j≤n-1 x I [j], where {α I } is the unique solution of the following linear system with 2 n-1 equations. For each I ∈ {0, 1} n-1 the system has the following equation: I ′ ∈{0,1} n-1 ∖{I} α I ′ = y I c (7) where for any I ∈ {0, 1} n-1 , I c ∈ {0, 1} n-1 is defined such that I c [j] = 1 -I[j] for all 1 ≤ j ≤ n -1. There is a unique solution because the corresponding matrix of the linear system is the difference between an all 1's matrix and the identity matrix. By the Sherman-Morrison formula (Sherman & Morrison, 1950) , this matrix is invertible, where in the formula the outer product rank-1 matrix is the all 1's matrix and the invertible matrix is minus the identity matrix. Then for N with the above weights and any x I : N (x I ) = I ′ ∈{0,1} n-1 max j σ w (I ′ ) ⋅ x[j] -max j σ u (I ′ ) ⋅ x[j] = I ′ ∈{0,1} n-1 α I ′ max j σ 1≤i≤n-1 x I ′ [i] ⋅ x I [j] = I ′ ∈{0,1} n-1 ∖{I c } α I ′ = y I by the definition of N , the orthogonality of the patterns, and Eq. 7. We have shown that any labeling y I can be achieved, and hence the set is shattered, completing the proof. C PROOF OF THEOREM 6.1 WLOG we prove the theorem for P = d. By the assumption, for p i ∈ P s , s i is an average of m IID binary variables y j 1{p i ∈ x j } with zero expected value. Thus, by Hoeffding's inequality we have for all p i ∈ P s that: P ⎛ ⎝ b s i ≤ 4b log(d) m ⎞ ⎠ ≤ 2 d 4 Therefore, by a union bound over all patterns p i ∈ P s , with probability at least 1-2 d 3 , for all p i ∈ P s : b s i ≤ 4b log(d) m ≤ 1 6cd Next we consider p 1 (the positive pattern), for which E [s 1 ] = 0.5 (because it only appears in the positive examples, and the prior over y is 0.5). Hoeffding's bound and the definition of m imply that s 1 ≥ 1 3 with probability at least 1 -1 d 3 .foot_9 We can now do a union bound over all patterns and PSI condition to obtain that with probability at least 1δ -3 d 3 we have by the PSI property and Eq. 8, for all p i ∈ P s : M + u (i) M + w (1) ≤ b s i s 1 + c √ m ≤ 1 2cd + c √ m < 1 d From PSI we have M + u (1) M + w (1) ≤ c √ m < 1 d . Therefore, for any positive point (x + , 1) Eq. 3 implies: N W (x + ) > M + w (1) 1 - d -1 d > 0 Thus, x + is classified correctly. By the symmetry of the problem and part 2 in Definition 4.1, any negative point will be classified correctly as well.

D FURTHER EXPERIMENTS FOR VALIDATION OF PSI

To further validate the PSI condition, we tested whether the conditions in the proof of Theorem 6.1 empirically hold. Specifically, in the proof we showed that M + u (i) M + w (1) < 1 d for all p i ∈ P s (in Eq. 9). We checked this for all settings of (n, d) and largest possible k and m, k = 10000 and m = 40000. In all of our experiments, SGD converged to a solution with 0 test error such that Eq. 9 holds for all p i ∈ P s . Finally, we checked how c * varies with m. Figure 3 show that c * is at most slightly correlated with m and has low value for large m. In the same setup of Section E.3, we performed experiments with distribution D u , n = 20, d = 50, k = 2500 and m ∈ {100, 200, 500, 1000, 2000, 5000, 20000, 40000 , 80000, 120000}.

E EXPERIMENTAL DETAILS OF SECTION 7

Here we provide details of the experiments performed in Section 7. All experiments were run on NVidia Titan Xp GPUs with 12GB of memory. Training algorithms were implemented in Tensor-Flow. All of the empirical results can be replicated in approximately 150 hours on a single Nvidia Titan Xp GPU.

E.1 VALUE OF c *

We use the following formula to compute c * in the experiments: c * = √ m max max i∈Ps∪P+ M + u (i) M + w (1) -2 max - s i s 1 , 0 , max i∈Ps∪P- M - w (i) M - u (2) -2 max - s i s 2 , 0 , 0

E.2 DISTRIBUTIONS IN EXPERIMENTS

We perform experiments with two types of distributions that satisfy the properties defined in Section 3. They differ in the random sampling procedure of spurious patterns described in Section 3. In both distributions P is the set of all one-hot vectors in R d . In the first distribution D u , the n -1 spurious patterns are selected uniformly at random without replacement from P s . In the second distribution D vc , for each 1 ≤ j ≤ n -1 one of the patterns from p 2j+1 , p 2j+2 is selected uniformly at random. Importantly, both D u and D vc satisfy E [y1{p i ∈ x}] = 0 for all p i ∈ P s . Thus, given Theorem 6.1, if PSI holds, good generalization will be implied. Remark E.1. The support of D vc is the shattered set B in the proof of Theorem 5.1. The proof implies that for any sampled training and test sets which are subsets of B, there exists a network with 0 training error and arbitrarily high test error. Therefore, by optimizing the training error, SGD can converge to these solutions. However, as we show empirically, SGD does not converge to these solutions, but rather it satisfies PSI and converges to solutions with good generalization performance.

E.3 FIGURE 1A EXPERIMENT

We performed more than 1000 experiments with the network in Eq. 1 and SGD. We All orthogonal patterns were one-hot vectors. We trained only the weights of the first convolutional layer. We used a batch size of 20 if k = 10000 and batch size of 100 for k = 1000. The learning rate was set to max{ 0.001 2k , 0.0000001} and σ g to 0.000001. The solution SGD returned was either after 50000 epochs or if there was an epoch where the training loss was less than 0.00001. For each experiment, we set b = 2 and empirically calculated c * .

E.4 FIGURE 1B EXPERIMENT

In the same setup of Section E.3 (i.e., batch size, stopping criteria, learning rate etc.), we performed experiments with distribution D u , n = 20, d = 50, m = 2000 and k ∈ {50, 100, 1000, 2500, 5000, 7500, 10000}.

E.5 FIGURE 1C EXPERIMENT

We experimented with a distribution D p which can vary the probability of a spurious pattern to be selected and thus can control max -si s1 , 0 . Given y = 1 it selects p 3 with probability p or p 4 with probability 1p. Then it selects the remaining n -2 patterns from P s ∖ {p 3 , p 4 } uniformly at random without replacement. Similarly, given y = -1 it selects p 3 with probability 1p or p 4 with probability p. The remaining n -2 patterns are selected uniformly without replacement from P s ∖ {p 3 , p 4 }. We experimented with various p and plotted for each solution of SGD, M + u (i) M + w (1) and max -si s1 , 0 for all p i ∈ P s ∪ P + . In the setup of Section E.3 we experimented with distributions D p for p values in {0.0, 0.01, 0.03, 0.05, 0.07, 0.1, 0.12, 0.2, 0.21, 0.28, 0.3, 0.4, 0.44, 0.5, 0.51, 0.59, 0.6, 0.68, 0.7, 0.78, 0.8, 0.9, 0.91, 0.94 0.95, 0.98, 0.99 Here we define additional notations that will be useful for the proof of the theorem. Let P T be the set of all patterns that appear in either x + or x -. Similarly to Eq. 2 define: W + t (i) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ j arg max l∈P T ∖{2} w (j) t ⋅ p l = i, w (j) t ⋅ p i > 0 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ U + t (i) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ j arg max l∈P T ∖{2} u (j) t ⋅ p l = i, u (j) t ⋅ p i > 0 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ and W - t (i) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ j arg max l∈P T ∖{1} w (j) t ⋅ p l = i, w (j) t ⋅ p i > 0 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ U - t (i) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ j arg max l∈P T ∖{1} u (j) t ⋅ p l = i, u (j) t ⋅ p i > 0 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ Define: A w = ⋃ i∈P T ∖{2} W + 0 (i) A u = ⋃ i∈P T ∖{1} U - 0 (i) Finally we define poly(x) to be any polynomial function of x.

F.2 AUXILIARY LEMMAS

We now prove several technical lemmas. In Section F.3 we use the lemmas to prove the theorem. In the next 3 lemmas we provide high probability bounds on sizes of certain sets that are functions of the sets in Eq. 11 and Eq. 12. Lemma F.1. For any 0 < < 1 4 , with probability at least 1 -4e -8 for any k > poly( 1 ): A w k -(1 -2 -n ) ≤ and A u k -(1 -2 -n ) ≤ Proof. It suffices to show that for any k: A w -(1 -2 -n ) k ≤ 2 √ k and A u -(1 -2 -n ) k ≤ 2 √ k For each 1 ≤ j ≤ k it holds that j ∈ A w with probability 1-2 -n . Therefore by Hoeffding's inequality, with probability at least 1 -2e -8 , A w -(1 -2 -n ) k ≤ 2 √ k. The same argument applies for A u , a union bound and setting k > 1 3 concludes the proof. Lemma F.2. For any > 0, with probability at least 1 -4 d 7 -4e -8 , for k > poly(log d, 1 ) and for all i ∈ P T ∖ {2}: W + 0 (i) k (1 -2 -n + ) ≤ 1 n + and W + 0 (i) k (1 -2 -n -) ≥ 1 n - Similarly, for all i ∈ P T ∖ {1} ∶ U - 0 (i) k (1 -2 -n + ) ≤ 1 n + and U - 0 (i) k (1 -2 -n -) ≥ 1 n - Proof. Without loss of generality, consider W + 0 (i) . We first condition on the random variable A w and given that the event k (1 -2 -n -) ≤ A w ≤ k(1 -2 -n + ) holds. By symmetry, we have E W + 0 (i) Aw = 1 n where the expectation is with respect to the initialization. Thus, we get by Hoeffding's inequality: P ⎛ ⎝ W + 0 (i) A w - 1 n ≤ 2 √ log d A w ⎞ ⎠ ≤ 2e -2 Aw 2 log d Aw 2 = 2 d 8 By the law of total probability, applying Lemma F.1 and a union bound over i ∈ P T twice (for both W + 0 (i) and U - 0 (i)), we get the desired result. Lemma F.3. For any > 0, with probability at least 1 -4 d 7 -4e -8 , for k > poly(log d, 1 ) and for all i ∈ P T ∖ {1, 2} the following holds: W + 0 (i) ∩ W - 0 (2) k 1 2 -2 -n-1 + ≤ 1 n(n + 1) + W + 0 (i) ∩ W - 0 (2) k 1 2 -2 -n-1 - ≥ 1 n(n + 1) - U + 0 (1) ∩ U - 0 (i) k 1 2 -2 -n-1 + ≤ 1 n(n + 1) + and U + 0 (1) ∩ U - 0 (i) k 1 2 -2 -n-1 - ≥ 1 n(n + 1) - Proof. The proof is similar to the proofs of Lemma F.1 and Lemma F.2. The difference is that we use the equalities E [ A w ∩ W - 0 (2) ] = E [ A u ∩ U + 0 (1) ] = 1 2 -2 -n-1 k instead of E [ A w ] = E [ A u ] = (1 -2 -n ) k as in Lemma F.1. Furthermore, we use E W + 0 (i)∩W - 0 (2) Aw∩W - 0 (2) = E U + 0 (1)∩U - 0 (i) Au∩U + 0 (1) = 1 n(n+1) for fixed A w ∩ W - 0 (2) and A u ∩ U + 0 (1) instead of E W + 0 (i) Aw = E U - 0 (i) Au = 1 n for fixed A w and A u as in Lemma F.2. Lemma F.4. For any M > 0 and δ > 0, there exists a sufficiently small σ g > 0, such that with probability at least 1δ, for all 1 ≤ i ≤ k, w (i) 0 ≤ M and u (i) 0 ≤ M . Proof. The proof is immediate. We now proceed to analyze the dynamics of gradient descent in the next two lemmas. Define M such that for all 1 ≤ i ≤ k, w ≤ M . Let E be the set of all t such that for all x ∈ S, it holds that N Wt (x) < 1. Let t * = arg min t {t -1 ∈ E, t ∉ E}. We assume that η and σ g are sufficiently small such that t * ≥ 2. Lemma F.5. For a sufficiently small , M , c η such that M << η, the following holds for any 1 ≤ t ≤ t * : 1. If j ∉ A w , then w (j) t = w (j) 0 -α η 2 p 2 where α ∈ {0, 1}. 2. If j ∈ W + 0 (1), then w (j) t = w (j) 0 -η 2 ∑ i∈P T ∖{1} α i p i + ηt 2 p 1 , where α i ∈ {0, 1}. 3. If i ∈ P T ∩ P s and j ∈ W + 0 (i) ∩ W - 0 (i) then w (j) t = w (j) 0 . 4. If i ∈ P T ∩ P s and j ∈ W + 0 (i) ∩ W - 0 (2), then w (j) t = w (j) 0 -η 2 p 2 + η 2 p i 5. If j ∉ A u , then u (j) t = u (j) 0 -αp 1 where α ∈ {0, 1}. 6. If j ∈ U - 0 (2), then u (j) t = u (j) 0 -η 2 ∑ i∈P T ∖{2} α i p i + ηt 2 p 2 , where α i ∈ {0, 1}. 7. If i ∈ P T ∩ P s and j ∈ U - 0 (i) ∩ U + 0 (1) then u (j) t = u (j) 0 . 8. If i ∈ P T ∩ P s and j ∈ U - 0 (i) ∩ U + 0 (1), then u (j) t = u (j) 0 -η 2 p 1 + η 2 p i Proof. 1. If j ∉ W - 0 (2), then for t = 1 the gradient of the loss with respect to w (i) is 0, because every pattern in P T has a negative dot product with w where the minimum is over integral times t. Notice that t + ! = t -can only occur when there exists an integer r such that 1 - c η rγ 2 ≤ 2 max{β + , β -} Choose c η to be a small number which is not an integral multiple of 2 γ (e.g., choose irrational c η ). Then, max{β + , β -} can be made sufficiently small such that Eq. 15 does not hold. 12 In this case, after 1 γcη ≤ t + = t -= t * ≤ 3 γcη iterations, gradient descent converges to a global minimum. F.3 FINISHING THE PROOF OF THEOREM 7.1 We are now ready to prove the theorem. Gradient descent converges to a global minimum after 1 αcη ≤ T ≤ 3 αcη iterations by Lemma F.6. Furthermore, by the proof of Lemma F.6, W + T (1) = W + 0 (1). For each j ∈ W + 0 (1), the norm of w (j) T is at least η 2 T ≥ 1 2γk . Therefore, for a sufficiently small by Lemma F.2: M + w (1) ≥ W + 0 (1) 2γk ≥ 1 3 Now, by Lemma F.6, for all j ∉ W + 0 (1), it holds that w G EXPERIMENTAL DETAILS OF SECTION 8 Here we provide details of the experiments performed in Section 8. All experiments were run on NVidia Titan Xp GPUs with 12GB of memory. Training algorithms were implemented in PyTorch. All of the empirical results can be replicated in approximately one hour on a single Nvidia Titan Xp GPU. We now describe how we created train and test sets for our setting. For train we sampled digits from the original MNIST training set and for test we sampled digits from the original MNIST test set. To sample a data point, we randomly sampled a label y ∈ {±1}. Then, if y = 1 we randomly sampled 9 MNIST digits (either from the MNIST train or test set). Then randomly chose 8 of them to be the color green and one of them to be the color blue. If y = -1, we do the same procedure with blue replaced by red. For training we implemented the setting in Section 3. Specifically, here we have n = 9 and d = 28 * 28 * 3 = 2352. We trained the network in Eq. 1 with k = 20 for training set sizes m = 6, m = 20 and m = 1000. For each training set size we performed 10 different experiments with different sampled training set and initialization. We ran SGD with batch size min{10, m}, learning rate 0.0001 and for 200 epochs. We report the test accuracy and train accuracy in the final epoch (200) . In all runs SGD gets 100% train accuracy. For m = 6 the mean test accuracy is 88.09% with standard deviation 12.7, for m = 20 the mean test accuracy is 99.84% with standard deviation 0.35 and for m = 1000 the mean test accuracy is 100% with standard deviation 0. we calculated max{0, x}. We scaled the weights of the network to be with values between 0 to 255 by dividing all entries by the maximum entry across all parameters of the network (after performing max{0, x}) and multiplying by 255.



We say that x contains p if there exists j such that x[j] = p. The order of the patterns will not matter, because the convolutional network is invariant to it. The reason we consider these two sets of patterns will be clear when we discuss detection ratios. Note that this does not affect the expressive power of the network. Given a tie between sets, we assume the weight is assigned arbitrarily to one of them. We state m ≥ 1 for simplicity. Alternatively, one can assume m ≥ C for a constant C. This holds with high probability at initialization for a sufficiently large network. Furthermore, in Section 7, we show that it holds during training in the case of two training points. We consider maxs i s 1 , 0 in the PSI definition because the detection ratios are non-negative. We note that the 4 P 3 may be improved to an arbitrary γ > 0 if we scale m by log 1 γ . In fact we can have exponential dependence here, but we use d 3 to simplify later expressions. Note that max{β + , β - } does not depend on cη.



Figure 1: Empirical analysis of c * . (a) Empirical calculation of c * for D u and D vc . Values are in log scale. (b) c * as a function of the network size k (c) Positive correlation between M + u (i) M + w (1) and max -si s1 , 0 . The depicted line is the best PSI bound with b = 2 (lowest c).

following theorem shows that PSI holds with constants b = 1 and c = √ 18c η . The proof analyzes the trajectory of gradient descent and is provided in Section F. Theorem 7.1. For a sufficiently small , σ g , c η such that σ g << η and k ≥ poly log d, 1 , with probability at least 1 -9 d 7 -8e -8 , gradient descent converges to a global minimum after T ≤ O 1 cη iterations and the PSI condition is satisfied with b = 1 and c = √ 18c η .

Figure 2: Experiments on a variant of MNIST. (a) Examples of data points. (b) Filters of CNN with k = 1 that perfectly classifies the data (c) Examples of learned filters of overparameterized CNNs for different training set sizes. Figures of all learned filters are given in Section G.

Figure 3: c * as a function of the training set size m.

Figure 4: Example of points in an orthogonal patterns distribution. Here there are 25 possible orthogonal patterns ( P = 25) and each pattern is a 10 × 10 image patch and thus d = 100. The number of patterns in each image is n = 16. The image consists of 4 rows of 4 patches each. The positive examples contain the pattern in P + . The negative examples contain the pattern in P -. In the two leftmost images of each class, the corresponding pattern is shown. All other patterns in an image are from the set of spurious patterns P s .

experimented with parameter values k ∈ {1000, 10000}, m ∈ {100, 500, 1000, 2000, 5000, 20000, 40000} , (n, d) ∈ {(10, 20), (10, 80), (20, 50), (40, 60)} for D = D u and (n, d) ∈ {(10, 20), (40, 80), (25, 50), (30, 60)} for D = D vc . For each distribution D u or D vc , we performed 10 experiments for each set of values for n, d, k and m. For each set of values we plot the mean of the 10 experiments and standard deviation error bars in shaded regions. In each one of the 10 experiments we randomly sampled the training and test sets according to the given distribution D u or D vc and randomly sampled the initialization of the network. We used a test set of size 1000.

, 1.0} We experimented with values n = 40, d = 60, m = 1000 and k = 2500. The solution SGD returned was either after 2000 epochs or if there was an epoch where the training loss was less than 0.00001.

for all t ≥ 1. If j ∈ W - 0 (2) then η 2 p 2 will subtracted in the first iteration, and w (i) t will not change in later iterations.

for 1 ≤ t ≤ t * , where β -is sufficiently small. Our goal is to show that gradient converges to a global minimum at T = t * . Let t + = arg min t≤t *

all i ∈ P ∖ {2}, M - w (i) ≤ c η . By symmetry, it follows that the PSI property holds with b = 1 and c = √ 18c η .

Figure5, Figure6and Figure7show the set of all filters in the experiments reported in Figure2c, for m = 6, m = 20 and m = 1000, respectively. To plot the figures, for each entry of the filter x

Figure 5: Learned filters for experiment with m = 6.

Figure 6: Learned filters for experiment with m = 20.

annex

2. The proof follows directly by the gradient update. In each iteration, η 2 p 1 is added and a pattern in P T ∖{1} is subtracted unless all such patterns already have a negative dot product with w (i) t . Note that we used here the fact that M << η. 3. For t = 1 we have by the gradient update:0 for all 1 ≤ t ≤ t * . 4. The proof follows by the gradient update as in previous proofs. For t = 1, the term η 2 p 2 is subtracted by the update of x -, since j ∈ W - 0 (2). The term η 2 p i is added due to the update of x + . Now j ∈ W + 1 (i) ∩ W - 1 (i) and thus w (j) twill not change in subsequent iterations, as in the proof of part 3. This concludes the proof.By symmetry, the proofs of 5-8 are identical to the proofs of parts 1-4.

Define

then we have the following:Lemma F.6. For a sufficiently small , M , c η such that M << η and k ≥ poly log d, 1 with probability at least 1 -9 d 7 -8e -8 , gradient descent converges to a global minimum after 1 γcη ≤ T ≤ 3 γcη iterations.Proof. Throughout the proof we use Lemma F.4 to choose a sufficiently small M such that w≤ M with probablity at least 1 -1 d 7 . We further apply Lemma F.1, Lemma F.2 and Lemma F.3 which together with Lemma F.4 hold with probability at least 1 -9 d 7 -8e -8 . Define the sets of weights B i such that i corresponds to the set of weights in part i of Lemma F.5. For example, B 2 = W + 0 (1) and B 8 = ⋃ i∈P T ∩Ps U - 0 (i) ∩ U + 0 (1). Define the following:We would like to analyze the dynamics of N Wt (x + ). To do so, we will address each N (i)Bounding N(1) Wt (x + ): By Lemma F.5 part 1, it follows that N(1)for all 1 ≤ t ≤ t * . Recall the definition of γ in Eq. 14. Then, by Lemma F.2, for sufficiently small and k > poly(log d, 1 ) it holds that:By Lemma F.4, we have:Therefore, after 1 ≤ t ≤ t * iterations, we have:By choosing M and to be sufficiently small (given an upper bound on t that does not depend on M and , which we show later), Nis sufficiently small.Calculating N(2)Wt (x + ): Notice that after n -1 iterations, we have by Lemma F.5 part 6 that N(2) Wt (x + ) = 0. By taking c η to be sufficiently small, we can ensure that n -1 < t * .

Bounding N

(3) Wt (x + ): By Lemma F.5 parts 4 and 8 and given that M << c η we have for 1 ≤ t ≤ t * :By Lemma F.3, for sufficiently small , for any i ∈ P T ∩ P s , the differenceis sufficiently small. We conclude that N(3)Wt (x + ) is sufficiently small for small .Bounding NWt (x + ): By Lemma F.5, parts 1,3,5,7 it follows that N (4)Wt (x + ) ≤ kM and thus can be made sufficiently small for small M .Finishing the proof: By combining the previous arguments we havefor 1 ≤ t ≤ t * , where β + is sufficiently small.By symmetry, we have: 

