ON THE INDUCTIVE BIAS OF A CNN FOR DISTRIBU-TIONS WITH ORTHOGONAL PATTERNS

Abstract

Training overparameterized convolutional neural networks with gradient based optimization is the most successful learning method for image classification. However, their generalization properties are far from understood. In this work, we consider a simplified image classification task where images contain orthogonal patches and are learned with a 3-layer overparameterized convolutional network and stochastic gradient descent (SGD). We empirically identify a novel phenomenon of SGD in our setting, where the dot-product between the learned pattern detectors and their detected patterns are governed by the pattern statistics in the training set. We call this phenomenon Pattern Statistics Inductive Bias (PSI) and empirically verify it in a large number of instances. We prove that in our setting, if a learning algorithm satisfies PSI then its sample complexity is O(d 2 log(d)) where d is the filter dimension. In contrast, we show a VC dimension lower bound which is exponential in d. We perform experiments with overparameterized CNNs on a variant of MNIST with non-orthogonal patches, and show that the empirical observations are in line with our analysis.

1. INTRODUCTION

Convolutional neural networks (CNNs) have achieved remarkable performance in various computer vision tasks (Krizhevsky et al., 2012; Xu et al., 2015; Taigman et al., 2014) . In practice, these networks typically have more parameters than needed to achieve zero train error (i.e., are overparameterized). Despite non-convexity and the potential problem of overfitting, training these models with gradient based methods leads to solutions with low test error. It is still largely unknown why such simple optimization algorithms have outstanding test performance for learning overparameterized convolutional networks. Recently, there have been major efforts to provide generalization guarantees for overparameterized CNNs. However, current generalization guarantees either depend on the number of channels of the network (Long & Sedghi, 2020) or hold under specific constraints on the weights (Li et al., 2018) . Clearly, the generalization of overparameterized CNNs depends on both the learning algorithm (gradient-based methods) and unique properties of the data. Providing generalization guarantees while incorporating these factors is a major challenge. Indeed, this requires analyzing non-convex optimization methods and mathematically defining properties of the data, which is extremely difficult for real-world problems. Therefore, it is necessary to first understand simple settings which are amenable to theoretical and empirical analysis and share salient features with real-world problems. Towards this goal, we analyze a simplified pattern recognition task where all patterns in the images are orthogonal and the classification is binary. The architecture is a 3-layer overparameterized convolutional neural network and it is learned using stochastic gradient descent (SGD). We take a unique approach that combines novel empirical observations with theoretical guarantees to provide a novel generalization bound which is independent of the number of channels and is a low-degree polynomial of the filter dimension, which is usually low in practice. Empirically, we identify a novel property of the solutions found by SGD. We observe that the statistics of patterns in the training data govern the magnitude of the dot-product between learned pattern detectors and their detected patterns. Specifically, patterns that appear almost exclusively in one of the classes will have a large dot-product with the channels that detect them. On the other hand, patterns that appear roughly equally in both classes, will have a low dot-product with their detecting channels. We formally define this as the "Pattern Statistics Inductive Bias" condition (PSI) and provide empirical evidence that PSI holds across a large number of instances. We also prove that SGD indeed satisfies PSI in a simple setup of two points in the training set. Under the assumption that PSI holds, we analyze the sample complexity and prove that it is at most O(d 2 log d), where d is the filter dimension. In contrast, we show that the VC dimension of the class of functions we consider is exponential in d, and thus there exist other learning algorithms (not SGD) that will have exponential sample complexity. Together, these results provide firm evidence that even though SGD can in principle overfit, it is nonetheless biased towards solutions which are determined by the statistics of the patterns in the training set and consequently it has good generalization performance. We perform experiments with overparamterized CNNs on a variant of MNIST that has nonorthogonal patterns. We use our analysis to better understand why SGD has low sample complexity in this setting. We empirically show that the inductive bias of SGD is similar to PSI. This suggests that the idea of PSI is not unique to the orthogonal case and can be useful for understanding overparameterized CNNs in other challenging settings.

2. RELATED WORK

Several recent works have studied the generalization properties of overparameterized CNNs. Some of these propose generalization bounds that depend on the number of channels (Long & Sedghi, 2020; Jiang et al., 2019) . Others provide guarantees for CNNs with constraints on the weights (Zhou & Feng, 2018; Li et al., 2018) . Convergence of gradient descent to KKT points of the max-margin problem is shown in Lyu & Li (2020) and Nacson et al. (2019) for homogeneous models. However, their results do not provide generalization guarantees in our setting. Gunasekar et al. ( 2018) study the inductive bias of linear CNNs. 2019) study a pattern classification problem similar to ours. However, their analysis holds for an unbounded hinge loss which is not used in practice. Furthermore, their sample complexity depends on the network size, and thus does not explain why large networks do not overfit. Other works have studied learning under certain ground truth distributions. For example, Brutzkus & Globerson (2019) study a simple extension of the XOR problem, showing that overparameterized CNNs generalize better than smaller CNNs. Single-channel CNNs are analyzed in (Du et al., 2018b; a; Brutzkus & Globerson, 2017; Du et al., 2018c) .

Yu et al. (

Other works study the inductive bias of gradient descent on fully connected linear or non-linear networks (Ji & Telgarsky, 2019; Arora et al., 2019a; Wei et al., 2019; Brutzkus et al., 2018; Dziugaite & Roy, 2017; Allen-Zhu et al., 2019; Chizat & Bach, 2020) . Fully connected networks were also analyzed via the NTK approximation (Du et al., 2019; 2018d; Arora et al., 2019b; Fiat et al., 2019) . Kushilevitz & Roth (1996) ; Shvaytser (1990) study the learnability of visual patterns distribution. However, our focus is on learnability using a specific algorithm and architecture: SGD trained on overparameterized CNNs.

3. THE ORTHOGONAL PATTERNS PROBLEM

Data Generating Distribution: We consider a learning problem that captures a key property of visual classification. Many visual classes are characterized by the existence of certain patterns. For example an 8 will typically contain an x like pattern somewhere in the image. Here we consider an abstraction of this behavior where images consist of a set of patterns. Furthermore, each class is characterized by a pattern that appears exclusively in it. We define this formally below. Let P be a set of orthogonal vectors in R d , where P ≤ d. For simplicity, we assume that p 2 = 1 for all p ∈ P. We consider input vectors x with n patterns of dimension d. Formally, x = (x[1], ..., x[n]) ∈ R nd where x[i] ∈ P is the ith pattern of x and n < d. We denote p ∈ x if x contains the pattern p ∈ P.foot_0 Let P(x) = {p ∈ x p ∈ P} denote the set of all patterns in x.



We say that x contains p if there exists j such that x[j] = p.

