ON THE INDUCTIVE BIAS OF A CNN FOR DISTRIBU-TIONS WITH ORTHOGONAL PATTERNS

Abstract

Training overparameterized convolutional neural networks with gradient based optimization is the most successful learning method for image classification. However, their generalization properties are far from understood. In this work, we consider a simplified image classification task where images contain orthogonal patches and are learned with a 3-layer overparameterized convolutional network and stochastic gradient descent (SGD). We empirically identify a novel phenomenon of SGD in our setting, where the dot-product between the learned pattern detectors and their detected patterns are governed by the pattern statistics in the training set. We call this phenomenon Pattern Statistics Inductive Bias (PSI) and empirically verify it in a large number of instances. We prove that in our setting, if a learning algorithm satisfies PSI then its sample complexity is O(d 2 log(d)) where d is the filter dimension. In contrast, we show a VC dimension lower bound which is exponential in d. We perform experiments with overparameterized CNNs on a variant of MNIST with non-orthogonal patches, and show that the empirical observations are in line with our analysis.

1. INTRODUCTION

Convolutional neural networks (CNNs) have achieved remarkable performance in various computer vision tasks (Krizhevsky et al., 2012; Xu et al., 2015; Taigman et al., 2014) . In practice, these networks typically have more parameters than needed to achieve zero train error (i.e., are overparameterized). Despite non-convexity and the potential problem of overfitting, training these models with gradient based methods leads to solutions with low test error. It is still largely unknown why such simple optimization algorithms have outstanding test performance for learning overparameterized convolutional networks. Recently, there have been major efforts to provide generalization guarantees for overparameterized CNNs. However, current generalization guarantees either depend on the number of channels of the network (Long & Sedghi, 2020) or hold under specific constraints on the weights (Li et al., 2018) . Clearly, the generalization of overparameterized CNNs depends on both the learning algorithm (gradient-based methods) and unique properties of the data. Providing generalization guarantees while incorporating these factors is a major challenge. Indeed, this requires analyzing non-convex optimization methods and mathematically defining properties of the data, which is extremely difficult for real-world problems. Therefore, it is necessary to first understand simple settings which are amenable to theoretical and empirical analysis and share salient features with real-world problems. Towards this goal, we analyze a simplified pattern recognition task where all patterns in the images are orthogonal and the classification is binary. The architecture is a 3-layer overparameterized convolutional neural network and it is learned using stochastic gradient descent (SGD). We take a unique approach that combines novel empirical observations with theoretical guarantees to provide a novel generalization bound which is independent of the number of channels and is a low-degree polynomial of the filter dimension, which is usually low in practice. Empirically, we identify a novel property of the solutions found by SGD. We observe that the statistics of patterns in the training data govern the magnitude of the dot-product between learned pattern detectors and their detected patterns. Specifically, patterns that appear almost exclusively in one of the classes will have a large dot-product with the channels that detect them. On the other hand,

