PLATEAU IN MONOTONIC LINEAR INTERPOLATION -A "BIASED" VIEW OF LOSS LANDSCAPE FOR DEEP NETWORKS

Abstract

Monotonic linear interpolation (MLI) -on the line connecting a random initialization with the minimizer it converges to, the loss and accuracy are monotonic -is a phenomenon that is commonly observed in the training of neural networks. Such a phenomenon may seem to suggest that optimization of neural networks is easy. In this paper, we show that the MLI property is not necessarily related to the hardness of optimization problems, and empirical observations on MLI for deep neural networks depend heavily on the biases. In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output, and when different classes have different last-layer biases on a deep network, there will be a long plateau in both the loss and accuracy interpolation (which existing theory of MLI cannot explain). We also show how the last-layer biases for different classes can be different even on a perfectly balanced dataset using a simple model. Empirically we demonstrate that similar intuitions hold on practical networks and realistic datasets.

1. INTRODUCTION

Deep neural networks can often be optimized using simple gradient-based methods, despite the objectives being highly nonconvex. Intuitively, this suggests that the loss landscape must have nice properties that allow efficient optimization. To understand the properties of loss landscape, Goodfellow et al. (2014) studied the linear interpolation between a random initialization and the local minimum found after training. They observed that the loss interpolation curve is monotonic and approximately convex (see the MNIST curve in Figure 1 ) and concluded that these tasks are easy to optimize. However, other recent empirical observations, such as Frankle (2020) observed that for deep neural networks on more complicated datasets, both the loss and the error curves have a long plateau along the interpolation path, i.e., the loss and error remain high until close to the optimum (see the CIFAR-10 curve in Figure 1 ). Does the long plateau along the linear interpolation suggest these tasks are harder to optimize? Not necessarily, since the hardness of optimization problems does not need to be related to the shape of interpolation curves (see examples in Appendix A). In this paper we give the first theory that explains the plateau in both loss and error interpolations. We attribute the plateau to simple reasons as the bias terms, the network initialization scale and the network depth, which may not necessarily be related to the difficulty of optimization. Note that there are many different theories for the optimization of overparametrized neural networks, in particular the neural tangent kernel (NTK) analysis (Jacot et al., 2018; Du et al., 2018; Allen-Zhu et al., 2019; Arora et al., 2019) and mean-field analysis (Chizat & Bach, 2018; Mei et al., 2018) . However they don't explain the plateau in both loss and error interpolations. For NTK regime, the network output is nearly linear in the parameters and the loss interpolation curve is monotonically decreasing and convex -no plateau in the loss interpolation. Mean-field regime often uses a smaller initialization on a homogeneous neural network (as considered in Chizat & Bach (2018); Mei et al. ( 2018)). In this case, the interpolated network output is basically a scaled version of the network output at the minimum and has same label predictions -no plateau in the error interpolation curve. 1.1 OUR RESULTS Our theoretical results consist of two parts. In the first part (see Section 3), we give a plausible explanation for the plateau in both the loss and error curves. Claim 1 (informal). If a deep network has a relatively small initialization, and its last-layer biases are significantly different for different classes, then both the loss and error curves will have a plateau. The length of the plateau is longer for a deeper network. We formalize this claim in Theorem 1. For intuition, consider an r-layer neural network that only has bias on the last layer, and consider Xavier initialization (Glorot & Bengio, 2010) which typically gives small output and zero bias. If we consider the α-interpolation point (with coefficient α for the minimum and (1 -α) for the initialization), then the weight "signal" from the minimum scales as α r (as it is the product of r layers) while the bias scales as α. As illustrated in Figure 2 (right), when r is large and there is a difference in biases, the bias will dominate, which creates a plateau in error. For the loss, one can also show that the weight signal is near 0 for small α, so the network output is dominated by the biases and the loss cannot beat the random guessing at initialization. Note that this explanation for the plateau does not have any implication on the hardness of optimization. However, why would the last-layer biases be different for different classes, especially in cases when the biases are initialized as zeros and all classes are balanced? In the second part (see Section 4), we focus on a simple model that we call r-homogeneous-weight network. This is a two-layer network whose i-th output is ⟨W i,: , x⟩ r + b i , where x ∈ R d is the network input, W i,: ∈ R d is the weight vector and b i ∈ R is the bias (see Figure 2 (left)). Our simple model simulates a depth-r ReLU/linear network with bias on the output layer, in the sense that the signal is r-homogeneous while the bias is 1-homogeneous in the parameters. Under this model we can show that: Claim 2 (informal). For the r-homogeneous-weight network on a simple balanced dataset, the class that is learned last has the largest bias. Here, a class is learned when all the samples in this class get classified correctly with good confidence. We basically show that once a class gets learned, the bias associated with this class starts decreasing and eventually the class that is learned last has the largest bias. We formalize this claim in Theorem 2. In Section 5, we verify these ideas empirically on fully-connected networks for MNIST (Deng, 2012), Fashion-MNIST (Xiao et al., 2017) and on VGG-16 (Simonyan & Zisserman, 2014) for CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009) . We first show that if we train a neural network without using any bias, then the error curve has much shorter plateau or no plateau at all. Even for networks that are trained normally with biases, we design a homogeneous interpolation scheme for biases to make sure that both biases and weights are r-homogeneous. Such an interpolation indeed significantly shortens the plateau for the error. We also show that decreasing the initialization scale or increasing the network depth can produce a longer plateau in both the error and loss curves. Finally, we show that the bias is correlated with the ordering in which the classes are being learned for small datasets, which suggests that even though the model we consider in the convergence analysis is simple, it captures some of the behavior in practice.



Figure 1: Loss interpolation curve and error interpolation curve for a four-layer fully-connected network (FCN4) on MNIST and for VGG16 on CIFAR-10.

