VISUALIZING HIGH-DIMENSIONAL TRAJECTORIES ON THE LOSS-LANDSCAPE OF ANNS

Abstract

Training artificial neural networks requires the optimization of highly non-convex loss functions. Throughout the years, the scientific community has developed an extensive set of tools and architectures that render this optimization task tractable and a general intuition has been developed for choosing hyper parameters that help the models reach minima that generalize well to unseen data. However, for the most part, the difference in trainability in between architectures, tasks and even the gap in network generalization abilities still remain unexplained. Visualization tools have played a key role in uncovering key geometric characteristics of the losslandscape of ANNs and how they impact trainability and generalization capabilities. However, most visualizations methods proposed so far have been relatively limited in their capabilities since they are of linear nature and only capture features in a limited number of dimensions. We propose the use of the modern dimensionality reduction method PHATE which represents the SOTA in terms of capturing both global and local structures of high-dimensional data. We apply this method to visualize the loss landscape during and after training. Our visualizations reveal differences in training trajectories and generalization capabilities when used to make comparisons between optimization methods, initializations, architectures, and datasets. Given this success we anticipate this method to be used in making informed choices about these aspects of neural networks.

1. INTRODUCTION

Artificial neural networks (ANNs) have been successfully used to solve a number of complex tasks in a diverse array of domains, such as object recognition, machine translation, image generation, 3D protein structure prediction and many more. Despite being highly overparameterized for the tasks they solve, and having the capacity to memorize the entire training data, ANNs tend to generalize to unseen data. This is a spectacular feat since the highly non-convex optimization typically encountered in them should (theoretically) be a significant obstacle to using these models (Blum & Rivest, 1993) . Questions such as why ANNs favor generalization over memorization and why they find good minima even with intricate loss functions still remain largely unanswered. One promising research direction for answering them is to look at the loss-landscape of deep learning models. Recent work tried to approach this task by proposing various visualization methods. An emerging challenge here is how to look at such an extremely high dimensional optimization landscape (linear in the number of parameters of the network) with respect to minimized loss. In past work, loss functions and their level lines were visualized via random directions starting at a minimum, or by means of linear methods like PCA. In some case, this approach proved effective in uncovering underlying structures in the loss-landscape and link them to network characteristics, such as generalization capabilities or structural features (Keskar et al., 2016; Li et al., 2018) . However, these methods have two major key drawbacks: (1) they are linear in that they only choose directions that are linear combinations of parameter axes while the loss landscape itself is highly nonlinear, and (2) they choose only two among thousands (if not millions) of axes to visualize and ignore all others. In this work, we utilize and adapt the PHATE dimensionality reduction method (Moon et al., 2019) , which relies on diffusion-based manifold learning, to study ANN loss landscapes by visualizing the evolution of network weights during training in low dimensions. In general, visualizations like PHATE Moon et al. (2019) are specifically designed to squeeze as much variability as possible into two dimensions, and thus provide an advantage over previous approaches. In particular our choice of using PHATE over other popular methods, such as tSNE (van der Maaten & Hinton, 2008) , is due to its ability to capture global and local structures of data, and in particular to keep intact the training trajectories that are traversed through during gradient descent. Indeed, during training, the high-dimensional neural networks weights change significantly while remaining on a connected manifold defined by the support of viable configurations (e.g., with sufficiently low training loss), which we refer to when discussing the geometry of the loss landscape. We show that PHATE is suitable to track such continuous weight trajectories, as opposed to tSNE or UMAP that tend to shatter them. Moreover, our approach provides general view of relevant geometric patterns that emerge in the high-dimensional parameter space, providing insights regarding the properties of ANN training and reflecting on their impact on the loss landscape.

Contributions:

We propose a novel loss-landscape visualization based on a variation of PHATE, implemented with cosine distance in Section 4. Our method is, to our knowledge, different from all other proposed methods for loss visualization in that it is naturally nonlinear and captures data characteristics from all dimensions. In Section 5.1, we show that our method uncovers key geometric patterns characterizing loss-landscape regions surrounding good and bad generalization optima, as well as memorization optima. Finally, we establish the robustness of our method by applying it to numerous tasks, architectures, and optimizers in Sections 5.3 and 5.2, suggesting that our method can be used in a consistent manner to validate training and design choices.

2. RELATED WORK

Loss landscape visualization methods have been proposed in numerous contexts. Goodfellow et al. (2014) proposed the "linear path experiment" where the loss of an ANN is evaluated at a series of points θ = (1 -α)θ i + αθ f for different values of α ∈ [0, 1] and θ i , θ f corresponding to the initial parameters of the model and the found optima in parameter space respectively. This one-dimensional linear interpolation method has allowed them to show that popular state of the art ANNs typically do not encounter significant obstacles along a straight path from initialization to convergent solution. They also used the method to visualize the loss along directions connecting two distinct minima and to show that these are linearly separated by a region of higher valued loss. This method was further developed by Im et al. (2016) , who adapted it to enable the visualization of two-dimensional projections of the loss-landscape using barycentric and bilinear interpolation for groups of three or four points in parameter space. This analysis method has allowed them to establish that despite starting with the same parameter initialization, different optimization algorithms find different minima. Furthermore, they noticed that the loss-landscape around minima have characteristic shapes that are optimizer-specific and that batch-normalizations smooths the loss function. 2017), which prevented meaningful comparisons between loss-landscape plots from different networks. They proposed 1D and 2D linear interpolation plots, similar to past techniques, but where they used filter-wise normalized directions to remove the scaling effect. This method has allowed them to visualize and compare the regions on the loss-landscape surrounding minima coming from multiple networks in a meaningful way and to correlate the "flatness" of the region to the generalization capabilities of the corresponding network. Furthermore, they studied the effects of the network depth, width and the presence of skip connections on the geometry of the loss-landscape and on network generalization. The importance of loss landscape visualization methods like the one presented in this paper increases with the growing scientific community interest in furthering our understanding of ANNs. As our understanding of this landscape gets deeper, we start uncovering more and more high-dimensional and complex geometric and topological characteristics. For instance, Draxler et al. ( 2018) have found nonlinear pathways in parameter space connecting distinct minima, along which the training and test errors remain small and losses are consistently low. This suggests that minima are not situated in isolated valleys but rather on connected manifolds representing low loss regions. However, such characteristics are intrinsically high-dimensional, making linear methods inadequate to visualize these structures. Even in standard applications of the linear methods one inevitably asks if the thousands (or



More recently, Li et al. (2018) have addressed the scale invariance and network symmetries problems discussed in Neyshabur et al. (2017); Dinh et al. (

