CONNECTING SPHERE MANIFOLDS HIERARCHICALLY FOR REGULARIZATION

Abstract

This paper considers classification problems with hierarchically organized classes. We force the classifier (hyperplane) of each class to belong to a sphere manifold, whose center is the classifier of its super-class. Then, individual sphere manifolds are connected based on their hierarchical relations. Our technique replaces the last layer of a neural network by combining a spherical fully-connected layer with a hierarchical layer. This regularization is shown to improve the performance of widely used deep neural network architectures (ResNet and DenseNet) on publicly available datasets (CIFAR100, CUB200, Stanford dogs, Stanford cars, and Tiny-ImageNet).

1. INTRODUCTION

Applying inductive biases or prior knowledge to inference models is a popular strategy to improve their generalization performance (Battaglia et al., 2018) . For example, a hierarchical structure is found based on the similarity or shared characteristics between samples and thus becomes a basic criterion to categorize particular objects. The known hierarchical structures provided by the datasets (e.g., ImageNet (Deng et al., 2009) classified based on the WordNet graph; CIFAR100 (Krizhevsky, 2009) in ten different groups) can help the network identify the similarity between the given samples. In classification tasks, the final layer of neural networks maps embedding vectors to a discrete target space. However, there is no mechanism forcing similar categories to be distributed close to each other in the embedding. Instead, we may observe classes to be uniformly distributed after training, as this simplifies the separation by the last fully-connected layer. This behavior is a consequence of seeing the label structure as 'flat,' i.e., when we omit to consider the hierarchical relationships between classes (Bilal et al., 2017) . To alleviate this problem, in this study, we force similar classes to be closer in the embedding by forcing their hyperplanes to follow a given hierarchy. One way to realize that is by making children nodes dependent on parent nodes and constraining their distance through a regularization term. However, the norm itself does not give a relevant information on the closeness between classifiers. Indeed, two classifiers are close if they classify two similar points in the same class. This means similar classifiers have to indicate a similar direction. Therefore, we have to focus on the angle between classifiers, which can be achieved through spherical constraints. Contributions. In this paper, we propose a simple strategy to incorporate hierarchical information in deep neural network architectures with minimal changes to the training procedure, by modifying only the last layer. Given a hierarchical structure in the labels under the form of a tree, we explicitly force the classifiers of classes to belong to a sphere, whose center is the classifier of their super-class, recursively until we reach the root (see Figure 2 ). We introduce the spherical fully-connected layer and the hierarchically connected layer, whose combination implements our technique. Finally, we investigate the impact of Riemannian optimization instead of simple norm normalization. By its nature, the proposed technique is quite versatile because the modifications only affect the structure of last fully-connected layer of the neural network. Thus, it can be combined with many other strategies (like spherical CNN from Xie et al. (2017) , or other deep neural network architectures). Related works. Hierarchical structures are well-studied, and their properties can be effectively learned using manifold embedding. The design of the optimal embedding to learn the latent hierarchy is a complex task, and was extensively studied in the past decade. For example, Word2Vec (Mikolov et al., 2013b; a) and Poincaré embedding (Nickel & Kiela, 2017) showed a remarkable performance in hierarchical representation learning. (Du et al., 2018) forced the representation of sub-classes to "orbit" around the representation of their super-class to find similarity based embedding. Recently, using elliptical manifold embedding (Batmanghelich et al., 2016) , hyperbolic manifolds (Nickel & Kiela, 2017; De Sa et al., 2018; Tifrea et al., 2018) , and a combination of the two (Gu et al., 2019; Bachmann et al., 2019) , shown that the latent structure of many data was non-Euclidean (Zhu et al., 2016; Bronstein et al., 2017; Skopek et al., 2019) . (Xie et al., 2017) showed that spheres (with angular constraints) in the hidden layers also induce diversity, thus reducing over-fitting in latent space models. Mixing hierarchical information and structured prediction is not new, especially in text analysis (Koller & Sahami, 1997; McCallum et al., 1998; Weigend et al., 1999; Wang et al., 1999; Dumais & Chen, 2000) . Partial order structure of the visual-semantic hierarchy is exploited using a simple order pair with max-margin loss function in (Vendrov et al., 2016) . The results of previous studies indicate that exploiting hierarchical information during training gives better and more resilient classifiers, in particular when the number of classes is large (Cai & Hofmann, 2004) . For a given hierarchy, it is possible to design structured models incorporating this information to improve the efficiency of the classifier. For instance, for support vector machines (SVMs), the techniques reported in (Cai & Hofmann, 2004; 2007; Gopal et al., 2012; Sela et al., 2011) use hierarchical regularization, forcing the classifier of a super-class to be close to the classifiers of its sub-classes. However, the intuition is very different in this case, because SVMs do not learn the embedding. In this study, we consider that the hierarchy of the class labels is known. Moreover, we do not change prior layers of the deep neural network, and only work on the last layer that directly contributed to build hyperplanes for a classification purpose. Our work is thus orthogonal to those works on embedding learning, but not incompatible. Comparison with hyperbolic/Poincaré/graph networks. Hyperbolic network is a recent technique that shows impressive results for hierarchical representation learning. Poincaré networks (Nickel & Kiela, 2017) were originally designed to learn the latent hierarchy of data using low-dimension embedding. To alleviate their drawbacks due to a transductive property which cannot be used for unseen graph inference, hyperbolic neural networks equipped set aggregation operations have been proposed (Chami et al., 2019; Liu et al., 2019) . These methods have been mostly focused on learning embedding using a hyperbolic activation function for hierarchical representation. Our technique is orthogonal to these works: First, we assume that the hierarchical structure is not learnt but already known. Second, our model focuses on generating individual hyperplanes of embedding vectors given by the network architecture. While spherical geometry has a positive curvature, moreover, that of hyperbolic space has a constant negative curvature. However, our technique and hyperbolic networks are not mutually exclusive. Meanwhile focusing on spheres embedded in R d in this study, it is straightforward to consider spheres embedded in hyperbolic spaces.

2.1. DEFINITION AND NOTATIONS

Figure 1 : To reference the node at the bottom, we use the notation n p with p = {1, 3, 2}. We use curly brackets {} to write a path, and angle brackets • for the concatenation of paths. We assume we have samples with hierarchically ordered classes. For instance, apple, banana, and orange are classes that may belong to the super-class "fruits." This represents hierarchical relationships with trees, as depicted in Figure 1 . We identify nodes in the graph through the path taken in the tree. To represent the leaf (highlighted in blue in Figure 1 ), we use the notation n {1,3,2} . This means it is the second child of the super-class n {1,3} , and recursively, until we reach the root.

