LEARNING STRUCTURED REPRESENTATIONS BY EMBEDDING CLASS HIERARCHY

Abstract

Existing models for learning representations in supervised classification problems are permutation invariant with respect to class labels. However, structured knowledge about the classes, such as hierarchical label structures, widely exists in many real-world datasets, e.g., the ImageNet and CIFAR benchmarks. How to learn representations that can preserve such structures among the classes remains an open problem. To approach this problem, given a tree of class hierarchy, we first define a tree metric between any pair of nodes in the tree to be the length of the shortest path connecting them. We then provide a method to learn the hierarchical relationship of class labels by approximately embedding the tree metric in the Euclidean space of features. More concretely, during supervised training, we propose to use the Cophenetic Correlation Coefficient (CPCC) as a regularizer for the crossentropy loss to correlate the tree metric of classes and the Euclidean distance in the class-conditioned representations. Our proposed regularizer is computationally lightweight and easy to implement. Empirically, we demonstrate that this approach can help to learn more interpretable representations due to the preservation of the tree metric, and leads to better generalization in-distribution as well as under sub-population shifts over multiple datasets.

1. INTRODUCTION

In supervised learning, the cross-entropy loss is often used for classification tasks. As a common practice in deep learning, in order to train a model for classification, practitioners build a linear layer over the representation to obtain the logit score of each class. A softmax transformation is then applied to convert the logits into a vector belonging to the probability simplex. As a result, we can randomly permute the representations of any classes without affecting the performance of the original classification task. However, in many real-world datasets, as we move towards fine-grained classification, labels are not independent from each other anymore: ImageNet (Deng et al., 2009) inherits label relationship from WordNet (Fellbaum, 1998) , that contains both semantic and lexical connections; iNaturalist (Van Horn et al., 2017) borrows the biological taxonomy so that each image contains seven labels that reflect the morphological characteristic of the organism. Many existing works (Deng et al., 2014; Yan et al., 2014; Ristin et al., 2015; Guo et al., 2018; Chen et al., 2019) investigated how to leverage this hierarchical information for various purposes, but how to explicitly project this knowledge onto representations remains unexplored. In this paper, we focus on the most common label relationship: tree hierarchy. As illustrated in Fig. 1b , given a tree hierarchy of classes, our goal is to learn representations in feature space such that the Euclidean distances between different class centers approximate the distances between these classes in the tree. More concretely, we shall first define a tree metric to be the length of the shortest path connecting two subset of classes in the tree hierarchy. Based on this tree metric, we then propose a regularizer, the cophenetic correlation coefficient (CPCC) between sequences of tree metric and Euclidean distance of the feature space, to ensure that the class-conditional representations inherit the tree structure of the classes. Different from the original cross-entropy loss with softmax activation, the proposed CPCC regularizer helps to break the symmetry of permutation invariance among the classes, and thus also improves the interpretability of the learned representations. We show that the proposed CPCC regularizer is computationally lightweight with negligible overhead, and can be applied to a wide range of supervised learning paradigms, including standard flat empirical risk minimization and other hierarchical objectives, including both multitask learning and cirriculum learning. For generalization, over six real-world datasets, we demonstrate that our proposed CPCC regularizer leads to improved generalization performance on some unseen tasks with sub-population shifts when there is only limited amount of labeled data.

2. PRELIMINARIES

In this section we first introduce the notations used throughout the paper, formulate our learning problem, and then briefly review the CPCC score to quantify the correlation of two sequences.

Notations and Setup

We shall use X and Y to denote the input and target random variables, living in spaces X and Y, respectively. In this work, we mainly focus on the supervised classification setting where for each input data point x ∈ X ⊆ R d , there is a ground-truth label y ∈ Y = [k] := {1, . . . , k}, where k is the number of output classes. We let µ be the joint distribution over (X, Y ) from where the data is sampled. During the learning process, the learner has access to a dataset D = {(x j , y j )} n j=1 of size n sampled from µ. In the context of representation learning, a learned representation z = f θ (x) is obtained by applying a feature encoder f θ : X → Z parametrized by θ to x, where Z ⊆ R p denotes the feature space. Upon feature vector z, we further apply a linear predictor g : Z → ∆ k , where we use ∆ k to denote the (k-1)-dimensional probability simplex. The cross-entropy loss is our objective function. Specifically, let q y ∈ ∆ k be a one-hot vector with the y-th component being 1. The cross-entropy loss, ℓ CE (•, •) between the prediction g • f (x) and the label y is given by ℓ CE (g • f (x), y) := -i∈[k] q i log(g(f (x)) i ). For z, z ′ , ∥z -z ′ ∥ 2 denotes the Euclidean distance between them.

2.1. CLASS HIERARCHY

In classification problems, the target label Y ∈ [k] is treated as a categorical random variable that can take k different nominal values. However, there is no particular ordering among these k categories, i.e., for different categories i, j ∈ [k], one can only compare whether i = j or not. Formally, letting d H (i, j) = 0 if i = j and d H (i, j) = 1 otherwise defines a metric d H (•, •) over Y. However, in many real-world applications the similarity between different classes is not binary. Consider object classification in ImageNet (Deng et al., 2009) as an example. Intuitively, one would think the distance between the classes corgi and chihuahua to be smaller than that between corgi and panda. One way to characterize this distance between different classes is through a tree of class hierarchy, also known as a dendrogram. An example is shown in Fig. 1a . Formally, let T := (V, E, d) be a weighted tree, where V is the set of nodes, E the set of weighted edges in T , and d : V × V → R + specifies the distance between nodes in the tree. Each node v S in T is associated with a subset of class labels S ⊆ [k], and can be recursively defined as follows: 1. For each class i ∈ [k], there is a corresponding leaf node v i ∈ V . Conversely, each leaf node v i ∈ V is identified with a single class label i ∈ [k]. 2. For some S ⊆ [k], if v S ∈ V is not a leaf node in T , then its children form a partition of S. In other words, if v S1 , . . . , v Sc are the children of v S , then ∀i ̸ = j, S i ∩ S j = ∅ and ∪ i∈[c] S i = S. 3. The root node of T is v [k] . At a colloquial level, the tree T specifies a hierarchy of class labels that represents the structured knowledge among them. For example, as shown in Fig. 1a of the MNIST dataset, the two children Figure 1 : Fig. 1a : MNIST class hierarchy. The root node contains all the 10 digit classes. The two children nodes of the root node correspond to the coarse classes of odd and even digits, respectively. Each leaf node in this class hierarchy corresponds to a fine class label (digit). Fig. 1b : An example of a class hierarchy tree T along with a visualization of the data in the feature space. The CPCC score computes the correlation coefficient of the tree metric from T in the left panel and the corresponding Euclidean distance obtained from the feature space in the right panel. of the root node correspond to the odd and even numbers, respectively. Accordingly, the distance between digits 1 and 3 is smaller than that between digits 1 and 2.

2.2. COPHENETIC CORRELATION COEFFICIENT (CPCC)

In the context of clustering, Sokal & Rohlf (1962) introduced the cophenetic correlation coefficient (CPCC) to evaluate the correspondence between two dendrograms. The CPCC is the Pearson's correlation coefficient between two sequences of pairwise distances. For a class hierarchy T and a node v ∈ V , the depth dt(v) of v is the length of the shortest path from v to the root of T . In the original applications of CPCC, the "dendrogrammatic " ground-truth distance t(v i , v j ) between a pair of nodes v i , v j in T is defined as follows: t(v i , v j ) := max{dt(v i ), dt(v j )} -dt(LCA (v i , v j )), where LCA (v i , v j ) is the least common ancestor (LCA) of v i and v j . In Fig. 1a , the LCA of 1 and 3 is "odd" while the LCA of 1 and 0 is the root node. As an example of ground-truth distance, let us consider the class hierarchy tree T and clustering of classes given in Fig. 1b . In Fig. 1b , t(△ a , △ b ) = 1 since they share an LCA △ at L1, and t(△ a , □ c ) = 2 as they go up 2 levels to meet at the the root node, which is their LCA. Now consider a dataset D. For a node v i ∈ T , since v i corresponds to a subset of classes, we use D i ⊆ D to denote the subset of data points whose class label belongs to v i . The pairwise distance between D i and D j is then defined as the Euclidean distance between the center of D i and D j : ρ(v i , v j ) := 1 ni x∈Di x -1 nj x ′ ∈Dj x ′ 2 , where n i = |D i | and n j = |D j | are the number of points in each cluster. Then, the CPCC score CPCC(t, ρ), between distances t and ρ is defined as: CPCC(t, ρ) := i<j (t(v i , v j ) -t)(ρ(v i , v j ) -ρ) ( i<j (t(v i , v j ) -t) 2 ) 1/2 ( i<j (ρ(v i , v j ) -ρ) 2 ) 1/2 , ( ) where t := 2 i<j t(v i , v j ) / k(k -1), ρ := 2 i<j ρ(v i , v j )/k(k -1) are the averages of all the pairwise distances for t and ρ.

3. OUR METHOD

In this section we first define an alternative tree metric used to measure the distance between two nodes in a tree. Then, we proceed to discuss our method that uses the proposed tree metric to learn structured representations when a class hierarchy T is available during supervised learning.

3.1. TREE METRIC

Given a tree T = (V, E, d) forming a class hierarchy, we formally define the tree metric d T as: Definition 3.1 (Tree Metric). The tree metric d T (v, v ′ ) for any pair of nodes v, v ′ ∈ V is the weighted length of the shortest path in T connecting v and v ′ . Proposition 3.1. For any undirected weighted graph G, d T (•, •) is a well-defined metric over Y, if all edge weights of G are positive and G is connected. Our main motivation to use the tree metric d T instead of t as defined in Section 2.2 is two-folds. First, while each edge in the class hierarchy T corresponds to a subset relationship, there are other kinds of structured relationships between classes that go beyond the subset relationship. For example, the benchmark dataset MetaShift (Liang & Zou, 2021 ) also contains a graph to describe the relationship between different subsets of classes. However, in MetaShift the weighted edges do not correspond to the subset relationship as in the case of a class hierarchy, but rather to a similarity/discrepancy measure between them. In this case, the relationship between classes corresponds to an undirected and weighted graph, where the notion of LCA does not apply any more. In fact, even in the case of a tree, the LCA is subject to change depending on which node in the tree is chosen as the root node. However, the tree metric d T is invariant to rotations of the tree, and applies to both trees and general graphs as shown in Proposition 3.1. Second, the definition of t(•, •) does not account for the weights of edges in T . This implies that all the fine-grained classes under a given super-class are the same. However, in many applications, different fine-grained classes may have different proportions or importance, for a given super-class. In these cases, the tree metric is more adapted since it also takes into account the edge weights. Nevertheless, when T is an unweighted class hierarchy tree, we have the following relationship between the proposed tree metric d T (•, •) and t(•, •): Proposition 3.2. Let T be an unweighted tree with a fixed root node. Then for any pair of nodes v i , v j ∈ V , d T (v i , v j ) = 2t(v i , v j ) -|dt(v i ) -dt(v j )|. The proof is deferred to App. A. In particular, Proposition 3.2 states that if the depths of v i and v j are the same in T , then our tree metric reduces to twice of t(v i , v j ). Multiplying a variable by a constant does not affect its correlation with others. Henceforth, in what follows, we propose to use the tree metric d T (•, •) in replacement of t(•, •) to compute CPCC.

3.2. STRUCTURED REPRESENTATIONS BY EMBEDDING THE TREE METRIC

Now that we have defined our tree metric d T (•, •), we are interested in learning representations f θ (•), by optimizing the model parameter θ, such that the Euclidean distance between the representations of any pair of class centers i, j ∈ [k] approximates d T (v i , v j ) in the tree T . Consider a dataset D = {(x i , y i )} n i=1 of size n for classification problem over ∆ k . We propose to use the CPCC between ρ Z (•, •) and d T (•, •) as a regularizer to the cross-entropy loss, resulting in the loss function: L(D) = (x,y)∈D ℓ CE (y, g(f θ (x))) -λ • CPCC(d T , ρ Z ), where λ > 0 is the regularization strength (λ = 1 in the experiments). The Euclidean distance ρ Z is computed in feature space. Concretely, we first apply the encoder f θ to D and obtain a set of points in Z × Y: D Z := {(f θ (x i ), y i )} n i=1 . Then, we partition D Z into k subsets according to the ground-truth labels, and consider the same tree structure T on D Z . Note the negative sign before coefficient λ in the above formulation, as we wish to maximize the CPCC score. In practice, at each iteration during training, since stochastic optimization methods are used, we process a batch of inputs instead of the whole data set D. For each incoming batch, we track the number of finest classes represented in the batch before any pairwise calculation. When all the inputs in a batch come from the same coarse class, the CPCC score is not well-defined due to the 0 variance of d T . This can happen when the batch size is relatively small. In such cases, we fix the value of the CPCC regularizer to 0 to avoid the numerical division by zero error. Time Complexity of the CPCC Regularizer The computation of our CPCC regularizer is lightweight. For a feature space with p dimensions, for each training iteration, there will be at most O(p min(b 2 , k 2 )) additional computations, where b is the batch size. Such an overhead is often negligible when compared with the computations needed to train a neural network. In App. B we also provide a brief discussion on the convergence of optimizing the above objective function with SGD.

3.3. THE BENEFITS OF STRUCTURED REPRESENTATIONS

In what follows, we describe two potential benefits of learning structured representations with the proposed CPCC regularizer, before providing thorough empirical validation in Section 4. Interpretability As we briefly discussed before, one potential drawback of the representations learned through supervised learning is the lack of interpretability. Recent work (Papyan et al., 2020; Han et al., 2021) have both empirically and theoretically (under certain assumptions) shown that under the cross-entropy loss, when enough training has happened, the learned representations will have reduced variance within each class, and the set of features corresponding to different classes will converge to the so-called simplex equiangular tight frame (ETF). Yet, the vertices of the simplex ETF are symmetric (in the sense of being permutation-invariant), hence the class features do not necessarily reflect the similarities/differences between different classes, even in feature space. By enforcing the Euclidean distances in feature space between different classes to be close to the tree metric through our CPCC regularization, we attempt to break the symmetry in learning the features. This can potentially lead to more interpretable features, as closer classes (in the sense of the tree metric) are closer to each other in feature space. Generalization Another by-product of structured representations is potentially better generalization both in-distribution when only limited amount of labels is available, or under sub-population shifts (Santurkar et al., 2020) . To see this, note that the goal of our CPCC regularizer is consistent with classification accuracy: it essentially pushes data from different classes away proportionally to their distance in the tree. Consequently, for sub-population shifts, if the hierarchy correctly captures coarse-fine relationship, future unseen fine-grained classes from the same coarse category will be further away from those under a different coarse category. This may help generalize to unseen fine-grained classes in zero or few-shot learning.

4. EXPERIMENTS

In this section, we apply our proposed method to: (i) study how using CPCC during training affects the representation learnt under various training objectives (Section 4.3), and (ii) see how the learned structured representations can improve generalization (Section 4.4).

4.1. DATA

We conduct our experiments on MNIST (Lecun et al., 1998) , CIFAR100 (Krizhevsky, 2009) , and BREEDS (Santurkar et al., 2020) . By using this variety of datasets and hierarchies, we get a comprehensive overview of the usefulness of CPCC as a regularizer. See App. G for the full hierarchies. MNIST contains handwritten digits from 0 to 9. We define odd and even digits to be two coarse classes (Fig. 1a ). The digits in the leaves are called fine classes below. The artificial level based on odd-and even-ness corresponds to concepts that are not visually observable. CIFAR100 comes with a predefined hierarchy: its coarse level has 20 classes, each containing 5 fine classes (e.g., beaver, dolphin, otter, seal and whale belongs to aquatic mammals). While the hierarchies are semantically meaningful, the coarse level labels are not purely defined by visual similarities. For instance, it is hard to tell the size of an animal from its image (making the coarse classes large omnivores/herbivores and mid-size mammals difficult to distinguish). BREEDS is a benchmark built on ImageNet (Deng et al., 2009) . It contains a manually calibrated label hierarchy, based solely on shared visual characteristics. Santurkar et al. (2020) proposed four tasks: LIVING17, ENTITY13, ENTITY30, and NONLIVING26. For each, we consider the leaf nodes, which are ImageNet classes, as our fine level classes, and define the coarse levels to be their "superclasses" at different depths. We end up with trees that only contain the root node and two levels of the initial hierarchy, and ignore intermediate relationships for CPCC regularization. New Levels Based on the coarse and fine levels in the hierarchy, we insert a mid level between the coarse and fine ones, as well as a coarser level between the coarse one and the root node. This results in classes verifying k coarser < k coarse < k mid < k fine . In MNIST, the mid classes are 1,3,5 (odd numbers ≤ 5), 7,9 (odd numbers > 5), 0,2,4 (even numbers ≤ 5), and 6,8 (even numbers > 5). We do not consider a coarser level (it is trivial to train on the root node). In CIFAR100, each coarse level (containing 5 fine classes) is split into arbitrary groups of 2/3 fine classes, creating 40 classes in the mid level. 2 arbitrary coarse classes are merged into 1 coarser label, creating 10 coarser labels. Since BREEDS contains 8 non-root levels in total, and all 4 datasets' coarse levels have a depth ≥ 2, we use the original hierarchy and let the mid level be one level above the fine classes, and the coarser level be one level above the coarse classes. Source & Target Split We split BREEDS into source (s) and target (t): s and t have the same coarser and coarse labels, but mid and fine classes are different. Following this idea, recall we split MNIST/CIFAR's coarse levels into groups of 2 and 3. We take 60 fine classes as CIFAR s and the rest as CIFAR t , 6 classes as MNIST s , 4 in MNIST t . Due to this construction, there is only one mid class in each of CIFAR/MNIST s/t 's coarse class. On the other hand, BREEDS 's coarse classes have many mid children.

4.2. BASELINES AND METRICS

As mentioned above, we operate in a fully supervised setting. We denote our neural network as a function h : X → ∆ k . h . = g • f θ can be decomposed into a feature extractor f θ , and a linear classifier g to which the softmax is applied. Training is performed using the following objectives, on the fine-coarse hierarchy of MNIST, CIFAR, and BREEDS, with and without CPCC as a regularization (see App. C, F for more details). Our baselines include: • Flat ℓ CE : training on the fine classes only, without leveraging any hierarchical information. • Multi-task Learning: jointly training a two-headed network to treat fine and coarse as two separate tasks. The loss function is the sum of the cross-entropies on the fine and coarse classification tasks, and we simply set the weight of the two parts to 1. • Curriculum Learning: In the spirit of curriculum learning, we first train on the coarse classes using ℓ CE and use y coarse instead of y fine . In the second step, we remove the linear classifier and fine tune a new one on the fine level labels with ℓ CE as the loss function. • Sum Loss: We define a hierarchical Sum Loss as ℓ CE (y coarse , Wh(x)) + ℓ CE (y fine , h(x)), W is a k 1 by k 2 matrix representing the relationships in the label tree: if a fine class i belongs to a coarse class j, then W ji is 1, otherwise the entry is set to 0. • HXE: The Hierarchical Cross Entropy (Bertinetto et al., 2020) that replaces the predicted output in ℓ CE with weighted hierarchical class conditional probabilities. • Soft: The soft labels objective (Bertinetto et al., 2020) where labels in ℓ CE are derived from a mapping function to encode class node similarity in y. • Quad: The Quadruplet Zhang et al. (2016) multi-task loss which combines ℓ CE with a generalized triplet loss to enforce different margins at different levels of the hierarchy. Metrics To evaluate the representation structure learnt with the various loss functions, as well as the influence of CPCC, we use (i) silhouette scores (Rousseeuw, 1987) to measure the salience of clustering patterns at the coarse level, (ii) CPCC as a metric to measure how the whole representation structure is similar to the fine-coarse hierarchy, (iii) t-SNE (Van der Maaten & Hinton, 2008) for visualization of the learnt embeddings in 2D, and (iv) a symmetric distance matrix to evaluate the hierarchical structure. Specifically, we calculate the Euclidean distance between the mean representation vectors for each pair of fine classes. The matrix is organized in a way where fine classes from the same coarse class are grouped together, so that the coarse within-cluster distance is shown around the diagonal while other entries present coarse level between-cluster distances. The pattern is aligned with distance matrix. When CPCC is used, fine classes from the same coarse classes tend to be closer, and coarse classes tend to be further apart.

4.3. STRUCTURE OF THE LEARNT REPRESENTATIONS

Fig. 2 show the effects of training with CPCC, which matches our expectations shown in Fig. 1b . In the distance matrix, we can see that the within-coarse cluster distance is much smaller than the between-coarse cluster distance (corresponding to diagonal 5 by 5 blocks). This fact is verified qualitatively in the t-SNE plots, where coarse groups tend to be better separated. Similar patterns are observed when CPCC is paired with the other loss functions described above. We want to point out that although other setups have a more structured representation to some extent, these are not as perfect as when paired with CPCC (see App. D for the figures). In Table 1 , we see that the objectives leveraging the hierarchical information tend to increase the CPCC. This is particularly true of the multi-task and soft labels setting. But, directly optimizing the CPCC score still gives the largest gains, both on CPCC and silhouette scores.

4.4. GENERALIZATION ON DATASETS WITH A SHARED HIERARCHY

Given a representation f θ trained on the fine-coarse hierarchy (Section 4.2), we want to see if structured representations help performance in-hierarchy, i.e., on the fine and coarse classes the model was trained on; as well as out-of-hierarchy, i.e., on new levels and/or new classes of the hierarchy. In-hierarchy In this setting, we evaluate the models on classes and levels used during training to construct the various objectives and the tree metric (i.e., the fine and coarse classes). Results can be found in the FineAcc and CoarseAcc columns of Table 1 . Adding our CPCC regularizer leads to better test accuracy at both levels, across objectives and datasets, with gains sometimes exceeding 1%. According to Goyal et al. (2021) , such a performance gain (especially on fine classes) is rarely observed when hierarchical information is leveraged. Overall, our findings suggest that when such information is available, using CPCC as a regularizer is beneficial. Out-of-hierarchy Two questions naturally arise in our hierarchical setting. The first one is how well CPCC structured representations generalize to new levels of the hierarchy. To answer this question, we report in Table 1 the accuracy on the new mid and coarser levels defined in Section 4.1, and not used during training. This accuracy is obtained zero-shot, via a simple marginalization (e.g., the probability of a mid class is the sum of the probabilities of all fine classes that belong to it). There too, adding the CPCC regularization results in performance gains (also see App. Table 4 ). The second natural question is if CPCC structured representations can generalize to classes unseen in the training hierarchy. Assume a model has learned that cats and dogs are animals. Does knowing the animal concept help it understand giraffes or horses better? To explore this, we train our models on the source split of the mid and fine classes (Section 4.1), and evaluate their performance on the target split. We can still apply "zero-shot" transfer to coarse and coarser level via marginalization, but fine tuning is necessary to classify the new mid and fine classes: we freeze f θ and fine tune a linear classifier g mid or g fine on a single image from each new target label (one-shot generalization). Results are shown in Table 2 . First, under this subpopulation shift, using CPCC still outperforms the original loss functions on coarse and coarser levels (zero-shot), which is consistent with results in Table 1 . Second, in one-shot generalization to new mid levels, CPCC gives an often large advantage. Intuitively, as all fine classes are grouped together within coarse groups, if one data point is randomly selected, then other data points in the same coarse class will readily be assigned the same label. Without this structure in the representation, generalization is more difficult as all fine labels are evenly distributed. The only notable exception is ENTITY13, where each coarse label has too many mid level children and grouping by coarse level hurts. Third, CPCC regularization is often harmful to one-shot fine level generalization due to coarse grouping: new fine classes are close together at the coarse level, making them hard to linearly separate. The structure of label tree matters: compared to LIVING17 and NONLIVING26, ENTITY 's fine level labels partition coarse labels into much more fine-grained subsets, resulting in the performance difference in BREEDS. We do observe that other hierarchical methods have some advantage compared to the flat cross entropy.

5. RELATED WORK

There were many works exploiting label hierarchy and we only refer to the most related ones. However, to the best of our knowledge, none of the previous work set learning structured representations as main objective or embedded the tree metric under this context.

Background of Baseline Methods

The simplest label hierarchy contains only two level, coarse and fine, which can be treated as two tasks trained jointly or sequentially. The former originates from Multi-task Learning (MTL) where part of single network is shared for multiple heads for each task during training (Caruana, 1997; Zhao et al., 2020; Inoue et al., 2020) . The latter echoes with Curriculum Learning (CL) (Bengio et al., 2009) , where pretraining with a easy task will help the convergence and performance on a hard task. We define coarse level classes following hierarchical methods, as well as label embedding methods (encoding hierarchical information into labels, the Soft objective), hierarchical losses (Quad (Zhang et al., 2016) , HXE). None of these use a regularization method, with the exception of group overlapping lasso (Zhao et al., 2011) . However, it was introduced for logistic regression, making it hard to to be applied to modern neural networks that use the penultimate layer as its representation. Learning with Label Hierarchy The most common motivation of hierarchical models is to improve the fine-level accuracy. Interestingly, accuracy improvements are often mixed: while most works claimed to gain performance improvement, Wang & Cottrell (2015) stated that this improvement was limited, and Goyal et al. (2021) claimed most hierarchical models lead to worse performance on non-hierarchical accuracy metrics. Additionally, using coarse level labels often appears in a weakly supervised setting, where coarse classes are always available but fine class labels are only accessible for part of data, to reduce annotation cost at a finer level (Taherkhani et al., 2019; Lei et al., 2017; Ristin et al., 2015) . Other works built hierarchy from dataset (Murdock et al., 2016; Li et al., 2010; Verma et al., 2012; Han et al., 2018; Zheng et al., 2017) . We only name a few since they are very different from our setting where the hierarchy is defined before training.

6. CONCLUSION

How to include label relation into representation is an open question. In this paper, in the context of tree label hierarchies, we use the cophenetic correlation coefficient as a regularizer to embed this hierarchical relationship into representations, and outperform other baseline methods. CPCC has multiple advantages, including low time complexity, better interpretability, flexibility on any supervised learning paradigms, and it can be applied to any common label relation graphs. We also demonstrate that it leads to better generalization performance on several downstream tasks. All these benefits show that our method provides an interesting solution to this important problem.



Figure2: The matrices show the distance between fine CIFAR100 classes with and without CPCC for Flat (lighter color means smaller distance, same color palette used for both). The light diagonal blocks with CPCC correspond to the coarse classes. We also show t-SNE visualization of the representations (colored by coarse labels) learnt using Flat with and without CPCC regularization. The pattern is aligned with distance matrix. When CPCC is used, fine classes from the same coarse classes tend to be closer, and coarse classes tend to be further apart.

Mean % and standard deviation over 5 seeds for various datasets, objectives and metrics, with and without CPCC (overall best in bold, best for a given objective with/without CPCC underlined). BREEDS 's results are on the source split. Regularizing with CPCC never hurts performance, and in most cases leads to consistent and sometimes significant improvements on all metrics.

The superscript denotes 1or 0-shot generalization. All models are trained on the source split s and evaluated on the target split t. s and t have different fine/mid classes but the same coarse/coarser classes. CPCC shows an advantage on mid, coarse and coarser, but not fine, levels. Flat 23.50 (1.74) 25.37 (2.36) 39.31 (0.21) 53.14 (0.14) FlatCPCC 24.04 (1.04) 27.99 (2.53) 42.49 (0.54) 56.14 (0.73)

ACKNOWLEDGMENTS

Han Zhao would like to thank the support from a Facebook research award and Amazon AWS Cloud Credits.

