LEARNING STRUCTURED REPRESENTATIONS BY EMBEDDING CLASS HIERARCHY

Abstract

Existing models for learning representations in supervised classification problems are permutation invariant with respect to class labels. However, structured knowledge about the classes, such as hierarchical label structures, widely exists in many real-world datasets, e.g., the ImageNet and CIFAR benchmarks. How to learn representations that can preserve such structures among the classes remains an open problem. To approach this problem, given a tree of class hierarchy, we first define a tree metric between any pair of nodes in the tree to be the length of the shortest path connecting them. We then provide a method to learn the hierarchical relationship of class labels by approximately embedding the tree metric in the Euclidean space of features. More concretely, during supervised training, we propose to use the Cophenetic Correlation Coefficient (CPCC) as a regularizer for the crossentropy loss to correlate the tree metric of classes and the Euclidean distance in the class-conditioned representations. Our proposed regularizer is computationally lightweight and easy to implement. Empirically, we demonstrate that this approach can help to learn more interpretable representations due to the preservation of the tree metric, and leads to better generalization in-distribution as well as under sub-population shifts over multiple datasets.

1. INTRODUCTION

In supervised learning, the cross-entropy loss is often used for classification tasks. As a common practice in deep learning, in order to train a model for classification, practitioners build a linear layer over the representation to obtain the logit score of each class. A softmax transformation is then applied to convert the logits into a vector belonging to the probability simplex. As a result, we can randomly permute the representations of any classes without affecting the performance of the original classification task. However, in many real-world datasets, as we move towards fine-grained classification, labels are not independent from each other anymore: ImageNet (Deng et al., 2009) inherits label relationship from WordNet (Fellbaum, 1998) , that contains both semantic and lexical connections; iNaturalist (Van Horn et al., 2017) borrows the biological taxonomy so that each image contains seven labels that reflect the morphological characteristic of the organism. Many existing works (Deng et al., 2014; Yan et al., 2014; Ristin et al., 2015; Guo et al., 2018; Chen et al., 2019) investigated how to leverage this hierarchical information for various purposes, but how to explicitly project this knowledge onto representations remains unexplored. In this paper, we focus on the most common label relationship: tree hierarchy. As illustrated in Fig. 1b , given a tree hierarchy of classes, our goal is to learn representations in feature space such that the Euclidean distances between different class centers approximate the distances between these classes in the tree. More concretely, we shall first define a tree metric to be the length of the shortest path connecting two subset of classes in the tree hierarchy. Based on this tree metric, we then propose a regularizer, the cophenetic correlation coefficient (CPCC) between sequences of tree metric and Euclidean distance of the feature space, to ensure that the class-conditional representations inherit the tree structure of the classes. Different from the original cross-entropy loss with softmax activation, the proposed CPCC regularizer helps to break the symmetry of permutation invariance among the classes, and thus also improves the interpretability of the learned representations. We show that the proposed CPCC regularizer is computationally lightweight with negligible overhead, and can be applied to a wide range of supervised learning paradigms, including standard flat empirical risk minimization and other hierarchical objectives, including both multitask learning and cirriculum learning. For generalization, over six real-world datasets, we demonstrate that our proposed CPCC regularizer leads to improved generalization performance on some unseen tasks with sub-population shifts when there is only limited amount of labeled data.

2. PRELIMINARIES

In this section we first introduce the notations used throughout the paper, formulate our learning problem, and then briefly review the CPCC score to quantify the correlation of two sequences.

Notations and Setup

We shall use X and Y to denote the input and target random variables, living in spaces X and Y, respectively. In this work, we mainly focus on the supervised classification setting where for each input data point x ∈ X ⊆ R d , there is a ground-truth label y ∈ Y = [k] := {1, . . . , k}, where k is the number of output classes. We let µ be the joint distribution over (X, Y ) from where the data is sampled. During the learning process, the learner has access to a dataset D = {(x j , y j )} n j=1 of size n sampled from µ. In the context of representation learning, a learned representation z = f θ (x) is obtained by applying a feature encoder f θ : X → Z parametrized by θ to x, where Z ⊆ R p denotes the feature space. Upon feature vector z, we further apply a linear predictor g : Z → ∆ k , where we use ∆ k to denote the (k-1)-dimensional probability simplex. The cross-entropy loss is our objective function. Specifically, let q y ∈ ∆ k be a one-hot vector with the y-th component being 1. The cross-entropy loss, ℓ CE (•, •) between the prediction g • f (x) and the label y is given by ℓ CE (g • f (x), y) := -i∈[k] q i log(g(f (x)) i ). For z, z ′ , ∥z -z ′ ∥ 2 denotes the Euclidean distance between them.

2.1. CLASS HIERARCHY

In classification problems, the target label Y ∈ [k] is treated as a categorical random variable that can take k different nominal values. However, there is no particular ordering among these k categories, i.e., for different categories i, j ∈ [k], one can only compare whether i = j or not. Formally, letting d H (i, j) = 0 if i = j and d H (i, j) = 1 otherwise defines a metric d H (•, •) over Y. However, in many real-world applications the similarity between different classes is not binary. Consider object classification in ImageNet (Deng et al., 2009) as an example. Intuitively, one would think the distance between the classes corgi and chihuahua to be smaller than that between corgi and panda. One way to characterize this distance between different classes is through a tree of class hierarchy, also known as a dendrogram. An example is shown in Fig. 1a . Formally, let T := (V, E, d) be a weighted tree, where V is the set of nodes, E the set of weighted edges in T , and d : V × V → R + specifies the distance between nodes in the tree. Each node v S in T is associated with a subset of class labels S ⊆ [k], and can be recursively defined as follows: 1. For each class i ∈ [k], there is a corresponding leaf node v i ∈ V . Conversely, each leaf node v i ∈ V is identified with a single class label i ∈ [k]. 2. For some S ⊆ [k], if v S ∈ V is not a leaf node in T , then its children form a partition of S. In other words, if v S1 , . . . , v Sc are the children of v S , then ∀i ̸ = j, S i ∩ S j = ∅ and ∪ i∈[c] S i = S.

3.. The root node of T is v

[k] .

