DEEP REPULSIVE CLUSTERING OF ORDERED DATA BASED ON ORDER-IDENTITY DECOMPOSITION

Abstract

We propose the deep repulsive clustering (DRC) algorithm of ordered data for effective order learning. First, we develop the order-identity decomposition (ORID) network to divide the information of an object instance into an order-related feature and an identity feature. Then, we group object instances into clusters according to their identity features using a repulsive term. Moreover, we estimate the rank of a test instance, by comparing it with references within the same cluster. Experimental results on facial age estimation, aesthetic score regression, and historical color image classification show that the proposed algorithm can cluster ordered data effectively and also yield excellent rank estimation performance.

1. INTRODUCTION

There are various types of 'ordered' data. For instance, in facial age estimation (Ricanek & Tesafaye, 2006) , face photos are ranked according to the ages. Also, in a video-sharing platform, videos can be sorted according to the numbers of views or likes. In these ordered data, classes, representing ranks or preferences, form an ordered set (Schröder, 2003) . Attempts have been made to estimate the classes of objects, including multi-class classification (Pan et al., 2018) , ordinal regression (Frank & Hall, 2001) , metric regression (Fu & Huang, 2008) . Recently, a new approach, called order learning (Lim et al., 2020) , was proposed to solve this problem. Order learning is based on the idea that it is easier to predict ordering relationship between objects than to estimate the absolute classes (or ranks); telling the older one between two people is easier than estimating their exact ages. Hence, in order learning, the pairwise ordering relationship is learned from training data. Then, the rank of a test object is estimated by comparing it with reference objects with known ranks. However, some objects cannot be easily compared. It is less easy to tell the older one between people of different genders than between those of the same gender. Lim et al. (2020) tried to deal with this issue, by dividing an ordered dataset into disjoint chains. But, the chains were not clearly separated, and no meaningful properties were discovered from the chains. In this paper, we propose a reliable clustering algorithm, called deep repulsive clustering (DRC), of ordered data based on order-identity decomposition (ORID). Figure 1 shows a clustering example of ordered data. Note that some characteristics of objects, such as genders or races in age estimation, are not related to their ranks, and the ranks of objects sharing such characteristics can be compared more reliably. To discover such characteristics without any supervision, the proposed ORID network decomposes the information of an object instance into an order-related feature and an identity feature unrelated to the rank. Then, the proposed DRC clusters object instances using their identity features; in each cluster, the instances share similar identity features. Furthermore, given a test instance, we decide its cluster based on the nearest neighbor (NN) rule, and compare it with reference instances within the cluster to estimate its rank. To this end, we develop a maximum a posteriori (MAP) estimation rule. Experimental results on ordered data for facial age estimation, aesthetic score regression (Kong et al., 2016) , and historical color image classification (Palermo et al., 2012) demonstrate that the proposed algorithm separates ordered data clearly into meaningful clusters and provides excellent rank estimation performances for unseen test instances. The contributions of this paper can be summarized as follows. • We first propose the notion of identity features of ordered data and develop the ORID network for the order-identity decomposition. • We develop the DRC algorithm to cluster data on a unit sphere effectively using a repulsive term. We also prove the local optimality of the solution. • We propose the MAP decision rule for rank estimation. The proposed algorithm provides the state-of-the-art performances for facial age estimation and aesthetic score regression.

2. RELATED WORK

2.1 ORDER LEARNING The notion of order learning was first proposed by Lim et al. (2020) . It aims to determine the order graph of classes and classify an object into one of the classes. In practice, it trains a pairwise comparator, which is a ternary classifier, to categorize the relationship between two objects into one of three cases: one object is bigger than, similar to, or smaller than the other. Then, it estimates the rank of a test object, by comparing it with reference objects with known ranks. However, not every pair of objects are easily comparable. Although Lim et al. (2020) attempted to group objects into clusters, in which objects could be more accurately compared, their clustering results were unreliable. Pairwise comparison has been used to estimate object ranks, because relative evaluation is easier than absolute evaluation in general. Saaty (1977) proposed the scaling method to estimate absolute priorities from relative priorities, which has been applied to various decision processes, including aesthetic score regression (Lee & Kim, 2019) . Also, some learning to rank (LTR) algorithms are based on pairwise comparison (Liu, 2009; Cohen et al., 1998; Burges et al., 2005; Tsai et al., 2007) . Order learning attempts to combine (possibly inconsistent) pairwise ordering results to determine the rank of each object. Thus, it is closely related to the Cohen et al.'s LTR algorithm (1998) , which learns a pairwise preference function and obtains a total order of a set to maximize agreements among preference judgments of pairs of elements. Also, order learning is related to rank aggregation (Dwork et al., 2001) , in which partially ordered sets are combined into a linearly ordered set to achieve the maximum consensus among those partial sets. Rank aggregation has been studied in various fields (Brüggemann et al., 2004) . Since optimal aggregation is NP-hard, Dwork et al. (2001) proposed an approximate algorithm, called Markov chain ordering. There are many other approximate schemes, such as the local Kemenization, Borda count, and scaled footrule aggregation.

2.2. CLUSTERING

Data clustering is a fundamental problem to partition data into disjoint groups, such that elements in the same group are similar to one another but elements from different groups are dissimilar. Although various clustering algorithms have been proposed (Hartigan & Wong, 1979; Ester et al., 1996; Kohonen, 1990; Dhillon & Modha, 2001; Reynolds, 2009) , conventional algorithms often yield poor performance on high-dimensional data due to the curse of dimensionality and ineffectiveness of similarity metrics. Dimensionality reduction and feature transform methods have been studied to map raw data into a new feature space, in which they are more easily separated. Linear transforms, such as PCA (Wold et al., 1987) , and non-linear transformations, including kernel methods (Hofmann et al., 2008) and spectral clustering (Ng et al., 2002) , have been proposed. Recently, deep neural networks have been adopted effectively as feature embedding functions (Le-Cun et al., 2015) , and these deep-learning-based feature embedding functions have been combined with classical clustering algorithms. For instance, Caron et al. (2018) proposed a deep clustering algorithm based on k-means. It clusters features from a neural network and then trains the network using the cluster assignments as pseudo-labels. This is done iteratively. Also, Yang et al. (2016) jointly learned feature representations and clustered images, based on agglomerative clustering. Chang et al. (2017) recast the image clustering task into a binary classification problem to predict whether a pair of images belong to the same cluster or different clusters. Similarly to these algorithms, we use a neural network to determine a feature space in which clustering is done more effectively. However, we consider the clustering of ordered data, and each cluster should consists of elements, whose ranks can be compared more accurately. There are conventional approaches to use clustering ideas to aid in classification or rank estimation. For example, Yan et al. (2015) developed a hierarchical classifier, which clusters fine categories into coarse category groups and classifies an object into a fine category within its coarse category group. For extreme multiclass classification, Daumé III et al. (2017) proposed to predict a class label among candidate classes only, which are dynamically selected by the recall tree. It is however noted that the leaves of the recall tree do not partition the set of classes. Also, for age estimation, Li et al. (2019) proposed a tree-like structure, called bridge-tree, to divide data into overlapping age groups and train a local regressor for each group. The set of local regressors can be more accurate than a global regressor to deal with the entire age range. Whereas these conventional approaches group data in the label dimension to perform their tasks more effectively, the proposed algorithm cluster data in the dimension orthogonal to the label dimension. In other words, we cluster data using identity features, instead of using order features.

3.1. PROBLEM DEFINITION

An order is a binary relation, often denoted by ≤, on a set Θ = {θ 1 , θ 2 , . . . , θ m } (Schröder, 2003) . It should satisfy three properties of reflexivity (θ i ≤ θ i for all i), antisymmetry (θ i ≤ θ j and θ j ≤ θ i imply θ i = θ j ), and transitivity (θ i ≤ θ j and θ j ≤ θ k imply θ i ≤ θ k ). Then, Θ is called a partially ordered set. Furthermore, if every pair of elements are comparable (θ i ≤ θ j or θ j ≤ θ i for all i, j), Θ is called a chain or linearly ordered set. An order describes ranks or priorities of classes. For example, in age estimation, θ i may represent the age class of i-year-olds. Then, θ 14 ≤ θ 49 represents that 14-year-olds are younger than 49-year-olds. As mentioned previously, it is less easy to tell the older one between people of different genders. An algorithm, hence, may compare a subject with reference subjects of the same gender only. In such a case, each age class θ i represents two subclasses θ female i and θ male i of different types, and the algorithm compares only subjects of the same type. Lim et al. (2020) assumed that subclasses of different types are incomparable and thus the set of subclasses is the union of k disjoint chains, where k is the number of types. However, in many ranking applications, objects of different types can be compared (although less easily than those of the same type are). Thus, instead of assuming incomparability across chains, we assume that there is a total order on Θ = {θ 1 , θ 2 , . . . , θ m }, in which each class θ i consists of k types of subclasses, and that object instances of the same type are more easily compared than those of different types. Suppose that n training instances in X = {x 1 , x 2 , . . . , x n } are given. Also, suppose that there are m ranks and the ground-truth rank of each instance is known. In this sense, X contains ordered data. The problem is twofold. The first goal is to decompose the whole instances X into k disjoint clusters {C j } k j=1 in which instances are more easily compared; X = k j=1 C j (1) where C i ∩C j = ∅ for i = j. In other words, we aim to partition the ordered data in X into k clusters, by grouping them according to their characteristics unrelated to their ranks. These characteristics, which tend to remain the same even when an object experiences rank changes, are referred to as 'identity' features in this work. For example, in age estimation, genders or races can be identity features. However, we perform the clustering without any supervision for identity features. Notice that instances within a cluster would be compared more easily than those across clusters, since they have similar identity features. The number k of clusters is assumed to be known a priori. Impacts of k on the clustering performance are discussed in Appendix B.7. The second goal is to assign an unseen test instance into one of the clusters and determine its rank by comparing it with reference instances within the cluster. To achieve these goals, we propose the ORID network and the DRC algorithm.

3.2. ORDER-IDENTITY DECOMPOSITION

In general, object instances can be compared more easily, as they have more similar identity features irrelevant to order. Therefore, we decompose the information of each object instance into an order feature and an identity feature. To this end, we propose the ORID network in Figure 2 , composed of three parts: autoencoder, discriminator, and comparator. 1) Autoencoder: Similarly to deep clustering algorithms in (Yang et al., 2017; Dizaji et al., 2017; Chen et al., 2017; Ji et al., 2017) , we use the autoencoder G • F (•), based on a neural network, to extract feature vectors. The encoder h x = F (x) maps an input vector x to a feature vector h x , while the decoder x = G(h x ) reconstructs x from h x . By minimizing the reconstruction loss x -x 1 , F is trained to represent x compactly with as little loss of information as possible. We decompose the overall feature h x ∈ R dor+d id into the order feature h x or and the identity feature h x id , given by h x or = [h x 1 , h x 2 , . . . , h x dor ] t (2) h x id = [h x dor+1 , h x dor+2 , . . . , h x dor+d id ] t / [h x dor+1 , h x dor+2 , . . . , h x dor+d id ] (3) where d or and d id are the dimensions of h x or and h x id . However, without additional control, the output h x of the neural network F would be highly entangled (Higgins et al., 2018) . To put together order-related information into h x or , we employ the comparator. 2) Comparator: Using the order features h x or and h y or of a pair of instances x and y, we train the comparator, which classifies their ordering relationship into one of three categories 'bigger,' 'similar,' and 'smaller': x y if θ(x) -θ(y) > τ, x ≈ y if |θ(x) -θ(y)| ≤ τ, x ≺ y if θ(x) -θ(y) < -τ, ) where θ(•) denotes the class of an instance. As in (Lim et al., 2020) , ' , ≈, ≺' represent the ordering relationship between instances, while '>, =, <' do the mathematical order between classes. The comparator outputs the softmax probability p xy = (p xy , p xy ≈ , p xy ≺ ). It is trained to minimize the cross-entropy between p xy and the ground-truth one-hot vector q xy = (q xy , q xy ≈ , q xy ≺ ). Because it is trained jointly with the autoencoder, the information deciding the ordering relationship tends to be encoded into the order features h x or and h y or . On the other hand, the remaining information necessary for the reconstruction of x and ŷ are encoded into the identity features h x id and h y id . 3) Discriminator: We adopt the discriminator D that tells real images from synthesized images, generated by the decoder G. Using the GAN loss (Goodfellow et al., 2014) , the discriminator helps the decoder to reconstruct more realistic output x and ŷ. Appendix A provides detailed network structures of these components in ORID.

3.3. DEEP REPULSIVE CLUSTERING

After obtaining the identity features h x1 id , h x2 id , . . . , h xn id of all instances x i ∈ X , we partition them into k clusters. Each cluster contains instances that are more easily comparable to one another. The identity features are normalized in Eq. ( 3) and lie on the unit sphere in R d id . In other words, we cluster data points on the unit sphere. Thus, the cosine similarity is a natural affinity metric. Let C j , 1 ≤ j ≤ k, denote the k clusters. Also, let c j , constrained to be on the unit sphere, denote the 'centroid' or the representative vector for the instances in cluster C j . We define the quality of cluster C j as x∈Cj (h x id ) t c j -α 1 k-1 l =j (h x id ) t c l (5) where the first term is the similarity of an instance in C j to the centroid c j , the second term with the negative sign quantifies the average dissimilarity of the instance from the other centroids, and α is a nonnegative weight. For a high quality cluster, instances should be concentrated around the centroid and be far from the other clusters. The second term is referred to as the repulsive term, as its objective is similar to the repulsive rule in (Lee et al., 2015) . Although conventional methods also try to increase inter-cluster dissimilarity (Ward Jr, 1963; Lee et al., 2015) , to the best of our knowledge, DRC is the first attempt to use an explicit repulsive term in deep clustering, which jointly optimizes clustering and feature embedding. Next, we measure the overall quality of the clustering by J({C j } k j=1 , {c j } k j=1 ) = k j=1 x∈Cj (h x id ) t c j -α 1 k-1 l =j (h x id ) t c l . We aim to find the optimum clusters to maximize this objective function J, yet finding the global optimum is NP-complete (Kleinberg et al., 1998; Garey et al., 1982) . Hence, we propose an iterative algorithm, called DRC, to find a local optimum, as in the k-means algorithm (Gersho & Gray, 1991) . 1. Centroid rule: After fixing the clusters {C j } k j=1 , we update the centroids {c j } k j=1 to maximize J in Eq. ( 6). Because the centroids should lie on the unit sphere, we solve the constrained optimization problem: maximize J({c j } k j=1 ) subject to c t j c j = 1 for all j = 1, . . . , k. (7) Using Lagrangian multipliers (Bertsekas, 1996) , the optimal centroids are obtained as c j = x∈Cj h x id -α 1 k-1 x∈X \Cj h x id / x∈Cj h x id -α 1 k-1 x∈X \Cj h x id . (8) 2. NN rule: On the other hand, after fixing the centroids, we update the membership of each instance to maximize J in Eq. ( 6). The optimal cluster C j is given by C j = x | (h x id ) t c j ≥ (h x id ) t c l for all 1 ≤ l ≤ k . (9) In other words, an instance should be assigned to C j if its nearest centroid is c j . We apply the centroid rule and the NN rule iteratively until convergence. Because both rules monotonically increase the same objective function J and the inequality J ≤ n + α k-1 n always holds, J is guaranteed to converge to a local maximum. Readers interested in the convergence are referred to (Sabin & Gray, 1986; Pollard, 1982) . Without the repulsive term in Eq. ( 6) (i.e. at α = 0), centroid c j in Eq. ( 8) is updated by c j = x∈Cj h x id / x∈Cj h x id , as done in the spherical k-means (Dhillon & Modha, 2001) . In contrast, with a positive α, the objective function J is reduced when the centroids are far from one another. Ideally, in equilibrium, the centroid of a cluster should be the opposite of the centroid of all the other clusters; x∈C j h x id x∈C j h x id t x∈X \C j h x id x∈X \C j h x id = -1 for all j = 1, 2, . . . , k. Note that the ORID network and thus the encoded feature space are trained jointly with the repulsive clustering. As the training goes on, the centroids repel one another, and the clusters are separated more clearly due to the repulsive term. We jointly optimize the clusters and the ORID network parameters, as described in Algorithm 1. First, we train the ORID network for warm-up epochs, by employing every pair of instances x and y as input. Then, using the identity features, we partition the input data into k clusters using k-means. Second, we repeat the fine-tuning of the ORID network and the repulsive clustering alternately. In the fine-tuning, a pair of x and y are constrained to be from the same cluster, and the following loss function is employed. = λ rec rec + λ clu clu + λ com com + λ gan gan . (12) for all j = 1, 2, . . . , k do 7: Update centroid c j via Eq. ( 8) Centroid rule 8: end for 9: for all j = 1, 2, . . . , k do 10: Update cluster C j via Eq. ( 9) NN rule 11: end for 12: until convergence or predefined number of iterations 13: until predefined number of epochs Output: Clusters {C j } k j=1 , centroids {c j } k j=1 , ORID network 3.4 RANK ESTIMATION Using the output of the DRC-ORID algorithm, we can estimate the rank of an unseen test instance x. First, we extract its identity feature h x id using the ORID encoder. By comparing h x id with the centroids {c j } k j=1 based on the NN rule, we find the most similar centroid c l . Then, x is declared to belong to cluster C l . Without loss of generality, let us assume that the classes (or ranks) are the first m natural numbers, Θ = {1, 2, . . . m}. Then, for each i ∈ Θ, we select a reference instance y i with rank i from cluster C l , so that it is the most similar to x. Specifically, y i = arg max y∈C l : θ(y)=i (h x id ) t h y id . We estimate the rank θ(x) of the test instance x, by comparing it with the chosen references y i , 1 ≤ i ≤ m. For the rank estimation, Lim et al. (2020) developed the maximum consistency rule, which however does not exploit the probability information, generated by the comparator. In this paper, we use the maximum a posteriori (MAP) estimation rule, which is described in detail in Appendix B.10.

4. EXPERIMENTAL RESULTS

This section provides various experimental results. Due to space limitation, implementation details and more results are available in Appendices C, D, and E.

4.1. FACIAL AGE ESTIMATION

Datasets: We use two datasets. First, MORPH II (Ricanek & Tesafaye, 2006 ) is a collection of about 55,000 facial images in the age range [16, 77] . It provides gender (female, male) and race (African American, Asian, Caucasian, Hispanic) labels as well. We employ the four evaluation settings A, B, C, and D in Appendix C.2. Second, the balanced dataset (Lim et al., 2020) is sampled from the three datasets of MORPH II, AFAD (Niu et al., 2016) , and UTK (Zhang et al., 2017) to overcome bias to specific ethnic groups or genders. It contains about 6,000 images for each combination of gender in {female, male} and ethnic group in {African, Asian, European}. Clustering: Figure 3 shows clustering results on MORPH II (setting A), when the number of clusters is k = 2. Setting A contains faces of Caucasian descent only. Thus, the proposed DRC-ORID divides those faces into two clusters according to genders in general, although the annotated gender information is not used. Most males are assigned to cluster 1, while a majority of females to cluster 2. On the other hand, setting B consists of Africans and Caucasians. Thus, those images are clustered according to the races, as shown in Appendix C.3. Figure 4 is the results on the balanced dataset at k = 3, which is composed of MORPH II, AFAD, and UTK images. Due to different characteristics of these sources, images are clearly divided according to their sources. At k = 2, MORPH II images are separated from the others. This is because, unlike the MORPH II images, the boundaries of most AFAD and UTK images are zeroed for alignment using SeetaFaceEngine (Zhang et al., 2014) . 2020) also tried the clustering of the balanced dataset. Figure 5 visualizes the feature space using t-SNE (Maaten & Hinton, 2008) . Although their method aligns the features according to ages, their clusters are not separated, overlapping one another. In contrast, the proposed DRC-ORID separates the three clusters clearly, as well as sorts features according to the ages within each cluster. More t-SNE plots for analyzing the impacts of the repulsive term are available in Appendix B.5.

Age transformation:

We assess the decomposition performance of ORID. Although ORID is not designed for age transformation (Or-El et al., 2020) , it decomposes an image x into the order and identity features, h x or and h x id . Thus, the age can be transformed in two steps. First, we replace h x or of x with h y or of a reference image y at a target age. Second, we decode the resultant feature (concatenation of h y or and h x id ) to obtain the transformed image. Figure 6 shows some results on MORPH II images. Order-related properties, such as skin textures and hair colors, are modified plausibly, but identity information is preserved. This indicates the reliability of ORID. Age estimation: Table 1 compares the proposed algorithm with conventional age estimators on the four evaluation settings of MORPH II. These conventional algorithms take 224 × 224 or bigger images as input, while ORID takes 64 × 64 images. Moreover, most of them adopt VGG16 (Simonyan & Zisserman, 2015) as their backbones, which is more complicated than the ORID encoder. Thus, for comparison, after fixing clusters using DRC-ORID, we train another pairwise comparator based on VGG16, whose architecture is the same as Lim et al. (2020) . We measure the age estimation performance by the mean absolute error (MAE) and the cumulative score (CS). MAE is the average absolute error between estimated and ground-truth ages, and CS computes the percentage of test samples whose absolute errors are less than or equal to a tolerance level of 5. Mainly due to the smaller input size of 64 × 64, the vanilla version yields poorer performances than the conventional algorithms. The VGG version, however, outperforms them significantly. First, in the proposed-VGG (k = 1), all instances can be compared, as in the OL algorithm. In other words, the clustering is not performed. Thus, the pairwise comparators of OL and the proposed-VGG (k = 1) are trained in the same way, but their rank estimation rules are different. Whereas 

4.2. AESTHETIC SCORE REGRESSION

The aesthetics and attribute database (AADB) is composed of 10,000 photographs of various themes such as scenery and close-up (Kong et al., 2016) . Each image is annotated with an aesthetic score in [0, 1]. We quantize the continuous score with a step size of 0.01 to make 101 score classes. Compared to facial images, AADB contains more diverse data. It is hence more challenging to cluster AADB images. Figure 8 shows example images in each cluster at k = 8. Images in the same cluster have similar colors, similar contents, or similar composition. This means that ORID extracts identity features effectively, corresponding to contents or styles that are not directly related to aesthetic scores. Using those identity features, DRC discovers meaningful clusters. Figure 9 visualizes the feature space of AADB. Aesthetic scores are sorted along one direction, while clusters are separated in the other orthogonal direction. In other words, the scores look like latitudes, while the clusters appear to be separated by meridians (or lines of longitude). As a point on the earth surface can be located by its latitude and longitude, an image is represented by its aesthetic score (order feature) and cluster (identity feature). Table 2 compares regression results. Even without clustering process, the proposed algorithm outperforms the Reg-Net and ASM algorithms. Moreover, by using the eight unsupervised clusters in Figure 8 , the proposed algorithm further reduces the MAE to yield the state-of-the-art result. (Lin & Li, 2012) 35.92 0.96 ORCNN (Niu et al., 2016) 44.67 0.81 CNNPOR (Liu et al., 2018) 50.12 0.82 GP-DNNOR (Liu et al., 2019) 46.60 

5. IMPACTS OF APPLICATIONS

The proposed algorithm can be applied to various ranking problems. In this paper, we demonstrated three vision applications: facial age estimation, aesthetic score regression, and historical image classification. In particular, the proposed age estimator has various potential uses. For example, it can block or recommend media contents to people according to their ages. However, it has harmful impacts, as well as positive ones. Moreover, although age information lacks the distinctiveness to identify an individual, identity features, extracted by ORID, can be misused in facial recognition systems, causing serious problems such as unwanted invasion of privacy (Raji et al., 2020) . Hence ethical considerations should be made before the use of the proposed algorithm. Recently, ethical concerns about the fairness and safety of automated systems have been raised (Castelvecchi, 2020; Roussi, 2020; Noorden, 2020) . Especially, due to the intrinsic imbalance of facial datasets (Ricanek & Tesafaye, 2006; Zhang et al., 2017; Niu et al., 2016) , most deep learning methods on facial analysis (Wen et al., 2020; Or-El et al., 2020) have unwanted gender or racial bias. The proposed algorithm is not free from this bias either. Hence, before any practical usage, the bias should be resolved. Also, even though the proposed algorithm groups data in an unsupervised manner, data are clustered according to genders or races on MORPH II. These results should never be misinterpreted in such a way as to encourage any racial or gender discrimination. We recommend using the proposed age estimator for research only.

6. CONCLUSIONS

The DRC algorithm of ordered data based on ORID was proposed in this work. First, the ORID network decomposes the information of an object into the order and identity features. Then, DRC groups objects into clusters using their identity features in a repulsive manner. Also, we can estimate the rank of an unseen test by comparing it with references within the corresponding cluster based on the MAP decision. Extensive experimental results on various ordered data demonstrated that the proposed algorithm provides excellent clustering and rank estimation performances.

A NETWORK STRUCTURE OF ORID

As described in Section 3.2, the ORID network consists of the encoder F , the decoder G, the comparator C, and the discriminator D. The network structures of these components are detailed in Tables 4∼ 7,  where 'k h ×k w -s-c Conv' and 'k h ×k w -s-c Deconv' denote the 2D convolution and 2D deconvolution with kernel size k h ×k w , stride s, and c output channels, respectively. 'BN' means batch normalization (Ioffe & Szegedy, 2015) , and 'c Dense' is a dense layer with c output channels. Note that the encoder takes a 64 × 64 RGB image as input, and the identity feature of the encoder output is l 2 -normalized in Eq. ( 3). Also, we set d or = 128 and d id = 896. To solve the constrained optimization problem in Eq. ( 7), we construct the Lagrangian function L = k j=1 x∈Cj (h x id ) t c j -α 1 k-1 l =j (h x id ) t c l -λ j k j=1 (c j t c j -1) where λ j , 1 ≤ j ≤ k, are Lagrangian multipliers (Bertsekas, 1996) . By differentiating L with respect to c j and setting it to zero, we have ∂L ∂cj = x∈Cj h x id -α 1 k-1 l =j x∈C l h x id -2λ j c j (15) = x∈Cj h x id -α 1 k-1 x∈X \Cj h x id -2λ j c j (16) = 0 ( ) for j = 1, . . . , k. Therefore, the optimal centroid c j is given by c j = x∈Cj h x id -α 1 k-1 x∈X \Cj h x id 2λ j . ( ) Because of the normalization constraint c j t c j = 1, we have 2λ j = x∈Cj h x id -α 1 k-1 x∈X \Cj h x id , which leads to the centroid rule in Eq. ( 8).

B.2 OPTIMALITY OF NN RULE

Let us consider two cases. First, instance x is declared to belong to cluster C j . It then contributes to the objective function J in Eq. ( 6) by β j = (h x id ) t c j -α 1 k-1 l =j (h x id ) t c l . ( ) Second, x is declared to belong to another cluster C j . Then, its contribution is β j = (h x id ) t c j -α 1 k-1 l =j (h x id ) t c l . By comparing the two contributions, we have β j -β j = (h x id ) t (c j -c j ) -α 1 k-1 (h x id ) t (c j -c j ) (22) = 1 + α 1 k-1 (h x id ) t (c j -c j ). ( ) This means that β j ≥ β j when (h x id ) t c j ≥ (h x id ) t c j . Therefore, x should be assigned to the optimal cluster C j * such that the cosine similarity (h x id ) t c j * is maximized. Equivalently, we have the NN rule in Eq. ( 9).

B.3 REGULARIZATION CONSTRAINT IN DRC

To prevent empty clusters and balance the partitioning, we enforce a regularization constraint so that every cluster contains at least a predefined number of instances. More specifically, when applying the NN rule, we enforce that at least 1 2k of instances are assigned to each cluster C j . The instances are selected in the decreasing order of cosine similarity (h x id ) t c j .

B.4 LOSS FUNCTIONS

In the DRC-ORID algorithm, we use the loss function = λ rec rec + λ clu clu + λ com com + λ gan gan ( ) where the reconstruction, clustering, comparator, and GAN losses are given by rec = 1 2N N i=1 x i -G(F (x i )) 1 + y i -G(F (y i )) 1 , clu = -1 2N N i=1 (h xi id ) t c j + (h yi id ) t c j , com = -1 N N i=1 q xiyi log p xiyi + q xiyi ≈ log p xiyi ≈ + q xiyi ≺ log p xiyi ≺ , = -1 2N N i=1 log(1 -D(G(F (x i )))) + log(1 -D(G(F (y i )))) , respectively. Here, N is the number of image pairs in a minibatch. The weighting parameters are set to λ rec = 5, λ clu = 0.1, λ com = 1, and λ gan = 1.

B.5 IMPACTS OF REPULSIVE TERM ON CLUSTERING

To analyze the impacts of the repulsive term in Eq. ( 6), we first compare clustering qualities with α = 0 and α = 0.1. At α = 0, the repulsive term is excluded from the objective function J and the centroid rule is reduced to Eq. ( 10) in the spherical k-means (Dhillon & Modha, 2001) . However, different from the spherical k-means, even at α = 0, the clustering is jointly performed with the training of the ORID network. We adopt two metrics to measure the quality of clustering: normalized mutual information (NMI) (Strehl & Ghosh, 2002) and centroid affinity (CA). NMI measures the information shared between two different partitioning of the same data A = ∪ U i=1 A i and B = ∪ V j=1 B j , NMI(A, B) = U i=1 V j=1 |A i ∩ B j | log N |Ai∩Bj | |Ai||Bj | ( U i=1 |A i | log |Ai| N )( V j=1 |B j | log |Bj | N ) where U and V are the numbers of clusters in A and B, respectively, N is the total number of samples, and | • | denotes the cardinality. Also, we define the centroid affinity (CA) as CA({c} k j=1 ) = 2 k(k-1) k j=1 k l>j c t j c l . For high-quality clustering, the centroids should be far from one another and thus should yield a low CA score. Figure 10 plots how NMI and CA vary as the iteration goes on. In this test, MORPH II (setting B) is used and the number of clusters k is set to 2. Since setting B consists of Africans and Caucasians, we use the race groups as the ground-truth partitioning for the NMI measurement. At early iterations, the NMI score of DRC-ORID with α = 0.1 is slightly better than that with α = 0. However, as the iterative training and clustering go on, the score gap gets larger. After the convergence, DRC-ORID with α = 0.1 outperforms the option α = 0 by a significant NMI gap of 0.13. Also, CA of the option α = 0.1 gradually decreases, whereas that of α = 0 does not. At α = 0.1, the repulsive term makes the centroids repel each other. As a result, CA, which is the cosine similarity between the two centroids, becomes almost -1, which means the equilibrium state in Eq. ( 11) is almost achieved. We also visualize the feature spaces of the two options, α = 0 and α = 0.1, using t-SNE in Figure 11 . It is observed that two clusters are more clearly separated by DRC-ORID with α = 0.1. Figure 12 shows the t-SNE results after the convergence with age labels. Figure 13 compares the NMI curves at different α's. The choice of α affects the quality of clustering, as α controls the intensity of the repulsive force between centroids. When α is too large, the centroids move too quickly, making the training of the ORID network difficult. On the other hand, when α is too small, the repulsive term does not affect the clustering meaningfully. Hence, α should be selected to strike a balance between training reliability and effective repulsion. It was found experimentally that clustering is performed well around α = 0.1. Finally, it is worth pointing out that, if the identity features were not normalized as in Eq. ( 3) and the repulsive clustering were performed in an unbounded space, the distances between centroids would get larger and larger as the iteration goes on. Thus, convergence would not be achieved. This is why we perform DRC on the bounded unit sphere.

B.6 IMPACTS OF REPULSIVE TERM ON RANK ESTIMATION

Table 8 compares the rank estimation results when the clustering is performed with and without the repulsive term. In this experiment, we use MORPH II (setting A) and set k = 2. Without the repulsive term, lower-quality clusters make the training of the comparator more difficult. As a result, the age estimation performance degrades significantly in terms of both MAE and CS. In other words, the quality of clustering affects the rank estimation performance greatly, and the proposed DRC algorithm provides high quality clusters suitable for the rank estimation. 14 compares the clustering results. When using order features or whole features, instances are divided by their ages. We see that instances younger than 30 mostly belong to cluster 1 and the others to cluster 2. Table 11 compares the performances of the age estimators trained using these clustering results. The best performance is achieved when the clustering is done on identity features. B.9 RELIABILITY OF FEATURE DECOMPOSITION Performing the comparison using order features only does not theoretically guarantee that orderrelated information is fully excluded from identity features. However, we observed empirically that the decomposition is sufficiently reliable if the dimension of an identity feature is selected properly. If the dimension is too small, the encoder may lose a significant portion of order-irrelevant information. On the contrary, if the dimension is too large, the encoder may encode order information redundantly. In our experiments, we use 128 and 896 dimensional vectors for order and identity features (d or = 128 and d id = 896), and obtain satisfactory decomposition results. k = 2 k = 3 k = 4 k = 2 k = 3 k = 4 To show that order-related information is excluded from identity features, we compare the accuracies of the comparator (i.e. ternary classifier), when identity features are used instead of order features. Specifically, we first extract order features and identity features from all instances in MORPH II using the pretrained ORID network. Then, we train two comparators that predict the ordering relationship between two instances x and y: one takes the order features h x or and h y or as input and the other takes the identity features h x id and h y id . Table 12 lists the comparator accuracies. We see that the comparator fails to predict ordering relationships from identity features. Also, Figure 15 is t-SNE visualization of the identity feature spaces with age or cluster labels, which confirms that order-related information is excluded effectively from identity features. set τ = 0.1. We initialize its feature extractor using VGG16 pre-trained on the ILSVRC2012 dataset (Deng et al., 2009) and its fully connected layers using the Glorot normal method. We employ the Adam optimizer with a minibatch size of 32. We start with a learning rate of 10 -4 and shrink it by a factor of 0.5 after every 80,000 steps. OL-supervised trains the comparator using supervised clusters separated according to gender or ethnic group annotations. Specifically, the supervised clusters at k = 2, 3, and 6 are divided according to genders, ethnic groups, and both genders and ethnic groups, respectively. On the other hand, OL-unsupervised and the proposed algorithm determine their clusters in unsupervised manners. We see that the proposed algorithm performs better than the conventional algorithms in all tests. By employing multiple clusters, the proposed algorithm improves MAE by 0.12 and CS by 0.73% on average. In contrast, OL-unsupervised improves MAE by 0.04 and CS by 0.07% only. This indicates that, by employing identity features, the proposed DRC-ORID algorithm groups instances into meaningful clusters, in which instance ranks can be compared more accurately. k = 1 k = 2 k = 3 k = 6 k = 1 k = 2 k = 3 k = 6 MV (

C.5 AGE TRANSFORMATION

More age transformation results are in Figure 17 . Note that, in Figure 6 , given an image x, we select the reference y at a target age, whose identity feature is the most similar to that of x, as in Eq. ( 13). Hence, the image x and the reference y have similar appearance. On the other hand, Figure 17 shows transformed images using randomly selected references. The first two cases transform the same image x with different references, but the transformed images are similar. Also, even when the gender and/or race of y are different from those of x, the identity information of x is preserved well in the transformed image. This confirms the reliability of ORID. 

C.6 RECONSTRUCTION

Figure 18 shows reconstructed faces using whole feature (h x or ⊕ h x id ), order feature only (h x or ⊕ 0), and identity feature only (0 ⊕ h x id ). Without the order feature, each decoded face is degraded but the person can be identified. In contrast, without the identity feature, the reconstruction is not related to the person except that it seems to be an average face of people at the same age as the person. These results confirm that order and identity features are complementary. 13) from the training set. In the default mode, for each age i, a single reference image is selected. However, the top r most similar references can be selected and used for the estimation. We use a single reference because multiple references improve the estimation performance only negligibly. In Figure 19 , the top three reference images are shown for each age from 16 to 53. In setting D, the two clusters are divided by Africans and the others in general. However, we see that test and reference images tend to have the same gender, as well as the same race. Furthermore, they have similar appearance, even when they have a big age difference. For aesthetic score regression, we implement a pairwise comparator based on EfficientNetB4 (Tan & Le, 2019) . The pairwise comparator has the same architecture as that for facial age estimation, except for the backbone network. To initialize the backbone, we adopt the parameters pre-trained on the ILSVRC2012 dataset. We initialize the other layers using the Glorot normal method. We update the network parameters using the Adam optimizer with a minibatch size of 16. We start with a learning rate of 10 -4 and shrink it by a factor of 0.8 every 8000 steps. Training images are augmented by random horizontal flipping. We set τ = 0.15 for the ternary categorization in Eq. (4).

D.2 CLUSTERING

Notice that the AADB dataset contains images of diverse contents and styles. Hence, when clustering with a small k, it is hard to observe the characteristics shared by images within each cluster, whereas k = 2 or 3 is sufficient for facial age data. We empirically found that at least eight clusters are required (k = 8) to partition the AADB dataset by meaningful criteria. Figure 20 provides more examples of clustering results at k = 8. 



Figure 1: A clustering example of facial photos, which are ordered according to ages. Without any supervision, the proposed algorithm can obtain meaningful clusters using identity features.

Figure 2: An overview of the ORID network.

Figure 3: MORPH II images in setting A are divided into two clusters.

Figure 5: t-SNE visualization of the feature spaces of the balanced dataset at k = 3.

Figure 9: t-SNE visualization of feature space of AADB at k = 8.

Figure 7: Example HCI images grouped into four clusters (k = 4).

et al., 2012) is a dataset for determining the decade when a photograph was taken. It contains images from five decades from 1930s to 1970s. Each decade category has 265 images: 210, 5, and 50 are used for training, validation and testing. Figure 7 shows the clustering results at k = 4. We observe similarity of contents in each cluster. Table 3 compares the quinary classification results. Frank & Hall (2001), Cardoso & da Costa (2007), Palermo et al. (2012), and RED-SVM use traditional features, while the others deep features. The performance gaps between these two approaches are not huge, since 1,050 images are insufficient for training deep networks.

Figure 10: Comparison of (a) NMI and (b) CA curves with and without the repulsive term.

Figure 13: Comparison of NMI curves at different α's.

Figure 14: Clustering results using different features: (a) identity features, (b) order features, and (c) whole features.

Figure 15: t-SNE visualization of identity feature spaces of MORPH II with age or cluster labels: (a) setting A and (b) setting B.

Figure 18: Reconstruction results. For each test, the input x, reconstruction x = G • F (x) using the whole feature, reconstruction G(0 ⊕ h x id ) using the identity feature, and reconstruction G(h x or ⊕ 0) using the order feature are shown.

Figure 19: Examples of reference images in facial age estimation.

Figure 20: Example AADB images grouped into eight clusters (k = 8).

Algorithm 1 DRC-ORID Input: Ordered data X = {x 1 , x 2 , . . . , x n }, k = the number of clusters 1: Train ORID network for warm-up epochs to minimize loss λ rec rec + λ com com + λ gan gan 2: Partition X into C 1 , C 2 , . . . , C k using k-meansFine-tune ORID network to minimize loss λ rec rec + λ clu clu + λ com com + λ gan gan

Comparison of age estimation results on MORPH II. Here, * means that the algorithm is pre-trained on the IMDB-WIKI dataset (Rothe et al., 2018).

Aesthetic score regression performances of the proposed algorithm and the conventional Reg-Net(Kong et al., 2016) and ASM(Lee & Kim, 2019) on the AADB dataset.

Comparison of classification performances on the HCI dataset.

The encoder F in the ORID network.

The decoder G in the ORID network.

The comparator C in the ORID network.

The discriminator D in the ORID network.

Comparison of age estimation results on MORPH II (setting A) when clustering is performed with and without the repulsive term. IMPACTS OF THE NUMBER k OF CLUSTERS ON RANK ESTIMATION Tables 9 and 10 compare the rank estimation results according to the number k of clusters on the MORPH II (setting A) and AADB datasets, respectively. On MORPH II, the age estimation performance decreases as k increases. Since the training set in setting A consists of only 4,394 images, each cluster at a large k contains too few instances. Thus, the comparator is trained inefficiently with fewer training pairs, degrading the performance. In contrast, AADB contains a large number of diverse images. Due to the diversity, a relatively large k should be to group images into meaningful clusters. Also, even at a large k, each cluster contains a sufficient number of data. Thus, as compared to MORPH II, results on AADB are less sensitive to k. In addition, we provide age estimation results on the balanced dataset in Table14, in which k has marginal impacts on the rank estimation performance.As mentioned previously, the quality of clustering significantly affects the rank estimation performance. Also, similarly to other algorithms based on k-means, the clustering quality of DRC is affected by k. Hence, for the proposed algorithm to be used on a new ordered dataset, k should be determined effectively to obtain good clustering and rank estimation results. Readers interested in the selection of k are referred toPham et al. (2005).

Age estimation results according to k on MORPH II (setting A).

Aesthetic score regression results according to k on AADB.

Age estimation performances when the three clustering results in Figure14are used.

Comparison of the comparator accuracies when different input features are used.

Comparison of age estimation results on the balanced dataset.

Table 14 lists age estimation results on the balanced dataset according to the number k of clusters.

ACKNOWLEDGMENTS

This work was supported in part by the MSIT, Korea, under the ITRC support program (IITP-2020-2016-0-00464) supervised by the IITP, and in part by the National Research Foundation of Korea (NRF) through the Korea Government (MSIP) under Grant NRF-2018R1A2B3003896.

B.10 MAP ESTIMATION

Let us describe the MAP estimation rule for rank estimation in Section 3.4. Given a test instance x, we select references y i by Eq. ( 13). Then, by comparing x with y i , the comparator yields the probability vector p xyi = (p xyi , p xyi ≈ , p xyi ≺ ) for the three cases in Eq. ( 4). Thus, given y i , the probability of θ(x) = r can be written asSuppose that x y i . Then, θ(x) -θ(y i ) = r -i > τ from Eq. ( 4). Also, the maximum possible rank is m. We hence assume that θ(x) has the uniform distribution between i + τ + 1 and m. In other words,, where U denotes a discrete uniform distribution.Similarly, we have. Then, we approximate the a posteriori probability P θ(x) (r | y 1 , . . . y m ) by averaging those single-reference inferences in Eq. ( 31);Finally, we obtain the MAP estimate of the rank of x, which is given by θ(x) = arg max

C FACIAL AGE ESTIMATION -MORE EXPERIMENTS AND DETAILS C.1 IMPLEMENTATION DETAILS

We initialize the parameters of the ORID network for facial age estimation using the Glorot normal method (Glorot & Bengio, 2010) . We use the Adam optimizer with a learning rate of 10 -4 and decrease the rate by a factor of 0.5 every 50,000 steps. For data augmentation, we do random horizontal flips only. This is because other augmentation schemes, such as brightness or contrast modification, may deform identity information such as skin colors. Also, d or and d id are set to be 128 and 896, respectively. In Eq. ( 6), we set α to 0.1 and decrease it to 0.05 after 200 epochs.

C.2 EVALUATION SETTINGS

For evaluation on the MORPH II dataset, we adopt four widely used testing protocols.• Setting A -5,492 images of the Caucasian race are selected and then randomly divided into two non-overlapping parts: 80% for training and 20% for test. 

C.3 CLUSTERING

We provide more clustering results on MORPH II. Figure 16 is the clustering results on setting B at k = 2. Since setting B consists of Africans and Caucasians, the images are clustered according to the races. Also, Table 13 summarizes the clustering results for settings A, B, and C at k = 2.The clustering result on setting D is omitted, since it is almost identical with that on setting C. In all settings, the proposed DRC-ORID divides facial images into two clusters with meaningful criteria, which are gender for setting A and race for settings B, C, and D. 

C.4 AGE ESTIMATION

We implement a VGG-based pairwise comparator and follow the settings of Lim et al. (2020) . Specifically, instead of Eq. ( 4), we use the ternary categorization based on the geometric ratio and

D.3 REFERENCE IMAGES

Figure 21 shows examples of reference images, which are used for the aesthetic score regression. Given a test image x, the reference image y i of aesthetic class i is selected by Eq. ( 13). For the aesthetic score regression, we use a single reference image for each aesthetic class, as done in the facial age estimation. Thus, 101 reference images are used in total. 

E HCI CLASSIFICATION -MORE EXPERIMENTS AND DETAILS E.1 IMPLEMENTATION DETAILS

For DRC-ORID for HCI classification, we set all hyper-parameters in the same way as we do in Appendix C.1. We set τ = 1 for the ternary categorization of ordering relationship in Eq. ( 4). Note that there are five decade classes from 1 to 5.

E.2 CLUSTERING

Figure 22 shows some sample images in the HCI dataset, which are ordered according to the decade classes. Figure 23 shows more example HCI images grouped into four clusters (k = 4). 

E.3 REFERENCE IMAGES

Figure 24 shows the five reference images for each of six test image examples. Note that, given a test image, the reference images of similar contents, tones, or composition are selected from the five decade classes. 1930s 1930s 1940s 1950s 1960s 1970s Test References 1940s 1930s 1940s 1950s 1960s 1970s Test References 1950s 1930s 1940s 1950s 1960s 1970s Test References 1960s 1930s 1940s 1950s 1960s 1970s Test References 1970s 1930s 1940s 1950s 1960s 1970s Test References 1960s 1930s 1940s 1950s 1960s 1970s Figure 24 : Examples of reference images in historical color image classification.Test References

