LEARNING DISCRIMINATIVE REPRESENTATIONS FOR CHROMOSOME CLASSIFICATION WITH SMALL DATASETS

Abstract

Chromosome classification is crucial for karyotype analysis in cytogenetics. Karyotype analysis is a fundamental approach for clinical cytogeneticists to identify numerical and structural chromosomal abnormalities. However, classifying chromosomes accurately and robustly in clinical application is still challenging due to: 1) rich deformations of chromosome shape, 2) similarity of chromosomes, and 3) imbalanced and insufficient labelled dataset. This paper proposes a novel pipeline for the automatic classification of chromosomes. Unlike existing methods, our approach is primarily based on learning meaningful data representations rather than only finding classification features in given samples. The proposed pipeline comprises three stages: The first stage extracts meaningful visual features of chromosomes by utilizing ResNet with triplet loss. The second stage optimizes features from stage one to obtain a linear discriminative representation via maximal coding rate reduction. It ensures the clusters representing different chromosome types are far away from each other while embeddings of the same type are close to each other in the cluster. The third stage is to identify chromosomes. Based on the meaningful feature representation learned in the previous stage, traditional machine learning algorithms such as SVM are adequate for the classification task. Evaluation results on a publicly available dataset show that our method achieves 97.22% accuracy and is better than state-of-the-art methods.

1. INTRODUCTION

Human chromosome classification is crucial for karyotype analysis in cytogenetics. Karyotype analysis is a fundamental approach for clinical cytogeneticists to identify numerical and structural chromosomal abnormalities, such as Turner syndrome, Chronic myelogenous leukaemia, Edwards syndrome, and Down syndrome (Stebbins & Ledyard., 1950; Sharma et al., 2017) . In clinical practice, karyotyping requires the preparation of a complete set of micro-photographed metaphase chromosomes in the cells, or more precisely, a karyogram (Figure 1 ). To do so, the cytogeneticists need to classify and sort these chromosomes into 23 pairs of chromosomes, including 22 pairs of autosomes and a pair of sex chromosomes (X and Y chromosomes in male cells and double X in female cells) (Jindal et al., 2017) . Chromosomes are highly coiled and condensed in metaphase, making karyotyping challenging, complicated, and laborious. Even for experienced cytogeneticists, considerable time and manual effort are indispensable in classifying and sorting various types of chromosomes to produce a karyogram. In order to reduce the burden of karyotyping, many researchers (Piper & Granum., 1989; Errington et al., 1993; Mashadi et al., 2007; Sharma et al., 2018; Lin et al., 2020) have been dedicated to auto-karyotyping using computer powers for decades. However, classifying chromosomes accurately and robustly in clinical application is still challenging. These challenges mainly stem from three aspects ( Jindal et al., 2017; Lin et al., 2020) : 1) rich deformations of chromosome shape, 2) similarity of chromosomes, and 3) imbalanced and insufficient labelled dataset. To tackle the above challenges, in this paper, we propose a novel pipeline for the automatic classification of chromosomes (Figure 2 ). However, unlike existing methods, our approach is primarily based on learning meaningful data representations rather than only finding classification features in given samples. Specifically, our proposed pipeline comprises three stages: The first stage extracts meaningful visual features of chromosomes by utilizing ResNet (He et al., 2016) with triplet loss (Weinberger et al., 2009; Schroff et al., 2015) . The second stage optimizes features from stage one to obtain a linear discriminative representation via maximal coding rate reduction (Chan et al., 2021) . It ensures the clusters representing different chromosome types are far away from each other while vectors of the same type are close to each other in the cluster. The third stage is to identify chromosomes. Based on the meaningful feature representation learned in the previous stages, traditional machine learning algorithms such as the Support Vector Machine (SVM) (Boser et al., 1992) can easily be plugged into our pipeline as classifiers. One public dataset is used to validate the performance and generalizability of our approach. Extensive experiments on the datasets corroborate that our proposed method achieved better performance than state-of-the-art methods. Our contributions can be summarised as follows: • We propose a representation learning-oriented approach for chromosome classification. Based on the enhanced feature representation, simple and efficient machine learning algorithms are adequate for the classification task. • Inspired by FaceNet's (Schroff et al., 2015) capability of distinguishing similar faces and ReduNet's competence in representation structure optimization, we propose combining them for more meaningful and effective feature learning with a small dataset. • We evaluate the proposed approach on a public dataset. It demonstrates its superior performance compared with the state-of-the-art method.

2. RELATED WORK

In the early years, hand-crafted and geometrical features (e.g., a chromosome's axis, centromere position, banding pattern features, and length) and complex pre-processings like straightening the chromosomes are primarily required with traditional machine learning algorithms for automatic chromosome classification (Ledley & Lubs., 1980; Egmont-Petersen et al., 2002; Javan-Roshtkhari 



Figure 1: (a) A G-stained microscopic image of male chromosomes for one case. (b) The karyogram of (a).

Figure 2: The flow diagram of our proposed pipeline: Chromosome images C are processed using the Backbone network (ResNet) with Triplet Loss for initial feature Learning. Then the learned feature embeddings K are fed to the ReduNet for discriminative representation learning. Finally, the enhanced feature representations Z are used for chromosome classification by the classifier.

