LEARNING HYPERBOLIC REPRESENTATIONS FOR UN-SUPERVISED 3D SEGMENTATION

Abstract

There exists a need for unsupervised 3D segmentation on complex volumetric data, particularly when annotation ability is limited or discovery of new categories is desired. Using the observation that much of 3D volumetric data is innately hierarchical, we propose learning effective representations of 3D patches for unsupervised segmentation through a variational autoencoder (VAE) with a hyperbolic latent space and a proposed gyroplane convolutional layer, which better models the underlying hierarchical structure within a 3D image. We also introduce a hierarchical triplet loss and multi-scale patch sampling scheme to embed relationships across varying levels of granularity. We demonstrate the effectiveness of our hyperbolic representations for unsupervised 3D segmentation on a hierarchical toy dataset, BraTS whole tumor dataset, and cryogenic electron microscopy data.

1. INTRODUCTION

Recent advances in technology have greatly increased both the availability of 3D data, as well as the need to process and learn from 3D data. In particular, technologies such as magnetic resonance imaging and cryogenic electron microscopy (cryo-EM) have led to greater availability of 3D voxel data. Deep learning is a promising technique to do so, but producing annotations for 3D data can be extremely expensive, especially for richer tasks such as segmentation in dense voxel grids. In some cases, labels may also be impossible to produce due to the limitations of current knowledge, or may introduce bias if we want to conduct scientific discovery. Unsupervised learning, which does not require annotations, is a promising approach for overcoming these limitations. In this work, we tackle the challenging problem of unsupervised segmentation on complex 3D voxel data by addressing the essential challenge of representation learning. We expand from prior literature in the hyperbolic domain that conducts classification in simple data to the task of segmentation in 3D images, which requires significantly more representation discriminability. In order to learn effective representations, we need to capture the structure of our input data. We observe that 3D images often have inherent hierarchical structure: as a biomedical example, a cryo-EM tomogram of a cell has a hierarchy that at the highest level comprises the entire cell; at a finer level comprises organelles such as the mitochondria and nucleus; and at an even finer level comprises sub-structures such as the nucleolus of a nucleus or proteins within organelles. For downstream analysis, we are typically interested in the unsupervised discovery and segmentation of structures spanning multiple levels of hierarchy. However, prior work on representation learning for unsupervised 3D segmentation does not explicitly model hierarchical structure between different regions of a 3D image. We argue that this hampers the ability to leverage hierarchical relationships to improve segmentation in complex 3D images. Our key insight is that we can utilize a hyperbolic embedding space to learn effective hierarchical representations of voxel regions in 3D images. Hyperbolic representations have been proposed as a continuous way to represent hierarchical data, as trees can be embedded in hyperbolic space with arbitrarily low error (Sarkar, 2011) . These methods have shown promise for modeling data types such as natural language word taxonomies (Nickel & Kiela, 2017; 2018), graphs (Nickel & Kiela, 2017; Mathieu et al., 2019; Ovinnikov, 2019; Chami et al., 2019) , as well as simple MNIST (LeCun et al., 2010) image data for classification (Mathieu et al., 2019) . To the best of our knowledge, our work is the first to introduce learning hyperbolic representations to capture hierarchical structure among subregions of complex 3D images, and to utilize the learned hyperbolic representations to perform a complex computer vision task such as segmentation. Our approach for learning hyperbolic representations of 3D voxel grid data is based on several key innovations. First, to handle larger and more complex 3D data such as biomedical images, we propose a hyperbolic 3D convolutional VAE along with a new gyroplane convolutional layer that respects hyperbolic geometry. Second, we enhance our VAE training objective with a novel self-supervised hierarchical triplet loss that helps our model learn hierarchical structure within the VAE's hyperbolic latent space. Finally, since our goal in segmentation is to learn hierarchy within voxel regions of 3D input, we present a multi-scale sampling scheme such that our 3D VAE can simultaneously embed hierarchical relationships across varying levels of granularity. In summary, our key contributions are as follows: • We introduce a hyperbolic 3D convolutional VAE with a novel gyroplane convolutional layer that scales the learning of hyperbolic representations to complex 3D data. • We propose a multi-scale sampling scheme and hierarchical triplet loss in order to encode hierarchical structure in the latent space and perform 3D unsupervised segmentation. • We demonstrate the effectiveness of our approach through experiments on a synthetic 3D toy dataset, the Brain Tumor Segmentation (BraTS) dataset (Menze et al., 2014; Bakas et al., 2017; 2018) , and cryo-EM data.

2. RELATED WORK

Segmentation on 3D voxel data Since 3D voxel grids are dense, computer vision tasks such as supervised segmentation are commonly performed using deep learning architectures with 3D convolutional layers (Chen et al., 2016; Dou et al., 2017; Hesamian et al., 2019; Zheng et al., 2019) . However, due to the challenges of obtaining voxel-level segmentations in 3D, there has been significant effort in finding semi-supervised approaches, including using labels only from several fully annotated 2D slices of an input volume (C ¸ic ¸ek et al., 2016) , using a smaller set of segmentations with joint segmentation and registration (Xu & Niethammer, 2019), and using one segmented input in conjunction with other unlabelled data (Zhao et al., 2019) . Unsupervised approaches for 3D segmentation are useful not only for further reducing the manual annotation effort required, but also for scientific discovery tasks where we lack the sufficient knowl- 



edge to provide representative training examples for structures of interest.Moriya et al. (2018)  extends to 3D data an iterative approach of feature learning followed by clustering(Yang et al., 2016).Nalepa et al. (2020)  uses a 3D convolutional autoencoder architecture and performs clustering of the latent representations. Another approach,(Dalca et al., 2018), uses a network pre-trained on manual segmentations from a separate dataset to perform unsupervised segmentation of 3D biomedical images. However, this limits applicability to areas where we already have a dataset with manual annotations and makes it unsuitable for unbiased unsupervised discovery. Gur et al. (2019) and Kitrungrotsakul et al. (2019) developed unsupervised methods for 3D segmentation of vessel structures, but these are specialized and do not generalize to the segmentation of other structures. Beyond unsupervised 3D segmentation, there has been work such as Ji et al. (2019) that performs unsupervised 2D segmentation based on a mutual information objective, and Caron et al. (2018), which proposes using the clustered output of an encoder as pseudo-labels. While these methods can be applied to 2D slices of a 3D volume to perform 3D segmentation, they generally suffer limitations due to insufficient modeling of the 3D spatial information. None of the aforementioned approaches explicitly model hierarchical structure, which is the main focus of our work.Hyperbolic representations A recent line of work has employed hyperbolic space to model hierarchical structure, with the intuition that tree structures can be naturally embedded into continuous hyperbolic space(Nickel & Kiela, 2017). Several works have proposed hyperbolic variational autoencoders (VAEs) as an unsupervised method to learn hyperbolic representations. Ovinnikov (2019) proposes a Wasserstein autoencoder on the Poincaré ball model of hyperbolic geometry.Nagano et al. (2019)  proposes a VAE on the hyperboloid model of hyperbolic geometry where the last layer of the encoder is an exponential map, and derives a reparametrisable sampling scheme for the wrapped normal distribution, which they use for the prior and posterior.Mathieu et al. (2019)

