SHAPE-TAILORED DEEP NEURAL NETWORKS

Abstract

We present Shape-Tailored Deep Neural Networks (ST-DNN). ST-DNN extend convolutional networks (CNN), which aggregate data from fixed shape (square) neighborhoods, to compute descriptors defined on arbitrarily shaped regions. This is natural for segmentation, where descriptors should describe regions (e.g., of objects) that have diverse shape. We formulate these descriptors through the Poisson partial differential equation (PDE), which can be used to generalize convolution to arbitrary regions. We stack multiple PDE layers to generalize a deep CNN to arbitrary regions, and apply it to segmentation. We show that ST-DNN are covariant to translations and rotations and robust to domain deformations, natural for segmentation, which existing CNN based methods lack. ST-DNN are 3-4 orders of magnitude smaller then CNNs used for segmentation. We show that they exceed segmentation performance compared to state-of-the-art CNN-based descriptors using 2-3 orders smaller training sets on the texture segmentation problem.

1. INTRODUCTION

Convolutional neural networks (CNNs) have been used extensively for segmentation problems in computer vision He et al. (2017) ; He et al. (2016) ; Chen et al. (2017) ; Xie & Tu (2015) . CNNs provide a framework for learning descriptors that are able to discriminate different textured or semantic regions within images. Much progress has been made in segmentation with CNNs but results are still far from human performance. Also, significant engineering must be performed to adapt CNNs to segmentation problems. A basic component in the architecture for segmentation problems involves labeling or grouping dense descriptors returned by a backbone CNN. A difficulty in grouping these descriptors arises, especially near the boundaries of segmentation regions, as CNN descriptors aggregate data from fixed shape (square neighborhoods) at each pixel and may thus aggregate data from different regions. This makes grouping these descriptors into a unique region difficult, which often results in errors in the grouping. In segmentation problems (e.g., semantic segmentation), current methods attempt to mitigate these errors by adding post-processing layers that aim to group simultaneously the (coarse-scale) descriptors from the CNN backbone and the fine-level pixel data. However, the errors introduced might not always be fixed. A more natural approach to avoid this problem is to consider the coarse and fine structure together, avoiding aggregation across boundaries, to prevent errors at the outset. To avoid such errors, one could design descriptors that aggregate data only within boundaries. To this end, Khan et al. (2015) introduced "shape-tailored" descriptors that aggregate data within a region of interest, and used these descriptors for segmentation. However, these descriptors are hand-crafted and do not perform on-par with learned approaches. Khan & Sundaramoorthi (2018) introduced learned shape-tailored descriptors by learning a neural network operating on the input channel dimension of input hand-crafted shape-tailored descriptors for segmentation. However, these networks, though deep in the channel dimension, did not filter data spatially within layers. Since an advantage of CNNs comes from exploiting spatial filtering at each depth of the network, in this work, we design shape-tailored networks that are deep and perform shape-tailored filtering in space at each layer using solutions of the Poisson PDE. This results in shape-tailored networks that provide more discriminative descriptors than a single shape-tailored kernel. This extension requires development of techniques to back-propagate through PDEs, which we derive in this work. Our contributions are specifically: 1. We construct and show how to train ST-DNN, deep networks that perform shape-tailored spatial filtering via the Poisson PDE at each depth so as to generalize a CNN to arbitrarily shaped regions. 2. We show analytically and empirically that ST-DNNs are covariant to translations and rotations as they inherit this property from the Poisson PDE. In segmentation, covariance (a.k.a., equivariance) to translation and rotation is a desired property: if a segment in an image is found, then the corresponding segment should be found in the translated / rotated image (or object). This property is not generally present with existing CNN-based segmentation methods even when trained with augmented translated and rotated images Azulay & Weiss (2019), and requires special consideration. 3. We show analytically and empirically that ST-DNNs are robust to domain deformations. These result from viewpoint change or object articulation, and so they should not affect the descriptor. 4. To demonstrate ST-DNN and the properties above, we validate them on the task of segmentation, an important problem in low-level vision Malik & Perona (1990) ; Arbelaez et al. (2011b) . Because of properties of the PDE, ST-DNN also have desirable generalization properties. This is because: a) The robustness and covariance properties are built into our descriptors and do not need to be learned from data, b) The PDE solutions, generalizations of Gabor-like filters Olshausen & Field (1996) ; Zador (2019), have natural image structure inherent in their solutions and so this does not need to be learned from data, and c) Our networks have fewer parameters compared to existing networks in segmentation. This is because the PDE solutions form a basis and only linear combinations of a few basis elements are needed to learn discriminative descriptors for segmentation. In contrast, CNNs spend a lot of parameters to learn this structure.

1.1. RELATED WORK

Traditional approaches to segmentation rely on hand-crafted features, e.g., through a filter bank Haralick & Shapiro (1985) . These features are ambiguous near the boundaries of objects. In Khan et al. (2015) hand-crafted descriptors that aggregate data within object boundaries are constructed to avoid this, but lack sufficient capacity to capture the diversity of textures or be invariant to nuisances. (2019) . Usually these classes are limited to a few object classes and do not tackle general textures, where the number of classes may be far greater, and thus such approaches are not directly applicable to texture segmentation. But semantic segmentation approaches may eventually benefit from our methodology as descriptors aggregating data only within objects or regions are also relevant to these problems. A learned shape-tailored descriptor Khan & Sundaramoorthi ( 2018) is constructed with a Siamese network on hand-crafted shape-tailored descriptors. However, Khan & Sundaramoorthi (2018) only does shape-tailored filtering in pre-processing as layering these requires new methods to train. We further examine covariance and robustness, not examined in Khan & Sundaramoorthi (2018) . Covariance to rotation in CNNs has been examined in recent works, e.g. 

2. CONSTRUCTION OF SHAPE-TAILORED DNN AND PROPERTIES

In this section, we design a deep neural network that outputs descriptors at each pixel within an arbitrary shaped region of interest and aggregates data only from within the region. We want the descriptors to be discriminative of different texture, yet robust to nuisances within the region (e.g., local photometric and geometric variability) to be useful for segmentation. Our construction uses a Poisson PDE, which naturally smooths data only within a region of interest. Smoothing naturally yields robustness to geometric nuisances (domain deformations). By taking linear combinations of derivatives of the output of the PDE, we can approximate the effect of general convolutional kernels but avoid mixing data across the boundary of region of interest. ST-DNN is also covariant to



Deep-learning based approaches have showed state-of-the-art results in edge-based methods Xie & Tu (2017); He et al. (2019); Deng et al. (2018). Watershed is applied on edge-maps to obtain the segmentation. The main drawback of these methods is it is often difficult to form segmentations due to extraneous or faint edges, particularly when "textons" in textures are large. CNNs have been applied to compute descriptors for semantic segmentation, where pixels in an image are classified into certain semantic object classes Li et al. (2019); Huang et al. (2019); Du et al. (2019); Pang et al. (2019); Zhu et al. (2019); Liu et al.

, Weiler et al. (2018); Yin et al. (2019); Anderson et al. (2019). They, however, are not shape-tailored so do not aggregate data only within shaped regions. Lack of robustness to deformation (and translation) in CNNs is examined in Azulay & Weiss (2019) and theoretically in Bietti & Mairal (2017). Sifre & Mallat (2013) constructs deformation robust descriptors inspired by CNNs, but are hand-crafted.

