ITNET: ITERATIVE NEURAL NETWORKS FOR FAST AND EFFICIENT ANYTIME PREDICTION

Abstract

Deep neural networks have usually to be compressed and accelerated for their usage in low-power, e.g. mobile, devices. Common requirements are high accuracy, high throughput, low latency, and a small memory footprint. A good trade-off between accuracy and latency has been shown by networks comprising multiple intermediate outputs. In this study, we introduce a multi-output network that has a tiny memory footprint in terms of its computational graph, which allows its execution on novel, massively-parallel hardware accelerators designed for extremely high throughput. To this end, the graph is designed to contain loops by iteratively executing a single network building block. These so-called iterative neural networks enable stateof-the-art results for semantic segmentation on the CamVid and Cityscapes datasets that are especially demanding in terms of computational resources. In ablation studies, the improvement of network training by intermediate network outputs as well as the trade-off between weight sharing over iterations and the network size are investigated.

1. INTRODUCTION

For massively-parallel hardware accelerators (Schemmel et al., 2010; Merolla et al., 2014; Yao et al., 2020; Graphcore, 2020a) , every neuron and synapse in the network model has its physical counterpart on the hardware system. Usually, by design, memory and computation is not separated anymore, but neuron activations are computed next to the memory, i.e. the parameters, and fully in parallel. This is in contrast to the rather sequential data processing of CPUs and GPUs, for which the computation of a network model is tiled and the same arithmetic unit is re-used multiple times for different neurons. Since the computation is performed fully in parallel and in memory, the throughput of massively-parallel accelerators is usually much higher than for CPUs and GPUs. This can be attributed to the fact that the latency and power consumption for accessing local memory, like for in-memory computing, are much lower than for computations on CPUs and GPUs that require the frequent access to non-local memory like DRAM (Sze et al., 2017) . However, the network graph has to fit into the memory of the massively-parallel hardware accelerators to allow for maximal throughput. If the network graph exceeds the available memory, in principle, the hardware has to be re-configured at high frequency as it is the case for CPUs and GPUs and the throughput would be substantially reduced. Mixed-signal massively-parallel hardware systems usually operate on shorter time scales than digital ones, e.g. compare Schemmel et al. (2010) and Yao et al. (2020) to Merolla et al. (2014) and Graphcore (2020a), and would allow for even higher throughputs. In order to achieve neural networks with tiny computational graphs, in which nodes are operations and edges are activations, we heavily re-use a single building block of the network (see the iterative block in Figure 1a ). Not only the structure of computations, i.e. the type of network layers including their dimensions and connectivity, is identical for each iteration of this building block, but also the weights are shared between iterations. In the computational graphs of these so-called iterative neural networks (ItNet), the re-used building blocks with shared weights can be represented by nodes with self-loops. Compared to conventional feed-forward networks, loops simplify the graph by reducing the number of unique nodes and, consequently, its computational footprint. However, the restriction of sharing weights usually decreases the number of free parameters and, hence, the accuracy of networks. To isolate and quantify this effect we compare networks with weight sharing to networks, for which the parameters of the building blocks are chosen to be independent between iterations of 2019) share the weights of U-Nets and recurrently connect their bottlenecks. Note that the ItNet requires more MACs, but has a substantially smaller size of the computational graph, and a lower latency than the reference networks. The throughput of ItNets is potentially also higher, since the small graph size allows for the execution on massively-parallel hardware systems. e) Image, label, and network predictions (with the mIoU in their corner) over the network outputs of the ItNet in (c). The data sample with the 75th-percentile mIoU is shown. the building block. In contrast to the above proposal, conventional deep neural networks for image processing usually do not share weights and have no (e.g. Huang et al., 2017) or few (e.g. one building block for each scale like by Greff et al., 2017) layers of identical structure. Liao & Poggio (2016) share weights between re-used building blocks, but use multiple unique building blocks. To improve the training of networks, which contain loops in their graphs, and to reduce the latency of networks during inference we use multiple intermediate outputs. Multi-output networks that heavily re-use intermediate activations are beneficial for a wide range of applications, especially in the mobile domain. In an online manner, they allow to trade off latency versus accuracy with barely any overhead (e.g. Huang et al., 2018) . From an application point of view, the benefit of this trade-off can be best described in the following two scenarios (Huang et al., 2018 ): In the so-called anytime prediction scenario, the prediction of a network is progressively updated, whereas the first output defines the initial latency of the network. In a second scenario, a limited computational budget can be unevenly distributed over a set of samples with different "difficulties" in order to increase the average accuracy. Since all nodes in the network graph are computed in parallel on massively-parallel hardware systems (e.g. Esser et al., 2016) , the latency for inference is dominated by the depth of the network, i.e. the longest path from input to output (Fischer et al., 2018) . Consequently, we prefer networks that compute all scales in parallel (similar to Huang et al., 2018; Ke et al., 2017) and increase their depth for each additional intermediate output to networks that keep the depth constant and progressively increase their width (Yu et al., 2019) . Furthermore, multi-scale networks are also beneficial for the integration of global information, as especially required by dense prediction tasks like semantic segmentation (Zhao et al., 2017) . To further reduce the latency we also reduce the depth of the building blocks for each scale. In deep learning literature, the computational costs are usually quantified by counting the parameters and/or the multiply-accumulate operations (MACs) required for the inference of a single sample. For fully convolutional networks, the number of parameters is independent of the spatial resolution



Figure1: This study in a nutshell. a) The iterative neural network (ItNet): first, images are preprocessed and potentially down-scaled by a sub-network called data block. Then, the output of this data block is processed by another sub-network that is iteratively executed. After every iteration n, the output of this iterative block is fed back to its input and, at the same time, is further processed by the classification block to predict the semantic map m n . This network generates multiple outputs m n with increasing quality and computational costs and heavily re-uses intermediate network activations. While the parameters are shared between iterations of the iterative block, the parameters of the classification block are independent for each n. b-d) Intersection-over-union (mIoU) over multiplyaccumulate operations (MACs), the size of the computational graph, and the latency, respectively, on the validation set of the Cityscapes dataset. To our knowledge ESPNetv2(Mehta et al., 2019)  is the state-of-the-art in terms of mIoU over MACs, for which we show the mIoU if not using pre-trained weights(Ruiz & Verbeek, 2019). Wang et al. (2019) share the weights of U-Nets and recurrently connect their bottlenecks. Note that the ItNet requires more MACs, but has a substantially smaller size of the computational graph, and a lower latency than the reference networks. The throughput of ItNets is potentially also higher, since the small graph size allows for the execution on massively-parallel hardware systems. e) Image, label, and network predictions (with the mIoU in their corner) over the network outputs of the ItNet in (c). The data sample with the 75th-percentile mIoU is shown.

