ITNET: ITERATIVE NEURAL NETWORKS FOR FAST AND EFFICIENT ANYTIME PREDICTION

Abstract

Deep neural networks have usually to be compressed and accelerated for their usage in low-power, e.g. mobile, devices. Common requirements are high accuracy, high throughput, low latency, and a small memory footprint. A good trade-off between accuracy and latency has been shown by networks comprising multiple intermediate outputs. In this study, we introduce a multi-output network that has a tiny memory footprint in terms of its computational graph, which allows its execution on novel, massively-parallel hardware accelerators designed for extremely high throughput. To this end, the graph is designed to contain loops by iteratively executing a single network building block. These so-called iterative neural networks enable stateof-the-art results for semantic segmentation on the CamVid and Cityscapes datasets that are especially demanding in terms of computational resources. In ablation studies, the improvement of network training by intermediate network outputs as well as the trade-off between weight sharing over iterations and the network size are investigated.

1. INTRODUCTION

For massively-parallel hardware accelerators (Schemmel et al., 2010; Merolla et al., 2014; Yao et al., 2020; Graphcore, 2020a) , every neuron and synapse in the network model has its physical counterpart on the hardware system. Usually, by design, memory and computation is not separated anymore, but neuron activations are computed next to the memory, i.e. the parameters, and fully in parallel. This is in contrast to the rather sequential data processing of CPUs and GPUs, for which the computation of a network model is tiled and the same arithmetic unit is re-used multiple times for different neurons. Since the computation is performed fully in parallel and in memory, the throughput of massively-parallel accelerators is usually much higher than for CPUs and GPUs. This can be attributed to the fact that the latency and power consumption for accessing local memory, like for in-memory computing, are much lower than for computations on CPUs and GPUs that require the frequent access to non-local memory like DRAM (Sze et al., 2017) . However, the network graph has to fit into the memory of the massively-parallel hardware accelerators to allow for maximal throughput. If the network graph exceeds the available memory, in principle, the hardware has to be re-configured at high frequency as it is the case for CPUs and GPUs and the throughput would be substantially reduced. Mixed-signal massively-parallel hardware systems usually operate on shorter time scales than digital ones, e. 2014) and Graphcore (2020a), and would allow for even higher throughputs. In order to achieve neural networks with tiny computational graphs, in which nodes are operations and edges are activations, we heavily re-use a single building block of the network (see the iterative block in Figure 1a ). Not only the structure of computations, i.e. the type of network layers including their dimensions and connectivity, is identical for each iteration of this building block, but also the weights are shared between iterations. In the computational graphs of these so-called iterative neural networks (ItNet), the re-used building blocks with shared weights can be represented by nodes with self-loops. Compared to conventional feed-forward networks, loops simplify the graph by reducing the number of unique nodes and, consequently, its computational footprint. However, the restriction of sharing weights usually decreases the number of free parameters and, hence, the accuracy of networks. To isolate and quantify this effect we compare networks with weight sharing to networks, for which the parameters of the building blocks are chosen to be independent between iterations of 1



g. compare Schemmel et al. (2010) and Yao et al. (2020) to Merolla et al. (

