SEDONA: SEARCH FOR DECOUPLED NEURAL NET-WORKS TOWARD GREEDY BLOCK-WISE LEARNING

Abstract

Backward locking and update locking are well-known sources of inefficiency in backpropagation that prevent from concurrently updating layers. Several works have recently suggested using local error signals to train network blocks asynchronously to overcome these limitations. However, they often require numerous iterations of trial-and-error to find the best configuration for local training, including how to decouple network blocks and which auxiliary networks to use for each block. In this work, we propose a differentiable search algorithm named SEDONA to automate this process. Experimental results show that our algorithm can consistently discover transferable decoupled architectures for VGG and ResNet variants, and significantly outperforms the ones trained with end-to-end backpropagation and other state-of-the-art greedy-leaning methods in CIFAR-10, Tiny-ImageNet and ImageNet.

1. INTRODUCTION

Backpropagation (Rumelhart et al., 1986) has made a significant contribution to the success of deep learning as the core learning algorithm for SGD-based optimization. However, backpropagation is sequential in nature and supports only synchronous weight updates. Specifically, the limited concurrency in backpropagation breaks down into two locking problems (Jaderberg et al., 2017) . First, update locking -a forward pass must complete first before any weight update. Second, backward locking -gradient computation of upper layers must precede that of lower layers. Also, backpropagation may be biologically implausible since the human brain prefers local learning rules without the global movement of error signals (Crick (1989) 2019) empirically demonstrated that greedy block-wise learning could yield competitive performance to end-to-end backpropagation. However, greedy block-wise learning introduces a group of new architectural decisions. Let us consider a case where we want to decouple an L-layer network into K blocks for a given K ∈ {1, . . . , L}. Then, the number of all possible groupings is L-1 K-1 . If we want to choose one of M candidates of auxiliary networks to generate local error gradients, we would have to consider In this work, we introduce a novel search method named SEDONA (SEarching for DecOupled Neural Architectures), which allows efficient search of decoupled neural architectures toward greedy block-wise learning. Given a base neural network, SEDONA optimizes the validation loss by grouping layers into blocks and selecting the best auxiliary network for each block. Inspired by DARTS (Liu et al., 2019) , we first relax the decision variables representing error propagation graphs and L-1 K-1 M K-1 different configurations.



; Marblestone et al. (2016); Lillicrap et al. (2020)). Greedy block-wise learning is a competitive alternative to backpropagation that overcomes these limitations. It splits layers into a stack of gradient-isolated blocks, each of which is trained with local error signals. Therefore, it is possible to simultaneously compute the gradients for different network components with more fine-grained locks. Limiting the depth of error propagation graphs also reduces the vanishing gradient and increases memory efficiency. Recently, Belilovsky et al. (2019), Nøkland & Eidnes (2019), Belilovsky et al. (2020), and Löwe et al. (

If local signals are not representative of the global goal, then the final performance would be damaged significantly.

