DYNAMIC GRAPH: LEARNING INSTANCE-AWARE CONNECTIVITY FOR NEURAL NETWORKS

Abstract

One practice of employing deep neural networks is to apply the same architecture to all the input instances. However, a fixed architecture may not be representative enough for data with high diversity. To promote the model capacity, existing approaches usually employ larger convolutional kernels or deeper network structure, which may increase the computational cost. In this paper, we address this issue by raising the Dynamic Graph Network (DG-Net). The network learns the instanceaware connectivity, which creates different forward paths for different instances. Specifically, the network is initialized as a complete directed acyclic graph, where the nodes represent convolutional blocks and the edges represent the connection paths. We generate edge weights by a learnable module router and select the edges whose weights are larger than a threshold, to adjust the connectivity of the neural network structure. Instead of using the same path of the network, DG-Net aggregates features dynamically in each node, which allows the network to have more representation ability. To facilitate the training, we represent the network connectivity of each sample in an adjacency matrix. The matrix is updated to aggregate features in the forward pass, cached in the memory, and used for gradient computing in the backward pass. We verify the effectiveness of our method with several static architectures, including MobileNetV2, ResNet, ResNeXt, and RegNet. Extensive experiments are performed on ImageNet classification and COCO object detection, which shows the effectiveness and generalization ability of our approach.

1. INTRODUCTION

Deep neural networks have driven a shift from feature engineering to feature learning. The great progress largely comes from well-designed networks with increasing capacity of models (He et al., 2016a; Xie et al., 2017; Huang et al., 2017; Tan & Le, 2019) . To achieve the superior performance, a useful practice is to add more layers (Szegedy et al., 2015) or expand the size of existing convolutions (kernel width, number of channels) (Huang et al., 2019; Tan & Le, 2019; Mahajan et al., 2018) . Meantime, the computational cost significantly increases, hindering the deployment of these models in realistic scenarios. Instead of adding much more computational burden, we prefer adding sampledependent modules to networks, increasing the model capacity by accommodating the data variance. Several existing work attempt to augment the sample-dependent modules into network. For example, Squeeze-and-Excitation network (SENet) (Hu et al., 2018) learns to scale the activations in the channel dimension conditionally on the input. Conditionally Parameterized Convolution (CondConv) (Yang et al., 2019) uses over-parameterization weights and generates individual convolutional kernels for each sample. GaterNet (Chen et al., 2018) adopts a gate network to extract features and generate sparse binary masks for selecting filters in the backbone network based upon inputs. All these methods focus on the adjustment of the micro structure of neural networks, using a data-dependent module to influence the feature representation at the same level. Recall the deep neural network to mammalian brain mechanism in biology (Rauschecker, 1984) , the neurons are linked by synapses and responsible for sensing different information, the synapses are activated to varying degrees when the neurons perceive external information. Such a phenomenon inspires us to design a data-dependent network structure so that different samples will activate different network paths. In this paper, we learn to optimize the connectivity of neural networks based upon inputs. Instead of using stacked-style or hand-designed manners, we allow more flexible selection for forwarding paths. Specifically, we reformulate the network into a directed acyclic graph, where nodes represent the convolution block while edges indicate connections. Different from randomly wired neural networks (Xie et al., 2019) that generate random graphs as connectivity using predefined generators, we rewire the graph as a complete graph so that all nodes establish connections with each other. Such a setting allows more possible connections and makes the task of finding the most suitable connectivity for each sample equivalent to finding the optimal sub-graph in the complete graph. In the graph, each node aggregates features from the preceding nodes, performs feature transformation (e.g. convolution, normalization, and non-linear operations), and distributes the transformed features to the succeeding nodes. The output of the last node in the topological order is employed as the representation through the graph. To adjust the contribution of different nodes to the feature representation, we further assign weights to the edges in the graph. The weights are generated dynamically for each input via an extra module (denoted as router) along with each node. During the inference, only crucial connections are maintained, which creates different paths for different instances. As the connectivity for each sample is generated through non-linear functions determined by routers, our method can enable the networks to have more representation power than the static network. We call our method as the Dynamic Graph Network (DG-Net). It doesn't increase the depth or width of the network, while only introduces an extra negligible cost to compute the edge weights and aggregate the features. To facilitate the training, we represent the network connection of each sample as a adjacent matrix and design a buffer mechanism to cache the matrices of a sample batch during training. With the buffer mechanism, we can conveniently aggregate the feature maps in the forward pass and compute the gradient in the backward pass by looking up the adjacent matrices. The main contributions of our work are as follows: • We first introduce the dynamic connectivity based upon inputs to exploit the model capacity of neural networks. Without bells and whistles, simply replacing static connectivity with dynamic one in many networks achieves solid improvement with only a slight increase of (∼ 1%) parameters and (∼ 2%) computational cost (see table 1 ). • DG-Net is easy and memory-conserving to train. The parameters of networks and routers can be optimized in a differentiable manner. We also design a buffer mechanism to conveniently access the network connectivity, in order to aggregate the feature maps in the forward pass and compute the gradient in the backward pass. BlockDrop (Wu et al., 2018) and HydraNet (Mullapudi et al., 2018) use reinforcement learning to learn the subset of blocks needed to process a given input. Some approaches prune channels (Lin et al., 2017a; You et al., 2019) for efficient inference. However, most prior methods are challenging to train, because they need to obtain discrete routing decisions from individual examples. Different from these approaches, DG-Net learns continuous weights for connectivity to enable ramous propagation of features, so can be easily optimized in a differentiable way.



We show that DG-Net not only improves the performance for human-designed networks (e.g. MobielNetV2, ResNet, ResNeXt) but also boosts the performance for automatically searched architectures (e.g. RegNet). It demonstrates good generalization ability on ImageNet classification (see table1) and COCO object detection (see table 2) tasks. Different from the modularized designed network which consists of topologically identical blocks, there exists some work that explores more flexible wiring patterns. MaskConnect (Ahmed & Torresani, 2018) removes predefined architectures and learns the connections between modules in the network with k conenctions. Randomly wired neural networks(Xie  et al., 2019)  use classical graph generators to yield random wiring instances and achieve competitive performance with manually designed networks. DNW(Wortsman et al., 2019)  treats each channel as a node and searches a fine-grained sparse connectivity among layers. TopoNet(Yuan et al., 2020)   learns to optimize the connectivity of neural networks in a complete graph that adapt to the specific task. Prior work demonstrates the potential of more flexible wirings, our work on DG-Net pushes the boundaries of this paradigm, by enabling each example to be processed with different connectivity.Dynamic Networks. Dynamic networks, adjusting the network architecture to the corresponding input, have been recently studied in the computer vision domain.SkipNet (Wang et al., 2018b),

