COMNET: CORTICAL MODULES ARE POWERFUL

Abstract

Existing CNN architectures may achieve efficiency in either one or two dimensions: FLOPs, depth, accuracy, representation power, latency but not in all. In this work, we present a pragmatically designed novel CNN architecture "CoMNet" which offers multi-dimensional efficiency at once such as: simple yet accurate, lower latency and FLOPs, high representation power in limited parameters, low memory consumption, negligible branching, smaller depths, and only a few design hyperparameters. The key to achieve the multi-dimensional efficiency is our use of biological underpinnings into CoMNet which is primarily the organization of cortical modules in the visual cortex. To realize CoMNet, a few concepts from well understood CNN designs are directly inherited such as residual learning. Our solid experimental evaluations demonstrate superiority of CoMNet over many state-ofthe-art industry and academia dominant architectures such as ResNet, RepVGG etc. For instance, CoMNet supersedes ResNet-50 on ImageNet while being 50% shallower, 22% lesser parameters, 25% lower FLOPs and latency, and in 16% lesser training epochs. Code will be opensourced post the reviews.

1. INTRODUCTION

To date, a wide variety of CNN architectures exists: branchless Krizhevsky et al. (2012) ; Simonyan & Zisserman (2014) , single branch (He et al., 2016) , multi-branch and variable filter topology (Szegedy et al., 2015; 2016) , feature resue (Huang et al., 2017) , mobile series (Howard et al., 2017; Zhang et al., 2018) . Despite their architectural diversity, they all have one characteristic in common; their primary design objective dimensions are limited. For instance, (Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016; Huang et al., 2017) seek to achieve higher accuracy regardless of network size and runtime, while mobile networks (Howard et al., 2017; Zhang et al., 2018) aim for fewer FLOPs at the cost of reduced representation power. Some neural architecture search based designs (Zoph et al., 2018; Tan & Le, 2019) focus on both higher accuracy and lower FLOPs but are heavily branched, discard latency, and runs slower on GPUs. More recently, (Ding et al., 2021) focus on accelerating (Simonyan & Zisserman, 2014) during inference phase while leaving training-phase unaddressed. In contrast, there are many other dimensions which are left unaddressed but are crucial in real-time applications, such as autonomous driving, robotics etc. Foremost is latency or per-sample runtime which still remains ignored in CNNs in favour of FLOPs or throughput. Optimizing latency is crucial because real-time applications tend to process as few as one frame at a time but not large batches, and also, fewer FLOPs does not mean lower latency (Ding et al., 2021) . Second, neural networks are replacing their traditional coutnerparts, posing a huge challenge to run multiple networks on a single device (Kumar et al., 2020) , bringing in resource efficiency constraints. Third, rising complexity of the tasks (Kumar et al., 2020; Kumar & Behera, 2019) demands large networks to achieve satisfactory results, because mobile networks are insufficient due to their smaller representation power (Sec 2). In the existing designs, the primary dimensions such as higher accuracy (He et al., 2016; Szegedy et al., 2016 ), fewer FLOPs (Sandler et al., 2018) , and inference time FLOPs (Ding et al., 2021) are explored only individually. However, considering the above discussion, addressing multi-dimensional efficiency is current need of the time. Consequently, in this work, we aim to achieve multi-dimensional efficiency at once, while offering stronger trade-offs in some of them if efficiency in all dimensions is not feasible. To the best of our knowledge, achieving multi-dimensional efficiency has not been explored as it is a difficult task, mostly due to a high correlation among dimensions, which makes it possible to be better in one but worse in another. Towards our objective, we propose a novel architecture "CoMNet", that can offer multi-dimensional benefits of lower architectural complexity, smaller depth, hardware-accelerators compatible, low memory consumption, low memory access costs on parallel computing hardware, low latency, parameter efficiency. CoMNet is primarily based on our translation of biological underpinnings of cortical modules (Mountcastle, 1997) which predominantly exist in a visual cortex. We particularly refer to the structure of cortical modules in ventral stream (Tanaka, 1996) which is responsible for object recognition in mammalians. To realize our mere biological design inspirations, we inherit a few concepts from the well established CNN designs e.g. residual leaning (He et al., 2016) . CoMNet mostly outperforms many key CNN designs, and also offers trade-offs but minimally. Although CoMNet has its design inspired from biological studies, we neither investigate nor assert that CoMNet is functionally similar to the visual cortex. In a brief, key take-aways of the paper are: First, the notion of multidimensional efficiency in CNNs, Second, the notion of artificial cortical modules (ACM) which helps achieving high representation in fewer parameters, controlled parameter growth, increased computational density, Third, the concept of columnar organization (Mountcastle, 1997) which helps to achieve smaller depths, lower latency and FLOPs, faster convergence, Fourth, long range connections similar to pyramidal neurons (Mountcastle, 1997) which further improves the accuracy of CoMNet. We also provide a detailed ablations, and a minimal design space of CoMNet to help one choosing a suitable model. By using GradCAM (Selvaraju et al., 2017) , we also show that CoMNet learns better data representations. Finally, we briefly discuss BrainScore (Schrimpf et al., 2020) and future prospects of CoMNet. In the next section, we talk about the most relevant works. In Sec. 2, we brief our biological insights and their translation into the CoMNet architecture. In Sec. 5, we present a rigorous experimental analysis. Finally, in Sec. 6, we provide conclusions about the paper.

2. RELATED WORK

Parameters and Representation Power: The earlier CNN designs (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016) possess high representation power. In these designs, the deeper layers consist of a large number of channels e.g. 512 (Simonyan & Zisserman, 2014) to compensate for the reduction in resolution, leading to exponential growth in the parameters, and synaptic connections of a kernel in these layers. It becomes a predominant cause of overfitting (Simonyan & Zisserman, 2014) which is alleviated via dropout (Krizhevsky et al., 2012) , but at the cost of increased training time. ResNet (He et al., 2016) avoids that by reducing and expanding the number of channels via 1 × 1 conv layers and placing them before and after 3 × 3 convolutions. Xie et al. (2017) restructures the residual unit of ResNet in form of groups, however, the issue of large depth, overall parameters still remains in picture. The mobile CNNs (Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018; Ma et al., 2018) on the other hand employ depthwise convolutions (Sifre & Mallat) to control the parameter growth and reduce FLOPs. However representation power decreases quickly. Moreover, depthwise convs are devoid of cross channel context which is crucial for better performance (Zhang et al., 2018) . Therefore such convolutions are followed by a 1 × 1 conv to intertwine cross channel information. Depth: In the above networks, the use of 1 × 1 layers increases network depth rapidly. For example, two 1 × 1 for each 3 × 3 in (He et al., 2016; Sandler et al., 2018; Zhang et al., 2018; Tan & Le, 2019) , while one in (Howard et al., 2017) . Despite being beneficial, these layers constitute a significant amount of depth e.g. 66% in (He et al., 2016) . Moreover, due to their pointwise nature, 1 × 1 convs do not contribute in the receptive field, which in contrast is governed by 3 × 3 layers. Branching: With time, the CNN architectures have grown from branchless (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014) to single branch (He et al., 2016) to multibranch (Szegedy et al., 2016; Schneider et al., 2017) . Neural architecture search has resulted in even heavily branched designs (Zoph et al., 2018; Tan & Le, 2019) . Although branching helps in improving accuracy, it also raises memory access cost on parallel computing hardware (Ding et al., 2021) which directly impacts latency and memory consumption. Latency: Both Depth and high branching increase latency despite having fewer FLOPs. In CNNs, each layer require certain computing time and many such layers are linked serially. Output of one layer can not be computed until all outputs of the preceding layers are available, despite enough computing power remains. This dramatically increases the latency despite fewer calculations per layer. For example, 100 layers each with 1ms time result in 100ms of latency while a shallow network of 15 layers with 3ms per layer runtime will result in 45ms of latency. The best illustration of this phenomenon is (Tan & Le, 2019) which despite having fewer FLOPs, runs equivalent to a five times bigger network (He et al., 2016) . More recently, (Ding et al., 2021) proposes structural reparameterization to accelerate (Simonyan & Zisserman, 2014) during inference phase, however, the training-phase network is still high in parameters, and branches even more than its predecessor (He et al., 2016) , showing no improvement in training time (Table 4 ).

