SACOD: SENSOR ALGORITHM CO-DESIGN TOWARDS EFFICIENT CNN-POWERED INTELLIGENT PHLATCAM

Abstract

There has been a booming demand for integrating Convolutional Neural Networks (CNNs) powered functionalities into Internet-of-Thing (IoT) devices to enable ubiquitous intelligent "IoT cameras". However, more extensive applications of such IoT systems are still limited by two challenges. First, some applications, especially medicine-and wearable-related ones, impose stringent requirements on the camera form factor. Second, powerful CNNs often require considerable storage and energy cost, whereas IoT devices often suffer from limited resources. PhlatCam, with its form factor potentially reduced by orders of magnitude, has emerged as a promising solution to the first aforementioned challenge, while the second one remains a bottleneck. Existing compression techniques, which can potentially tackle the second challenge, are far from realizing the full potential in storage and energy reduction, because they mostly focus on the CNN algorithm itself. To this end, this work proposes SACoD, a Sensor Algorithm Co-Design framework to develop more efficient CNN-powered PhlatCam. In particular, the mask coded in the PhlatCam sensor and the backend CNN model are jointly optimized in terms of both model parameters and architectures via differential neural architecture search. Extensive experiments including both simulation and physical measurement on manufactured masks show that the proposed SACoD framework achieves aggressive model compression and energy savings while maintaining or even boosting the task accuracy, when benchmarking over two state-of-the-art (SOTA) designs with six datasets on four different tasks. We also evaluate the performance of SACoD on the actual PhlatCam imaging system with visualizations and experiment results. All the codes will be released publicly upon acceptance.

1. INTRODUCTION

Recent CNN breakthroughs trigger a growing demand for intelligent IoT devices, such as wearables and biology devices (e.g., swallowed endoscopes). However, two major challenges are hampering more extensive applications of CNN-powered IoT devices. First, some applications, especially medicine-and biology-related ones, impose strict requirements on the form factor, especially the thickness, which are often too stringent for existing lens-based imaging systems. Second, powerful CNNs require considerable hardware costs, whereas IoT devices only have limited resources. For the first challenge, lensless imaging systems (Asif et al., 2015; Shimano et al., 2018; Adams et al., 2017; Antipa et al., 2018; Boominathan et al., 2020) have emerged as a promising rescue. For example, PhlatCam (Boominathan et al., 2020) replaces the focal lenses with a set of phase masks, which encodes the incoming light instead of directly focusing it. The encoded information can be either computationally decoded to reconstruct the images or processed specifically for different applications. Such lensless imaging systems can be made much smaller and thinner, because the phase masks are smaller than the focal lens, and they can be placed much closer to the sensors and fabricated with much lower costs. For the second challenge, many recent works focus on designing CNNs with improved hardware efficiency, i.e., by applying generic neural architecture search (NAS) to find efficient CNNs. As such, a naive way to address the two aforementioned challenges simultaneously is to introduce lensless cameras as the signal acquisition frontend and then apply NAS to optimize the backend CNN. However, such approaches would result in disjoint optimization that can be far from optimal. A generic NAS would treat the camera as given, and only optimize the CNN. Likewise, existing phase mask designs for lensless cameras treat the CNNs as given, and only optimize the masks. Such disjoint optimization fails to (1) take advantage of the masks' potential computational capacity, with which the NAS optimization can be fundamentally improved, and (2) perform end2end optimization.

Output The Proposed SACoD Framework Input

It is shown in (Boominathan et al., 2020) that, under some assumptions, the phase masks in Phlat-Cam essentially perform 2D convolutions on the incoming lights, and the convolution kernel is encoded in the masks. Moreover, unlike other convolutional layers, the phase masks' convolutions are almost free -they do not consume additional energy, computation power, or storage, regardless of what value each mask takes. Therefore, we aim to incorporate the phase mask design into NAS to enable an end2end optimization of the sensing-processing pipeline, while exempting a portion of the pipeline from the efficiency penalties. Such co-designs are expected to achieve better tradeoffs between accuracy and efficiency. To this end, we propose a Sensor Algorithm Co-Design (SACoD) framework to enable more energyefficient CNN-powered IoT devices. While SACoD is general and can be applied to different sensing and intelligent processing systems, it is developed and evaluated in the context of PhlatCam (Boominathan et al., 2020) based imaging systems. Our main contributions are: • We propose SACoD, a novel co-design framework that jointly optimizes the sensor and neural networks to enable more energy-efficient CNN-powered IoT devices. To our best knowledge, SACoD is the first to propose sensor algorithm co-design for CNN inferences. • We develop an effective design of the optical layer to (1) exploit its potential computation capability and (2) enable co-search of the optical layer and backend algorithm. We then characterize the trade-off between accuracy and the required area of the corresponding imaging systems to demonstrate its effectiveness under practical size constraints. • Extensive experiments and ablation studies validate that the proposed SACoD consistently achieves reduced hardware costs/area while offering a comparable or even better task accuracy, when evaluated over two SOTA lensless imaging systems on four tasks and six datasets. And part of the experiments are further evaluated with fabricated masks to validate SACoD's effectiveness in the physical measurement besides simulation.

2. RELATED WORKS

Neural architecture search. Recently NAS has attracted increasing attention. It eliminates the handcrafting process and automatically designs neural architectures. Existing NAS techniques can be divided into three categories, evolution-based, reinforcement-learning (RL)-based, and one-shot NAS. As the computational overheads of evolution-or RL-based approaches can be unacceptably high, many techniques (Brock et al., 2017; Cai et al., 2018a; Liu et al., 2017; 2018; Pham et al., 2018; Xie et al., 2018) have been proposed to reduce the searching cost, among which differentiable architecture search (DARTS) has gained intensive interests. While being conceptually general, SACoD in this paper adopts the DARTS method, where a super-network is optimized during search and the strongest sub-network is preserved and then retrained. Lensless imaging systems. To eliminate the size or thickness burden caused by the lens, various lensless imaging systems have been developed. While lensless imaging systems have been widely used for capturing X-ray and gamma-ray (Dicke, 1968; Caroli et al., 1987) , it is still in an exploring stage for visible spectrum uses (Asif et al., 2015; Shimano et al., 2018; Antipa et al., 2018; Boom- 



Photoelectric

