A NEW PHOTORECEPTOR-INSPIRED CNN LAYER EN-ABLES DEEP LEARNING MODELS OF RETINA TO GEN-ERALIZE ACROSS LIGHTING CONDITIONS

Abstract

As we move our eyes, and as lighting changes in our environment, the light intensity reaching our retinas changes dramatically and on multiple timescales. Despite these changing conditions, our retinas effortlessly extract visual information that allows downstream brain areas to make sense of the visual world. Such processing capabilities are desirable in many settings, including computer vision systems that operate in dynamic lighting environments like in self-driving cars, and in algorithms that translate visual inputs into neural signals for use in vision-restoring prosthetics. To mimic retinal processing, we first require models that can predict retinal ganglion cell (RGC) responses reliably. While existing state-of-the-art deep learning models can accurately predict RGC responses to visual scenes under steady-state lighting conditions, these models fail under dynamic lighting conditions. This is because changes in lighting markedly alter RGC responses: adaptation mechanisms dynamically tune RGC receptive fields on multiple timescales. Because current deep learning models of the retina have no in-built notion of light level or these adaptive mechanisms, they are unable to accurately predict RGC responses under lighting conditions that they were not trained on. We present here a new deep learning model of the retina that can predict RGC responses to visual scenes at different light levels without requiring training at each light level. Our model combines a fully trainable biophysical front end capturing the fast and slow adaptation mechanisms in the photoreceptors with convolutional neural networks (CNNs) capturing downstream retinal processing. We tested our model's generalization performance across light levels using monkey and rat retinal data. Whereas conventional CNN models without the photoreceptor layer failed to predict RGC responses when the lighting conditions changed, our model with the photoreceptor layer as a front end fared much better in this challenge. Overall, our work demonstrates a new hybrid approach that equips deep learning models with biological vision mechanisms enabling them to adapt to dynamic environments.

1. INTRODUCTION

A key problem in visual neuroscience is to generate models that can accurately predict how neurons will respond to visual stimuli. Along with their role in basic neuroscience, these models have applications in prosthetic devices and can form the basis for bio-inspired computer vision systems that aim to mimic the impressively robust functions of the human visual system. Use of machine learning models for such applications in neuroscience has become increasingly ubiquitous given their strong performance in computer vision applications like object recognition (Chollet, 2017; Simonyan & Zisserman, 2015; Krizhevsky et al., 2017) . For example using convolutional neural networks (CNNs) to predict responses of neurons in visual cortex (Kindel et al., 2019; Cadena et al., 2017) and retina (McIntosh et al., 2016; Tanaka et al., 2019; Yan et al., 2022; Goldin et al., 2022) to visual stimuli. We focus here on the retina. Under carefully controlled experimental conditions, and with constant lighting conditions, CNN models can predict responses of retinal ganglion cells (RGCs, the "output" cells of the retina, whose axons form the optic nerve) to visual stimuli with high accuracy (McIntosh et al., 2016) . In natural vision, however, lighting conditions are highly dynamic: the amount of light falling on the retina can change by several orders of magnitude at mul-tiple timescales. For example, light input can change locally on a region of the retina by an order of magnitude in less than a second following rapid eye movements such as saccades; global light levels may fluctuate by an order of magnitude every few seconds on a cloudy day, as clouds pass between the observer and the sun. These changes in light level substantially alter RGC responses to visual stimuli (Tikidji-Hamburyan et al., 2015; Ruda et al., 2020; Idrees et al., 2020; 2022; Farrow et al., 2013) , and pose the as-of-yet-unanswered question of whether CNNs can accurately predict RGC responses under these more challenging conditions. To answer this question, we first tested the ability of conventional CNN models (i.e., the Deep Retina model of (McIntosh et al., 2016) ) to predict RGC responses in the presence of changing light levels. These CNNs could accurately predict RGC responses when tested under the same lighting conditions under which they were trained but were unable to accurately predict responses under different lighting conditions. To improve the performance of machine learning (ML) models in predicting RGC responses under varied lighting conditions, we created a new type of convolutional neural network layer as a frontend for CNNs. This fully trainable input layer mimics the transformation of light into electrical signals at the retina's photoreceptors, including the adaptation mechanisms that modulate photoreceptor response sensitivity and kinetics and by doing so enable the retina to operate under a wide and dynamic input range. This in-built adaptation allows the model to adapt to changing light levels locally, at each pixel location in the input, the same way our retinas do. CNNs with this new input layer could generalize to test lighting conditions very different from the lighting conditions under which they were trained. We anticipate that our new photoreceptor-inspired CNN layer could have several significant impacts in various fields of vision science. First, our model can be used directly in vision-restoring prosthetics by enabling the translator algorithms to function under diverse lighting conditions, and thereby improve the operating range of these prosthetics. Second, our model can be used by visual neuroscientists to investigate dynamic retinal computations. Using the model as a front-end for algorithms that predict the responses of neurons in visual cortex, will enable those cortical models to operate under more naturalistic and more varied lighting conditions. Finally, because their local (pixel-bypixel) adaptation to changing light levels can filter out image changes due to changing lighting and shadow, our photoreceptor-inspired CNNs could have substantial application for computer vision systems operating in outdoor environments. There, lighting changes and shadows are known to confuse existing object detection and recognition algorithms (Yadron & Tynan, 2016; Levin, 2018; Janai et al., 2020; Gomez-Ojeda et al., 2015; Kolaman et al., 2019) as they do not naturally separate image changes due to local lighting variations from image changes due to changing object content. Here, photoreceptor-inspired CNN models can discount disruptive events in a video stream, such as sudden large changes in the amount of light reflected off a tracked object due to shadows or changes in intensity of the light source.

2.1. DEEP LEARNING MODELS OF THE VISUAL SYSTEM

Deep learning approaches have become increasingly popular in modeling the visual system as they can describe neural responses to a visual scene markedly more accurately than linear-nonlinear (LN) models. The current state-of-the-art retina predictor, called Deep Retina (McIntosh et al., 2016) , is based on a 2-layer CNN. This model takes as input a movie from which spatio-temporal features are extracted by the CNN layers, and outputs the spike rate of retinal ganglion cells. This model, when trained to predict tiger salamander retinal ganglion cell responses to spatiotemporal white noise stimuli, outperformed other models such as LN and the generalized linear models (GLMs). Such CNN based models of the retina have been used for describing responses to natural stimuli (McIntosh et al., 2016; Tanaka et al., 2019) and for explaining the underlying neural computations that lead to RGC responses (Tanaka et al., 2019; Yan et al., 2022; Goldin et al., 2022) . CNN based models can also better capture the activity of other visual areas such as the primary visual cortex (V1), than standard LN models (Kindel et al., 2019; Cadena et al., 2017) . While these models have been shown to reliably predict responses to visual stimuli having similar image statistics as the training set, it is unclear whether they will generalize to stimuli with different image statistics, such as different ambient luminance levels.

