CAN A FRUIT FLY LEARN WORD EMBEDDINGS?

Abstract

The mushroom body of the fruit fly brain is one of the best studied systems in neuroscience. At its core it consists of a population of Kenyon cells, which receive inputs from multiple sensory modalities. These cells are inhibited by the anterior paired lateral neuron, thus creating a sparse high dimensional representation of the inputs. In this work we study a mathematical formalization of this network motif and apply it to learning the correlational structure between words and their context in a corpus of unstructured text, a common natural language processing (NLP) task. We show that this network can learn semantic representations of words and can generate both static and context-dependent word embeddings. Unlike conventional methods (e.g., BERT, GloVe) that use dense representations for word embedding, our algorithm encodes semantic meaning of words and their context in the form of sparse binary hash codes. The quality of the learned representations is evaluated on word similarity analysis, word-sense disambiguation, and document classification. It is shown that not only can the fruit fly network motif achieve performance comparable to existing methods in NLP, but, additionally, it uses only a fraction of the computational resources (shorter training time and smaller memory footprint).

1. INTRODUCTION

Deep learning has made tremendous advances in computer vision, natural language processing and many other areas. While taking high-level inspiration from biology, the current generation of deep learning methods are not necessarily biologically realistic. This raises the question whether biological systems can further inform the development of new network architectures and learning algorithms that can lead to competitive performance on machine learning tasks or offer additional insights into intelligent behavior. Our work is inspired by this motivation. We study a well-established neurobiological network motif from the fruit fly brain and investigate the possibility of reusing it for solving common machine learning tasks in NLP. We consider this exercise as a toy model example illustrating the possibility of "reprogramming" of naturally occurring algorithms and behaviors (clustering combinations of input stimuli from olfaction, vision, and thermo-hydro sensory system) into a target algorithm of interest (learning word embeddings from raw text) that the original biological organism does not naturally engage in. The mushroom body (MB) is a major area of the brain responsible for processing of sensory information in fruit flies. It receives inputs from a set of projection neurons (PN) conveying information from several sensory modalities. The major modality is olfaction [2], but there are also inputs from the PN responsible for sensing temperature and humidity [29], as well as visual inputs [45; 6] . These sensory inputs are forwarded to a population of approximately 2000 Kenyon cells (KCs) through a set of synaptic weights [26] . KCs are reciprocally connected through an anterior paired lateral (APL) neuron, which sends a strong inhibitory signal back to KCs. This recurrent network effectively implements winner-takes-all competition between KCs, and silences all but a small fraction of top activated neurons [8] . This is the network motif that we study in this paper; its schematic is shown in Fig. 1 . KCs also send their outputs to mushroom body output neurons (MBONs), but this part of the MB network is not included into our mathematical model. Behaviorally, it is important for a fruit fly to distinguish sensory stimuli, e.g., different odors. If a fruit fly senses a smell associated with danger, it's best to avoid it; if it smells food, the fruit fly might want to approach it. The network motif shown in Fig. 1 is believed to be responsible for clustering sensory stimuli so that similar stimuli elicit similar patterns of neural responses at the level of KCs to allow generalization, while distinct stimuli result in different neural responses, to allow discrimination. Importantly, this biological network has evolved to accomplish this task in a very efficient way. In computational linguistics there is a long tradition [19] of using distributional properties of linguistic units for quantifying semantic similarities between them, as summarized in the famous quote by JR Firth: "a word is characterized by the company it keeps" [14] . This idea has led to powerful tools such as Latent Semantic Analysis [9], topic modelling [3], and language models like word2vec [30], GloVe [34] , and, more recently, BERT [10] which relies on the Transformer model [44] . Specifically word2vec models are trained to maximize the likelihood of a word given its context, GloVe models utilize global word-word co-occurence statistics, and BERT uses a deep neural network with attention to predict masked words (and the next sentence). As such, all these methods utilize the correlations between individual words and their context in order to learn useful word embeddings. In our work we ask the following question: can the correlations between words and their contexts be extracted from raw text by the biological network of KCs, shown in Fig. 1 ? Further, how do the word representations learned by KCs differ from those obtained by existing NLP methods? Although this network has evolved to process sensory stimuli from olfaction and other modalities and not to "understand" language it uses a general purpose algorithm to embed inputs (from different modalities) into a high dimensional space with several desirable properties, which we discuss below. Our approach relies on a recent proposal that the recurrent network of mutually inhibited KCs can be used as a "biological" model for generating sparse binary hash codes for the input data presented at the projection neuron layer [8] . It was argued that a matrix of random weights projecting from PN layer into the KCs layer leads to the highly desirable property of making the generated hash codes locality sensitive, i.e., placing similar inputs close to each other in the embedding space and pushing distinct stimuli far apart. A subsequent study [39] has demonstrated that the locality sensitivity of the hash codes can be significantly increased, compared to the random case, if the matrix of weights from PN to KCs is learned from data. The idea of using the network of KCs with random projections for NLP tasks has also been previously explored in [37] , see discussion in section 6.



Figure 1: Network architecture. Several groups of PNs corresponding to different modalities send their activities to the layer of KCs, which are inhibited through the reciprocal connections to the APL neuron.

