SEMI-SUPERVISED AUDIO REPRESENTATION LEARN-ING FOR MODELING BEEHIVE STRENGTHS

Abstract

Honey bees are critical to our ecosystem and food security as a pollinator, contributing 35% of our global agriculture yield (Klein et al., 2007). In spite of their importance, beekeeping is exclusively dependent on human labor and experiencederived heuristics, while requiring frequent human checkups to ensure the colony is healthy, which can disrupt the colony. Increasingly, pollinator populations are declining due to threats from climate change, pests, environmental toxicity, making their management even more critical than ever before in order to ensure sustained global food security. To start addressing this pressing challenge, we developed an integrated hardware sensing system for beehive monitoring through audio and environment measurements, and a hierarchical semi-supervised deep learning model, composed of an audio modeling module and a predictor, to model the strength of beehives. The model is trained jointly on audio reconstruction and prediction losses based on human inspections, in order to model both low-level audio features and circadian temporal dynamics. We show that this model performs well despite limited labels, and can learn an audio embedding that is useful for characterizing different sound profiles of beehives. This is the first instance to our knowledge of applying audio-based deep learning to model beehives and population size in an observational setting across a large number of hives.

1. INTRODUCTION

Pollinators are one of the most fundamental parts of crop production worldwide (Klein et al., 2007) . Without honey bee pollinators, there would be a substantial decrease in both the diversity and yield of our crops, which includes most common produce (van der Sluijs & Vaage, 2016) . As a model organism, bees are also often studied through controlled behavioral experiments, as they exhibit complex responses to many environmental factors, many of which are yet to be fully understood. A colony of bees coordinate its efforts to maintain the overall health, with different types of bees tasked for various purposes. One of the signature modality of characterizing bee behavior is through the buzzing frequencies emitted through the vibration of the wings, which can correlate with various properties of the surroundings, including temperature, potentially allowing for a descriptive 'image' of the hive in terms of strength (Howard et al., 2013; Ruttner, 1988) . However, despite what is known about honey bees behavior and their importance in agriculture and natural diversity, there remains a substantial gap between controlled academic studies and the field practices carried out (López-Uribe & Simone-Finstrom, 2019). In particular, beekeepers use their long-tenured experience to derive heuristics for maintaining colonies, which necessitates frequent visual inspections of each frame of every box, many of which making up a single hive. During each inspection, beekeepers visually examine each frame and note any deformities, changes in colony size, amount of stored food, and amount of brood maintained by the bees. This process is labor intensive, limiting the number of hives that can be managed effectively. As growing risk factors make human inspection more difficult at scale, computational methods are needed in tracking changing hive dynamics on a faster timescale and allowing for scalable management. With modern sensing hardware that can record data for months and scalable modeling with state-of-the-art tools in machine learning, we can potentially start tackling some of challenges facing the management of our pollinators, a key player in ensuring food security for the future.

2. BACKGROUND AND RELATED WORKS

Our work falls broadly in applied machine learning within computational ethology, where automated data collection methods and machine learning models are developed to monitor and characterize biological species in natural or controlled settings (Anderson & Perona, 2014) . In the context of honey bees, while there has been substantial work characterizing bee behavior through controlled audio, image, and video data collection with classical signal processing methods, there has not been a large-scale effort studying how current techniques in deep learning can be applied at scale to the remote-monitoring of beehives in the field. Part of the challenge lies in data collection. Visual-sensing within beehives is nearly impossible given the current design of boxes used to house bees. These boxes are heavily confined with narrow spaces between many stacked frames for bees to hatch, rear brood, and store food. This makes it difficult to position cameras to capture complete data, without a redesign of existing boxes. Environment sensors, however, can capture information localized to a larger region, such as temperature and humidity. Sound, likewise, can travel across many stacked boxes, which are typically made from wood and have good acoustics. Previous works have explored the possibility of characterizing colony status with audio in highly stereotyped events, such as extremely diseased vs healthy beehives (Robles-Guerrero et al., 2017) or swarming (Krzywoszyja et al., 2018; Ramsey et al., 2020) , where the old Queen leaves with a large portion of the original colony. However, we have not seen work that attempt to characterize more sensitive measurements, such as population of beehives, based on audio. We were inspired by these works and the latest advances in hardware sensing and deep learning audio models to collect audio data in a longitudinal setting across many months for a large number of managed hives, and attempt to characterize some of the standard hive inspection items through machine learning. While audio makes it possible to capture a more complete picture of the inside of a hive, there are still challenges related to data semantics in the context of annotations. Image and video data can be readily processed and labeled post-collection if the objects of interest are recognizable. However, with honey bees, the sound properties captured by microphones are extremely difficult to discriminate, even to experts, due to the fact that the sound is not semantically meaningful, and microphone sensitivity deviations across sensors makes it difficult to compare data across different hives. Thus, it is not possible to retrospectively assign labels to data, making humans inspections during data collection the only source of annotations. As beekeepers cannot inspect a hive frequently due to the large number of hives managed and the potential disturbance caused to the hive, the task becomes few-shot learning. In low-show learning for audio, various works have highlighted the usefulness of using semisupervised or unsupervised objectives and/or learning an embedding of audio data, mostly for the purpose of sound classification or speech recognition (Jansen et al., 2020; Lu et al., 2019) . These models typically capture semantic differences between different sound sources. We were inspired by the audio classification work with semi-supervised or contrastive-learning objectives to build an architecture that could model our audio and learn an embedding without relying only on task-specific supervision. Unlike previous audio datasets used in prior works, longitudinal data is unlikely to discretize into distinct groups due to the slower continuously shifting dynamics across time on the course of weeks. Therefore, we make the assumption that unlike current audio datasets which contain audio from distinct classes that can be clustered into sub-types, our data more likely occupy a smooth latent space, due to the slow progression in time of changing properties, such as the transition between healthy and low-severity disease, or changes in the size of the population, as bee colonies increase by only around one frame per week during periods of colony growth (Russell et al., 2013; Sakagami & Fukuda, 1968 ).

3. METHODS

Hive Setup Each hive composes of multiple 10-frame standard Langstroth boxes stacked on top of one another, with the internal sensor located in the center frame of the bottom-most box, and the external sensor on the outside side wall of the box. This sensor placement is based on prior knowledge that bees tend to collect near the bottom box first prior to moving up the tower (Winston, 1987) . Due to difficulties in obtaining data that would span the spectrum of different colony sizes

