JOINT IMPLICIT NEURAL REPRESENTATIONS FOR GLOBAL-SCALE SPECIES MAPPING

Abstract

Estimating the geographical range of a species from sparse observations is a challenging and important geospatial prediction problem. Given a set of locations where a species has been observed, the goal is to learn a model that can predict whether the species is present or absent at any location. This problem has a long history in ecology, but traditional methods struggle to take advantage of emerging large-scale crowdsourced datasets which can include tens of millions of records for hundreds of thousands of species. We propose a new approach based on implicit neural representations that jointly estimates the geographical ranges of 47k species simultaneously. We also introduce a series of benchmarks that measure different aspects of species range estimation and spatial representation learning. We find that our approach scales gracefully, making better predictions as we scale up the number of species used for training and the amount of training data per species. Despite being trained on noisy and biased crowdsourced data, our models can approximate expert-developed gold standard range maps for many species.

1. INTRODUCTION

We are witnessing a dramatic decline in global biodiversity, which has severe ramifications for natural resource management, food security, and ecosystem services that are crucial to human health (Watson et al., 2019) . If we want to take effective conservation action we must understand where different species live. In ecology, this problem is known as species distribution modeling (SDM) (Elith & Leathwick, 2009) . Ideally we would have up-to-date global SDMs for all species. Unfortunately, we only have SDMs for a relatively small number of species and locations. A key obstacle is that most SDM methods are incompatible with the most common form of species data. An SDM must predict whether a species is present or absent at any location given spatially sparse observation records. With a sufficient amount of presence-absence data -records of where a species has been observed to occur and where it has been confirmed to be absent -this problem can be approached using standard methods from machine learning and statistics (Beery et al., 2021) . 1Most SDM methods require presence-absence data. Unfortunately, presence-absence data is scarce due to the difficulty of verifying that a species is absent from an area. Presence-only data -which consists only of locations where a species has been observed, with no confirmed absences -is much more abundant. For instance, at the end of 2022 the iNaturalist community science platform (iNa) had collected over 70M observations across 300k species, all of which is presence-only data. Though presence-only data is not without drawbacks (Hastie & Fithian, 2013) , it is important to develop SDM methods that can take advantage of this vast supply of data. Deep learning is currently one of our best tools for making use of large-scale datasets. Deep neural networks also have a key advantage over many prior SDM methods because they can jointly learn the distribution of many species in the same model (Chen et al., 2017; Tang et al., 2018; Mac Aodha et al., 2019) . By learning representations that share information across species, the models can make improved predictions (Chen et al., 2017) . However, the majority of these deep learning approaches need presence-absence data for training, which prevents them from scaling beyond the small number of species and regions for which sufficient presence-absence data is available.



The term "presence-absence" should not be taken to convey absolute certainty about whether a species is present or absent. False absences and, to a lesser extent, false presences are a serious concern in SDM(MacKenzie et al., 2002; Guillera-Arroita, 2017).

