JOINT IMPLICIT NEURAL REPRESENTATIONS FOR GLOBAL-SCALE SPECIES MAPPING

Abstract

Estimating the geographical range of a species from sparse observations is a challenging and important geospatial prediction problem. Given a set of locations where a species has been observed, the goal is to learn a model that can predict whether the species is present or absent at any location. This problem has a long history in ecology, but traditional methods struggle to take advantage of emerging large-scale crowdsourced datasets which can include tens of millions of records for hundreds of thousands of species. We propose a new approach based on implicit neural representations that jointly estimates the geographical ranges of 47k species simultaneously. We also introduce a series of benchmarks that measure different aspects of species range estimation and spatial representation learning. We find that our approach scales gracefully, making better predictions as we scale up the number of species used for training and the amount of training data per species. Despite being trained on noisy and biased crowdsourced data, our models can approximate expert-developed gold standard range maps for many species.

1. INTRODUCTION

We are witnessing a dramatic decline in global biodiversity, which has severe ramifications for natural resource management, food security, and ecosystem services that are crucial to human health (Watson et al., 2019) . If we want to take effective conservation action we must understand where different species live. In ecology, this problem is known as species distribution modeling (SDM) (Elith & Leathwick, 2009) . Ideally we would have up-to-date global SDMs for all species. Unfortunately, we only have SDMs for a relatively small number of species and locations. A key obstacle is that most SDM methods are incompatible with the most common form of species data. An SDM must predict whether a species is present or absent at any location given spatially sparse observation records. With a sufficient amount of presence-absence data -records of where a species has been observed to occur and where it has been confirmed to be absent -this problem can be approached using standard methods from machine learning and statistics (Beery et al., 2021) . 1Most SDM methods require presence-absence data. Unfortunately, presence-absence data is scarce due to the difficulty of verifying that a species is absent from an area. Presence-only data -which consists only of locations where a species has been observed, with no confirmed absences -is much more abundant. For instance, at the end of 2022 the iNaturalist community science platform (iNa) had collected over 70M observations across 300k species, all of which is presence-only data. Though presence-only data is not without drawbacks (Hastie & Fithian, 2013) , it is important to develop SDM methods that can take advantage of this vast supply of data. Deep learning is currently one of our best tools for making use of large-scale datasets. Deep neural networks also have a key advantage over many prior SDM methods because they can jointly learn the distribution of many species in the same model (Chen et al., 2017; Tang et al., 2018; Mac Aodha et al., 2019) . By learning representations that share information across species, the models can make improved predictions (Chen et al., 2017) . However, the majority of these deep learning approaches need presence-absence data for training, which prevents them from scaling beyond the small number of species and regions for which sufficient presence-absence data is available. Figure 1 : We show that sparse species observation data (left) can be used to learn meaningful geospatial representations (middle). We evaluate the ability of these models to (right): estimate species ranges, assist computer vision classifiers, and transfer to other geospatial prediction tasks. Our work makes the following contributions: (i) A novel implicit neural representation approach to joint SDM across tens of thousands of species, trained with crowdsourced presence-only data. (ii) A detailed investigation of loss functions for learning from presence-only data, their scaling properties, and the resulting geospatial representations. (iii) A new suite of geospatial benchmark tasks -ranging from species mapping to fine-grained visual categorization -which will facilitate future research on large-scale SDM and geospatial representation learning.

2. RELATED WORK

Species distribution modeling (SDM) refers to a set of methods that aim to predict where (and sometimes when and in what quantities) species of interest are likely to be found (Elith & Leathwick, 2009) . The literature on SDM is vast. Readers interested in an overview of SDM should see the classic review by Elith & Leathwick (2009) or the recent review of SDM for computer scientists by Beery et al. (2021) . Note that we focus narrowly on the problem of estimating species range, i.e. we do not consider more complex problems like abundance estimation (Potts & Elith, 2006) . Traditional approaches to SDM train conventional supervised learning models (e.g. logistic regression (Pearce & Ferrier, 2000) , random forests (Cutler et al., 2007), etc.) to learn a mapping between hand-designed sets of environmental features (e.g. altitude, average rainfall, etc.) and species presence or absence (Phillips et al., 2004; Elith et al., 2006) . Readers interested in traditional SDM approaches should consult Norberg et al. (2019); Valavi et al. (2021; 2022) and the references therein. More recently, deep learning methods have been introduced that instead jointly represent multiple different species within the same model (Chen et al., 2017; Botella et al., 2018b; Tang et al., 2018; Mac Aodha et al., 2019) . These models are typically trained on crowdsourced data, which can introduce additional challenges and biases that need to be accounted for during training (Fink et al., 2010; Chen & Gomes, 2019; Johnston et al., 2020; Botella et al., 2021) . We build on the work of Mac Aodha et al. ( 2019), who proposed a neural network approach that forgoes the need for environmental features (as required by e.g. Botella et al. (2018b); Tang et al. (2018) ) by learning to predict species presence from geographical location alone. The problem of joint SDM with presence-only data can be viewed as an instance of multi-label classification with limited supervision. In particular, it is an example of single positive multi-label learning (SPML) (Cole et al., 2021; Verelst et al., 2022; Zhou et al., 2022) . The goal is to train a model that is capable of making multi-label predictions at test time, despite having only ever observed one positive label per training instance (i.e. no confirmed negative training labels). Our work connects the SPML literature and species range mapping literature, and sets up large-scale joint species distribution modeling as a challenging real-world SPML task. This setting presents significant new challenges for SPML, which has previously been limited to relatively small label spaces (< 100 categories). We investigate the role of the number of categories in our experiments. Some SPML methods such as ROLE (Cole et al., 2021) are not computationally viable when the label space is large. One of our baselines will be the SPML method of Zhou et al. (2022) , which is scalable and attains state-of-the-art performance on standard SPML benchmarks. Our task is related to the growing number of works that use coordinate neural networks for implicitly representing images (Tancik et al., 2020) and 3D scenes (Sitzmann et al., 2019) , including those that perform novel-view image synthesis (Mildenhall et al., 2020) . There are many design choices in these methods that are being actively studied, including the impact of the activation functions in the network (Sitzmann et al., 2019; Ramasinghe & Lucey, 2022) and the effect of different input encodings (Tancik et al., 2020; Zheng et al., 2022) . In most research on implicit neural representa-



The term "presence-absence" should not be taken to convey absolute certainty about whether a species is present or absent. False absences and, to a lesser extent, false presences are a serious concern in SDM(MacKenzie et al., 2002; Guillera-Arroita, 2017).

