SOCIAL AND ENVIRONMENTAL IMPACT OF RECENT DEVELOPMENTS IN MACHINE LEARNING ON BIOLOGY AND CHEMISTRY RESEARCH

Abstract

Potential societal and environmental effects such as the rapidly increasing resource use and the associated environmental impact, reproducibility issues, and exclusivity, the privatization of ML research leading to a public research brain-drain, a narrowing of the research effort caused by a focus on deep learning, and the introduction of biases through a lack of sociodemographic diversity in data and personnel caused by recent developments in machine learning are a current topic of discussion and scientific publications. However, these discussions and publications focus mainly on computer science-adjacent fields, including computer vision and natural language processing or basic ML research. Using bibliometric analysis of the complete and full-text analysis of the open-access literature, we show that the same observations can be made for applied machine learning in chemistry and biology. These developments can potentially affect basic and applied research, such as drug discovery and development, beyond the known issue of biased data sets.

1. INTRODUCTION

The unprecedented progress of machine learning during the past two decades has been catalysed and remains driven by the development of increasingly powerful computer hardware. This progress is enabled by the ability of deep neural networks to scale exceptionally well with increasing data availability and model complexity compared to other approaches. Thus, they can be trained for linear regression on small data sets and, with conceptually simple changes to the network architecture, for language translation or image generation on immense text corpora and image collections. While comparatively exceptional, deep neural networks are understood to still only scale linearly at an exponential cost (Schwartz et al., 2020) , leading to diminishing returns (Thompson et al., 2021) . Among the machine learning community, this has raised concern over the future direction of the field and a growing exclusivity driven by ever-increasing hardware and energy costs (Schwartz et al., 2020; Thompson et al., 2021; Jurowetzki et al., 2021) . After a discourse on the intertwined recent history of deep learning and hardware advances, we will analyse the applicability of the most prominent concerns raised in machine learning research to applied machine learning research in biology and chemistry. We have categorised these concerns under socioeconimic, scientific, and environmental considerations. The hard-and software that catalysed rapid developments in machine learning In late 2002 and early 2003, the release of the Radeon 9700 and GeForce FX video cards introduced a fully programmable graphics pipeline, extending and later replacing the existing fixed function pipelines. Unlike the fixed function pipeline, which allowed the user to only supply input matrices and parameters to built-in operations, the programmable pipeline introduced the execution of user-written shader programs on the GPU (Contributors, 2015) . This fundamental change allowed programmers and researchers to exploit the intrinsic parallelism of GPUs 2 years before Intel would introduce its first dual-core CPU. Within months of the availability of this new hardware and the accompanying APIs, researchers implemented linear algebra methods on GPUs and introduced programming frameworks to use GPUs for general-purpose computations (Thompson et al., 2002; Krüger & West- 

