SOCIAL AND ENVIRONMENTAL IMPACT OF RECENT DEVELOPMENTS IN MACHINE LEARNING ON BIOLOGY AND CHEMISTRY RESEARCH

Abstract

Potential societal and environmental effects such as the rapidly increasing resource use and the associated environmental impact, reproducibility issues, and exclusivity, the privatization of ML research leading to a public research brain-drain, a narrowing of the research effort caused by a focus on deep learning, and the introduction of biases through a lack of sociodemographic diversity in data and personnel caused by recent developments in machine learning are a current topic of discussion and scientific publications. However, these discussions and publications focus mainly on computer science-adjacent fields, including computer vision and natural language processing or basic ML research. Using bibliometric analysis of the complete and full-text analysis of the open-access literature, we show that the same observations can be made for applied machine learning in chemistry and biology. These developments can potentially affect basic and applied research, such as drug discovery and development, beyond the known issue of biased data sets.

1. INTRODUCTION

The unprecedented progress of machine learning during the past two decades has been catalysed and remains driven by the development of increasingly powerful computer hardware. This progress is enabled by the ability of deep neural networks to scale exceptionally well with increasing data availability and model complexity compared to other approaches. Thus, they can be trained for linear regression on small data sets and, with conceptually simple changes to the network architecture, for language translation or image generation on immense text corpora and image collections. While comparatively exceptional, deep neural networks are understood to still only scale linearly at an exponential cost (Schwartz et al., 2020) , leading to diminishing returns (Thompson et al., 2021) . Among the machine learning community, this has raised concern over the future direction of the field and a growing exclusivity driven by ever-increasing hardware and energy costs (Schwartz et al., 2020; Thompson et al., 2021; Jurowetzki et al., 2021) . After a discourse on the intertwined recent history of deep learning and hardware advances, we will analyse the applicability of the most prominent concerns raised in machine learning research to applied machine learning research in biology and chemistry. We have categorised these concerns under socioeconimic, scientific, and environmental considerations. The hard-and software that catalysed rapid developments in machine learning In late 2002 and early 2003, the release of the Radeon 9700 and GeForce FX video cards introduced a fully programmable graphics pipeline, extending and later replacing the existing fixed function pipelines. Unlike the fixed function pipeline, which allowed the user to only supply input matrices and parameters to built-in operations, the programmable pipeline introduced the execution of user-written shader programs on the GPU (Contributors, 2015) . This fundamental change allowed programmers and researchers to exploit the intrinsic parallelism of GPUs 2 years before Intel would introduce its first dual-core CPU. Within months of the availability of this new hardware and the accompanying APIs, researchers implemented linear algebra methods on GPUs and introduced programming frameworks to use GPUs for general-purpose computations (Thompson et al., 2002; Krüger & West-ermann, 2003) . This rapid development marked the dawn of general-purpose computing on graphics processing units (GPGPU). In a presentation at ICS '08, Harris presented the successes of GPGPU by highlighting a speed-up in molecular docking, N-body simulations, HD video stream transcoding, or image processing-applications in machine learning were not discussed. However, just one year later, the introduction of GPUs as general-purpose processors catalysed the deep learning explosion of the early 2010s by allowing deep learning algorithms pioneered by Alexey Ivakhnenko in 1971 to be run within practical time on widely available consumer hardware when Rajat et al. showed that GPUs outperform CPUs by an order of magnitude in large-scale deep unsupervised learning tasks (Ivakhnenko, 1971; Raina et al., 2009) . Hardware and energy requirements increase in machine learning research In 2010, Ciresan et al. ( 2010) introduced a multi-layer perceptron (MLP) with up to 12.11 million free parameters where forward and backward propagation were implemented on a GPU using NVIDIA's proprietary CUDA API introduced by Harris at ICS '08 two years before, speeding up the routines by a factor of 40. In their arXiv paper, they also report the computer's hardware specifications as "Core2 Quad 9450 2.66GHz processor, 3GB of RAM, and a GTX280 graphics card". The GTX 280 graphics card by NVIDIA was, at the time of the paper's writing, two years old and cost USD 893 when first released (adjusted for inflation). Equipped with this 2-year-old hardware that cost well below USD 1,000, Cires ¸an et al. were able to improve upon the state-of-the-art performance on the MNIST classification benchmark set four years prior by Ranzato et al. (2006) . As they not only reported the hardware used but also the time it took to train the model, the power usage of the GPU and CPU, with a thermal design power (TDP) of 236 and 95 Watt, respectively, can be calculated as 114.5h × (236 + 95)W = 37.9kWh. Seven years later, Vaswani et al. (2017) introduced the transformer architecture. The training used 8 NVIDIA Tesla P100 GPUs, whose price was ∼USD 55,100 at the time, and took 84 hours, resulting in overall energy usage of 84h × 8 × 250W = 168kWh. Hardware and energy requirements explode in applied machine learning Applying the novel transformer architecture, NVIDIA reportedly trained the 345 million parameter BERT model in 2019, which was previously introduced by Google in the same year, on 4 DGX-2H servers (64 Tesla V100s) in 79.2 hours, with a maximum power usage of 12,000 Watt, resulting in a total power use of 3.8 MWh (79.2h × 4 × 12kW) (Devlin et al., 2018) . The cost of this system at the time of training was USD 1,596,000. Alternatively, the BERT model could be trained on on-demand Google Cloud GPUs for USD 2.48 per GPU hour, resulting in total costs of USD 12,570 (2.48 × 64 × 79.2). The MT-NLG model presented by NVIDIA and Microsoft in 2021 represents the acceleration of hardware and energy cost in the field (Smith et al., 2022) . The 530 billion parameter model was trained on 560 DGX A100 servers-a total of 4,480 NVIDIA A100 80GB Tensor Core GPUs-for 2,160 hours (Rajbhandari et al., 2022) . The power usage of a cluster of 560 DGX A100 servers during 2,160 hours is 7.862 GWh (2,160 × 560 × 6.5kW). Taking the world average electricity price of USD 0.131 per kWh during December 2021, the total electricity bill for training MT-NLG was USD 1,029,922. The total cost of the hardware is hard to estimate as specialised network hardware is required to build such a cluster; however, the DGX A100 was priced at USD 199,000 on release, resulting in a minimum total cost of USD 111, 440, 000 (199, 000 × 560) . Training the model on on-demand Google Cloud GPUs for USD 2,141.82 per GPU month results in a total cost of USD 28, 786, 060.8 (3 × 4, 480 × 2, 141.82) . Hardware and energy costs drive the de-democratization of machine learning The examples discussed above represent an increasing hardware and energy cost in conducting basic and applied deep learning-based machine learning research. The resulting diminishing returns and the environmental impact have previously been discussed by Thompson et al. (2021) and Schwartz et al. (2020) . The development of increasing costs following a potential breakthrough stands in contrast to similar or even more disruptive changes in other fields, such as CRISPR-Cas9 lowering costs in molecular biology or the ever-decreasing costs of genome and RNA sequencing (Ledford, 2015; Wetterstrand, 2021; Gierahn et al., 2017) . While CRISPR-Cas9 and affordable sequencing has led to what has been called the democratization of access to sequencing and genome editing (Guernet & Grumolato, 2017; McPherson, 2014; Srivathsan et al., 2019) , cutting-edge machine learning research is becoming potentially increasingly expensive and exclusive (Ahmed & Wahed, 2020) . Indeed, in a 2020 article on the most cited research articles, all mentioned machine learning research was conducted by, or in collaboration with, OpenAI, Microsoft, and Alphabet (Kingma & Ba, 2014; Ren et al., 

