COMPRESSING MULTIDIMENSIONAL WEATHER AND CLIMATE DATA INTO NEURAL NETWORKS

Abstract

Weather and climate simulations produce petabytes of high-resolution data that are later analyzed by researchers in order to understand climate change or severe weather. We propose a new method of compressing this multidimensional weather and climate data: a coordinate-based neural network is trained to overfit the data, and the resulting parameters are taken as a compact representation of the original grid-based data. While compression ratios range from 300× to more than 3,000×, our method outperforms the state-of-the-art compressor SZ3 in terms of weighted RMSE, MAE. It can faithfully preserve important large scale atmosphere structures and does not introduce significant artifacts. When using the resulting neural network as a 790× compressed dataloader to train the WeatherBench forecasting model, its RMSE increases by less than 2%. The three orders of magnitude compression democratizes access to high-resolution climate data and enables numerous new research directions.

1. INTRODUCTION

Numerical weather and climate simulations can produce hundreds of terabytes to several petabytes of data (Kay et al., 2015; Hersbach et al., 2020) and such data are growing even bigger as higher resolution simulations are needed to tackle climate change and associated extreme weather (Schulthess et al., 2019; Schär et al., 2019) . In fact, kilometer-scale climate data are expected to be one of, if not the largest, scientific datasets worldwide in the near future. Therefore it is valuable to compress those data such that ever-growing supercomputers can perform more detailed simulations while end users can have faster access to them. Data produced by numerical weather and climate simulations contains geophysical variables such as geopotential, temperature, and wind speed. They are usually stored as multidimensional arrays where each element represents one variable evaluated at one point in a multidimensional grid spanning space and time. Most compression methods (Yeh et al., 2005; Lindstrom, 2014; Lalgudi et al., 2008; Liang et al., 2022; Ballester-Ripoll et al., 2018) use an auto-encoder approach that compresses blocks of data into compact representations that can later be decompressed back into the original format. This approach prohibits the flow of information between blocks. Thus larger blocks are required to achieve higher compression ratios, yet larger block sizes also lead to higher latency and lower bandwidth when only a subset of data is needed. Moreover, even with the largest possible block size, those methods cannot use all the information due to computation or memory limitations and cannot further improve compression ratio or accuracy. We present a new lossy compression method for weather and climate data by taking an alternative view: we compress the data through training a neural network to surrogate the geophysical variable as a continuous function mapping from space and time coordinates to scalar numbers (Figure 1 ). The input horizontal coordinates are transformed into three-dimensional Cartesian coordinates on the unit sphere where the distance between two points is monotonically related to their geodesic distance, such that the periodic boundary conditions over the sphere are enforced strictly. The resulting Cartesian coordinates and other coordinates are then transformed into Fourier features before flowing into fully-connected feed forward layers so that the neural network can capture high-frequency signals (Tancik et al., 2020) . After the neural network is trained, its weights are quantized from Besides high compression ratios, our method provides extra desirable features for data analysis. Users can access any subset of the data with a cost only proportional to the size of the subset because accessing values with different coordinates is independent and thus trivially parallelizable on modern computation devices such as GPUs or multi-core CPUs. In addition, the functional nature of the neural network provides "free interpolation" when accessing coordinates that do not match existing grid points. Both features are impossible or impractical to implement on traditional compressors with a matching compression ratio.

1.1. RELATED WORK

Compression methods for multidimensional data Existing lossy compression methods compress multidimensional data by transforming it into a space with sparse representations after truncating, and then quantize and optionally entropy-encode the resulting sparse data. As one of the most successful methods in the area, SZ3 (Liang et al., 2022) finds sparse representations in the space of locally spanning splines. It provides an error-bounded compression method and can achieve a 400× compression ratio in our test dataset. TTHRESH (Ballester-Ripoll et al., 2018) compresses data by decomposing it into lower dimensional tensors. It performs well on isotropic data such as medical images and turbulence data with compression ratios of around 300×, but it is not the case for heavily stratified weather and climate data where it either fails to compress or results in poor compression ratios of around 3×. ZFP (Lindstrom, 2014) is another popular compression method that provides a fixed-bitrate mode by truncating blocks of orthogonally transformed data. It provides only low compression ratios that usually do not exceed 10×. While there are lossless methods for multidimensional data (Yeh et al., 2005) , they cannot achieve compression ratios more than 2× because scientific data stored in multidimensional floating point arrays rarely has repeated bit patterns (Zhao et al., 2020) . SimFS (Girolamo et al., 2019) is a special lossless compression method that compresses the simu-



Quantize𝑥 𝑖 = cos 𝜓 𝑖 sin 𝜙 𝑖 𝑦 𝑖 = cos 𝜓 𝑖 cos 𝜙 𝑖 𝑧 𝑖 = sin 𝜓 𝑖 (𝜓 𝑖 , 𝜙 𝑖 ) (𝑡 𝑖 , 𝑝 𝑖 ) Diagram of the neural network structure: Coordinates are transformed into 3D coordinates and fed to the Fourier feature layer (green). Then they flow into a series of fully connected blocks (light blue) where each block consists of two feed forward layers (dark blue) with batch norms (orange) and a skip connection. Solid lines indicate flows of data in both compression (training) and decompression (inference) processes, and dash lines correspond to the compression-only process.

