TRANSFORMERS FOR MODELING PHYSICAL SYSTEMS

Abstract

Transformers are widely used in neural language processing due to their ability to model longer-term dependencies in text. Although these models achieve state-ofthe-art performance for many language related tasks, their applicability outside of the neural language processing field has been minimal. In this work, we propose the use of transformer models for the prediction of dynamical systems representative of physical phenomena. The use of Koopman based embeddings provide a unique and powerful method for projecting any dynamical system into a vector representation which can then be predicted by a transformer model. The proposed model is able to accurately predict various dynamical systems and outperform classical methods that are commonly used in the scientific machine learning literature. 1 2

1. INTRODUCTION

The transformer model (Vaswani et al., 2017) , built on self-attention, has largely become the stateof-the-art approach for a large set of neural language processing (NLP) tasks including language modeling, text classification, question answering, etc. Although more recent transformer work is focused on unsupervised pre-training of extremely large models (Devlin et al., 2018; Radford et al., 2019; Dai et al., 2019; Liu et al., 2019) , the original transformer model garnered attention due to its ability to out-perform other state-of-the-art methods by learning longer-term dependencies without recurrent connections. Given that the transformer model was originally developed for NLP, nearly all related work has been rightfully confined within this field with only a few exceptions. Here, we focus on the development of transformers to model dynamical systems that can replace otherwise expensive numerical solvers. In other words, we are interested in using transformers to learn the language of physics. The surrogate modeling of physical systems is a research field that has existed for several decades and is a large ongoing effort in scientific machine learning. A surrogate model is defined as an approximate model of a physical phenomenon that is designed to replace an expensive computational solver that would otherwise be needed to resolve the system of interest. The key characteristic of surrogate models is their ability to model a distribution of initial or boundary conditions rather than learning just one solution. This is arguably essential for the justification of training a deep learning model versus using a standard numerical solver. The most tangible applications of surrogates are for optimization, design and inverse problems where many repeated simulations are typically needed. With the growing interest in deep learning, deep neural networks have been used for surrogate modeling a large range of physical systems in recent literature. Standard deep neural network architectures such as auto-regressive (Mo et al., 2019; Geneva & Zabaras, 2020a) , residual/Euler (González-García et al., 1998; Sanchez-Gonzalez et al., 2020) , recurrent and LSTM based models (Mo et al., 2019; Tang et al., 2020; Maulik et al., 2020) have been largely demonstrated to be effective at modeling various physical dynamics. Such models generally rely exclusively on the past time-step to provide complete information on the current state of the system's evolution. Particularly for dynamical systems, present machine learning models lack generalizable time cognisant capabilities to predict multi-time-scale phenomena present in systems including turbulent fluid flow, multi-scale materials modeling, molecular dynamics, chemical processes, etc. Thus currently adopted models struggle to maintain true physical accuracy for long-time predictions. Much work is needed to scale such deep learning models to larger physical systems that are of scientific and industrial interest. This work deviates from this pre-existing literature by investigating the use of transformers for the prediction of physical systems, relying entirely on selfattention to model dynamics. In the recent work of Shalova & Oseledets (2020) , such self-attention models were tested to learn single solutions of several low-dimensional ordinary differential equations. In this work, we propose a physics inspired embedding methodology to model a distribution of dynamical solutions. We will demonstrate our model on high-dimensional partial differential equation problems that far surpass the complexity of past works. To the authors best knowledge, this is the first work to explore transformer NLP architectures for the prediction of physical systems.

2. METHODS

When discussing dynamical systems, we are interested in systems that can be described through a dynamical ordinary or partial differential equation: φ t = F x, φ(t, x, η), ∇ x φ, ∇ 2 x φ, φ • ∇ x φ, . . . , F : R × R n → R n , t ∈ T ⊂ R + , x ∈ Ω ⊂ R m , in which φ ∈ R n is the solution of this differential equation with parameters η, in the time interval T and spatial domain Ω with a boundary Γ ⊂ Ω. This general form can embody a vast spectrum of physical phenomena including fluid flow and transport processes, mechanics and materials physics, and molecular dynamics. In this work, we are interested in learning the set of solutions for a distribution of initial conditions φ 0 ∼ p(φ 0 ), boundary conditions B(φ) ∼ p (B) ∀x ∈ Γ or equation parameters η ∼ p (η). This accounts for modeling initial value, boundary value and stochastic problems. We emphasize that this is fundamentally different, more difficult and of greater interest for most scientific applications compared to learning a single solution. To make this problem applicable to the use of transformer models, the continuous solution is discretized in both the spatial and temporal domains such that the solution of the differential equation is Φ = {φ 0 , φ 1 , . . . φ T } ; φ i ∈ R n×d for which φ i has been discretized by d points in Ω. We assume an initial state φ 0 and that the time interval T is discretized by T time-steps with a time-step size ∆t. Hence, we pose the problem of modeling a dynamical system as a time-series problem. The machine learning methodology has two core components: the transformer for modeling dynamics and the embedding network for projecting physical states into a vector representation. Similar to NLP, the embedding model is trained prior to the transformer. This embedding model is then frozen and the entire data-set is converted to the embedded space in which the transformer is then trained as illustrated in Fig. 1 . During testing, the embedding decoder is used to reconstruct the physical states from the transformer's predictions. The embedding model is first trained using Koopman based dynamics. The embedding model is then frozen (fixed), all training data is embedded and the transformer is trained in the embedded space.

2.1. TRANSFORMER

The transformer model was originally designed with NLP as the sole application with word vector embeddings of a passage of text being the primary input (Vaswani et al., 2017) . However, recent works have explored using attention mechanisms for different machine learning tasks (Veličković et al., 2017; Zhang et al., 2019; Fu et al., 2019) and a few investigate the use of transformers for applications outside of the NLP field (Chen et al., 2020) . This suggests that self-attention and in particular transformer models may work well for any problem that can be posed as a sequence of vectors. In this work, the primary input to the transformer will be an embedded dynamical system, Ξ = {ξ 0 , ξ 1 , . . . .ξ T }, where the embedded state at time-step i is denoted as ξ i ∈ R e . Given that we are interested in the prediction of a physical time series, this motivates the usage of a language modeling architecture that is designed for the sequential prediction of words in a body of text. We select the transformer decoder architecture used in the GPT models (Radford et al., 2018; 2019) . Figure 2 : The transformer decoder model used for the prediction of physical dynamics. Our model follows the GPT-2 architecture based on the implementation in the Hugging Face transformer repository (Wolf et al., 2019) , but is significantly smaller in size than these modern NLP transformers. This model consists of a stack of transformer decoder layers that use masked attention, as depicted in Fig. 2 . The inputs to the transformer are the embedded representation of the physical system with the sinusoidal positional encoding proposed in the original transformer (Vaswani et al., 2017) . To train the model, consider a data set of D embedded i.i.d. time-series D = Ξ i D i=1 for which we can use the standard time-series Markov model (language modeling) log-likelihood: L D = D i T j -log p ξ i j |ξ i j-k , . . . , ξ i j-1 , θ . k is the context window and θ are the model's parameters. Contrary to the standard NLP approach which poses the likelihood as a softmax over a dictionary of tokens, the likelihood here is taken as a standard Gaussian between the transformer's prediction and the target embedded value resulting in a L 2 loss. This is due to the fact that the solution to most physical systems cannot be condensed to a discrete finite set making tokenization into a finite dictionary not possible and thus a softmax approach not applicable. Training is the standard auto-regressive method used in GPT (Radford et al., 2018) , as opposed to the word masking (Devlin et al., 2018) , constrained to the embedded space. The physical states, φ i , have the potential to be very high dimensional thus training the transformer in the lower-dimensional embedded space can significantly lower training costs.

2.2. EMBEDDING MODEL

The second major component of the machine learning methodology is the embedding model responsible for projecting the discretized physical state space into a 1D vector representation. In NLP, the standard approach is to tokenize then embed a finite vocabulary of words, syllables or characters using methods such as n-gram models, Byte Pair Encoding (Gage, 1994) , Word2Vec (Mikolov et al., 2013a; b) , GloVe (Pennington et al., 2014) , etc. These methods allow language to be represented by a series of 1D vectors that serve as the input to the transformer. Clearly a finite tokenization and such NLP embeddings are not directly applicable to a physical system, thus we propose our own embedding method designed specifically for dynamical systems. Consider learning the generalized mapping between the system's state space and embedded space: F : R n×d → R e and G : R e → R n×d . Naturally, multiple approaches can be used especially if the dimensionality of the embedded space is less than that of the state-space but this is not always the case. The primary approach that we will propose is a Koopman observable embedding which is a technique that can be applied universally to all dynamical systems. Considering the discrete time form of the dynamical system in Eq. 1, the evolution of the state variables can be abstracted by φ i+1 = F (φ i ) for which F is the dynamic map from one time-step to the next. The foundation of Koopman theory states that for any dynamical system of this form, there exists an infinite set of state observables, g (φ i ), that evolve linearly in time such that: Kg (φ i ) g • F (φ i ) , where K is known as the Koopman operator (Koopman, 1931) which is time-invariant. Namely, the Koopman observables can evolve in time through continual matrix multiplication of with the Koopman operator: g (φ i+1 ) = Kg (φ i ) , g (φ i+2 ) = K 2 g (φ i ) , g (φ i+3 ) = K 3 g (φ i ) , . . . in which K n denotes n matrix products (e.g. K 3 = K • K • K). Modeling the dynamics of a system through the linear Koopman space can be attractive due to its simplification of the dynamics but also the potential physical insights it brings along with it. Spectral analysis of the Koopman operator can reveal fundamental dynamical modes that drive the system's evolution in time. However, these observables, g (φ i ), are typically unknown and are theoretically infinite dimensional. Thus Koopman theory can be viewed as a trade off between lifting the state space into observable space with more complex states but with simpler dynamics. For practical application, the Koopman operator and observables are finitely approximated. In recent years, various machine learning based approaches have been proposed to learn both the Koopman operator and observables approximated in a finite space for modeling, control and dynamical mode analysis (Takeishi et al., 2017; Li et al., 2017; Lusch et al., 2018; Korda & Mezić, 2018; Otto & Rowley, 2019; Korda et al., 2020; Mezic, 2020) . While deep learning methods have enabled greater success with discovering Koopman observables and operators (Lusch et al., 2018; Otto & Rowley, 2019) , applications have still been limited to fairly simple systems and do not work for long time prediction. This is likely due to the approximation of the finite-dimensional Koopman observables, limited data and complete dependence on the discovered Koopman operator K to model the dynamics. Suggesting the prediction of a system through a single linear transform clearly has significant limitations and is fundamentally a naive approach from a machine learning perspective. In this work, we propose using approximate Koopman observations as a methodology to develop embeddings for the transformer model such that F(φ i ) g(φ i ). As seen in Fig. 3 , the embedding model follows a standard encoder-decoder model with the middle latent variables being the Koopman observables. Embedding model architectures for each numerical example are provided in Appendix A. We introduce a learnable Koopman operator which takes the form of a banded matrix that is symmetrical about the diagonal to encourage the discovery of dominate dynamical modes rather than high-frequency oscillations (Lusch et al., 2018; Otto & Rowley, 2019) . This learned Koopman operator is disposed of once training of the embedding model is complete. Given the data set of physical state time-series, D Φ = Φ i D i , the Koopman embedding model is trained with the following loss: L DΦ = D i=1 T j=0 λ 0 M SE φ i j , G • F φ i j Reconstruction +λ 1 M SE φ i j , G • K j F φ i 0 Dynamics +λ 2 K 2 2 Decay . This loss function consists of three components: the first is a reconstruction loss which ensures a consistent mapping to and from the embedded representation. The second is the Koopman dynamics loss which pushes ξ j to follow linear dynamics resulting in time-steps of similar dynamics to have similar embeddings. The last term decays the Koopman operator's parameters to help force the model to discover lower-dimensional dynamical modes. In reference to traditional NLP embeddings, we believe our Koopman observable embedding has a motivation similar to Word2Vec (Mikolov et al., 2013a) as well as more recent embedding methods such as context2vec (Melamud et al., 2016) , ELMo (Peters et al., 2018) , etc. Namely, these methods are based on the principle of context: words that are contextually related to each other have a similar embedding. The Koopman embedding has a similar objective encouraging physical realizations containing similar time-invariant dynamical modes to also have similar embeddings. Hence, our goal with the embedding model is to not find the true Koopman observables or operator, but rather leverage Koopman to enforce physical context.

3. CHAOTIC DYNAMICS

As a foundational numerical example to rigorously compare the proposed model to other classical machine learning techniques, we will first look at surrogate modeling of the Lorenz system governed by: dx dt = σ (y -x) , dy dt = x (ρ -z) -y, dz dt = xy -βz. ( ) We use the classical parameters of ρ = 28, σ = 10, β = 8/3. For this numerical example, we wish to develop a surrogate model for predicting the Lorenz system given a random initial state x 0 ∼ U(-20, 20), y 0 ∼ U(-20, 20) and z 0 ∼ U(10, 40). In other words, we wish to surrogate model various initial value problems for this system of ODEs. The Lorenz system is used here because of its well known chaotic dynamics which make it extremely sensitive to numerical perturbations and thus an excellent benchmark for assessing a machine learning model's accuracy. We plot four separate test cases in Fig. 4 for which only the initial state is provided and the transformer model predicts 320 time-steps. Several test predictions for alternative models are provided in Appendix B. In general, we can see the transformer model is able to yield extremely accurate predictions even beyond its context length. Additionally, we plot the Lorenz solution for 25k time-steps from a numerical solver and predicted from transformer model in Fig. 5a . Note both have the same structure, which qualitatively indicates that the transformer indeed maintains physical dynamics. 

4. 2D FLUID DYNAMICS

The next dynamical system we will test is transient 2D fluid flow governed by the Navier-Stokes equations: ∂u i ∂t + u j ∂u i ∂x j = - 1 ρ ∂p ∂x i + ν ∂u i ∂x 2 j , in which u i and p are the velocity and pressure, respectively. ν is the viscosity of the fluid. We consider modeling the classical problem of flow around a cylinder at various Reynolds number defined by Re = u in d/ν in which u in = 1 and d = 2 are the inlet velocity and cylinder diameter, respectively. In this work, we choose to develop a surrogate model to predict the solution at any Reynolds number between Re ∼ U (100, 750). This problem is a fairly classical flow to investigate in scientific machine learning with various levels of difficulty (Lee & You, 2017; Morton et al., 2018; Lusch et al., 2018; Han et al., 2019; Xu & Duraisamy, 2020; Geneva & Zabaras, 2020b) . Here we choose one of the more difficult forms: model the flow starting at a steady state flow field at t = 0. Meaning the model is provided zero information on the structure of the cylinder wake during testing. Training, validation and testing data is obtained using the OpenFOAM simulator (Jasak et al., 2007) , from which a rectangular structured sub-domain is sampled centered around the cylinder's wake. Given that the flow is two-dimensional, our model will predict the x-velocity, y-velocity and pressure fields, {u x , u y , p} ∈ R 3×128×64 . The training, validation and test data sets consist of 27, 6 and 7 fluid flow simulations respectively with 400 time-steps each at a physical time-step size of ∆t = 0.5. As a base line model, we train a convolutional encoder-decoder model with a stack of three convolutional LSTMs (Xingjian et al., 2015) in the center. We note that convolutional LSTMs have been used extensively in recent scientific machine learning literature for modeling various physical systems (Han et al., 2019; Wiewel et al., 2019; Tang et al., 2020; Geneva & Zabaras, 2020b) , thus can be considered a state-of-the-art approach. The convolutional LSTM model is trained for 500 epochs. Additionally, three different embedding methods are implemented: the first is the proposed Koopman observable embedding using a convolutional auto-encoder. This model encodes the fluid x-velocity, y-velocity, pressure and viscosity fields to an embedded dimension of 128, F : R 4×128×64 → R 128 ; G : R 128 → R 3×128×64 . The second embedding method is using the same convolutional auto-encoder model but without the enforcement of Koopman dynamics on the embedded variables. The last embedding method tested was principal component analysis (PCA), a classical dimensionality reduction benchmark method for all fields of science. For each embedding method an identical transformer model with a context length of 128 with 4 transformer decoder layers is trained. Similar to the previous example, the embedding models are trained for 300 epochs when applicable and the transformer is trained for 200. 

5. 3D REACTION-DIFFUSION DYNAMICS

The final numerical example to demonstrate the proposed transformer model is a 3D reactiondiffusion system governed by the Gray-Scott model: ∂u ∂t = r u ∂u ∂x 2 i -uv 2 + f (1 -u) , ∂v ∂t = r v ∂v ∂x 2 i + vu 2 -(f + k) v, in which u and v are the concentration of two species, r u and r v are their respective diffusion rates, k is the kill rate and f is the feed rate. This is a classical system of particular application to chemical processes as it models the following chemical reaction: U + 2V → 3V ; V → P . For a set region of feed and kill rates, this seemingly simple system can yield a wide range of complex dynamics (Pearson, 1993; Lee et al., 1993) . Hence, under the right settings, this system is an excellent case study to push the proposed methodology to its predictive limits. In this work, we will use the parameters: r u = 0.2, r v = 0.1, k = 0.055 and f = 0.025 which results in a complex dynamical reaction seen in Fig. 8 . Akin to the first numerical example, the initial condition of this system is stochastic such that the system is seeded with 3 randomly placed perturbations within the periodic domain. 4 . Despite this system having complex dynamics in 3D space, the transformer is able to produce acceptable predictions with very similar structures as the numerical solver. To increase the model's predictive accuracy we believe the limitation here is not in the transformer, but rather the number of training data and the inaccuracies of the embedding model due to the dimensionality reduction needed. This is supported by the fact that increasing the transformer's depth does not yield considerable improvements for the test errors in Table 4 . Additional, results are provided in Appendix D. 

6. CONCLUSION

While transformers and self-attention models have been established as a powerful framework for NLP tasks, the adoption of such methodologies has yet to fully permute other fields. In this work, we have demonstrated the potential transformers have for modeling dynamics of physical phenomena. The transformer architecture with self-attention allows the model to learn longer and more complex temporal dependencies than alternative machine learning methods, which is benign for many physical systems that exhibit multi-scale dynamics. This can be directly attributed to the transformers ability to draw information from many past time-steps efficiently through self-attention rather than just the most recent time-step which occurs in auto-regressive, Euler-based and standard RNN models. The key challenge of using the transformer model is identifying appropriate embeddings to represent the physical state of the system, for which we propose the use of Koopman dynamics to enforce dynamical context. Using the proposed methods, our transformer model can outperform alternative models widely seen in recent scientific machine learning literature. The investigation of unsupervised pre-training of such transformer models as well as gaining a better understanding of what attention mechanisms imply for physical systems will be the subjects of works in the near future.

A KOOPMAN EMBEDDING MODELS

The architectures for the Koopman embedding models are provided. In general, each is a classical auto-encoder architecture using standard machine learning methodologies. For complete details we refer the reader to the code. 

B LORENZ SUPPLEMENTARY RESULTS

Predictions for all of the tested models for three test cases are plotted in Fig. 12 in which we can see the transformer model is consistently accurate within the plotted time frame. Additional test predictions for the Lorenz system using the transformer model are shown in Fig. 13 . Lastly, in Fig. 14 we illustrate the two different noise levels used to contaminate the training data for the results listed in Table 2 . 



Code available at: [URL available after review]. Supplementary videos available at: https://sites.google.com/view/transformersphysx.



Figure 1: The two training stages for modeling physical dynamics using transformers. (Left to right) The embedding model is first trained using Koopman based dynamics. The embedding model is then frozen (fixed), all training data is embedded and the transformer is trained in the embedded space.

Figure 3: Example of a Koopman embedding for a two-dimensional system using a convolution encoder-decoder model. The encoder model, F, projects the physical states into the approximate Koopman observable embedding. The decoder model, G recovers the physical states from the embedding.

Figure 4: Four test case predictions using the transformer model for 320 time-steps.

Figure 5: (a) Lorenz solution of 25k time-steps. (b) The Lorenz map produced by each model. (c) The test relative mean-squared-error (MSE) with respect to time.

Figure 6: Vorticity, ω = ∇ x u y -∇ y u x , of two test case predictions using the proposed transformer with Koopman embeddings at Reynolds numbers 233 (top) and 633 (bottom).

Figure 7: Test set relative mean-squared-error (MSE) of the base-line convolutional LSTM model (ConvLSTM) and transformers with Koopman (KM), auto-encoder (AE) and PCA embedding methods over time.

Training, validation and testing data are obtained from a Runge-Kutta finite difference simulation on a structured grid, {u, v} ∈ R 2×32×32×32 . The training, validation and test data sets consist of 512, 16 and 56 time-series, respectively, with 200 time-steps each at a physical time-step size of ∆t = 5. A 3D convolutional encoder-decoder is used to embed the two species into a 512 embedding dimension, F : R 2×32×32×32 → R 512 ; G : R 512 → R 2×32×32×32 . Transformer models with varying depth are trained all with a context length of 128. All other model and training parameters for the transformer models are consistent. A test prediction using the transformer model is shown in Fig.8and the errors for each trained transformer are listed in Table

Figure 8: u volume plots of the target (top) and transformer prediction (bottom) for a test case of the Gray-Scott system. Isosurfaces displayed span the range u = [0.3, 0.5] to show the inner structure.

Figure 9: Fully-connected embedding network with ReLU activation functions for the Lorenz system.

Figure10: 2D convolutional embedding network with ReLU activation functions for the flow around a cylinder system consisting of 5 convolutional encoding/decoding layers. Each convolutional operator has a kernel size of (3, 3). In the decoder, the feature maps are up-sampled before applying a standard convolution. Additionally, two auxiliary fully-connected networks are used to predict the diagonal and off-diagonal elements of the Koopman operator for each viscosity ν.

Figure11: 3D convolutional embedding network with leaky ReLU activation functions for the Gray-Scott system. Batch-normalization used between each of the convolutional layers. In the decoder, the feature maps are up-sampled before applying a standard 3D convolution.

Figure 12: Three Lorenz test case predictions using each tested model for 128 time-steps.

Figure 13: Lorenz test case predictions using the transformer model for 320 time-steps.

Figure 14: Comparison between the clean and contaminated (noisy) training data.



Test set relative mean-squared-error (MSE) for surrogate modeling the Lorenz system at several time-step intervals with noisy data.The proposed transformer and alternative models' relative mean squared errors for the test set are plotted in Fig.5cas a function of time and listed in Table1segmented into several intervals based on the transformer's context length. We can see that the transformer model is able to achieve the best performance using the least number of parameters with the exception of the Koopman only model for early times. However, the Koopman model is unstable for long time predictions rendering it useless for surrogate predictions. To quantify accuracy of the chaotic dynamics of each model, the Lorenz map is plotted in Fig.5bwhich is a well-defined relation between successive z local maximas despite the Lorenz's chaotic nature. Calculated using 25k time-step predictions from each model, again we can see that the transformer model agrees the best with the numerical solver indicating that it has learned the best physical dynamics of all the tested models.

Test set relative mean-squared-error (MSE) of each output field for surrogate modeling 2D fluid flow past a cylinder. Models listed include the convolutional LSTM model (ConvLSTM) and transformers with Koopman (KM), auto-encoder (AE) and PCA embedding methods.



ACKNOWLEDGMENTS

The authors acknowledge computing resources provided by the AFOSR Office of Scientific Research through the DURIP program.

C CYLINDER SUPPLEMENTARY RESULTS

Prediction fields of all the tested models for a single test case are plotted in Figs. 15 & 16 . While all models are able to perform adequately with qualitatively good results, the transformer model with Koopman embeddings (Transformer-KM) outperforms alternatives.The evolution of the flow field projected onto the dominate eigenvectors of the learned Koopman operator for two Reynolds numbers are plotted in Fig. 17 . This reflects the dynamical modes that were learned by the embedding model to impose physical "context" on the embeddings. Given that the eigenvectors are complex, we plot both the magnitude, |ψ|, and angle, ∠ψ. For both Reynolds numbers it is easy to see the initial transition region, t < 100, before the system enters the periodic vortex shedding. Once in the periodic region, we can see that the higher Reynolds number has a higher frequency which reflects the increased vortex shedding speed. 

