TRANSFORMERS FOR MODELING PHYSICAL SYSTEMS

Abstract

Transformers are widely used in neural language processing due to their ability to model longer-term dependencies in text. Although these models achieve state-ofthe-art performance for many language related tasks, their applicability outside of the neural language processing field has been minimal. In this work, we propose the use of transformer models for the prediction of dynamical systems representative of physical phenomena. The use of Koopman based embeddings provide a unique and powerful method for projecting any dynamical system into a vector representation which can then be predicted by a transformer model. The proposed model is able to accurately predict various dynamical systems and outperform classical methods that are commonly used in the scientific machine learning literature. 1 2

1. INTRODUCTION

The transformer model (Vaswani et al., 2017) , built on self-attention, has largely become the stateof-the-art approach for a large set of neural language processing (NLP) tasks including language modeling, text classification, question answering, etc. Although more recent transformer work is focused on unsupervised pre-training of extremely large models (Devlin et al., 2018; Radford et al., 2019; Dai et al., 2019; Liu et al., 2019) , the original transformer model garnered attention due to its ability to out-perform other state-of-the-art methods by learning longer-term dependencies without recurrent connections. Given that the transformer model was originally developed for NLP, nearly all related work has been rightfully confined within this field with only a few exceptions. Here, we focus on the development of transformers to model dynamical systems that can replace otherwise expensive numerical solvers. In other words, we are interested in using transformers to learn the language of physics. The surrogate modeling of physical systems is a research field that has existed for several decades and is a large ongoing effort in scientific machine learning. A surrogate model is defined as an approximate model of a physical phenomenon that is designed to replace an expensive computational solver that would otherwise be needed to resolve the system of interest. The key characteristic of surrogate models is their ability to model a distribution of initial or boundary conditions rather than learning just one solution. This is arguably essential for the justification of training a deep learning model versus using a standard numerical solver. The most tangible applications of surrogates are for optimization, design and inverse problems where many repeated simulations are typically needed. With the growing interest in deep learning, deep neural networks have been used for surrogate modeling a large range of physical systems in recent literature. Standard deep neural network architectures such as auto-regressive (Mo et al., 2019; Geneva & Zabaras, 2020a) , residual/Euler (González-García et al., 1998; Sanchez-Gonzalez et al., 2020) , recurrent and LSTM based models (Mo et al., 2019; Tang et al., 2020; Maulik et al., 2020) have been largely demonstrated to be effective at modeling various physical dynamics. Such models generally rely exclusively on the past time-step to provide complete information on the current state of the system's evolution. Particularly for dynamical systems, present machine learning models lack generalizable time cognisant capabilities to predict multi-time-scale phenomena present in systems including turbulent fluid flow, multi-scale materials modeling, molecular dynamics, chemical processes, etc. Thus currently adopted models struggle to maintain true physical accuracy for long-time predictions. Much work is needed to scale such deep learning models to larger physical systems that are of scientific and industrial interest. This work deviates from this pre-existing literature by investigating the use of transformers for the prediction of physical systems, relying entirely on selfattention to model dynamics. In the recent work of Shalova & Oseledets (2020), such self-attention models were tested to learn single solutions of several low-dimensional ordinary differential equations. In this work, we propose a physics inspired embedding methodology to model a distribution of dynamical solutions. We will demonstrate our model on high-dimensional partial differential equation problems that far surpass the complexity of past works. To the authors best knowledge, this is the first work to explore transformer NLP architectures for the prediction of physical systems.

2. METHODS

When discussing dynamical systems, we are interested in systems that can be described through a dynamical ordinary or partial differential equation: φ t = F x, φ(t, x, η), ∇ x φ, ∇ 2 x φ, φ • ∇ x φ, . . . , F : R × R n → R n , t ∈ T ⊂ R + , x ∈ Ω ⊂ R m , in which φ ∈ R n is the solution of this differential equation with parameters η, in the time interval T and spatial domain Ω with a boundary Γ ⊂ Ω. This general form can embody a vast spectrum of physical phenomena including fluid flow and transport processes, mechanics and materials physics, and molecular dynamics. In this work, we are interested in learning the set of solutions for a distribution of initial conditions φ 0 ∼ p(φ 0 ), boundary conditions B(φ) ∼ p (B) ∀x ∈ Γ or equation parameters η ∼ p (η). This accounts for modeling initial value, boundary value and stochastic problems. We emphasize that this is fundamentally different, more difficult and of greater interest for most scientific applications compared to learning a single solution. To make this problem applicable to the use of transformer models, the continuous solution is discretized in both the spatial and temporal domains such that the solution of the differential equation is Φ = {φ 0 , φ 1 , . . . φ T } ; φ i ∈ R n×d for which φ i has been discretized by d points in Ω. We assume an initial state φ 0 and that the time interval T is discretized by T time-steps with a time-step size ∆t. Hence, we pose the problem of modeling a dynamical system as a time-series problem. The machine learning methodology has two core components: the transformer for modeling dynamics and the embedding network for projecting physical states into a vector representation. Similar to NLP, the embedding model is trained prior to the transformer. This embedding model is then frozen and the entire data-set is converted to the embedded space in which the transformer is then trained as illustrated in Fig. 1 . During testing, the embedding decoder is used to reconstruct the physical states from the transformer's predictions. 

2.1. TRANSFORMER

The transformer model was originally designed with NLP as the sole application with word vector embeddings of a passage of text being the primary input (Vaswani et al., 2017) . However, recent works have explored using attention mechanisms for different machine learning tasks (Veličković et al., 2017; Zhang et al., 2019; Fu et al., 2019) and a few investigate the use of transformers for applications outside of the NLP field (Chen et al., 2020) . This suggests that self-attention and in



Code available at: [URL available after review]. Supplementary videos available at: https://sites.google.com/view/transformersphysx.



Figure 1: The two training stages for modeling physical dynamics using transformers. (Left to right) The embedding model is first trained using Koopman based dynamics. The embedding model is then frozen (fixed), all training data is embedded and the transformer is trained in the embedded space.

