TAYLORNET: A TAYLOR-DRIVEN GENERIC NEURAL ARCHITECTURE

Abstract

In this work, we propose the Taylor Neural Network (TaylorNet), a generic neural architecture that parameterizes Taylor polynomials using DNNs without non-linear activation functions. The key challenges of developing TaylorNet lie in: (i) mitigating the curse of dimensionality caused by higher-order terms, and (ii) improving the stability of model training. To overcome these challenges, we first adopt Tucker decomposition to decompose the higher-order derivatives in Taylor expansion parameterized by DNNs into low-rank tensors. Then we propose a novel reducible TaylorNet to further reduce the computational complexity by removing more redundant parameters in the hidden layers. In order to improve training accuracy and stability, we develop a new Taylor initialization method. Finally, the proposed models are evaluated on a broad spectrum of applications, including image classification, natural language processing (NLP), and dynamical systems. The results demonstrate that our proposed Taylor-Mixer, which replaces MLP and activation layers in the MLP-Mixer with Taylor layer, can achieve comparable accuracy on image classification, and similarly on sentiment analysis in NLP, while significantly reducing the number of model parameters. More importantly, our method can explicitly learn and interpret some dynamical systems with Taylor polynomials. Meanwhile, the results demonstrate that our Taylor initialization can significantly improve classification accuracy compared to Xavier and Kaiming initialization.

1. INTRODUCTION

This paper proposes a generic neural architecture, called TaylorNet, that parameterizes Taylor polynomials using deep neural networks. It can be employed to a variety of application domains, including image classification, dynamical systems, and natural language processing (NLP). Importantly, the proposed method does not use non-linear activation functions, thus promoting interpretability of DNNs in some applications, such as dynamical systems. This work is motivated by a growing popularity of physics-guided machine learning (ML) (Jia et al., 2021; Daw et al., 2017) , which integrates physical priors into neural networks. Thus, it endows neural networks with the ability to generalize to new domains better. As a result, physics guided ML has been widely applied to a variety of areas, such as dynamical systems (Cranmer et al., 2020; Lusch et al., 2018; Greydanus et al., 2019) , quantum mechanics (Schütt et al., 2017) , and climate changes (Kashinath et al., 2021; Pathak et al., 2022) . However, existing methods based on DNNs are either tailored to solve certain specific problems, such as PDEs (Li et al., 2020; Raissi et al., 2017) and dynamics prediction (Greydanus et al., 2019; Lusch et al., 2018; Wang et al., 2019) , or hard to explain the results. Hence, the question is, can we develop a generic interpretable neural architecture that can be used in a wide range of machine learning domains? In this paper, we develop a novel Taylor-driven neural architecture, called TaylorNet, that parameterizes Taylor polynomials using DNNs without non-linear activation functions, as shown in Fig. 1 . The proposed TaylorNet is able to generalize to a wide spectrum of ML tasks, ranging from computer vision and dynamical systems to NLP. However, developing TaylorNet poses two main challenges. First, the computational complexity of Taylor polynomial parameterized by DNNs grows exponentially as the polynomial order increases. Second, its higher-order terms often lead to training instability. To deal with these challenges, we first adopt Tucker decomposition to decompose the higher-order ! ! " ! # !! " " !" # "! # " " ∑ $ ! ! " ! # !! " " !" # "! # " " ∑ $ ! ! " ! # !! " " !" # "! # " " ∑ $ & #$! & # 1 st order 2 nd order (a) Taylor Layer (b) TaylorNet tensors in TaylorNet into a set of low-rank tensors (Malik & Becker, 2018; Kolda & Bader, 2009) . To further reduce its computational complexity, we propose a reducible TaylorNet that removes more redundant learnable parameters in the hidden layers. In order to show the generalization of our architecture, we propose a Taylor-Mixer that uses Taylor layers in place of both the MLP layers and activation functions in the MLP-Mixer (Tolstikhin et al., 2021) . Then a new Taylor initialization method is developed to improve the stability and accuracy of the model. Finally, we evaluate the proposed Taylor-Mixer and TaylorNet in a variety of applications, including image classification, dynamical system, and NLP. The results show that our Taylor-Mixer can achieve comparable accuracy to the MLP-Mixer on image classification while exhibiting a considerable reduction of model parameters. The proposed TaylorNet also explicitly learns and interprets the dynamics of two classic physical systems with polynomials. Besides, our method can also be applied to NLP. The evaluation results on sentiment analysis demonstrate a competitive accuracy to the recently proposed pNLP-Mixer (Fusco et al., 2022) . Meanwhile, our Taylor initialization can reach an accuracy that is over 10% higher than the Xavier and Kaiming initialization methods for the proposed TaylorNet (Glorot & Bengio, 2010; He et al., 2015) . In summary, our contributions include: 1) we design the TaylorNet, a novel neural architecture without activation functions that can generalize to a broad spectrum of ML domains, 2) we then propose a reducible TaylorNet that remarkably reduces the number of learning parameters, 3) a new Taylor initialization method is proposed to stabilize model training, 4) we develop a Taylor-Mixer that replaces both the MLP layers and activation functions in MLP-Mixer with Taylor layers, which can achieve comparable accuracy to the MLP-Mixer on image classification and sentiment analysis, and 5) Our approach can explicitly learn and explain some dynamical systems with polynomials, making way for interpretable ML.

2. RELATED WORK

Polynomial Neural Networks. We summarize the significant difference between the proposed TaylorNet and existing Polynomial Neural Networks. Earlier work Nikolaev & Iba (2006) mainly adopted Group Method of Data Handling to learn partial descriptors in polynomial neural networks. Then a follow-up pi-sigma network Shin & Ghosh (1991) was developed to filter the input features using predefined basis. These methods, however, fail to scale to high-dimensional data Chrysos et al. (2020) . Recently, researchers designed Π-Nets Chrysos et al. (2020; 2022) that parameterizes the polynomial functions using deep neural networks and tensor decomposition. However, the performance of Π-Nets will degrade in larger networks. In addition, Π-Net can be viewed as a special case of TaylorNet at expansion point 0 since its adopted CP-decomposition is a special case of Tucker decomposition. Furthermore, we develop a novel Taylor initialization to improve the training stability while Π-Net does not. Our initialization method is different from the initialization paradigm for Tensorial Convolutional Neural Networks Pan et al. (2022) . Related Work on Taylor Series. Some recent studies developed Taylor-based neural networks. For example, TaylorSwiftNet Pourheydari et al. ( 2021) was developed for time-series prediction, but it is not applicable to high-dimensional classification. Recently, Mao et al Mao et al. (2022) developed Phy-Taylor to learn the physical dynamics based on partial physics knowledge, but it suffers from the curse of dimensionality. Different from these methods, it can be used in a wide spectrum of application domains without using activation functions. Moreover, it can interpret some dynamical systems using Taylor polynomials. Learning Dynamics. Some researchers developed physics-based DNNs to learn the dynamics of physical systems. For example, Lusch et al. designed the Koopman operator (Geneva & Zabaras, 2022; Lusch et al., 2018) that maps the non-linear dynamics into a linear Koopman representation space for predicting the future states of dynamical systems. Recent studies proposed Hamiltonian and Lagrangian neural networks (Cranmer et al., 2020) that strictly follow conservation laws. However, these methods are designed for specific problems, and they are hard to apply to other domains, such as computer vision and natural language processing (NLP). Tensor Decomposition. Tensor decomposition (Kolda & Bader, 2009) aims to represent highdimensional tensors using multilinear operations over the factor matrices or tensors. Some popular tensor decomposition methods include CP decomposition (Carroll & Chang, 1970) , Tucker decomposition (Tucker, 1966) , tensor train (TT) decomposition (Oseledets, 2011), and tensor ring (TR) decomposition (Zhao et al., 2016) . Thanks to their ability to reduce computational complexity without breaking out data structure, tensor decomposition techniques are increasingly being widely used in machine learning applications (Wu et al., 2020; Pan et al., 2022; Qiu et al., 2021) . Inspired by prior work, we adopt tensor decomposition to deal with the curse of dimensionality in our TaylorNet.

3. PRELIMINARIES

Notations. We summarize the main notations used throughout this paper in Appendix A. Regarding mathematical symbols, scalars are denoted by normal letters, e.g. x and y, while vectors are denoted by lowercase boldface letters, e.g. x. In addition, matrices are denoted by uppercase boldface letters, e.g. X, while tensors are denoted by calligraphic letters, e.g. X and W. Taylor Polynomial. Taylor polynomial is able to approximate non-linear smooth functions given an arbitrary compact Hausdorff space according to Stone-Weierstrass theorem (Stone, 1948) . According to Taylor's theorem (Thomas et al., 2005) , for a vector-valued multivariate function f : R d → R o , its Taylor polynomial at a point x = x 0 can be expressed as f (x) ≈ N k=0 1 k! d j=0 ∆xj ∂ ∂xj k f x 0 , where x ∈ R d , x 0 ∈ R d , ∆x j = x j -x j,0 , j = 1, . . . , d. It can also be written as the following tensor form (Chrysos et al., 2020) , f (x) ≈ f (x0) + N k=1 W [k] k+1 j=2 ×j ∆x , where ×m denotes mode-m vector product, ∆x = xx 0 and W [k] ∈ R o k m=1 ×d are scaled higher-order derivatives of f at x = x 0 . However, the problem of Taylor polynomial is that its computational complexity grows exponentially as the order increases, making it hard to be applied to high-dimensional data. Tucker Decomposition. Tucker decomposition aims to decompose a tensor into a small core tensor multiplied by a set of matrices along the corresponding mode (Tucker, 1966; Kolda & Bader, 2009) . In essence, Tucker decomposition can be viewed as a higher-order principal component analysis. Let X be N -way tensors, then its Tucker decomposition is given by X = G ×1 A (1) ×2 • • • ×N A (N ) , where G is a core tensor and A (n) (n = 1, . . . , N ) are the factor matrices. According to mode-n unfolding (Kolda & Bader, 2009) , Eq. 3 can be expressed as the following matricized form: X (n) = A (n) G (n) A (N ) ⊗ • • • ⊗ A (n+1) ⊗ A (n-1) ⊗ • • • ⊗ A (1) ⊤ , where X (n) and G (n) are matrices that mean the mode-n matricization of the tensor X and G and ⊗ denotes Kronecker product.

4. PROPOSED METHOD

In this section, we first introduce a lightweight Taylor Neural Network using Tucker decomposition. As an extension, we propose a reducible TaylorNet to further improve the computational efficiency by removing redundant trainable parameters in the middle layers. In order to stabilize the model training process and improve accuracy, a novel Taylor initialization method is developed in this work. Moreover, we present the connection between TaylorNet and some other existing neural networks.

4.1. TAYLOR NEURAL NETWORKS

Since mapping function f in Eq. 2 is unknown and needs to be learned by deep neural networks, we cannot calculate the derivatives W [k] of f directly. To deal with this issue, this work adopts deep neural networks to parameterize the Taylor polynomial in Eq. 2, where f (x 0 ) and W [k] (k = 1, . . . , N ) are learnable parameters during model training. However, the computational complexity of tensor W [k] grows exponentially, O(d k ), as the polynomial order k increases. To overcome this issue, Tucker decomposition is adopted in this work. According to Eq. 3, the scaled derivatives W [k] can be decomposed into W [k] = G [k] ×1 O k ×2 I k1 • • • × k+1 I kk = G [k] ×1 O k k j=1 ×j+1I kj (5) where G [k] ∈ R r out,k k j=1 ×r in,k,j is the core tensor, I kj ∈ R d×r in,k,j (j = 1, . . . , k) and O k ∈ R o×r out,k are input and output factor matrices, respectively. Here we use r in,k,j and r out,k to represent the Tucker ranks corresponding to the j-th input and output dimension in the k-th-order term of the Taylor polynomial. Substituting Eq. 5 into 2, the k-th term of Taylor polynomial can be written as W [k] k+1 j=2 ×j ∆x = W [k] k+1 j=2 ×j∆x ⊤ = G [k] ×1 O k k i=1 ×i+1I ki k j=1 ×j+1∆x ⊤ (6) Based on commutative law and associative property in mode-n product (Kolda & Bader, 2009) , Eq. 6 can be reformulated as W [k] k+1 j=2 ×j ∆x = G [k] ×1 O k k j=1 ×j+1 ∆x ⊤ I kj ∈ R o . Please refer to the detailed transformation in Appendix. B. However, to our knowledge, the current deep learning frameworks (e.g, Pytorch and TensorFlow) do not support batch-size-based mode-n product in Eq. 7. Fortunately, since the result of Eq. 7 is a vector, according to mode-n unfolding as illustrated in Eqs. 3 and 4, we can convert it into a matricized form as follows. W [k] k+1 j=2 ×j ∆x = O k G k I ⊤ kk ∆x ⊗ • • • ⊗ I ⊤ k1 ∆x , Finally, substituting the above Eq. 8 into Taylor polynomial in Eq. 1, resulting in a lightweight N -th-order Taylor layer as follows. f (x) = β + N k=1 O k G k I ⊤ kk ∆x ⊗ • • • ⊗ I ⊤ k1 ∆x , where β = f (x 0 ), O k , G k , and I ⊤ kj (k = 1, . . . , N ; j = 1, . . . , k) are learnable parameters. After that, we can stack L Taylor layers with a N -th order expansion to construct a new neural network, referred as the TaylorNet. According to Eq. 9 above, the output of the l-th layer in our TaylorNet with a N -th-order expansion can be written as y (l) = β (l) + N k=1 O (l) k G (l) k I (l) kk ⊤ y (l-1) ⊗ • • • ⊗ I (l) k1 ⊤ y (l-1) , where y (l) ∈ R d (l+1) is the output of the l-th layer and y (0) = ∆x ∈ R d (1) , d (1) = d. Here d (l) is the input dimension of the l-th layer. In addition, β (l) ∈ R d (l+1) , O (l) k ∈ R d (l+1) ×r (l) out,k , G (l) k ∈ R r (l) out,k × k j=1 r (l) in,k,j , I (l) kj ∈ R d (l) ×r (l) in,k,j (k = 1, . . . , N ; j = 1, . . . , k) are learnable parameters of the l-th Tucker Taylor layer. Fig. 1 shows the overall framework of the proposed TaylorNet. In this paper, we use the Taylor layer with a second order expansion, since it is able to mitigate the overfitting problem and also reduce the number of trainable parameters in DNNs. Remark 4.1. The computational complexity of the k-th-order term in Taylor layer decreases from O(od k ) to O(r out,k k j=1 r in,k,j + or out,k + d k j=1 r in,k,j ) , where d and o denote the dimension of the input and the output. When o and d are much larger than the rank of a core tensor in Tucker decomposition, the number of parameters will be reduced by orders of magnitude.

4.2. REDUCIBLE TAYLORNET AND TAYLOR-MIXER

Reducible TaylorNet. We further propose a reducible TaylorNet, called R-TaylorNet, to reduce the number of trainable parameters of TaylorNet. The basic idea is to use a single matrix as the new trainable parameter to replace the original product of the current layer's output factor matrix and the next layer's input factor matrix, namely, compositing two consecutively multiplying parameter matrices O (l) k and I kj (l+1) into a single parameter matrix. Below, we will theoretically derive the R-TaylorNet. According to the block multiplication of matrices, Eq. 10 can be rewritten as y (l) = β (l) + O (l) 1 O (l) 2 • • • O (l) N        G (l) 1 z (l) 11 G (l) 2 z (l) 22 ⊗ z (l) 21 . . . G (l) N z (l) N N ⊗ • • • ⊗ z (l) N 1        , ( ) where z (l) kj = I (l) kj ⊤ y (l-1) ∈ R r (l) in,k,j , we call it hidden features of the l-th layer in this work. In order to further simplify the above equation, we define the following notations: O (l) def = O (l) 1 O (l) 2 • • • O (l) N ∈ R d (l+1) × N k=1 r (l) out,k , h z (l) 11 , . . . , z (l) N N def =        G (l) 1 z (l) 11 G (l) 2 z (l) 22 ⊗ z (l) 21 . . . G (l) N z (l) N N ⊗ • • • ⊗ z (l) N 1        ∈ R N k=1 r (l) out,k . Then z (l+1) kj in the (l + 1)-th hidden layer can be expressed by the following recursive form z (l+1) kj = I (l+1) kj ⊤ β (l) + O (l) h z (l) 11 , . . . , z (l) N N = I (l+1) kj ⊤ β (l) + I (l+1) kj ⊤ O (l) h z (l) 11 , . . . , z (l) N N , where k = 1, . . . , N , and j = 1, . . . , k. In order to reduce some intermediate parameters in DNNs, we introduce new lower-dimensional matrices (vectors) to replace the product of some matrices in the above Eq. 13. Namely, v l) . By doing so, we can simplify Eq. 13 as (l) kj = I (l+1) kj ⊤ β (l) and T (l) kj = I (l+1) kj ⊤ O ( z (l+1) kj = v (l) kj + T (l) kj h z (l) 11 , . . . , z (l) N N , k = 1, . . . , N ; j = 1, . . . , k where v (l) kj ∈ R r (l+1) in,k,j and T (l) kj ∈ R r (l+1) in,k,j × N k=1 r (l) out,k are the new parameters in DNNs. Finally, we can use Eq. 14 above to implement a L-layer reducible TaylorNet. The feedforward method of a single layer is summarized in Algorithm 1 in Appendix C. Remark 4.2. In R-TaylorNet, the computational complexity of calculating hidden features z (l+1) 11 , . . . , z (l+1) N N from z (l) 11 , . . . , z N N in the l-th layer can be reduced by O( N k=1 (d (l+1) r (l) out,k + d (l+1) k j=1 r (l+1) in,k,j -( N m=1 r (l) out,m )( k j=1 r (l+1) in,k,j ))) compared to the original TaylorNet. Taylor-Mixer. Building on the R-TaylorNet, we also propose a new Taylor-Mixer that replaces both the MLP layers and non-linear activation functions in the MLP-Mixer Tolstikhin et al. ( 2021) with Taylor layer. The resulting Taylor-Mixer can be applied to image classification and natural language processing (NLP).

4.3. TAYLOR INITIALIZATION

We also develop a robust Taylor initialization method to mitigate the training instability caused by higher-order terms. For simplicity, we omit superscript (l) unless otherwise specified. Following Xavier Glorot & Bengio (2010) and Kaiming initialization He et al. (2015) , we assume that: 1) the input elements (variables) of the l-th layer, denoted by y (l-1) , follow independent zero-mean Gaussian distribution. 2) the weights in O k , G k , I kj (k = 1, . . . , N ; j = 1, . . . , k) are initialized independently with zero mean. 3) β is initialized to 0. Then we have the following proposition. Proposition 4.1. The variance of input and output variables of the l-th layer satisfies: (σ (l) y ) 2 = N k=1 r out,k σ 2 O,k   k j=1 r in,k,j σ 2 G,k   (d + 2k -2)!! (d -2)!! σ 2k I,k (σ (l-1) y ) 2k where !! denotes double factorial, ( 2 and (σ (l-1) y ) 2 denote the variance of y (l) and y (l-1) . And ) 2 . To satisfy these requirements, the variance of weights should be initialized to: σ 2 O,k , σ 2 G, σ 2 O,k = λ k 1 r out,k , σ 2 G,k = 1 k j=1 r in,k,j , σ 2k I,k = (d -2)!! (d + 2k -2)!! s.t. N k=1 λ k = 1 (16) where λ k is a coefficient that can be used to scale the importance of the k-th-order term. Please refer to the theoretical analysis of Proposition 4.1 in Appendix E. Similarly, we have also developed an initialization method for Reducible TaylorNet, please refer to Appendix F for more details.

4.4. CONNECTIONS TO EXISTING MODELS

We present the connection between TaylorNet and some existing neural networks. According to Eq. 3, one-layer TaylorNet of order 1 is a linear function, f (x) = f (x 0 ) + J (x -x 0 ), where J is the Jacobian matrix of f (x) at x = x 0 . Thus, the TaylorNet of order 1 can be viewed as fully connected layers in deep neural networks. The second order term in a Taylor layer can be expressed as H (1) [(x -x 0 ) ⊗ (x -x 0 )], where H is the scaled second-order derivative tensor of f (x) at x = x 0 . The Kronecker product of xx 0 can be viewed as the pixel-level attention, analogous to the token-level attention in Transformer Vaswani et al. (2017) ; Dosovitskiy et al. (2020) . Finally, our TaylorNet adopts higher-order terms to compensate for the residual errors, as shown in Fig. 1 First of all, we compare our Taylor initialization with two commonly used initialization approaches: Xavier and Kaiming initialization. In this task, we conduct experiments on CIFAR10 using four-layer Taylor-Mixer (34.4M parameters) as described in Section 4.2. Fig. 2 illustrates the comparison results of different methods using 3 random seeds. We can observe that our Taylor initialization can significantly increase the classification accuracy by over 10% compared to the next best approach, Xavier initialization. The primary reason why both Xavier and Kaiming initialization do not perform well is that they fail to ensure the same variance for input and output at each layer.

5.2. EVALUATION ON IMAGE CLASSIFICATION

We evaluate the performance of our proposed Taylor-Mixer on image classification. We compare it with the MLP-Mixer Tolstikhin et al. ( 2021), which can achieve competitive results on image classification benchmarks. In our experiment, we choose the point of expansion at x 0 = 0 for Taylor-Mixer since the input data are normalized. Following the similar method in MLP-Mixer, we pre-train our model on ILSVRC2012 1 shows the performance comparison of our Taylor-Mixer and the baselines under different model sizes. We can see that the proposed Taylor-Mixer performs better than Π-nets and achieves comparable accuracy to the MLP-Mixer with fewer model parameters on both CIFAR10 and CIFAR100. In particular, the parameters of our Base model can be reduced by about 42% compared to the MLP-Mixer. Therefore, we can conclude that our Taylor-Mixer outperforms the MLP-Mixer. 

5.3. EVALUATION ON DYNAMICAL SYSTEMS

Next, we apply our TaylorNet to predict and interpret the dynamics of physical systems. We evaluate it on two dynamical systems, Duffing equation and High-dimensional non-linear flow attractor. To train our model, we generate 100 trajectories by randomly choosing 100 initial conditions. Then we use the 20 trajectories generated from 20 different initial conditions as the validation data. The time span of each trajectory is t = 0, 0.01, 0.02, . . . , 10 with sampling time, 0.01. Thus, we can convert a continuous dynamical system into a discrete dynamical system, x k+1 = f (x k ). Next, we use regression technique to predict the next state using TaylorNet. We compare our approach with two methods: ODE solver (ground truth), called odeint, from SciPy package and 3-layer MLP. Duffing equation. We first adopt TaylorNet to predict the dynamics of Duffing equation, given by ẍ = x -x 3 =⇒ ẋ1 = x 2 ẋ2 = x 1 -x 3 1 . ( ) In our experiment, we choose x 1 (0), x 2 (0) ∈ [-1, 1]. Fig. 3 illustrates the trajectory prediction of different methods on Duffing dynamics using one random initial condition. We can observe from it that our TaylorNet can attain very good trajectory prediction and its error is 1.492 × 10 -7 compared to the ODE solver, odeint, from SciPy package. It thus significantly outperforms the MLP whose error is about 0.3514. More importantly, since our method does not use activation functions, it has the ability to explicitly learn the dynamical systems in the following Eq. 18. After comparing to the original Duffing equation, we can see that our predicted model is very close to the ground-truth model in Eq.17. ẋ1 ≈ 1.001x 2 , ẋ2 ≈ 1.001x 1 -1.001x 3 1 . ( ) High-dimensional non-linear flow attractor. We then apply our method to predict the dynamics of non-linear fluid flow. According to Noack et al. ( 2003), the dynamical system can be described by the following low-dimensional model. ẋ1 = µx 1 -ωx 2 + Ax 1 x 3 , ẋ2 = ωx 1 + µx 2 + Ax 2 x 3 , ẋ3 = -λ(x 3 -x 2 1 -x 2 2 ). Following the prior work Lusch et al. (2018) , we choose µ = 0.1, ω = 1, λ = 10, A = -0.1, and x 1 (0), x 2 (0) ∈ [-1.1, 1.1], x 3 (0) ∈ [0, 2.42] in the experiment. Fig. 4 shows the trajectory predictions of flow attractor using different methods. We can observe that our approach can accurately predict the trajectory as the ODE solver, odeint (ground truth). In addition, the error of our method is about 3.361 × 10 -6 , which is three orders of magnitude smaller than that of the MLP (4.447 × 10 -3 ). Finally, we leverage our TaylorNet to reconstruct the dynamical system in the following. We can see the predicted model is very close to the ground truth in the above Eq. 19. Based on these two examples, we can conclude that our TaylorNet is able to explicitly learn and interpret the dynamics of some physical systems with polynomials. ẋ1 ≈ 0.095x 1 -1.003x 2 -0.100x 1 x 2 , ẋ2 ≈ 1.002x 1 + 0.095x 2 + 0.100x 2 x 3 , ẋ3 ≈ -9.513x 3 + 9.521x 2 1 + 9.521x 2 2 . (20)

5.4. EVALUATION ON NLP

Finally, we also explore our method in NLP applications. In this work, we use sentiment analysis on IMDB Maas et al. (2011) 

6. CONCLUSION

This paper developed a Taylor-driven generic neural architecture, called TaylorNet, that is able to naturally introduce inductive bias to deep neural networks (DNNs). Different from classical DNNs, our TaylorNet adopted higher-order terms to replace the conventional non-linear activation functions. More specifically, we first proposed a lightweight Taylor Neural Network (TaylorNet) based on Tucker decomposition. As an extension, we also developed a reducible TaylorNet that can remove redundant parameters in hidden layers to improve computational efficiency. Then we proposed a new Taylor-Mixer that replaces both the MLP layers and activation functions in the MLP-Mixer with Taylor layers. In order to improve the model performance, a novel Taylor initialization approach was proposed. Evaluation results illustrated that the proposed method can achieve comparable accuracy to the baselines on image classification and sentiment analysis in NLP. In particular, our approach can significantly reduce the number of desired model parameters on image classification. Importantly, our approach could explicitly learn and interpret some dynamical systems with polynomials, making way for explainable ML.

A NOTATIONS

We summarize the main notations used throughout the paper in the following table.  y or f (x) R o Output of TaylorNet/Reducible TaylorNet z (l) kj = I (l) kj ⊤ y (l-1) R r (l) in,k,j Pre-G hidden features of l-th layer in TaylorNet h z (l) 11 , . . . , z (l) N N R N k=1 r (l) out,k Post-G hidden features of l-th layer in TaylorNet β = f (x 0 ) R o Learnable vector parameter G [k] R rout,k k j=1 ×rin,k,j Learnable core tensor of TaylorNet G def = G [k] (1) R r [k] out,k × k j=1 rin,k,j mode-1 matricization of G [k] I kj R d×rin,k,j Learnable input factor matrices of TaylorNet O k R o×rout,k Learnable output factor matrices of TaylorNet v (l) kj R r (l) in,k,j New learnable vector parameters in Reducible TaylorNet T (l) kj R r (l) in,k,j × N k=1 r (l-1) out,k New learnable matrix parameters in Reducible TaylorNet (σ (l) y ) 2 , (σ (l-1) y ) 2 , σ 2 O,k , σ 2 G,k , σ 2 I,k N Initialization variance for y (l) , y (l-1) , O (l) k , G (l) k , I (l) k λ k N Initialization coefficient for σ 2 O,k B PROPERTIES OF TENSOR MODE-N PRODUCT Lemma B.1. For mode-n matrix product, it satisfies the commutative law (Kolda & Bader, 2009) X × m A × n B = X × n B × A, which means that the order of multiplication is irrelevant when it comes to different modes in a series of mode matrix product. Lemma B.2. For mode-n matrix product, it satisfies the following associative property(Kolda & Bader, 2009) X × n A × n B = X × n (BA). Proof. Based on Lemma. B.1, the k-th term of Taylor expansion in Eq. 6 can be rewritten as W [k] k+1 j=2 ×j ∆x = G [k] ×1 O k k j=1 ×j+1I kj ×j+1 ∆x ⊤ Based on Lemma. B.2, the above Eq. can be reformulated as W [k] k+1 j=2 ×j ∆x = G [k] ×1 O k k j=1 ×j+1 ∆x ⊤ I kj ∈ R o . Proof finished.

C FEEDFORWARD METHOD FOR REDUCIBLE TAYLORNET

We summarize the feedforward method for Reducible TaylorNet below. Algorithm 1: Feedforward Method for Reducible TaylorNet Input :∆x ∈ R d Output :y ∈ R o Initialize v (l) kj , T (l) kj , β (L) , O (L) , G k , I (1) kj / * from the 1-st layer to the L-th layer * / for l = 1, . . . , L do / * from the 1-st order to the N -th order * / for k = 1, . . . , N do for j = 1, . . . , k do if l = 1 then z (1) kj = I (1) kj ⊤ ∆x else z (l) kj = v (l-1) kj + T (l-1) kj h z (l-1) 11 , . . . , z (l-1) N N end end end end y = y (L) = β (L) + O (L) h z (L) 11 , . . . , z (L) N N

D MODEL CONFIGURATIONS AND PARAMETER SETTINGS

In this section, we present the detailed model configurations and parameter settings for the following four different tasks.

D.1 EXPERIMENTAL DETAILS FOR IMAGE CLASSIFICATION

For image classification, our Taylor-Mixer is built on the existing MLP-mixer Tolstikhin et al. (2021) . Thus we follow the experimental settings in the MLP-mixer for pre-training and fine-tuning, unless stated otherwise. Pre-training. We pre-train all the models at 224 using linear learning rate warm-up and cosine learning rate decay. We set the batch-size to 1024 for Base and Small model due to GPU memory capacity limitation in our servers. Since the input data are normalized, we choose Taylor expansion x 0 = 0 in our model. The detailed model configurations and parameters settings are presented in Table 4 . Our Taylor-Mixer is set to 2 and 4 layers in our experiments. The Tucker rank of input and output factor matrices are set to 110 and 140, respectively. We describe the rule of thumb for choosing the ranks as follows. For a N -th order Taylor layer with input and output rank r in,k,j and r out,k , the effective width of this layer is approximately min( We use momentum SGD optimizer and a cosine learning rate scheduler with a linear warm-up. The batch size of fine-tuning is set to 512. We also use gradient clipping at global norm 1. In addition, we do not use dropout, the same as MLP-Mixer.

D.2 EXPERIMENTAL DETAILS FOR DYNAMICAL SYSTEMS

In this experiment, we leverage one-layer TaylorNet based on Tucker decomposition. The rank of each dimension in the core tensor is set to 16. In addition, the batch size is set to 128. We use Adam optimizer with learning rate 0.001.

D.3 EXPERIMENTAL DETAILS FOR NLP

For sentiment analysis in NLP, we follow the same experimental setup in the pNLP-Mixer Fusco et al. (2022) unless otherwise stated. Following pNLP-Mixer, we set the batch size and hidden size to 256 and 256 respectively. We use Adam optimizer with learning rate 10 -4 . Different from pNLP-Mixer, the length of input tokens is set to 512. we use BERT embeddings for a token by averaging the embeddings of its subword units. In order to make the number of parameters similar to that of pNLP-Mixer, we choose 2-layers Taylor-NLP with the rank of 30 and 50 for the input and output matrices respectively. We also use dropout of 0.5 and weight decay of 0.01 to mitigate overfit problem. Initialization λ 1 , λ 2 are set to 0.99, 0.01. For k = 1, Eq. 30 can be rewritten as d i1=1 E[x 2 i1 ] = d × 1 = d For k = 2, Eq. 30 can be rewritten and proved as d i1=1,i2=1 E[x 2 i1 x 2 i2 ] d i1=1 E[x 4 i1 ] + d i1,i2̸ =i1 E[x 2 i1 x 2 i2 ] = d × 3 + d(d -1) × 1 = d(d + 2) (33) For k = 3, Eq. 30 can be rewritten and proved as d i1,i2,i3 E[x 2 i1 x 2 i2 x 2 i3 ] = d i1 E[x 6 i1 ] + C 1 3 d i1,i2̸ =i1 E[x 4 i1 x 2 i2 ] + d i1,i2̸ =i1,i3̸ =i1,i2 E[x 2 i1 x 2 i2 x 2 i3 ] = d × 15 + 3 × d(d -1) × 3 + d(d -1)(d -2) × 1 = d(d + 2)(d + 4) For k = 4, Eq. 30 can be rewritten and proved as d i1,i2,i3,i4 E[x 2 i1 x 2 i2 x 2 i3 x 2 i4 ] = d i1 E[x 8 i1 ] + C 1 4 d i1,i2̸ =i1 E[x 6 i1 x 2 i2 ] + C 2 4 2 d i1,i2̸ =i1 E[x 4 i1 x 4 i2 ]+ C 2 4 d i1,i2̸ =i1,i3̸ =i1,i2 E[x 4 i1 x 2 i2 x 2 i3 ] + d i1,i2̸ =i1,i3̸ =i1,i2,i4̸ =i1,i2,i3 E[x 2 i1 x 2 i2 x 2 i3 x 2 i4 ] = d × 105 + 4 × d(d -1) × 15 + 3 × d(d -1) × 9+ 6 × d(d -1)(d -2) × 3 + d(d -1)(d -2)(d -3) × 1 = d(d + 2)(d + 4)(d + 6), completing the proof. Based on the above proof for small k, we can extrapolate Conjecture E.1 to all k ≤ d. We have empirically validated that the this conjecture still holds for d ∈ 1, . . . , 64, k ≤ d using computer simulation. Nevertheless, we are still attempting to prove it thoroughly for our future work. Based on the above lemmas and Conjecture E.1, we can offer the proof of Proposition 4.1 below. Proof. We first define two hidden features in the Taylor layer zk = I kk ⊤ y (l-1) ⊗ • • • ⊗ I k1 ⊤ y (l-1) , and h k = G k zk . Let σ 2 h,k = Var[(h k ) j ] denotes the variance of h k , and ν z,k = E[( zk ) 2 i ] denotes the second order origin moment of zk . We can first derive the relationship between the variance of y (l) and h k . According to Eq. 10, in TaylorNet, we have y (l) = β + N k=1 O k h k . And it can be decomposed into y (l) i = β i + N k=1 r out,k j=1 (O k ) i,j (h k ) j Therefore, according to Lemma E.2, we can derive the variance of y (l) as (σ (l) y ) 2 = N k=1 r out,k j=1 σ 2 O,k E[(h k ) 2 j ] Given that E[(h k ) ] = E[G k zk ] = 0, the above Eq. 37 can be further simplified as (σ (l) y ) 2 = N k=1 r out,k σ 2 O,k σ 2 h,k Next, we establish the relationship between σ 2 h,k and ν z,k . We can decompose h k = G k zk into the following formula. (h k ) j = r in,k,1 i1 . . . r in,k,k i k (G k ) j,i1×•••×i k ( zk ) i1×•••×i k (39) Therefore, according to Lemma E.2, we can derive σ 2 h,k as σ 2 h,k = k j=1 r in,k,j σ 2 G,k ν z,k Next, we establish the relationship between ν z,k and (σ (l-1) y ) 2 . We can decompose zk = I kk ⊤ y (l-1) ⊗ • • • ⊗ I k1 ⊤ y (l-1) into ( zk ) i1×•••×i k =   d j k (I kk ) j k ,i k (y (l-1) ) j k   . . .   d j1 (I k1 ) j1,i1 (y (l-1) ) j1   Therefore we can derive ν z,k as In this section, we will elaborate the initialization method for Reducible TaylorNet (R-TaylorNet). ν z,k = E[( zk ) 2 i1×•••×i k ] We keep using the notations in Section 4.2, 4.3 and E. Since the original input and output variable y (l-1) , y (l) are omitted in reduced TaylorNet, we will alternatively examine the relationship between the variance of h (l-1) and h (l) . Recall that h (l) is defined in 4.2 and E as h (l) def =       h (l) 1 h (l) 2 . . . . . . G (l) N z (l) N N ⊗ • • • ⊗ z (l) N 1        Similar to the analysis in Section 4.3, we assume that 1) all elements in h (l-1) follow independent zero-mean Gaussian distribution, 2) the weights in T k , G k are initialized independently with zero mean. 3) v k is initialized to 0. First, we can establish the relationship between (σ (l) h,k ) 2 and ν z,k in the same way as described in Section E, which can be written as (σ (l) h,k ) 2 = k j=1 r in,k,j σ 2 G,k ν z,k Next, we establish the relationship between ν z,k and (σ (l-1) h ) 2 . In reducible TaylorNet, zk is calculated as z kj = v kj + T kj h (l-1) zk = z k1 ⊗ • • • ⊗ z kk (50) Below, we introduce a more fine-grained block multiplication notation of T kj h (l-1) T kj h (l-1) = T kj1 . . . 



Figure 1: (a) Architecture of Taylor Layer of order 2 using Tucker decomposition; (b) TaylorNet consists of N Taylor layers of order 2.

k , and σ 2 I,k denote the variance of the weights in O k , G k , and I kj (k = 1, . . . , N ; j = 1, . . . , k), respectively. Following the prior works He et al. (2015); Glorot & Bengio (2010), we should enforce (σ model training. Plus, we would also like to ensure that all intermediate features inside the Taylor layer have similar variance as (σ (l-1) y

Figure 2: Accuracy comparison of different initialization methods using 3 random seeds.

Figure 3: Trajectory prediction of different methods on Duffing Equation.

Figure 4: Trajectory prediction of different methods on Non-linear Fluid Flow.

in,k,j , N k=1 r out,k ). Therefore, in order to achieve larger width with fixed number of parameters, we should setN k=1 k j=1 jr in,k,j ≈ N k=1 r out,k .Then we can scale r in,k,j and r out,k together to adjust the number of parameters of the model.

) j k ,i k (y (l-1) ) j k ) 2 j k ,i k (y (l-1) ) 2 j k . . .

kk ) 2 j k ,i k . . . (I k1 ) 2 j1,i1 E (y (l-1) ) 2 j k . . . (y (l-1) )

in,k,j σ 2 G,k   (d + 2k -2)!! (d -2)!! σ 2k I,k (σ (l-1)

In order to stabilize the model training of our TaylorNet, we should choose property is that the second order moments of intermediate features h k , zk should also be equal to the variance of the input.Namely, we would also like to ensure that σ 2 h,k = ν z,k = (σ

Performance comparison for different methods using 5 random seeds. Here Small/16 and Base/32 mean the patch size is 16 × 16 and 32 × 32, respectively. We can observe that Taylor-Mixer has slightly higher accuracy but fewer parameters than the MLP-Mixer. In particular, our Base model exhibits a significant reduction in model parameters.

Performance comparison of the proposed Taylor-NLP and pNLP-Mixer using IMDB dataset.

Summary of notations.

Configurations of Taylor-Mixer architectures for different model scales: Small and Base. For a fair comparison, we follow the experimental settings in the MLP-Mixer work.

E ANALYSIS ON TAYLOR INITIALIZATION IN PROPOSITION 4.1

In this section, we offer the theoretical analysis on Taylor initialization in Proposition 4.1. First, we introduce two lemmas used to decompose the variance of the output variables.Lemma E.1. Suppose that w 1 is independent to w 2 , x 1 , and x 2 , (ii) w 2 is independent to w 1 , x 1 and x 2 , and (iii)completing the proof.Lemma E.2. If E[w j ] = 0, and w j is independent to w k and x i , for all j, k ̸ = j, i, then we haveAccording to Lemma E.1, we can eliminate the second term above. Thus we havecompleting the proof.Next, we introduce a conjecture which will be used in our main proof.Conjecture E.1. For a random vector x following standard multivariate Gaussian distribution and for arbitrary k ≤ d, we havewhere i 1 , . . . , i k denote the indices of x, d is the dimension of the input and k is the order of the Taylor series expansion.We first offer a proof of this conjecture for small k (k = 1, 2, 3, 4) using enumeration below. This is because we often choose lower-order Taylor expansion for each layer in TaylorNet considering the computational cost.Proof. We use the property of Unit Gaussian distribution that) where x follows Unit Gaussian distribution and p is a positive even integer.Given that v k is initialized to 0, we can decompose the above matrix multiplication intoWhen choosing the initialization variance for the original TaylorNet as shown in Eq. 16, we can set different λ k to scale the importance of k-th-order term. Similarly. in Reducible TaylorNet, we would also like to scale the variance of T kjm for different m. Namely, let σ 2 T,km denote the variance of T kjm , and σ 2 T,k be the standard variance for T kj , then they should satisfy σ 2 T,km = λ m σ 2 T,k . Now we focus on 2-order Reducible TaylorNet. We can derive ν z,k for k = 1, 2 asInitialization. Using the same methodology in Section E, we need to choose σ) 2 . On the other hand, we would also like to ensure that ν z,k = (σ 

