TAYLORNET: A TAYLOR-DRIVEN GENERIC NEURAL ARCHITECTURE

Abstract

In this work, we propose the Taylor Neural Network (TaylorNet), a generic neural architecture that parameterizes Taylor polynomials using DNNs without non-linear activation functions. The key challenges of developing TaylorNet lie in: (i) mitigating the curse of dimensionality caused by higher-order terms, and (ii) improving the stability of model training. To overcome these challenges, we first adopt Tucker decomposition to decompose the higher-order derivatives in Taylor expansion parameterized by DNNs into low-rank tensors. Then we propose a novel reducible TaylorNet to further reduce the computational complexity by removing more redundant parameters in the hidden layers. In order to improve training accuracy and stability, we develop a new Taylor initialization method. Finally, the proposed models are evaluated on a broad spectrum of applications, including image classification, natural language processing (NLP), and dynamical systems. The results demonstrate that our proposed Taylor-Mixer, which replaces MLP and activation layers in the MLP-Mixer with Taylor layer, can achieve comparable accuracy on image classification, and similarly on sentiment analysis in NLP, while significantly reducing the number of model parameters. More importantly, our method can explicitly learn and interpret some dynamical systems with Taylor polynomials. Meanwhile, the results demonstrate that our Taylor initialization can significantly improve classification accuracy compared to Xavier and Kaiming initialization.

1. INTRODUCTION

This paper proposes a generic neural architecture, called TaylorNet, that parameterizes Taylor polynomials using deep neural networks. It can be employed to a variety of application domains, including image classification, dynamical systems, and natural language processing (NLP). Importantly, the proposed method does not use non-linear activation functions, thus promoting interpretability of DNNs in some applications, such as dynamical systems. This work is motivated by a growing popularity of physics-guided machine learning (ML) (Jia et al., 2021; Daw et al., 2017) , which integrates physical priors into neural networks. Thus, it endows neural networks with the ability to generalize to new domains better. As a result, physics guided ML has been widely applied to a variety of areas, such as dynamical systems (Cranmer et al., 2020; Lusch et al., 2018; Greydanus et al., 2019 ), quantum mechanics (Schütt et al., 2017) , and climate changes (Kashinath et al., 2021; Pathak et al., 2022) . However, existing methods based on DNNs are either tailored to solve certain specific problems, such as PDEs (Li et al., 2020; Raissi et al., 2017) and dynamics prediction (Greydanus et al., 2019; Lusch et al., 2018; Wang et al., 2019) , or hard to explain the results. Hence, the question is, can we develop a generic interpretable neural architecture that can be used in a wide range of machine learning domains? In this paper, we develop a novel Taylor-driven neural architecture, called TaylorNet, that parameterizes Taylor polynomials using DNNs without non-linear activation functions, as shown in Fig. 1 . The proposed TaylorNet is able to generalize to a wide spectrum of ML tasks, ranging from computer vision and dynamical systems to NLP. However, developing TaylorNet poses two main challenges. First, the computational complexity of Taylor polynomial parameterized by DNNs grows exponentially as the polynomial order increases. Second, its higher-order terms often lead to training instability. To deal with these challenges, we first adopt Tucker decomposition to decompose the higher-order To further reduce its computational complexity, we propose a reducible TaylorNet that removes more redundant learnable parameters in the hidden layers. In order to show the generalization of our architecture, we propose a Taylor-Mixer that uses Taylor layers in place of both the MLP layers and activation functions in the MLP-Mixer (Tolstikhin et al., 2021) . Then a new Taylor initialization method is developed to improve the stability and accuracy of the model. ! ! " ! # !! " " !" # "! # " " ∑ $ ! ! " ! # !! " " !" # "! # " " ∑ $ ! ! " ! # !! " " !" # "! # " " ∑ $ & #$! & # 1 st order 2 nd order (a) Finally, we evaluate the proposed Taylor-Mixer and TaylorNet in a variety of applications, including image classification, dynamical system, and NLP. The results show that our Taylor-Mixer can achieve comparable accuracy to the MLP-Mixer on image classification while exhibiting a considerable reduction of model parameters. The proposed TaylorNet also explicitly learns and interprets the dynamics of two classic physical systems with polynomials. Besides, our method can also be applied to NLP. The evaluation results on sentiment analysis demonstrate a competitive accuracy to the recently proposed pNLP-Mixer (Fusco et al., 2022) . Meanwhile, our Taylor initialization can reach an accuracy that is over 10% higher than the Xavier and Kaiming initialization methods for the proposed TaylorNet (Glorot & Bengio, 2010; He et al., 2015) . In summary, our contributions include: 1) we design the TaylorNet, a novel neural architecture without activation functions that can generalize to a broad spectrum of ML domains, 2) we then propose a reducible TaylorNet that remarkably reduces the number of learning parameters, 3) a new Taylor initialization method is proposed to stabilize model training, 4) we develop a Taylor-Mixer that replaces both the MLP layers and activation functions in MLP-Mixer with Taylor layers, which can achieve comparable accuracy to the MLP-Mixer on image classification and sentiment analysis, and 5) Our approach can explicitly learn and explain some dynamical systems with polynomials, making way for interpretable ML. 



Figure 1: (a) Architecture of Taylor Layer of order 2 using Tucker decomposition; (b) TaylorNet consists of N Taylor layers of order 2.tensors in TaylorNet into a set of low-rank tensors(Malik & Becker, 2018; Kolda & Bader, 2009). To further reduce its computational complexity, we propose a reducible TaylorNet that removes more redundant learnable parameters in the hidden layers. In order to show the generalization of our architecture, we propose a Taylor-Mixer that uses Taylor layers in place of both the MLP layers and activation functions in theMLP-Mixer (Tolstikhin et al., 2021). Then a new Taylor initialization method is developed to improve the stability and accuracy of the model.

Polynomial Neural Networks. We summarize the significant difference between the proposed TaylorNet and existing Polynomial Neural Networks. Earlier work Nikolaev & Iba (2006) mainly adopted Group Method of Data Handling to learn partial descriptors in polynomial neural networks. Then a follow-up pi-sigma networkShin & Ghosh (1991)  was developed to filter the input features using predefined basis. These methods, however, fail to scale to high-dimensional dataChrysos  et al. (2020). Recently, researchers designed Π-Nets Chrysos et al. (2020; 2022) that parameterizes the polynomial functions using deep neural networks and tensor decomposition. However, the performance of Π-Nets will degrade in larger networks. In addition, Π-Net can be viewed as a special case of TaylorNet at expansion point 0 since its adopted CP-decomposition is a special case of Tucker decomposition. Furthermore, we develop a novel Taylor initialization to improve the training stability while Π-Net does not. Our initialization method is different from the initialization paradigm for Tensorial Convolutional Neural NetworksPan et al. (2022).Related Work on Taylor Series. Some recent studies developed Taylor-based neural networks. For example, TaylorSwiftNet Pourheydari et al. (2021) was developed for time-series prediction, but it is not applicable to high-dimensional classification. Recently, Mao et alMao et al. (2022)  developed Phy-Taylor to learn the physical dynamics based on partial physics knowledge, but it suffers from

