RETHINKING SKIP CONNECTION MODEL AS A LEARN-ABLE MARKOV CHAIN

Abstract

Over the past few years afterward the birth of ResNet, skip connection has become the defacto standard for the design of modern architectures due to its widespread adoption, easy optimization, and proven performance. Prior work has explained the effectiveness of the skip connection mechanism from different perspectives. In this work, we deep dive into the model's behaviors with skip connections which can be formulated as a learnable Markov chain. An efficient Markov chain is preferred as it always maps the input data to the target domain in a better way. However, while a model is explained as a Markov chain, it is not guaranteed to be optimized following an efficient Markov chain by existing SGD-based optimizers prone to getting trapped in local optimal points. In order to move towards a more efficient Markov chain, we propose a simple routine of penal connection to make any residual-like model become a learnable Markov chain. Aside from that, the penal connection can also be viewed as a particular model regularization and can be easily implemented with one line of code in the most popular deep learning frameworks. The encouraging experimental results in multi-modal translation and image recognition empirically confirm our conjecture of the learnable Markov chain view and demonstrate the superiority of the proposed penal connection.

1. INTRODUCTION

Over the last decade, deep learning has been dominant in many tasks, including image recognition (Voulodimos et al., 2018) , machine translation (Singh et al., 2017) , speech recognition (Zhang et al., 2018) , etc. Many SGD-based methods and excellent network structures come to the fore (Alom et al., 2019) . Among them, skip connection (He et al., 2016) is a widely-used technique to improve the performance and the convergence of deep neural networks. Aided by the skip connection, models with very deep layers can be easily optimized by SGD-based methods (Amari, 1993 ), e.g., vanilla SGD (Cherry et al., 1998 ), Momentum SGD (Sutskever et al., 2013 ), Adagrad (Lydia and Francis, 2019 ), Adam (Kingma and Ba, 2014) . Recently, many theoretical explanations of how it works have been largely underexplored (Li and Yuan, 2017; Allen-Zhu et al., 2019) . In this work, we continued to explore the behaviors of the model with skip connection and view it as a learnable Markov chain (short for Markov chain) (Gagniuc, 2017) . To our best knowledge, it is the first time to analyze from this perspective. In the conception of the Markov chain, the output of a residual block is noted as the predicted direction with respect to the input. For better elaboration, we introduce another term ideal direction. The ideal direction always points to a more accurate direction than the predicted direction, which can translate an input to the target domain in a more efficient way. Then, we define an indicator ε to reflect how efficient a learned Markov chain is, based on the angle between the predicted and ideal direction. In contrast to the original predicted direction, an efficient Markov chain with an ideal direction is preferred since it always maps the input to the target domain in a better way. However, we are aware that existing SGD-based optimizers are quite lazy to update the model following an efficient Markov chain, which hinders the upper bound performance. To train a more efficient Markov chain, we propose a very simple routine of penal connection to convert a residual-like model to a Markov chain by just adding one line of code in existing deep learning frameworks. On the one hand, the penal connection is capable of enforcing the optimizer to update the model following the rules of the efficient Markov chain. On the other hand, it can be viewed as a type of additional model regularization, which alleviates over-fitting and enhances generalization. Compared with the original residual-like model, the Markov chain also has more benefits in deeper networks that suffer from performance degradation corresponding to more learnable parameters. The experimental results in multi-modal translation and image recognition not only demonstrate the feasible analysis of regarding a residual-like model as a Markov chain but also examine the superiority of the proposed penal connection throughout the optimization process. Our main contributions can be summarized in two folds. First, we present a new perspective to understand the skip connection model as a learnable Markov chain and carry out exhaustive theory analysis and experimental verification. Second, we propose the penal connection, a simple method to enable a network to be optimized within a more efficient Markov chain, which can substantially improve performances both in the fields of NLP and CV. 𝑥 ! 𝑥 " 𝑥 #$" 𝑥 % 𝑓 &! 𝑥 ! + 𝑧 " |𝑥 ! 𝑓 &" 𝑥 #$" + 𝑧 # |𝑥 #$" 𝑓 &# 𝑥 %$" + 𝑧 % |𝑥 %$" 𝑥 %$" Markov chain node 𝒏𝒍 (a) 𝑥 ! 𝑧" | 𝑥! 𝑥 " 𝑥 #$" 𝑦 𝑥 %$" 𝑧% |𝑥% $ " 𝑑 " |𝑥 ! ∠𝛼 " ∠𝛼 # 𝑑 # | 𝑥 # $ " Target domain ℬ Source domain 𝒜 (b) Figure 1 : A model M with L skip connections can be recognized as a Markov chain C consisting of L nodes. The forward pass is corresponding to a Markov process. As shown in Fig. 1 (a), a skip connection along with a residual-like block f θ l (•) builds up a Markov chain node n l (the gray dash box in middle). The input of n l is x l-1 , i.e., the output of the previous Markov node. The output of n l can be formulated by  x l = x l-1 + z l |x l-1 , where z l |x l-1 = f θ l (x l-1 )

2. METHOD

In this section, we first reformulate the residual-like model as a Markov chain and introduce ε to reflect the efficiency of a learned Markov chain. Then, we also define a δ-convex chain and make the convergence proof based on it. Before exploring the optimization algorithm, the dilemma between the behavior of an efficient Markov chain and existing backward-propagation algorithms is thoroughly discussed. Lastly, we propose a penal connection mechanism to boost the performance of the Markov chain.



Figure1: A model M with L skip connections can be recognized as a Markov chain C consisting of L nodes. The forward pass is corresponding to a Markov process. As shown in Fig.1(a), a skip connection along with a residual-like block f θ l (•) builds up a Markov chain node n l (the gray dash box in middle). The input of n l is x l-1 , i.e., the output of the previous Markov node. The output of n l can be formulated by x l = x l-1 + z l |x l-1 , where z l |x l-1 = f θ l (x l-1 ) is the predicted direction by residual-like block. As shown in Fig.1(b), guided by z l,l∈1,••• ,L , the input data x 0 ∈ A can gradually shift to the target label y ∈ B along the learned Markov chain. The red dashed arrow d l |x l-1 is the ideal direction with respect to x l-1 and z l .

funding

* Equal contribution. † Corresponding author. The work is supported in part by NSFC Grants (62072449).

