RETHINKING SKIP CONNECTION MODEL AS A LEARN-ABLE MARKOV CHAIN

Abstract

Over the past few years afterward the birth of ResNet, skip connection has become the defacto standard for the design of modern architectures due to its widespread adoption, easy optimization, and proven performance. Prior work has explained the effectiveness of the skip connection mechanism from different perspectives. In this work, we deep dive into the model's behaviors with skip connections which can be formulated as a learnable Markov chain. An efficient Markov chain is preferred as it always maps the input data to the target domain in a better way. However, while a model is explained as a Markov chain, it is not guaranteed to be optimized following an efficient Markov chain by existing SGD-based optimizers prone to getting trapped in local optimal points. In order to move towards a more efficient Markov chain, we propose a simple routine of penal connection to make any residual-like model become a learnable Markov chain. Aside from that, the penal connection can also be viewed as a particular model regularization and can be easily implemented with one line of code in the most popular deep learning frameworks. The encouraging experimental results in multi-modal translation and image recognition empirically confirm our conjecture of the learnable Markov chain view and demonstrate the superiority of the proposed penal connection.

1. INTRODUCTION

Over the last decade, deep learning has been dominant in many tasks, including image recognition (Voulodimos et al., 2018) , machine translation (Singh et al., 2017) , speech recognition (Zhang et al., 2018) , etc. Many SGD-based methods and excellent network structures come to the fore (Alom et al., 2019) . Among them, skip connection (He et al., 2016) is a widely-used technique to improve the performance and the convergence of deep neural networks. Aided by the skip connection, models with very deep layers can be easily optimized by SGD-based methods (Amari, 1993 ), e.g., vanilla SGD (Cherry et al., 1998) , Momentum SGD (Sutskever et al., 2013) , Adagrad (Lydia and Francis, 2019) , Adam (Kingma and Ba, 2014) . Recently, many theoretical explanations of how it works have been largely underexplored (Li and Yuan, 2017; Allen-Zhu et al., 2019) . In this work, we continued to explore the behaviors of the model with skip connection and view it as a learnable Markov chain (short for Markov chain) (Gagniuc, 2017) . To our best knowledge, it is the first time to analyze from this perspective. In the conception of the Markov chain, the output of a residual block is noted as the predicted direction with respect to the input. For better elaboration, we introduce another term ideal direction. The ideal direction always points to a more accurate direction than the predicted direction, which can translate an input to the target domain in a more efficient way. Then, we define an indicator ε to reflect how efficient a learned Markov chain is, based on the angle between the predicted and ideal direction. In contrast to the original predicted direction, an efficient Markov chain with an ideal direction is preferred since it always maps the input to the target domain in a better way. However, we are aware that existing SGD-based optimizers are quite lazy to update the model following an efficient Markov chain, which hinders the upper bound performance. To train a more efficient Markov chain, we propose a very simple routine of penal connection to convert a residual-like model to a Markov chain by just adding one line of code in existing deep learning frameworks. On the one hand, the penal connection is capable of enforcing the optimizer to update the model following the rules of the efficient Markov chain. On the other hand, it can be viewed as a type of additional model regularization, which alleviates over-fitting and enhances generalization. Compared with the original residual-like model, the Markov chain also has more benefits in deeper networks that suffer from performance degradation corresponding to more learnable parameters. The experimental results in multi-modal translation and image recognition not only demonstrate the feasible analysis of regarding a residual-like model as a Markov chain but also examine the superiority of the proposed penal connection throughout the optimization process. Our main contributions can be summarized in two folds. First, we present a new perspective to understand the skip connection model as a learnable Markov chain and carry out exhaustive theory analysis and experimental verification. Second, we propose the penal connection, a simple method to enable a network to be optimized within a more efficient Markov chain, which can substantially improve performances both in the fields of NLP and CV. 𝑥 ! 𝑥 " 𝑥 #$" 𝑥 % 𝑓 &! 𝑥 ! + 𝑧 " |𝑥 ! 𝑓 &" 𝑥 #$" + 𝑧 # |𝑥 #$" 𝑓 &# 𝑥 %$" + 𝑧 % |𝑥 %$" 𝑥 %$" Markov chain node 𝒏𝒍 (a) 𝑥 ! 𝑧" | 𝑥! 𝑥 " 𝑥 #$" 𝑦 𝑥 %$" 𝑧% |𝑥% $ " 𝑑 " |𝑥 ! ∠𝛼 " ∠𝛼 # 𝑑 # | 𝑥 # $ " Target domain ℬ Source domain 𝒜 (b) Figure 1 : A model M with L skip connections can be recognized as a Markov chain C consisting of L nodes. The forward pass is corresponding to a Markov process. As shown in Fig. 1 (a), a skip connection along with a residual-like block f θ l (•) builds up a Markov chain node n l (the gray dash box in middle). The input of n l is x l-1 , i.e., the output of the previous Markov node. The output of n l can be formulated by  x l = x l-1 + z l |x l-1 , where z l |x l-1 = f θ l (x l-1 )

2. METHOD

In this section, we first reformulate the residual-like model as a Markov chain and introduce ε to reflect the efficiency of a learned Markov chain. Then, we also define a δ-convex chain and make the convergence proof based on it. Before exploring the optimization algorithm, the dilemma between the behavior of an efficient Markov chain and existing backward-propagation algorithms is thoroughly discussed. Lastly, we propose a penal connection mechanism to boost the performance of the Markov chain.

2.1. THE LEARNABLE MARKOV CHAIN

Similar to the definition of a traditional Markov chain in Gagniuc (2017), a learnable Markov chain can be defined as: Definition 1 (The learnable Markov chain: C L ). A learnable Markov chain C L is a stochastic model with learnable parameters {θ 1 , • • • , θ L } describing a sequence of L possible events in which the state x l of the current event only depends on the state x l-1 attained in the previous event. Inefficient but easy to learn Efficient but hard to learn Figure 2 : A simple task to show the performance difference between plain net, skip net, and our proposed Markov net. In this task, we build a model which consists of three fully connected layers with 28 learnable parameters to shift coordinate (x, y) to a target distribution, which can be formulated as x 2 -y 2 . More details are listed in supplementary materials. We update the model with SGD optimizer for 10000 steps. As shown in Fig. 2 (b), Markov net can better complete this task. As shown in Fig. 1 (a), M L indicates a model with L residual-like blocks, and can also be considered as a learnable Markov chain C L with L nodes since the output state (i.e. output feature map of a residual block) x l of node n l only depends on the state x l-1 from the previous node n l-1 . As a result, a forward pass through M L can be viewed as a Markov process, as shown in Fig. 1(b ). In a bit more detail, the corresponding Markov chain with respect to the input data x 0 ∈ A is formulated as: C L (x 0 → x L ) := x 0 +z1|x0 ----→ x 1 → • • • x l +z l |x l-1 ------→ x l+1 → • • • x L-1 +z L |x L-1 ------→ x L (1) where f θ l is a feature transformation function by i-th residual-like block in M L , and is also the i-th chain node in C L with learnable parameters θ l . z l |x l-1 = f θ l (x l-1 ) is the predicted direction for previous state x l-1 in node n l . Correspondingly, the ideal direction d l |x l-1 with respect to x l-1 and z l can be defined as: Definition 2 (The ideal direction: d l |x l-1 ). Assume the function ℓ measures the distance between two variables, and if ℓ(a, c) ≥ ℓ(b, c), then ℓ(a, c) ≥ ℓ(µa + (1 -µ)b, c) ≥ ℓ(b, c), where µ ∈ [0, 1]. d l |x l-1 is an ideal direction with respect to x l-1 and z l as long as: ℓ(x l-1 + ηd l |x l-1 , y) ≤ ℓ(x l-1 + ηz l |x l-1 , y). ( ) where η is a small step size. Obviously, given x l-1 , z l and ℓ, the ideal direction d l is still not unique because anyone who can outperform the predicted direction z l under the measurement of ℓ is qualified, as Eq. 2 holds. We collect all the ideal directions for x l-1 , z l under ℓ as D ℓ,z l |x l-1 . Lemma 1 If d l ∈ D ℓ,z l |x l-1 , then d ′ l = µd l + (1 -µ)z l ∈ D ℓ,z l |x l-1 , where µ ≥ 0. Proof 2.1 Since d l is an ideal direction, then: ℓ(x l-1 + ηd l , y) ≤ ℓ(µ(x l-1 + ηd l ) + (1 -µ)(x l-1 + ηz l ), y) = ℓ(x l-1 + η(µd l + (1 -µ)z l d ′ l ), y) ≤ ℓ(x l-1 + ηz l , y), ) thus, d ′ l = µd l + (1 -µ)z l is an ideal direction with respect to z l and x l-1 . Different function f θ l takes discrepant effects on the set D ℓ,z l |x l-1 . f θ l is used to be a sequence of convolutions (e.g. residual block in ResNet (He et al., 2016) )or a popular Transformer introduced in Vaswani et al. (2017) , as long as it can make C L shift input x 0 ∈ A to the target domain B correctly. Importantly, we next devise an intuitive indicator ε to reflect how the efficiency of C L (x 0 → x L ): Definition 3 (The efficiency of Markov chain: C L (x 0 → x l )). The efficiency indicator ε measures an average cosine similarity between z l |x l-1 and d l |x l-1 : ε := 1 L L l=1 cos α l = 1 L L l=1 ⟨⃗ z l , ⃗ d l ⟩. ( ) where ⃗ v is the normalized tensor of v, defined as ⃗ v := v/∥v∥ 2 . ∠α l is the angle between z l |x l-1 and d l |x l-1 . From the definition, a larger ε indicates a smaller ∠α l between the predicted direction z l and the ideal direction d l , mirroring a more efficient Markov chain and vice versa. More specifically, if z l always has a positive fraction on d l for all nodes, we call this is a convex chain which formally defined as follows: Definition 4 (δ-convex chain). If ∃δ > 0, ∀n l ∈ C L (x 0 → x L ), ⟨z l , d l ⟩ > δ∥d l ∥ 2 2 , then C(x 0 → x L ) is dubbed as a δ-convex chain. Lemma 2 If C L (x 0 → x L ) is δ-convex, ε > 0. Proof 2.2 Since C L (x 0 → x L ) is δ-convex, then for all l: ⟨z l , d l ⟩ = ⟨∥z l ∥ 2 ⃗ z l , ∥d l ∥ 2 ⃗ d l ⟩ = ∥z l ∥ 2 ∥d l ∥ 2 ⟨⃗ z l , ⃗ d l ⟩ > δ∥d l ∥ 2 2 > 0 (5) It is noteworthy that ∥z l ∥ 2 ∥d l ∥ 2 > 0, thus ∀n l , ⟨⃗ z l , ⃗ d l ⟩ > 0, which yields ε = 1 L L l=1 ⟨⃗ z l , ⃗ d l ⟩ > 0. Lemma 2 tells that if C L (x 0 → x L ) is δ-convex, it must be an efficient Markov chain, while the reverse is not necessarily true. As shown in Fig. 1 (b), the plotted chain is an efficient Markov chain, but it is not a δ-convex chain since there does not exist a non-negative δ making ∠α 1 satisfy Definition 4. If C L (x 0 → x L ) is δ-convex, then in every node n l , a positive fraction of the predicted direction z l , i.e., z l cos ∠α l , is pointing to the same direction as d l . Given an appropriate step size η, the input could eventually arrive at the target domain, despite along with a winding path. Fig. 2 (a) also illustrates a simple model to verify it. The source input gets closer to target domain B in every step while moving along a δ-convex Markov chain (visualization in brown). Formally, we have concluded the following lemma for ensuring convergence. Lemma 3 For each chain node n l ∈ C L (x 0 → x L ), consider the forward process x l = x l-1 + z l , where E[z l ] = d l , E[∥z l ∥ 2 F ] ≤ Z 2 . Suppose for all nodes, x l is always inside the δ-convex region with diameter D, i.e., ∥x t -y∥ F ≤ D. Then, for any a > 0 and any L such that L a log L ≥ D 2 δ 2 (1+a)Z 2 , we have E[∥x L -y∥ 2 F ] ≤ (1+a) log LZ 2 δ 2 L . The proof of Lemma 3 appears in Appendix B. Notably, Lemma 3 does not imply that x L equals to y. It only describes that x L is sufficiently close to y if z l can predict the correct direction for x l-1 . For a longer chain (i.e. deeper network with larger L), the x L will be restricted closer to target y with a relatively less error. Taken together, Lemma. 3 guarantees that a Markov chain C L can shift source input x 0 ∈ A to target domain B through L nodes in an efficient way if C L is δ-convex. Until now, we can trustingly settle down to optimize M L from the conception of C L .

2.2. MARKOV CHAIN OPTIMIZATION

Before exploring the optimization method of a Markov chain, it is necessary to have a careful discussion about the efficiency of a Markov chain taking effect in the optimization process. According to aforesaid Lemma. 3, if all parameters θ l are optimized in a way that makes C L (x 0 → x L ) always be a δ-convex chain for any input x 0 , the convergence speed for optimization appears to be substantially improved, and the final performance as well. However, we caution that a δ-convex chain is not guaranteed by existing SGD-based optimizers, such as SGD and Adam. Actually, the existing optimizers prefer towards to an efficient or even an inefficient Markov chain, where the land scope of the loss is more smooth, instead of a δ-convex one. After discovery and practice, we find SGD-based optimizers have the potential to bridge the gap which will be discussed later. The diagrammatic sketch of efficient and inefficient Markov chain is illustrated in Fig. 2 (a), which can better help to understand. Intuitively, while ε → 1, an efficient Markov chain acts more like a δ-convex chain and can obtain its good properties. However, a too large ε is a disaster for existing optimizers, resulting in hard convergence. Hence, the main idea to optimize a Markov chain is to train a "reasonable" efficient Markov chain, i.e. let ∥ε∥ 2 2 no larger than a given threshold ρ. In this way, the objective function can be formulated as: L = L M L (x L , y) + λ∥ε∥ 2 2 . ( ) The first item of Eq. 6 is the original loss function of the model M L and the second item is the additional penalization with λ to hold ∥ε∥ 2 2 ≤ ρ. If set λ = 0 in Eq. 6, the objective function degenerates into a plain mode without any remedy toward forming a more efficient Markov chain. Instead, if set λ > 0, the penalization will take effect, suggesting enable to obtain a more efficient Markov model. The toy example in Fig. 2 (b) visualizes the significant effect of this penalization term over the plain net (MLPs) and skip net (MLPs with skip connection). 𝑥 ! = 𝑥 !"# + 𝑧 ! 𝑥 !"# 𝑧! = 𝑓$! 𝑥! " # -𝜂𝑔 % ! 𝑑 ! = -𝜂𝑔 % ! + 𝑧 ! 𝑥 ! & = 𝑥 ! - 𝜂 𝑔 % ! (a) The proposed ideal direction. 𝑥 !"# 𝑓 $ ! 𝑥 !"# + 𝑔 % ! 𝑔 % ! 𝜕𝜆 𝜀 & & 𝜕𝑧 ! 𝜕ℒ ℳ " 𝑥 ( , 𝑦 𝜕𝑥 ! 𝑔 ) # ! 𝑔 % !$% = 𝑔 % ! + 𝑔 ) # ! 𝑔 % ! 𝑔 * ! (b) The backward propagation of gradient within a Markov chain node. Figure 3 : A visualization of ideal direction computed based on g x l in Fig. 3(a) . It is worth noting that x ′ l can be viewed that we update x l by g x l with a small learning rate η. The backward propagation of gradients within a Markov chain node is plotted in Fig. 3(b) . Compared with the residual-like model, we only add an additional gradient g z l while computing the gradient with respect to f θ l . In order to solve Eq. 6, the correct ideal direction d l with respect to x l-1 is required to figure out. As discussed previously, d l is not unique, thus the different definition of ℓ leads to a different set of d l . Actually, since the feature space of x l always lies in very high dimension space, it would be a great challenge to find a suitable ℓ to define an ideal direction d l . As for a chain node n l , we reuse the target loss function L M L to build a valid ℓ function: ℓ(x l , y) := L M L (C L (x l → x L ), y). ( ) x l is forward along the rest chain nodes n l+1 , • • • , n L , and the final output x L is taken to compute the loss between y by L M L . This way, the gradient with respect to x l is: g x l := ∂ℓ(x l , y) ∂x l = ∂L M L (x L , y) ∂x l (8) Hence, g x l can be obtained while the backward propagation during the training process and an ideal direction d ′ l based on g x l is z l -ηg x l , where η is a small step size, as shown in Fig. 3(a) . Proof 2.3 Since x l = x l-1 + z l and ℓ(x l -ηg x l , y) < ℓ(x l , y) always holds, set d ′ l := z l -ηg x l , then ℓ(x l-1 + d ′ l , y) = ℓ(x l-1 + z l -ηg x l , y) = ℓ(x l -ηg x l , y) < ℓ(x l , y) = ℓ(x l-1 + z l , y) (9) Thus, d ′ l = z l -ηg x l is an ideal direction for x l-1 . From Lemma 1, we known that d l = µd ′ l + (1 -µ)z l = z l -ηµg x l is also an ideal direction, i.e., z l -ηµg x l ∈ D ℓ,z l |x l-1 . Assume ∥g x l ∥ 2 ≥ 1, then we can set µ = 1 ∥gx l ∥2 ∈ [0, 1], yields d l = z l -η⃗ g x l . Since η is small, ⃗ d l can be approximately expressed as ⃗ z l -η⃗ g x l , and ε can be reformulated as: ε = 1 L L l=1 ⟨⃗ z l , ⃗ d l ⟩ ≈ 1 L L l=1 ⟨⃗ z l , ⃗ z l -η⃗ g x l ⟩ = 1 L L l=1 ⟨⃗ z l , ⃗ z l ⟩ + ⟨⃗ z l , -η⃗ g x l ⟩ = 1 L L l=1 1 + ⟨⃗ z l , -η⃗ g x l ⟩ = 1 + η 1 L L l=1 ⟨⃗ z l , -⃗ g x l ⟩ ε ′ where ε ′ will be used in the following experiments instead of ε as it indicates the same efficiency properties of the Markov chain but is easier to understand (a larger ε ′ indicates a more efficient C L ). Then, the gradient to z l can be computed as: g z l := ∂L M L (x L , y) ∂x l ∂x l z l + ∂λ∥ϵ∥ 2 2 ∂z l = g x l ∂(x l-1 + z l ) ∂z l + λ ∂∥ L l=1 1 + η⟨⃗ z l , -⃗ g x l ⟩∥ 2 2 ∂z l ≈ g x l + λ ∂ L l=1 ∥1 + η⟨cz l , -⃗ g x l ⟩∥ 2 2 ∂z l (11) = g x l + λ ∂∥1 + η⟨cz l , -⃗ g x l ⟩∥ 2 2 ∂z l = g x l + 2λ(-η⃗ g x l )(1 + ⟨cz l , -η⃗ g x l ⟩) = g x l + 2λ(-η⃗ g x l ) + 2λcη 2 z l = (1 - 2λη ∥g x l ∥ 2 )g x l + 2λcη 2 z l ≈ g x l + τ z l ( ) where τ is a hyper-parameter and c =foot_0 ∥z l ∥2 in Eq. 11 can be regarded as a constant value to simplify the gradient derivation process. The analysis of this estimation can be found in Appendix C. Despite lots of hyper-parameters introduced for facilitating derivation, e.g., λ, η, c, ρ, we only need to specify a single hyper-parameter τ in the final formulation, which relieves the heavy burden from a hyperparameter sweep. We dubbed this special optimization method as penal connection, which seems to add a simple penalization on the norm of z l as a type of model regularization. The compute graph has been plotted in Fig. 3(b) , and it can be easily implemented based on the PyTorch framework by adding one line of code (see Algo. 1 1 ). Algorithm 1 Pseudo code of penal connection in a PyTorch-like style. z l = f θ l (x l-1 ) # Only following line is added to register a hook z l .register hook(lambda g x l , z l =z l .detach().clone(): g x l + τ z l ) x l = x l-1 + z l Lastly, due to the various choice of ℓ, the gradient to z l is correspondingly discrepant. Further community exploration about more reasonable and effective ways to compute the ideal direction is of great value so continually push forward the performance of the Markov chain.

3. EXPERIMENTS

In this section, we conduct intensive experiments to demonstrate the superiority of the Markov chain in the field of Natural Language Processing and Computer Vision. Unless otherwise specified, all the residual-like blocks in M are converted to a Markov chain node n l in C. Specifically, in transformer block (Vaswani et al., 2017) which consists of a multi-head self-attention module (MSA) and a feed-forward network (FFN), we convert it as two chain nodes as both MSA and FFN employ skip connection respectively. For simplicity, we set τ the same for all nodes in a Markov chain. experiments are carried out by publicly available projects implemented by PyTorch (Paszke et al., 2017) on a device equipped with 8 NVIDIA-A100 GPUs. More experimental details can be found in the supplementary materials.

3.1. MULTI-MODAL TRANSLATION

Multi-modal translation requires a model to be capable of translating the text information from the source language A to the target language B. How the efficient/inefficient Markov chain performs during the training process can be clearly observed in this experiment. Figure 4 : The testing curve across different epochs on WMT16 English to Germany translation tasks (Germany to translation curve can be found in supplementary materials). Dataset. WMT16 (Bojar et al., 2016 ) is a widely used translation dataset based on the data from statmt.org, which contains various interesting translation tasks on specified domains. Here, we are focusing on the news translation tasks between English and Germany. The text for all the test sets is drawn from news articles. Implementation details. We adopt the most widely used benchmark Transformer (Vaswani et al., 2017) as our strong baseline. The embedding size d emb is 512, and the source and target embedding layers are shared. We empirically set τ to 3 × 10 -4 , which generalizes well to all translation tasks. Here, we opt for the mutual translation tasks between English and Germany for validation. All the models are trained using Adam optimizer with β 1 = 0.9, β 2 = 0.98. We use a batch size of 1024 and weight decay of 0.05 and other recipes for training are identical to the original implementation (Vaswani et al., 2017) . We set up two regular training settings, Q and S separately (see Table 1 ). Result analysis. From Fig. 4 4 may illustrate this phenomenon. From the curve, we find that the transformer model is more likely to be an inefficient Markov chain whose ε ′ is always negative. Aided by the proposed penal connection, it becomes an efficient Markov chain whose ε ′ is mostly positive. Furthermore, we observed that at the early training stage, C L is an inefficient chain, which is even worse than the baseline. After several epochs, it conversely turns to become a δ-convex chain (with a large ε ′ ). After that, C L decays to a less efficient Markov chain with a small positive ε ′ . This phenomenon also confirms our analysis in the method section: Moving along with a δ-convex chain can help converge to the target in the most efficient way. However, it may not be the best way under the existing optimizer. In order to achieve higher performance, the model will be pushed to be a less efficient chain. As for the baseline model, we find that it always acts as an inefficient Markov chain which more easily gets trapped in local minima.

3.2. IMAGE CLASSIFICATION

Different from translation tasks, the gradient for image classification is quite sparse and the redundant parameters often guarantee the optimizer towards to an effective Markov chain, even to a δ-convex chain, suggesting that even without the penal connection, the model can also be optimized efficiently. Despite that, we still observe a slight τ (smaller than 10 -7 ) can further improve the final accuracy by a non-trivial margin. This benefit is at least partly due to the model regularization effect of τ which alleviates over-fitting. Dataset. We conduct a series of experiments on the task of image classification using the ImageNet-1K dataset (Deng et al., 2009) , which consists of 1.28 million training images across 1000 classes and 50k images for validation. Implementation details. There are many variants of residual-like models used in the image classification task. We conduct experiments on three representative types of models, i.e. ResNet (He et al., 2016) , ViT (Dosovitskiy et al., 2020) (and its variants, e.g., DeIT (Touvron et al., 2021) , Swin (Liu et al., 2021) ). During experiments, we find that all these models can learn an efficient Markov chain or even δ-convex chain, so that only a very small τ (less than 10 -6 ) is applied. All the models are trained for a 10-epoch linear warmup and a cosine decaying schedule afterward for 290 epochs. More details can be found in the supplementary materials. Result analysis. As listed in Tab. 2, despite that the baseline models have already achieved a saturated accuracy, a suitable τ with penal connection routine can further push forward this performance by a non-trivial margin. Across our experiments, a large τ usually does hurt the mode performance. This observation reflects another effect of penal connection, that is τ not only engages the model to learn an efficient Markov chain, but it also limits the Markov chain to be too efficient, keeping the Markov chain away from a δ-convex chain. We think it is also reasonable because a δ-convex chain may lead to hard optimization, and a slight τ can take effect in this situation. In other words, the penal connection can also be viewed as a model regularization that can improve the final performance via alleviating over-fitting from a new standpoint.

3.3. MODEL DEGRADATION

One of the main reasons that residual-like models become widely used across various tasks is their capacity to counter the problem of model degradation in deeper networks. Model degradation refers to the phenomenon that with the increasing depth of model M L , the performance will no longer be improved and even worse. We find that by converting a residual-like model M L to a Markov chain C L , the model degradation problem can be solved better. With the aid of the penal connection routine, a deep model can arrive at a higher performance than the original counterpart. Dataset. CIFAR10foot_1 dataset consists of 60k 32x32 color images in 10 classes. There are 50k training images and 10k test images. The CIFAR100 dataset is identical to CIFAR-10, except the number of classes is 100. Implementation details. We take ResNet He et al. (2016) for comparison. In order to investigate the performance trend of models over different depths, we build seven models with different depths, i.e., L ∈ {18, 20, 32, 34, 44, 50, 56}. We used a momentum SGD optimizer with a batch size of 128 and a weight decay of 0.0001 for CIFAR10 and 0.0005 for CIFAR100. The τ is 3 × 10 -9 for all experiments. We trained all the models for 200 epochs from an initial learning rate of 0.1. The learning rate decayed by a factor of 10 at epochs 100, and 150 for CIFAR10, and at epochs 60, 120, and 160 for CIFAR100. Result analysis. The results are listed in Tab. 3. All the Markov chain models significantly advance baseline residual models. In particular, as the model goes deeper, the performance of baseline models saturates first and decay afterward. On the contrary, the Markov chain with penal connection consistently achieves stable gains. This evidence confirms that our proposed method can further alleviate the model degradation problem, motivating future scaling efforts in depth.

4. CONCLUSIONS

In this work, we introduce the conception of a learnable Markov chain for the residual-like model, and propose a simple routine of penal connection to boost the model performance and alleviate the model degradation in deep depth as well. Adequate theoretical analysis and comprehensive experiments on the different types of architectures across a spectrum of tasks jointly demonstrate the rationality and effectiveness of the learnable Markov chain. While these initial results are encouraging, many challenges remain. For example, a better way to compute the ideal direction of Markov chain would likely lead to further improved performance. We expect more research would pay more attention to this new perspective which will inspire future work. Markov chain. A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event (Gagniuc, 2017) . Informally, it can be thought of as "What happens next depends only on the state of affairs now.". Markov chains have many applications for statistical models (Karlin, 2014) , such as studying cruise control systems in motor vehicles, queues or lines of customers arriving at an airport, currency exchange rates, and animal population dynamics (Meyn and Tweedie, 2009) . Markov processes are the basis for general stochastic simulation methods known as Markov chain Monte Carlo, which are used for simulating sampling from complex probability distributions, and have found application in Bayesian statistics, thermodynamics, statistical mechanics, physics, chemistry, economics, finance, signal processing, information theory and speech processing (Gamerman and Lopes, 2006) . Recently, the notation of the Markov chain also plays an important role in reinforcement learning (Otterlo and Wiering, 2012), image generation (Ho et al., 2020) and other deep learning-related fields (Mardt et al., 2018) . Shwartz-Ziv and Tishby (2017) regards the feedforward process of a neural network as a Markov chain and explains the behavior of the neural network in the optimization process from the perspective of Shannon Information Theory. B PROOF OF LEMMA. 3 Proof B.1 According to the forward pass, we have E[∥x l+1 -y∥ 2 F ] = E[∥x l -y + z l+1 ∥ 2 F ] (13) = E[∥x l -y∥ 2 F ] + 2⟨x l -y, d l+1 ⟩ + ∥z l+1 ∥ 2 F ≤ E[∥x l -y∥ 2 F ] + 2⟨x l -y, d l+1 ⟩ + Z 2 ≤ (1 + 2δ)E[∥x l -y∥ 2 F ] + Z 2 . ( ) Now if δE[∥x l -y∥ 2 F ] ≥ Z 2 , 3 we know the E[∥x l -y∥ 2 F ] will decrease by a factor of (1 + δ) for every chain node. Otherwise, although it could increase, we know E[∥x l -y∥ 2 F ] ≤ Z 2 δ . ( ) We know after L nodes, either E[∥x L -y∥ 2 F ] is already smaller than Z 2 δ = (1+a) log LZ 2 δ 2 L , or it is decreasing by factor of (1 + δ) for every node, which means E[∥x L -y∥ 2 F ] ≤ E[∥x 0 -y∥ 2 F ](1 + δ) L ≤ D 2 e δL = D 2 e (1+a) log L = D 2 L a L ≤ (1 + a) log LZ 2 δ 2 L . ( ) The last inequality holds since L a log L ≥ D 2 δ 2 (1 + a)Z 2 (17) Thus, E[∥x L -y∥ 2 F ] will be smaller than (1+a) log LZ 2 δ 2 L .

C ANALYSIS OF ESTIMATION g z l

Here, we give a detailed analysis of the estimation error of g z l in Eq. 12. Firstly, we strictly follow the chain rule to derive the accurate formula of g z l : g z l := ∂L M L (x L , y) ∂x l ∂x l z l + ∂λ∥ϵ∥ 2 2 ∂z l = g x l ∂(x l-1 + z l ) ∂z l + λ ∂∥ L l=1 1 + η⟨⃗ z l , -⃗ g x l ⟩∥ 2 2 ∂z l = g x l + λ ∂(∥1 + η⟨⃗ z l , -⃗ g x l ⟩∥ 2 2 ) ∂z l = g x l + λ ∂(∥1 + η⟨ z l ∥z l ∥2 , -⃗ g x l ⟩∥ 2 2 ) ∂z l = g x l + 2λ(1 + ⟨ z l ∥z l ∥ 2 , -η⃗ g x l ⟩) ∂⟨ z l ∥z l ∥2 , -η⃗ g x l ⟩ ∂z l :=A where A := ∂⟨ z l ∥z l ∥2 , -η⃗ g x l ⟩ ∂z l = ∂ ⟨z l ,-η⃗ g x+l ⟩ ∥z l ∥2 ∂z l = (-η⃗ g x l )∥z l ∥ 2 + ∥z l ∥ -1 2 z l ⟨z l , -η⃗ g x l ⟩ ∥z l ∥ 2 2 = -cη⃗ g x l + c 3 ⟨z l , -η⃗ g x l ⟩z l . c := 1 ∥z l ∥2 is defined in Eq. 11. Then we substitute Eq. 19 into Eq. 18: g z l := g x l + 2λ(1 + ⟨ z l ∥z l ∥ 2 , -η⃗ g x l ⟩)(-cη⃗ g x l + c 3 ⟨z l , -η⃗ g x l ⟩z l ) = g x l + 2λ(1 + ⟨cz l , -η⃗ g x l ⟩)(-cη⃗ g x l + c 3 ⟨z l , -η⃗ g x l ⟩z l ) = g x l + 2λ(1 -cηt)(-cη⃗ g x l -c 3 ηtz l ) where t := ⟨z l , ⃗ g x l ⟩. Eq. 20 can be further reformulated as: g z l := g x l + 2λ(1 -cηt)(-cη⃗ g x l -c 3 ηtz l ) = (1 -2cηλ 1 -cηt ∥g x l ∥ 2 )g x l + 2ηλc 3 t(cηt -1)z l Hence, the estimated error term ϵ can be calculated by subtracting Eq. 12 from Eq. 21, which is: ϵ := -2cηλ 1 -cηt ∥g x l ∥ 2 g x l + (2ηλc 3 t(cηt -1) -τ )z l . Since λ and η are very close to zero, the influence of the first term in right can be ignored, so the second term mainly contributes to the estimation error. When we choose a suitable hyper-parameter τ that makes τ = 2ηλc 3 t(cηt -1) hold, the estimation error ϵ could be abysmally close to zero. Empirically, the value of τ varies significantly across different tasks.

D THE QUALITY OF THE PROPOSED IDEAL DIRECTION

There are many ways to define an ideal direction. As shown in Fig. 5 , our proposed ideal direction calculation approach will satisfy the definition in most situations.  ϵ ′ ≥ ϵ ′′ (23) where ϵ ′ is the efficiency of ResNet chain and ϵ ′′ is the efficiency of VGG chain. Proof E.1 Without losing generality, we will discuss only one node here. As shown in Fig. 6 , since C is δ-convex, x l will always get closer to the target compared with x l-1 and the original point. Hence, ⟨⃗ x l-1 , ⃗ d l ⟩ ≥ 0 (24) holds. Suppose the parameters θ l are the same for ResNet and VGG, then: x ′ l = z l + x l-1 (25) x ′′ l = z l , where z l := f θ l (x l01 ). x ′ l and x ′′ l are the next chain node for ResNet and VGG respectively. As for chain node x l-1 , the estimated direction for ResNet is ⃗ z l and the estimated direction for VGG is ⃗ z l -⃗ x l-1 . Therefore, ϵ ′ = ⟨z l , d l ⟩ (27) ϵ ′′ = ⟨z l -x l-1 , d l ⟩. (28) According to Eq. 24, we have: ⟨x l-1 , d l ⟩ ≥ 0 ⇒ ⟨z l , d l ⟩ -⟨z l , d l ⟩ + ⟨x l-1 , d l ⟩ ≥ 0 ⇒ ⟨z l , d l ⟩ -⟨z l -x l-1 , d l ⟩ ≥ 0 ⇒ ⟨z l , d l ⟩ ≥ ⟨z l -x l-1 , d l ⟩ ⇒ ϵ ′ ≥ ϵ ′′ Proofed. Lemma. 4 indicates that ResNet can form a more efficient Markov chain compared with VGG, and leads to better performance.

F MORE DISCUSSION ON MODEL DEGRADATION

As shown in Fig. 7 , the randomly initialized Markov chain is prone to move in zigzags and even turn back as the chain goes longer (i.e. the model going deeper), which hinders the model from fitting target distribution efficiently, probably resulting in worse performance. When we apply penal connection to enforce the network to be an efficient Markov chain, the turn-back chain nodes do not exist so each node takes at least a non-negative effect whatever how long the chain is. As a result, it could alleviate the model degradation problem. 



Tips: z l .register hook(lambda gx l : gx l + τ z l ) is not allowed which could lead to a memory leak. https://www.cs.toronto.edu/ ˜kriz/cifar.html In order to simplify the derivation process, we use the F-norm function to measure the distance between each state x l and the target y without losing the generality.



Figure1: A model M with L skip connections can be recognized as a Markov chain C consisting of L nodes. The forward pass is corresponding to a Markov process. As shown in Fig.1(a), a skip connection along with a residual-like block f θ l (•) builds up a Markov chain node n l (the gray dash box in middle). The input of n l is x l-1 , i.e., the output of the previous Markov node. The output of n l can be formulated by x l = x l-1 + z l |x l-1 , where z l |x l-1 = f θ l (x l-1 ) is the predicted direction by residual-like block. As shown in Fig.1(b), guided by z l,l∈1,••• ,L , the input data x 0 ∈ A can gradually shift to the target label y ∈ B along the learned Markov chain. The red dashed arrow d l |x l-1 is the ideal direction with respect to x l-1 and z l .

(a) The influence of efficient/inefficient Markov chain. (b) A toy model to demonstrate the importance of a suitable ε.

Figure5: The difference between ℓ(x l -ηg x l , y) -ℓ(x l , y) during training in the toy model. Under more than 70.0% of situations, ℓ(x l -ηg x l , y) < ℓ(x l , y strictly holds.

Figure 6: Advantage of ResNet over VGG under Markov chain's guide. As long as C L (x 0 → x L ) is a δ-convex chain, the chain formed by ResNet is more efficient than the one formed by VGG.

Figure 7: Structural comparison between a folding Markov chain (a) and an unfolding Markov chain (b).

Testing PPL (the lower the better) and accuracy (the higher the better) of Transformer (Trans.) and counterpart Markov chain (Markov.).

The top-1 accuracy on ImageNet-1k on different architectures with different τ .

and Tab. 1, a few patterns can be observed. One is that the Markov chain with penal connection converges faster than the residual counterpart. Surprisingly, the Markov chain training with 200 epochs can outperform the baseline model well-training with 2000 epochs (10× training schedule), demonstrating the good merits of the proposed method. It is worth noting that the transformer model does not saturate without adequate training schedule length. The ε ′ value plotted in Fig.

The accuracy of ResNet over different depths on CIFAR10, CIFAR100 and ImageNet1K.

funding

* Equal contribution. † Corresponding author. The work is supported in part by NSFC Grants (62072449).

