GRAPH PERMUTATION SELECTION FOR DECODING OF ERROR CORRECTION CODES USING SELF-ATTENTION Anonymous authors Paper under double-blind review

Abstract

Error correction codes are an integral part of communication applications and boost the reliability of transmission. The optimal decoding of transmitted codewords is the maximum likelihood rule, which is NP-hard. For practical realizations, suboptimal decoding algorithms are employed; however, the lack of theoretical insights currently impedes the exploitation of the full potential of these algorithms. One key insight is the choice of permutation in permutation decoding. We present a data-driven framework for permutation selection combining domain knowledge with machine learning concepts such as node embedding and self-attention. Significant and consistent improvements in the bit error rate are shown for the simulated Bose Chaudhuri Hocquenghem (BCH) code as compared to the baseline decoders. To the best of our knowledge, this work is the first to leverage the benefits of self-attention networks in physical layer communication systems.

1. INTRODUCTION

Shannon's well known channel coding theorem (Shannon, 1948) states that for every channel a code exists, such that encoded messages can be transmitted and decoded with an error as low as needed while the transmission rate is below the channel's capacity. For practical applications, latency and computational complexity constrain code size. Thus, structured codes with low complexity encoding and decoding schemes, were devised. Some structured codes possess a main feature known as the permutation group (PG). The permutations in PG map each codeword to some distinct codeword. This is crucial to different decoders, such as the parallelizable soft-decision Belief Propagation (BP) (Pearl, 2014) decoder. It empirically stems from evidence that whereas decoding various corrupted words may fail, decoding a permuted version of the same corrupted words may succeed (Macwilliams, 1964) . For instance, this is exploited in the mRRD (Dimnik & Be'ery, 2009) and the BPL (Elkelesh et al., 2018) algorithms, which perform multiple runs over different permuted versions of the same corrupted codewords by trading off complexity for higher decoding gains. Nonetheless, there is room for improvement since not all permutations are required for successful decoding of a given word: simply a fitting one is needed. Our work deals with obtaining the best fit permutation per word, by removing redundant runs which thus preserve computational resources. Nevertheless, it remains unclear how to obtain this type of permutation as indicated by the authors in (Elkelesh et al., 2018) who stated in their Section III.A, "there exists no clear evidence on which graph permutation performs best for a given input". Explicitly, the goal is to approximate a function mapping from a single word to the most probable-to-decode permutation. While analytical derivation of this function is hard, advances in the machine learning field may be of use in the computation of this type of function. The recent emergence of Deep Learning (DL) has demonstrated the advantages of Neural Networks (NN) in a myriad of communication and information theory applications where no analytical solutions exists (Simeone, 2018; Zappone et al., 2019) . For instance in (Belghazi et al., 2018) , a tight lower bound on the mutual information between two high-dimensional continuous variables was estimated with NN. Another recurring motive for the use of NN in communications has to do with the amount of data at hand. Several data-driven solutions were described in (Caciularu & Burshtein, 2018; Lin et al., 2019) for scenarios with small amounts of data, since obtaining data samples in the real world is costly and hard to collect on-the-fly. On the other hand, one should not belittle the benefits of unlimited simulated data, see (Be'ery et al., 2020; Simeone et al., 2020) . Lately, two main classes of decoders have been put forward in machine learning for decoding. The first is the class of model-free decoders employing neural network architectures as in (Gruber et al., 2017; Kim et al., 2018) . The second is composed of model-based decoders (Nachmani et al., 2016; 2018; Doan et al., 2018; Lian et al., 2019; Carpi et al., 2019) implementing parameterized versions of classical BP decoders. Currently, the model-based approach dominates, but it suffers from a regularized hypothesis space due to its inductive bias. Our work leverages permutation groups and DL to enhance the decoding capabilities of constrained model-based decoders. First, a self-attention model (described in Section 3) (Vaswani et al., 2017) is employed to embed all the differentiated group permutations of a code in a word-independent manner, by extracting relevant features. This is done once before the test phase during a preprocess phase. At test time, a trained NN accepts a corrupted word and the embedded permutations and predicts the probability for successful decoding for each permutation. Thereafter, a set of either one, five or ten most-probable-to-decode permutations are chosen, and decoding is carried out on the permuted channel words rather than decoding an arbitrary dataset with all permutations, and empirically choosing the best subset of them. Our method is evaluated on the renowned BCH code.

2. RELATED WORK

Permutation decoding (PD) has attracted renewed attention (Kamenev et al., 2019; Doan et al., 2018; Hashemi et al., 2018) given its proven gains for 5G-standard approved polar codes. (Kamenev et al., 2019) suggested a novel PD method for these codes. However the main novelty lies in the proposed stopping criteria for the list decoder, whereas the permutations are chosen in a random fashion. The authors in (Doan et al., 2018) presented an algorithm to form a permutation set, computed by fixing several first layers of the underlying structure of the polar decoder, and only permuting the last layers. The original graph is included in this set as a default, with additional permutations added during the process of a limited-space search. Finally we refer to (Hashemi et al., 2018) which proposes a successive permutations scheme that finds suitable permutations as decoding progresses. Again, due to the exploding search space, they only considered the cyclic shifts of each layer. This limited-search first appeared in (Korada, 2009) . Most PD methods, like the ones mentioned above, have made valuable contributions. We, on the other hand, see the choice of permutation as the most integral part of PD, and suggest a pre-decoding module to choose the best fitting one. Note however that a direct comparisons between the PD model-based works mentioned and ours are infeasible. Regarding model-free approaches, we refer in particular to (Bennatan et al., 2018) since it integrates permutation groups into a model-free approach. In that paper, the decoding network accepts the syndrome of the hard decisions as part of the input. This way, domain knowledge is incorporated into the model-free approach. We introduce domain knowledge by training the permutation embedding on the parity-check matrix and accepting the permuted syndrome. Furthermore, each word is chosen as a fitting permutation such that the sum of LLRs in the positions of the information-bits is maximized. Note that this approach only benefits model-free decoders. Here as well comparisons are infeasible.

3. BACKGROUND

Coding In a typical communication system, first, a length k binary message m ∈ {0, 1} k is encoded by a generator matrix G into a length n codeword c = G m ∈ {0, 1} n . Every codeword c satisfies Hc = 0, where H is the parity-check matrix (uniquely defined by GH = 0). Next, the codeword c is modulated by the Binary Phase Shift Keying (BPSK) mapping (0 → 1, 1 → -1) resulting in a modulated word x. After transmission through the additive white Gaussian noise (AWGN) channel, the received word is y = x + z, where z ∼ N (0, σ 2 z I n ). At the receiver, the received word is checked for any detectable errors. For that purpose, an estimated codeword ĉ is calculated using a hard decision (HD) rule: ĉi = 1 {yi<0} . If the syndrome s = H ĉ is all zeros, one outputs ĉ and concludes. A non-zero syndrome indicates that channel errors occurred. Then, a decoding function dec : y → {0, 1} n , is utilized with output ĉ. One standard soft-decision decoding algorithm is Belief Propagation (BP). BP is a graph-based inference algorithm that can be used to decode corrupted codewords in an iterative manner, working over a factor graph known as the Tanner graph. The Tanner graph is an undirected graphical model, depicting the constraints that define the code. In these graphs, BP messages that are propagated along cycles become correlated after several BP iterations, preventing convergence to the correct posterior distribution and thus reducing overall decoding performance. We refer the interested reader to (Richardson & Urbanke, 2008) for a full derivation of the BP for linear codes, and to (Dehghan & Banihashemi, 2018) for more details on the effects of cycles in codes. Another works (Nachmani et al., 2016; 2018) assigned learnable weights θ to the BP algorithm. This formulation unfolds the BP algorithm into a NN, referred to as weighted BP (WBP). The intuition offered was that the trained weights compensate for the short cycles (these are most performance devastating) in the Tanner graph. Permutation Group of a code Let π be a permutation on {1, ..., n}. A permutation of a codeword c = (c 1 , ..., c n ) exchanges the positions of the entries of c: π(c) = (c π(1) , c π(2) , ..., c π(n) ) . A permutation π is an automorphism of a given code C if c ∈ C implies π(c) ∈ C. The group of all automorphism permutations of a code C is denoted Aut(C), also referred to as the PG of the code. Only several codes have known PGs (Guenda, 2010) such as the BCH codes, given in (MacWilliams & Sloane, 1977) [pp.233] as: π α,β (i) = 2 α • i + β (mod n) with α ∈ {1, . . . , log 2 (n + 1)} and β ∈ {1, . . . , n}. Thus a total of n log 2 (n + 1) permutations compose Aut(C). One possible way to mitigate the detrimental effects of cycles is by using code permutations. We can apply BP on the permuted received word and then apply the inverse permutation on the decoded word. This can be viewed as applying BP on the original received word with different weights on the variable nodes. Since there are cycles in the Tanner graph there is no guarantee that the BP will converge to an optimal solution and each permutation enables a different decoding attempt. This strategy has proved to yield to a better convergence and overall decoding performance gains (Dimnik & Be'ery, 2009) , as observed in our experiments, in Section 5.

Graph Node Embedding

The method we propose uses a node embedding technique for embedding the variable nodes of the code's Tanner graph, thus taking the code structure into consideration. Specifically, in Sec. 4.2 we employ the node2vec (Grover & Leskovec, 2016) method. We briefly describe this method and the reader can refer to the paper for more technical details. The task of node embedding is to encode nodes in a graph as low-dimensional vectors that summarize their relative graph position and the structure of their local neighborhood. Each learned vector corresponds to a node in the graph, and it has been shown that in the learned vector space, geometric relations are captured; e.g., interactions that are modeled as edges between the nodes in the graph. Specifically, node2vec is trained by maximizing the mean probability of the occurrence of subsequent nodes in fixed length sampled random walks. It employs both breadth-first (BFS) and depth-first (DFS) graph searches to produce high quality informative node representations. Self-Attention An attention mechanism for neural networks that was designed to enable neural models to focus on the most relevant parts of the input. This modern neural architecture allows for the use of weighted averaging to optimize a task objective and to deal with variable sized inputs. When feeding an input sequence into an attention model, the resulting output is an embedded representation of the input. When a single sequence is fed, the attentive mechanism is employed to attend to all positions within the same sequence. This is commonly referred to as the self-attention representation of a sequence. Initially, self-attention modelling was used in conjunction with recurrent neural networks (RNNs) and convolutional neural networks (CNNs) mostly for natural language processing (NLP) tasks. In (Bahdanau et al., 2015) , this setup was first employed and was shown to produce superior results on multiple automatic machine translation tasks. In this work we use self attention for permutation representation. This mechanism enables better and richer permutation modelling compared to a non-attentive representation. The rationale behind σ 2 z π HD | • | perm2vec g y π( ) π q π H s |π( )| p(y, π) Figure 1 : A schematic architecture of the Graph Permutation Selection (GPS) classifier. using self-attention comes from permutation distance metrics preservation; a pair of "similar" permutations will have a close geometric self-attentive representation in the learned vector space, since the number of index swaps between permutations only affects the positional embedding additions.

4.1. PROBLEM FORMULATION AND ALGORITHM OVERVIEW

Assume we want to decode a received word y encoded by a code C. Picking a permutation from the PG Aut(C) may result in better decoding capabilities. However, executing the decoding algorithm for each permutation within the PG is a computationally prohibitive task especially if the code permutation group is large. An alternative approach involves first choosing the best permutation and only then decoding the corresponding permuted word. Given a received word y, the optimal single permutation π ∈ Aut(C) is the one that minimizes the bit error rate (BER): π = arg min π∈Aut(C) BER Å π -1 (dec(π(y))), c ã (1) where c is the submitted codeword and BER is the Hamming distance between binary vectors. The solution to Eq. ( 1) is intractable since the correct codeword is not known in the decoding process. We propose a data-driven approach as an approximate solution. The gist of our approach is to estimate the best permutation without applying a tedious decoding process for each code permutation and without relying on the correct codeword c. We highlight the key points of our approach below, and elaborate on each one in the rest of this section. Our architecture is depicted in Fig. 1 . The main components are the permutation embedding (Section 4.2) and the permutation classifier (Section 4.3). First, the permutation embedding block perm2vec receives a permutation π, and outputs an embedding vector q π . Next, the vectors π(y) and q π are the input to the permutation classifier that computes an estimation p(y, π) of the probability of word π(y) to be successfully decoded by dec. Next, we select the permutation whose probability of successful decoding is maximal: π = arg max π∈Aut(C) p(y, π) and decoding is done on π(y). Finally the decoded word ĉ = π-1 (dec(π(y))) is outputted.

4.2. PERMUTATION EMBEDDING

Our permutation embedding model consists of two sublayers: self-attention followed by an average pooling layer. To the best of our knowledge, our work is the first to leverage the benefits of the self-attention network in physical layer communication systems. In (Vaswani et al., 2017) , positional encodings are vectors that are originally compounded with entries based on sinusoids of varying frequency. They are added as input elements prior to the first self-attention layer, in order to add a position-dependent signal to each embedded token and help the model incorporate the order of the input tokens by injecting information about the relative or absolute position of the tokens. Inspired by this method and other recent NLP works (Devlin et al., 2019; Liu et al., 2019; Yang et al., 2019) , we used learned positional embeddings which have been shown to yield better performance than the constant positional encodings, but instead of randomly initializing them, we first pre-train node2vec node embeddings over the corresponding code's Tanner graph. We then take the variable nodes output embeddings to serve as the initial positional embeddings. This helps our model to incorporate some graph structure and to use the code information. We denote by d w the dimension of the output embedding space (this hyperparameter is set before the node embedding training). It should be noted that any other node embedding model can be trained instead of node2vec which we leave for future work. Self-attention sublayers usually employ multiple attention heads, but we found that using one attention head was sufficient. Furthermore, using more self-attention layers did improve the results either. Denote the embedding vector of π(i) by u i ∈ R dw and the embedding of the ith variable node by v i ∈ R dw . Note that both u i and v i are learned, but as stated above, v i is initialized with the output of the pre-trained variable node embedding over the code's Tanner graph. Thereafter, the augmented attention head operates on an input vector sequence, W = (w 1 , . . . , w n ) of n vectors where w i ∈ R dw , w i = u i + v. The attention head computes a same-length vector sequence P = (p 1 , . . . , p n ), where p i ∈ R dp . Each encoder's output vector is computed as a weighted sum of linearly transformed input entries, p i = n j=1 a ij (Vw j ) where the attention weight coefficient is computed using the softmax function, a ij = e b ij n m=1 e b im , of the normalized relative attention between two input vectors w i and w j , b ij = (Qwi) (Kwj ) √ dp . Note that Q, K, V ∈ R dp×dw are learned parameters matrices. Finally, the vector representation of the permutation π is computed by applying the average pooling operation across the sequence of output vectors, q π = 1 n n i=1 p i , and is passed to the permutation classifier.

4.3. PERMUTATION CLASSIFIER

We next describe a classifier that predicts the probability of a successful decoding given received word y and a permutation π represented by a vector q. It is more convenient to consider the log likelihood ratio (LLR) for soft-decoding. The LLR values in the AWGN case are given by = 2 σ 2 z •y, and knowledge of σ z is assumed. The input is passed to a neural multilayer perceptron (MLP) with the absolute value of the permuted input LLRs π( ) and the syndrome s ∈ R n-k of the permuted word π( ). We first use linear mapping to obtain = W • |π( )| and s = W s • s respectively, where W ∈ R dp×n and W s ∈ R dp×(n-k) are learned matrices. Then, inspired by (Wang et al., 2018) , we use the following similarity function: g (h) = w 4 ϕ 3 (ϕ 2 (ϕ 1 (h))) + b 4 (3) where, h = [q; ;s ;q • ;q • s ; • s ;|q -|;|q -s |;| -s |] . (4) Here [•] stands for concatenation and • stands for the Hadamard product. We also define ϕ i (x) = LeakyReLU (W i x + b i ) where W 1 ∈ R 9dp×2dp , W 2 ∈ R 2dp×dp , W 3 ∈ R dp×dp/2 and W 4 ∈ R dp/2 are the learned matrices and b 1 ∈ R 2dp , b 2 ∈ R dp , b 3 ∈ R dp/2 and b 4 ∈ R are the learned biases respectively. Finally, the estimated probability for successful decoding of π(y) is computed as follows, p(y, π) = σ(g(h)) where g(h) is the last hidden layer and σ(•) is the sigmoid function. The Graph Permutation Selection (GPS) algorithm for choosing the most suitable permutation is depicted in Fig. 1 Table 1 : Values of the hyper-parameters, permutation embedding and classifier. 

4.4. TRAINING DETAILS

We jointly train the permutation embedding and the permutation classifier, employing a single decoder dec. The cross entropy loss computed for a single received word y is: L = - π d y,π log(p(y, π)) + (1 -d y,π ) log(1 -p(y, π)) where d y,π = 1 if decoding of π(y) was successful under permutation π, otherwise d y,π = 0. The set of decoders dec used for the dataset generation is described in Section 5. Each mini-batch consists of K received words from the generated training dataset. This dataset contains pairs of permuted word (y, π) together with a corresponding label d y,π . We used an allzero transmitted codeword. Empirically, using only the all-zero word seems to be sufficient for training. Nonetheless, the test dataset is composed of randomly chosen binary codewords c ∈ C, as one would expect, without any degradation in performance. Each codeword is transmitted over the AWGN channel with σ z specified by a given signal-to-noise ratio (SNR), with an equal number of positive examples (d=1) and negative examples (d=0) in each batch. The overall hyperparameters used for training the perm2vec and the GPS classifier are depicted in Table 1 . To pre-train the node embeddings, we used the default hyperparameters suggested in the original work (Grover & Leskovec, 2016) except for the following modifications: number of random walks 2000, walk length 10, neighborhood size 10 and node embedding dimension d w = 80. Regarding computational latency, our perm2vec component is executed only at training time, which results with pretrained permutations' embeddings. Then, all the embeddings are stored in memory. At test time, we determine the probability of a permutation to decode p(y, π) with a single forward pass of the permutation classifier. To find the most suitable permutation, one has to compute n log 2 (n + 1) such forward passes. These computations are not dependant, hence they can be done in parallel. To conclude, the overall computational latency of our scheme is of a single forward pass through the permutation classifier network.

5. EXPERIMENTAL SETUP AND RESULTS

The proposed GPS algorithm is evaluated on four different 16), (63, 36), (63, 45) and (127, 64). As for the decoder dec, we applied GPS on top of the BP (GPS+BP) and on top of a pre-trained WBP (GPS+WBP), trained with the same configuration taken from (Nachmani et al., 2017) . All decoders are tested with 5 BP iterations and the syndrome stopping criterion is adopted after each iteration. These decoders are based on the systematic parity-check matrices, H = [P |I n-k ], since these matrices are commonly used. For comparison, we employ a random permutation selection (from the PG) as a baseline for each decoder -rand+BP and rand+WBP. In addition, we depict the maximum likelihood results, which are the theoretical lower bound for each code (for more details, see (Richardson & Urbanke, 2008 , Section 1.5)). Performance Analysis We assess the quality of our GPS using the BER metric, for different SNR values [dB] when at least 1000 error words occurred. Note that we refer to the SNR as the normalized SNR (E b /N 0 ), which is commonly used in digital communication. Fig. 3 presents the results for BCH (31, 16) and BCH(63, 36) and Table 2 lists the results for all codes and decoders, with our GPS method and random selection. For clarity, in Table 2 we present the BER negative decimal logarithm only for the baselines, considered as the top-1 results. As can be seen, using our preprocess method outperforms the examined baselines. For BCH(31,16) (Fig. 2a ), perm2vec together with BP gains up to 2.75 dB as compared to the random BP and up to 1.8 dB over the random WBP. Similarly, for BCH(63,36) (Fig. 2b ), our method outperforms the random BP by up to 2.75 dB and by up to 2.2 dB with respect to WBP. We also observed a small gap between our method and the maximum likelihood lower bound. The maximal gaps are 0.4 dB and 1.4 dB for BCH(31,16) and BCH(63,36), respectively. Top-κ Evaluation In order to evaluate our classifier's confidence, we also investigated the performance of the top-κ permutations -this method could be considered as a list-decoder with a smart permutation selection. This extends Eq. ( 2) from top-1 to the desired top-κ. The selected codeword ĉ is chosen from a list of κ decoders by ĉ = arg max κ ||y -ĉκ || 2 2 , as in (Dimnik & Be'ery, 2009) . The results for κ ∈ {1, 5} are depicted in Table 2 and Fig. 3a . Generally, as κ increases better performance is observed, with the added-gain gradually eroded. Furthermore, we plot the empirical BP lower bound achieved by decoding with a 5-iterations BP over all κ = n log 2 (n + 1) permutations; and selecting the output word by the argmax criterion mentioned above. In Fig. 3a the reported results are for BCH (63, 45) . We observed an improvement of 0.4 dB between κ = 1 and κ = 5 and only 0.2 dB between κ = 5 and κ = 10. Furthermore, the gap between κ = 10 and the BP lower bound is small (0.4 dB). Note that using the BP lower bound is impractical since each BP scales by O(n log n) while our method only scales by O(n). Moreover, in our simulations, we found that the latency for five BP iterations was 10-100 times greater compared to our classifier's inference. Embedding Size Evaluation In Fig. 3b we present the performance of our method using two embedding sizes. We compare our base model, that uses embedding size d q = 80 to the small model that uses embedding size d q = 20 (note that d q = d w ). Recall that changing the embedding size also affects the number of parameters in g, as in Eq. (3). Using a smaller embedding size causes a slight degradation in performance, but still dramatically improves the random BP baseline. For the shorter BCH(63,36), the gap is 0.5 dB and for BCH(127,64) the gap is 0.2 dB.

Ablation Study

We present an analysis over a number of facets of our permutation embedding and classifier for BCH (63,36), (63,45) and (127,64) . We fixed the BER to 10 -3 and inspected the SNR degradation of various excluded components with respect to our complete model. We present the ablation analysis for our permutation classifier and permutation embedding separately. Regarding the permutation classifier, we evaluated the complete classifier (described in Section 4.3) against its three partial versions; Omitting the permutation embedding feature vector q π caused a performance degradation of 1.5 to 2 dB. Note that the permutation π still affects both and s . Excluding or s caused a degradation of 1-1.5 and 2.5-3 dB, respectively. In addition, we tried a simpler feature vector h = [q; ;s ] which led to a performance degradation of 1 to 1.5 dB. Regarding the permutation embedding, we compared the complete perm2vec (described in Section 4.2) against its two partial versions: omitting the self-attention mechanism decreased performance by 1.25 to 1.75 dB. Initializing the positional embedding randomly instead of using node embedding also caused a performance degradation of 1.25 to 1.75 dB. These results illustrate the advantages of our complete method, and, as observed, the importance of the permutation embedding component. Note that we preserved the total number of parameters after each exclusion for fair comparison.

6. CONCLUSION

We presented a self-attention mechanism to improve decoding of linear error correction codes. For every received noisy word, the proposed model selects a suitable permutation out of the code's PG without actually trying all the permutation based decodings. Our method pre-computes the permutations' representations, thus allowing for fast and accurate permutation selection at the inference phase. Furthermore, our method is independent of the code length and therefore is considered scal-able. We demonstrate the effectiveness of perm2vec by showing significant BER performance improvement compared to the baseline decoding algorithms for various code lengths. Future research should extend our method to polar codes, replacing the embedded Tanner graph variable nodes by embedded factor graph variable nodes.



Figure 2: BER vs. SNR for GPS and random permutation selection. Both BP and WBP are considered.

(a) Top-κ evaluation for BCH(63,45). (b) Embedding size evaluation.

Figure 3: BER vs. SNR performance comparison for various experiments and BCH codes.

A comparison of the BER negative decimal logarithm for three SNR values[dB]. Higher is better. We bold the best results and underline the second best ones.

