GRAPH PERMUTATION SELECTION FOR DECODING OF ERROR CORRECTION CODES USING SELF-ATTENTION Anonymous authors Paper under double-blind review

Abstract

Error correction codes are an integral part of communication applications and boost the reliability of transmission. The optimal decoding of transmitted codewords is the maximum likelihood rule, which is NP-hard. For practical realizations, suboptimal decoding algorithms are employed; however, the lack of theoretical insights currently impedes the exploitation of the full potential of these algorithms. One key insight is the choice of permutation in permutation decoding. We present a data-driven framework for permutation selection combining domain knowledge with machine learning concepts such as node embedding and self-attention. Significant and consistent improvements in the bit error rate are shown for the simulated Bose Chaudhuri Hocquenghem (BCH) code as compared to the baseline decoders. To the best of our knowledge, this work is the first to leverage the benefits of self-attention networks in physical layer communication systems.

1. INTRODUCTION

Shannon's well known channel coding theorem (Shannon, 1948) states that for every channel a code exists, such that encoded messages can be transmitted and decoded with an error as low as needed while the transmission rate is below the channel's capacity. For practical applications, latency and computational complexity constrain code size. Thus, structured codes with low complexity encoding and decoding schemes, were devised. Some structured codes possess a main feature known as the permutation group (PG). The permutations in PG map each codeword to some distinct codeword. This is crucial to different decoders, such as the parallelizable soft-decision Belief Propagation (BP) (Pearl, 2014) decoder. It empirically stems from evidence that whereas decoding various corrupted words may fail, decoding a permuted version of the same corrupted words may succeed (Macwilliams, 1964) . For instance, this is exploited in the mRRD (Dimnik & Be'ery, 2009) and the BPL (Elkelesh et al., 2018) algorithms, which perform multiple runs over different permuted versions of the same corrupted codewords by trading off complexity for higher decoding gains. Nonetheless, there is room for improvement since not all permutations are required for successful decoding of a given word: simply a fitting one is needed. Our work deals with obtaining the best fit permutation per word, by removing redundant runs which thus preserve computational resources. Nevertheless, it remains unclear how to obtain this type of permutation as indicated by the authors in (Elkelesh et al., 2018) who stated in their Section III.A, "there exists no clear evidence on which graph permutation performs best for a given input". Explicitly, the goal is to approximate a function mapping from a single word to the most probable-to-decode permutation. While analytical derivation of this function is hard, advances in the machine learning field may be of use in the computation of this type of function. The recent emergence of Deep Learning (DL) has demonstrated the advantages of Neural Networks (NN) in a myriad of communication and information theory applications where no analytical solutions exists (Simeone, 2018; Zappone et al., 2019) . For instance in (Belghazi et al., 2018) , a tight lower bound on the mutual information between two high-dimensional continuous variables was estimated with NN. Another recurring motive for the use of NN in communications has to do with the amount of data at hand. Several data-driven solutions were described in (Caciularu & Burshtein, 2018; Lin et al., 2019) for scenarios with small amounts of data, since obtaining data samples in

annex

the real world is costly and hard to collect on-the-fly. On the other hand, one should not belittle the benefits of unlimited simulated data, see (Be'ery et al., 2020; Simeone et al., 2020) .Lately, two main classes of decoders have been put forward in machine learning for decoding. The first is the class of model-free decoders employing neural network architectures as in (Gruber et al., 2017; Kim et al., 2018) . The second is composed of model-based decoders (Nachmani et al., 2016; 2018; Doan et al., 2018; Lian et al., 2019; Carpi et al., 2019) implementing parameterized versions of classical BP decoders. Currently, the model-based approach dominates, but it suffers from a regularized hypothesis space due to its inductive bias.Our work leverages permutation groups and DL to enhance the decoding capabilities of constrained model-based decoders. First, a self-attention model (described in Section 3) (Vaswani et al., 2017) is employed to embed all the differentiated group permutations of a code in a word-independent manner, by extracting relevant features. This is done once before the test phase during a preprocess phase. At test time, a trained NN accepts a corrupted word and the embedded permutations and predicts the probability for successful decoding for each permutation. Thereafter, a set of either one, five or ten most-probable-to-decode permutations are chosen, and decoding is carried out on the permuted channel words rather than decoding an arbitrary dataset with all permutations, and empirically choosing the best subset of them. Our method is evaluated on the renowned BCH code.

2. RELATED WORK

Permutation decoding (PD) has attracted renewed attention (Kamenev et al., 2019; Doan et al., 2018; Hashemi et al., 2018) given its proven gains for 5G-standard approved polar codes. (Kamenev et al., 2019) suggested a novel PD method for these codes. However the main novelty lies in the proposed stopping criteria for the list decoder, whereas the permutations are chosen in a random fashion. The authors in (Doan et al., 2018) presented an algorithm to form a permutation set, computed by fixing several first layers of the underlying structure of the polar decoder, and only permuting the last layers. The original graph is included in this set as a default, with additional permutations added during the process of a limited-space search. Finally we refer to (Hashemi et al., 2018) which proposes a successive permutations scheme that finds suitable permutations as decoding progresses. Again, due to the exploding search space, they only considered the cyclic shifts of each layer. This limited-search first appeared in (Korada, 2009) .Most PD methods, like the ones mentioned above, have made valuable contributions. We, on the other hand, see the choice of permutation as the most integral part of PD, and suggest a pre-decoding module to choose the best fitting one. Note however that a direct comparisons between the PD model-based works mentioned and ours are infeasible.Regarding model-free approaches, we refer in particular to (Bennatan et al., 2018) since it integrates permutation groups into a model-free approach. In that paper, the decoding network accepts the syndrome of the hard decisions as part of the input. This way, domain knowledge is incorporated into the model-free approach. We introduce domain knowledge by training the permutation embedding on the parity-check matrix and accepting the permuted syndrome. Furthermore, each word is chosen as a fitting permutation such that the sum of LLRs in the positions of the information-bits is maximized. Note that this approach only benefits model-free decoders. Here as well comparisons are infeasible.

3. BACKGROUND

Coding In a typical communication system, first, a length k binary message m ∈ {0, 1} k is encoded by a generator matrix G into a length n codeword c = G m ∈ {0, 1} n . Every codeword c satisfies Hc = 0, where H is the parity-check matrix (uniquely defined by GH = 0). Next, the codeword c is modulated by the Binary Phase Shift Keying (BPSK) mapping (0 → 1, 1 → -1) resulting in a modulated word x. After transmission through the additive white Gaussian noise (AWGN) channel, the received word is y = x + z, where z ∼ N (0, σ 2 z I n ). At the receiver, the received word is checked for any detectable errors. For that purpose, an estimated codeword ĉ is calculated using a hard decision (HD) rule: ĉi = 1 {yi<0} . If the syndrome s = H ĉ is all zeros, one outputs ĉ and concludes. A non-zero syndrome indicates that channel errors occurred.

