DENOISING DIFFUSION ERROR CORRECTION CODES

Abstract

Error correction code (ECC) is an integral part of the physical communication layer, ensuring reliable data transfer over noisy channels. Recently, neural decoders have demonstrated their advantage over classical decoding techniques. However, recent state-of-the-art neural decoders suffer from high complexity and lack the important iterative scheme characteristic of many legacy decoders. In this work, we propose to employ denoising diffusion models for the soft decoding of linear codes at arbitrary block lengths. Our framework models the forward channel corruption as a series of diffusion steps that can be reversed iteratively. Three contributions are made: (i) a diffusion process suitable for the decoding setting is introduced, (ii) the neural diffusion decoder is conditioned on the number of parity errors, which indicates the level of corruption at a given step, (iii) a line search procedure based on the code's syndrome obtains the optimal reverse diffusion step size. The proposed approach demonstrates the power of diffusion models for ECC and is able to achieve state of the art accuracy, outperforming the other neural decoders by sizable margins, even for a single reverse diffusion step. Our code is attached as supplementary material.

1. INTRODUCTION

Reliable digital communication is of major importance in the modern information age and involves the design of codes that can be robustly decoded despite noisy transmission channels. The target decoding is defined by the NP-hard maximum likelihood rule, and the efficient decoding of commonly employed families of codes, such as algebraic block codes, remains an open problem. Recently, powerful learning-based techniques have been introduced. Model-free decoders (O'Shea & Hoydis, 2017; Gruber et al., 2017; Kim et al., 2018) employ generic neural networks and may potentially benefit from the application of powerful deep architectures that have emerged in recent years in various fields. A Transformer-based decoder that is able to incorporate the code into the architecture has been recently proposed by Choukroun & Wolf (2022) . It outperforms existing methods by sizable margins, at a fraction of their time complexity. The decoder's objective in this model is to predict the noise corruption, to recover the transmitted codeword (Bennatan et al., 2018) . Deep generative neural networks have shown significant progress over the last years. Denoising Diffusion Probabilistic Models (DDPM) (Ho et al., 2020b) are an emerging class of likelihood-based generative models. Such methods use diffusion models and denoising score matching to generate new samples, for example, images (Dhariwal & Nichol, 2021) or speech (Chen et al., 2020a) . The DDPM model learns to perform a reversed diffusion process on a Markov chain of latent variables, and generates samples by gradually removing noise from a given signal. One major drawback of model-free approaches is the high space/memory requirement and time complexity that hamper its deployment on constrained hardware. Moreover, the lack of an iterative solution means that both highly and slightly corrupted codewords go through the same computationally demanding neural decoding procedure. In this work, we consider the error correcting code paradigm via the prism of diffusion processes. The channel codeword corruption can be viewed as an iterative forward diffusion process to be reversed via an adapted DDPM. As far as we can ascertain, this is the first adaptation of diffusion models to error correction codes. Beyond the conceptual novelty, we make three technical contributions: (i) our framework is based on an adapted diffusion process that simulates the coding and transmission processes, (ii) we further condition the denoising model on the number of parity-check errors, as an indicator of the signal's level of corruption, and (iii) we propose a line-search procedure that minimizes the denoised code syndrome, in order to provide an optimal step size for the reverse diffusion. Applied to a wide variety of codes, our method outperforms the state-of-the-art learning-based solutions by very large margins, employing extremely shallow architectures. Furthermore, we show that even a single reverse diffusion step with a controlled step size can outperform concurrent methods.

2. RELATED WORKS

The emergence of deep learning for communication and information theory applications has demonstrated the advantages of neural networks in many tasks, such as channel equalization, modulation, detection, quantization, compression, and decoding (Ibnkahla, 2000) . Model-free decoders employ general neural network architectures (Cammerer et al., 2017; Gruber et al., 2017; Kim et al., 2018; Bennatan et al., 2018) . However, the exponential number of possible codewords makes the decoding of large codes unfeasible. Bennatan et al. (2018) preprocess the channel output to allow the decoder to remain provably invariant to the transmitted codeword and to eliminate risks of overfitting. Model-free approaches generally make use of multilayer perceptron networks or recurrent neural networks to simulate the iterative process existing in many legacy decoders (Gruber et al., 2017; Kim et al., 2018; Bennatan et al., 2018) . However, many architectures have difficulties in learning the code or analyzing the reliability of the output, and require prohibitive parameterization or expensive graph permutation preprocessing (Bennatan et al., 2018) . Recently, Choukroun & Wolf (2022) proposed the Error Correction Code Transformer (ECCT), obtaining SOTA performance. The model embeds the signal elements into a high-dimensional space where analysis is more efficient, while the information about the code is integrated via a masked self-attention mechanism. Diffusion Probabilistic Models were first introduced by Sohl-Dickstein et al. (2015) , who presented the idea of using a slow iterative diffusion process to break the structure of a given distribution while learning the reverse neural diffusion process, in order to restore the structure in the data. Song & Ermon (2019) proposed a new score-based generative model, building on the work of Hyvärinen & Dayan (2005) , as a way of modeling a data distribution using its gradients, and then sampling using Langevin dynamics (Welling & Teh, 2011) . The DDPM method of Ho et al. (2020b) is a generative model based on the neural diffusion process that applies score matching for image generation. Song et al. (2020b) leverage techniques from stochastic differential equations to improve the sample quality obtained by score-based models; Song et al. (2020a) and Nichol & Dhariwal (2021a) propose methods for improving sampling speed; Nichol & Dhariwal (2021a) and Saharia et al. (2021) demonstrated promising results on the difficult ImageNet generation task, using upsampling diffusion models. Several extensions to other fields, such as audio (Kong et al., 2020; Chen et al., 2020b) , have been proposed.

3. BACKGROUND

We provide in this section the necessary background on error correction coding and DDPM. Coding We assume a standard transmission that uses a linear code C. The code is defined by the binary generator matrix G of size k × n and the binary parity check matrix H of size (n -k) × n defined such that GH T = 0 over the order 2 Galois field GF (2). The input message m ∈ {0, 1} k is encoded by G to a codeword x ∈ C ⊂ {0, 1} n satisfying Hx = 0 and transmitted via a Binary-Input Symmetric-Output channel, e.g., an AWGN channel. Let y denote the channel output represented as y = x s + ε, where x s denotes the Binary Phase Shift Keying (BPSK) modulation of x (i.e., over {±1}), and ε is a random noise independent of the transmitted x. The main goal of the decoder f : R n → R n is to provide a soft approximation x = f (y) of the codeword. We follow the preprocessing of Bennatan et al. (2018) ; Choukroun & Wolf (2022) , in order to remain provably invariant to the transmitted codeword and to avoid overfitting. The preprocessing transforms y to a vector of dimensionality 2n -k defined as ỹ = h(y) = [|y|, s(y)] , where, [•, •] denotes vector concatenation, |y| denotes the absolute value (magnitude) of y and s(y) ∈ {0, 1} n-k denotes the binary code syndrome. The syndrome is obtained via the GF (2) multiplication of the binary mapping of y with the parity check matrix such that s(y) = Hy b := Hbin(y) := H 0.5(1sign(y)) . (2) The induced parameterized decoder ϵ θ : R 2n-k → R n with parameters θ aims to predict the multiplicative noise denoted as ε and defined such that y = Denoising Diffusion Probability Model (DDPM) Ho et al. (2020a) assume a data distribution x 0 ∼ q(x) and a Markovian noising process q that gradually adds noise to the data to produce noisy samples {x i } T i=1 . Each step of the corruption process adds Gaussian noise according to some variance schedule given by β t such that q(x t |x t-1 ) ∼ N (x t ; 1 -β t x t-1 , β t I)) x t = 1 -β t x t-1 + β t z t-1 , z t-1 ∼ N (0, I). (3) q(x t |x 0 ) can be expressed as a Gaussian distribution such that, with α t := 1 -β t and ᾱt := t s=0 α s , we have q(x t |x 0 ) ∼ N (x t ; √ ᾱt x 0 , (1 -ᾱt )I) x t = √ ᾱt x 0 + ε √ 1 -ᾱt , ε ∼ N (0, I). The intractable reverse diffusion process q(x t-1 |x t ) approaches a diagonal Gaussian distribution as -Dickstein et al., 2015) and can be approximated using a neural network p θ (x t ) in order to predict the Gaussian statistics. The model is trained by stochastically optimizing the random terms of the variational lower bound of the negative log-likelihood function. β t ---→ t→∞ 0 (Sohl One can find via Bayes' theorem that the posterior q(x t-1 |x t , x 0 ) is also Gaussian, making the objective a sum of tractable KL divergences between Gaussians. Ho et al. (2020a) found a more practical objective, defined via the training of a model ϵ DDP M θ (x t , t) that predicts the additive noise ε from Eq. 4 as follows L DDP M (θ) = E t∼U [1,T ],x0∼q(x),ε∼N (0,I) ||ε -ϵ DDP M θ (x t , t)|| 2 . (5) The distribution q(x T ) is assumed to be a nearly isotropic Gaussian distribution, such that sampling x T is trivial. Thus, the reverse diffusion process is given by the following iterative process x t-1 = 1 √ α t x t - 1 -α t √ 1 -ᾱt ϵ DDP M θ (x t , t) .

4. DENOISING DIFFUSION ERROR CORRECTION CODES

We present the elements of the proposed denoising diffusion for decoding and the proposed architecture, together with its training procedure. An illustration of the coding setting and the proposed decoding framework are given in Figure 1 . Figure 1 : Illustration of the communication system. We train a parameterized iterative decoder ϵ θ conditioned on the number of parity check errors. The decoding is performed iteratively through the reverse diffusion process, as described in this paper.

4.1. DATA TRANSMISSION AS A FORWARD DIFFUSION PROCESS

Given a codeword x 0 sampled from the Code distribution x 0 ∼ q(x), we propose to define the codeword transmission procedure y = x 0 +σε as a forward diffusion process adding a small amount of Gaussian noise to the sample in t steps with t ∈ (0, . . . , T ), where the step sizes are controlled by a variance schedule {β t } T t=0 . In our setting, we propose the following unscaled forward diffusion q(x t := y|x t-1 ) ∼ N (x t ; x t-1 , β t I). Thus, for a given received word y and a corresponding t, we consider y as a codeword that has been corrupted gradually, such that for ε ∼ N (0, I) y := x t = x 0 + σε = x 0 + βt ε ∼ N (x t ; x 0 , βt I), where βt = t i=1 β i and σ defines the level of corruption of the AWGN channel. Thus, the transmission of data over noisy communication channels can be defined as a modified iterative diffusion process to be reversed for decoding.

4.2. DECODING AS A REVERSE DIFFUSION PROCESS

Following Bayes' theorem, the posterior q(x t-1 |x t , x 0 ) is a Gaussian such that q(x t |x t-1 , x 0 ) ∼ N (x t ; μt (x t , x 0 ), βt I), where, according to Eq. 8, we have μt (x t , x 0 ) = βt βt + β t x t + β t βt + β t x 0 = x t - βt β t βt + β t ε , and βt = βt β t βt + β t . The full derivation is given in the Appendix A. Similarly to (Sohl-Dickstein et al., 2015; Ho et al., 2020b) , we wish to approximate the intractable Gaussian reverse diffusion process q(x t- 1 |x t ) such that q(x t-1 |x t ) ≈ p θ (x t-1 |x t ) ∼ N (x t-1 ; µ θ (x t , t), βt I) , with fixed variance βt . Following the simplified objective of Ho et al. (2020b) , one would adapt the negative log-likelihood approximation such that the decoder predicts the additive noise of the adapted diffusion process and L(θ) = E t∼U [1,T ],x0∼q(x),ε∼N (0,I) [||ε -ϵ θ (x 0 + βt ε, t)|| 2 ]. One interesting property of the syndrome-based approach of Bennatan et al. ( 2018) is that, similarly to denoising diffusion models, in order to retrieve the original codeword, the decoder's objective is to predict the channel's noise. However, the syndrome-based approach enforces the prediction of the multiplicative noise ε instead of the additive noise ε, in contrast to classic diffusion models. We note, however, that the exact value of the multiplicative noise is not important for hard decoding, but only its sign since x s = sign(y ε). Therefore, we propose to learn the hard (i.e., the sign) prediction of the multiplicative noise using the binary cross entropy loss as a surrogate objective, such that Figure 3 : Influence of the noise or E b /N 0 (normalized SNR) on the number of parity check errors for several codes. The greater the noise, the higher the number of parity check errors, which demonstrates that the syndrome conveys information about the level of noise. L(θ) = E t,x0,ε BCE ϵ θ (x 0 + βt ε, t), εb , where the target binary multiplicative noise is defined as εb = bin x 0 (x 0 + βt ε) , and BCE denotes the binary cross entropy loss.

4.3. DENOISING VIA PARITY CHECK CONDITIONING

The reverse denoising process of traditional DDPM is conditioned by the time step. Thus, by sampling Gaussian noise, which is assumed as equivalent to step t = T , one can fully reverse the diffusion by up to T iterations. In our case, we are not interested in a generative model, but in an exact iterative denoising scheme, where the original signal is only corrupted to a measured extent. Moreover, a given noisy code conveys information about the level of noise via its syndrome, since s(y) = Hy = Hx + Hz = Hz. Fig. 3 illustrates the impact of noise on the number of parity check errors. Evidently, one can approximate an injective function between the number of parity check errors and the amount of noise. Such a function is a direction indication of the proximity of the current iterate to a solution (codeword). Therefore, we suggest conditioning the diffusion decoder according to the number of parity check errors e t , such that e t := e(x t ) = n-k i=1 s(x t ) i ∈ {0, . . . n -k}. The resulting training objective is now given by L(θ) = E t,x0,ε BCE ϵ θ (x 0 + β1/2 t ε, e t e t e t ), εb . Following this logic, the number of required denoising steps T = n -k is set as the maximum number of parity check errors. Similarly to the classical DDPM training procedure, sampling a time step t ∼ U(0, . . . , T ) produces noise, which in turn induces a certain number of parity errors. The training procedure of our method is given in Alg. 1. The framework assumes a random "time" sampling, producing a noise and then a syndrome to be corrected. Note that, our model-free solution is invariant to the transmitted codeword, and the diffusion decoding can be trained with one single codeword (Alg. 1 line 1). Since the denoising model predicts the multiplicative noise ε, at inference time it needs to be transformed into its additive counterpart ε in order to perform the gradient step in the original additive diffusion process domain. We obtain the additive noise by subtracting the modulated predicted codeword sign(x) from the noisy signal, such that ε = ysign(x) = ysign( εy). Therefore, following ε ∼ N (0, I) 6: x t = x 0 + βt ε = x 0 ε 7: Take gradient descent step on: BCE(ϵ θ (x t , e t ), bin(ε)) 8: until converged Get λ according to Eq. 14 8: y = y -λ{ βγ βγ -1 /( βγ + β γ )ε 9: return bin(y) Eq. 9, at inference time the reverse process is given by x t-1 = x t - βt β t βt + β t x t -sign(x t ε) = x t - βt β t βt + β t x t -sign(x t ϵ θ (x t , e t )) The inference procedure is defined in Alg.2. If the syndrome is non-zero, we predict the multiplicative noise, extract the corresponding additive noise, and perform the reverse step. We illustrate in Fig. 2 the reverse diffusion dynamics (gradient field) for a (3, 1) repetition code, i.e., G = (1, 1, 1).

4.4. SYNDROME-BASED LINE SEARCH FOR REVERSE DIFFUSION STEP SIZE

One major limitation of the generative neural diffusion process is the large number of diffusion steps required -generally a thousand -in order to generate high-quality samples. Several methods proposed faster sampling procedures in order to accelerate data generation via schedule subsampling or step size correction (Nichol & Dhariwal, 2021b; San-Roman et al., 2021) . In our configuration, one can assess the quality of the denoised signal via the value of its syndrome, i.e., the number of parity check errors, while a zero syndrome means a valid codeword. Therefore, we propose to find the optimal step size λ by solving the following optimization problem λ * = arg min λ∈R + ∥s x t -λ βt βt βt + β t ε ∥ 1 , where s(•) denotes the syndrome computed over GF (2) as in Eq. 2. While many line-search (LS) methods exist in numerical optimization (Nocedal & Wright, 2006) , since the objective is highly non-differentiable, we suggest adopting a grid search procedure such that the search space becomes restricted to λ ∈ I where I is a predefined discrete segment. This parallelizable procedure reduces the number of iterations by a sizable factor, as shown in Section 5. Details regarding the grid-search procedure are discussed in Appendix C.

Architecture and Training

The state-of-the-art ECCT architecture of Choukroun & Wolf (2022) is used as ϵ θ . In this architecture, the capacity of the model is defined according to the chosen embedding dimension d and the number of self-attention layers N . In order to condition the network by the number of parity errors e t ∈ {0, . . . , n -k}, we employ a d dimensional one hot encoding multiplied via Hadamard product with the initial elements' embedding of the ECCT. Denoting the ECCT's embedding of the i element as ϕ i , the new embedding is defined as φi = ϕ i ⊙ ψ(e t ), ∀i, where ψ denotes the n-k one hot embedding. As a transformation of the syndrome, e t remains also invariant to the codeword. Additional details on the DDECCT architecture are given in Appendix F. The discrete grid search of λ is uniformly sampled over I = [1, 20] with 20 samples, in order to find the optimal step size. A denser or a code adaptive sampling may improve the results, according to a predefined computation-speed trade-off. We show the distribution of optimal λ in Appendix C. The Adam optimizer (Kingma & Ba, 2014) is used with 128 samples per mini-batch, for 2000 epochs, with 1000 mini-batches per epoch. The noise scheduling is constant and set to β t = 0.01, ∀t. An extended discussion regarding the β scheduler can be found in Appendix G. We initialized the learning rate to 10 -4 coupled with a cosine decay scheduler down to 5 • 10 -6 at the end of training. No warmup (Xiong et al., 2020) was employed. 

5. EXPERIMENTS

To evaluate our method, we train the proposed architecture with three classes of linear block codes: Low-Density Parity Check (LDPC) codes (Gallager, 1962) , Polar codes (Arikan, 2008) and Bose-Chaudhuri-Hocquenghem (BCH) codes (Bose & Ray-Chaudhuri, 1960) . All parity check matrices are taken from Helmling et al. (2019) . The proposed architecture is defined solely by the number of encoder layers N and the dimension of the embedding d. We compare our method with the BP algorithm (Pearl, 1988) , the recent Autore- 6.5 ± 5.9 6.3 ± 5.5 6.2 ± 5.2 3.1 ± 3.9 3.1 ± 3.8 3.1 ± 3.8 10.5 ± 6.4 10.32 ± 6.11 10.2 ± 5.9 6.1 ± 5.0 6.1 ± 4.9 6.1 ± 4.9 3.1 ± 3.8 3.1 ± 3.8 3.1 ± 3.8 3.6 ± 7.4 2.7 ± 6.0 2.0 ± 4.4 1.2 ± 2.7 1.1 ± 2.0 1.0 ± 1.3 0.6 ± 0.8 0.6 ± 0.6 0.6 ± 0.5 2.2 ± 3.4 1.2 ± 2.0 1.1 ± 1.5 1.2 ± 1.4 0.9 ± 0.5 0.9 ± 0.5 0.7 ± 0.8 0.6 ± 0.5 0.6 ± 0. 1.0 ± 1.9 1.0 ± 1.9 1.0 ± 1.9 0.7 ± 0.7 0.7 ± 0.6 0.7 ± 0.6 2.2 ± 3.6 1.5 ± 3.5 1.4 ± 3.4 1.3 ± 1.5 0.9 ± 1.1 0.9 ± 1.1 0.8 ± 0.8 0.7 ± 0.5 0.7 ± 0.5 LDPC(121,80) -0.23 -0.13 -0.10 -0.03 -0.10 -0.17 2.8 ± 8.1 2.6 ± 7.8 2.6 ± 7.6 1.2 ± 2.6 1.2 ± 2.5 1.2 ± 2.4 0.9 ± 0.7 0.9 ± 0.7 0.9 ± 0.6 2.2 ± 2.9 1.3 ± 2.6 1.2 ± 2.3 1.5 ± 0.9 1.0 ± 0.5 1.0 ± 0.4 1.1 ± 0.6 0.9 ± 0.3 0.9 ± 0. We refer the reader to Choukroun & Wolf (2022) for a detailed complexity analysis of the ECCT. Details about the computational overhead of the DDECCT are given in Appendix F. Note that LDPC codes are designed specifically for BP-based decoding (Richardson et al., 2001) . The results are reported as bit error rates (BER) for different normalized SNR values (Eb/N 0 ). We follow the testing benchmark of (Nachmani & Wolf, 2019; Choukroun & Wolf, 2022) . During testing, our decoder decodes at least 10 5 random codewords, to obtain at least 500 frames with errors at each SNR value. All baseline results were obtained from the corresponding papers. The results are reported in Tab. 1, where we present the negative natural logarithm of the BER. For each code, we present the results of the BP-based competing methods for 5 and 50 iterations (first and second rows), corresponding to a neural network with 10 and 100 layers, respectively. As in (Choukroun & Wolf, 2022) , our framework's performance with Line Search (LS) as described in Section 4.4 is evaluated for six different architectures, with N = {2, 6} and d = {32, 64, 128}, respectively (first to third rows). BER plots with respect to the SNR are given in Appendix I. As can be seen, our approach outperforms the current SOTA results of ECCT by extremely large margins on several codes, at a fraction of the capacity. Especially for shallow models, the difference can be an order of magnitude. Performance is closer with short high-rate codes, for which ECCT performance is already very high. We present in Figure 5 the performance of the proposed DDECCT on larger codes. As can be seen, DDECCT can learn to efficiently decode larger codes and outperforms ECCT. A separate comparison to the non-neural SCL Polar decoder of Tal & Vardy (2015) is given in Appendix H, demonstrating the need to train bigger architectures in order to surpass this specialized decoder. We present in Table 2 the difference in accuracy ∆ between the line search procedure and the regular reverse diffusion. We also present convergence statistics (mean and standard deviation of the number of iterations) for the regular reverse diffusion and the line search procedure. The full table with the statistics for all of the codes is given in Appendix E. Evidently, the line search procedure enables extremely fast convergence, requiring as little as one iteration for high SNR. Note that we measure the number of iterations required to reach a syndrome of zero failed checks. We do not apply early stopping to the decoding, which could reduce the average number of iterations even further if the decoder stagnates and does not converge to zero syndrome.

Non-Gaussian Channel

We test our framework on a non-Gaussian Rayleigh fading channel, which is often used for simulating the propagation environment of a signal, e.g., for wireless devices. In this fading model, the transmission of the codeword x ∈ {0, 1} n is defined as y = hx s + z, where h is an n-dimensional i.i.d. Rayleigh-distributed vector with a scale parameter α, and z ∼ N (0, σ 2 I n ). In our simulations, we assume a high scale α = 1 in order to easily compare and reproduce the results, while the level of Gaussian noise and the testing procedure remain the same as described in the paper. The overall variance of the transmitted codeword y in the Rayleigh channel is roughly twice the AWGN's on the tested SNR range. The results are presented in Figure 4 . As can be observed, our method is still able to learn to decode, even under these very noisy fading channels.

BER evolution through iteration/time

We illustrate in Figure 6 the denoising process for several codes. We show how the BER decreases with time for the regular proposed method and the augmented line search procedure. We can observe the very fast convergence of the line search approach. We further provide in Appendix D the performance of the proposed framework for one, two and three iteration steps. We can see that LS enables outperforming the original ECCT, even with one step only.

6. CONCLUSIONS

We present a novel denoising diffusion method for the decoding of algebraic block codes. It is based on an adapted diffusion process that simulates the channel corruption we wish to reverse. The method makes use of the syndrome as a conditioning signal and employs a line-search procedure to control the step size. Since it inherits the iterative nature of the underlying process, both training and deployment are extremely efficient. Even with very low-capacity networks, the proposed approach outperforms existing neural decoders by sizable margins for a broad range of code families.

A UNSCALED DIFFUSION DERIVATION

According to Bayes' rule, q(x t-1 |x t , x 0 ) = q(x t |x t-1 , x 0 ) q(x t-1 |x 0 ) q(x t |x 0 ) ∝ exp - 1 2 (x t -x t-1 ) 2 β t + (x t-1 -x 0 ) 2 1 -βt - (x t -x 0 ) 2 1 -βt = exp - 1 2 ( 1 β t + 1 βt )x 2 t-1 -( 2 β t x t - 2 βt x 0 )x t-1 + C(x t , x 0 ) (15) where C(x t , x 0 ) represents the constant term of the second-order equation. Following the standard Gaussian density function, the mean and variance can be parameterized as follows βt = 1 β t + 1 βt -1 μt (x t , x 0 ) = ( 1 β t x t + 1 βt x 0 )/ βt = βt βt + β t x t + β t βt + β t x 0 = βt βt + β t x t + β t βt + β t (x t -βt ε) = x t - βt β t βt + β t ε. (16)

B FORWARD DIFFUSION PROCESS VISUALIZATION

We provide a visualization of the forward diffusion process as described in Section 4.1, using the three dimensional repetition code as discussed in Section 4.3. Figure 7 presents random diffusion processes from one of the two valid codewords through time. As can be seen, there is a migration from the valid codewords (depicted by either a blue or a red cross) to the vicinity of invalid words (black crosses). 

D DDECCT PERFORMANCE WITH FEW ITERATIONS

Table 3 presents the performance of the proposed framework for one, two, and three iteration steps with the regular reverse diffusion method and with the proposed line-search approach. As can be seen, the line-search approach improves over the regular reverse diffusion by orders of magnitude. 6.5 ± 5.9 6.3 ± 5.5 6.2 ± 5.2 3.1 ± 3.9 3.1 ± 3.8 3.1 ± 3.8 10.5 ± 6.4 10.32 ± 6.11 10.2 ± 5.9 6.1 ± 5.0 6.1 ± 4.9 6.1 ± 4.9 3 1.0 ± 1.9 1.0 ± 1.9 1.0 ± 1.9 0.7 ± 0.7 0.7 ± 0.6 0.7 ± 0.6 2.2 ± 3.6 1.5 ± 3.5 1.4 ± 3.4 1.3 ± 1.5 0.9 ± 1.1 0.9 ± 1.1 2.1 ± 7.0 2.1 ± 7.0 2.0 ± 6.9 4.8 ± 12.9 4.5 ± 12.3 4.3 ± 12.0 1.3 ± 3.5 1.2 ± 3.2 1.2 ± 3.0 0.9 ± 0.6 0.9 ± 0.6 0.9 ± 0.5 2.0 ± 6.7 1.9 ± 6.2 1.8 ± 5.7 1.0 ± 1.3 1.0 ± 1.3 1.0 ± 1.2 6.0 ± 13.3 4.4 ± 13.2 4.2 ± 12.7 2.3 ± 3.1 1.3 ± 2.6 1.3 ± 2.5 1.5 ± 1.1 1.0 ± 0.4 1.0 ± 0.3 LDPC( 1.0 ± 1.1 1.0 ± 0.9 1.0 ± 0.7 0.9 ± 0.4 0.9 ± 0.4 0.9 ± 0. 0.9 ± 0.7 0.9 ± 0.7 0.9 ± 0.6 2.2 ± 2.9 1.3 ± 2.6 1.2 ± 2.3 1.5 ± 0.9 1.0 ± 0.5 1.0 ± 0.4 1.1 ± 0.6 0.9 ± 0.3 0.9 ± 0. The DDECCT architecture is depicted in Figure 9 . In order to condition the network on the number of parity errors e t ∈ {0, . . . , n -k}, we employ a d dimensional one hot encoding multiplied via Hadamard product with the initial elements' embedding of the ECCT. Denoting the ECCT's embedding of the i element as ϕ i , the new embedding is defined as φi = ϕ i ⊙ ψ(e t ), ∀i, where ψ denotes the n -k one hot embedding.

F.2 DDECCT COMPUTATIONAL COMPLEXITY

The computational overhead consists of the conditioning, which is a negligible Hadamard product of the initial ECCT embedding with the one-hot encoding of the number of parity-check errors, the parallel line-search procedure (which scales linearly with the density of the code, and most codes are sparse), and the number of reverse diffusion iterations, which is reduced to extremely few iterations by the line-search framework. Therefore, the large improvement of DDECCT over ECCT is obtained while employing a modest increase in computational complexity. Most importantly, the space complexity, determining the capacity of the network, is extremely reduced with the DDECC framework, since even shallow DDECCT architectures outperform deep ECCT models. It should also be mentioned that as described in the introduction, the iterative DDECC framework supports a differential treatment of the samples based on their level of corruption, what is not possible with ECCT.

G BETA SCHEDULING

The choice of the scheduling range, i.e. t ∈ {0, 1, . . . , T } is explained as follows. In contrast to classical denoising diffusion models used as generative models, the conditioning in our model is performed over the number of parity-check errors in order to obtain an estimate of the proximity to the solution. Thus, T is now defined as the maximum possible number of parity-check errors, i.e. n -k. The values of the beta schedule have been set empirically such that the cumulative sum of the diffusion noise roughly corresponds to the average training noise statistics, i.e., βt ≈ σ such that y = x + σz, z ∼ N (0, I), as described in Eq. 8. In addition to the empirical support, constant scheduling has been chosen to induce a uniform treatment of the noise space. Finally, the full method uses the line-search procedure, reducing the need to precisely define or tune the scheduler. We tested the SCL algorithm for L = {1, 4} using the AFF3CT software (Cassagne et al., 2019) . Note that with this software package, we cannot ensure an exact comparison to our settings and even with regard to the polar code used.

H COMPARISON WITH SUCCESSIVE CANCELLATION LIST (SCL) POLAR DECODER

Increasing the capacity of the network, especially with more layers, is expected to lead to better results as demonstrated for LDPC codes. Similarly, SCL with bigger lists would obtain improved accuracy. We present in Figure 10 the performance of different architecture size. We can observe larger architecture is able to close the gap with the SOTA on Polar codes. We believe the high number of degrees of freedom in our model coupled with careful training and hyper-parameter tuning should fill the remaining gap. Most importantly, we believe that permutations of the parity check matrix may have a large impact on the performance as described in many previous neural decoding works (e.g. (Bennatan et al., 2018; Raviv et al., 2020) ). 



x s • ε. The prediction of the multiplicative noise instead of the additive physical one is done in order to remain invariant to the transmitted codeword (|y| = |x s ε| = |ε|), thereby avoiding the risk of code overfitting, as described by Bennatan et al. (2018) and the proof of lemma 1 of Richardson & Urbanke (2001). The final prediction takes the form xs = sign(y • ϵ θ (|y|, Hy b )).

Figure 2: Reverse diffusion dynamics on a (3,1) repetition code. The two points represent the two only signed codewords: ±(1, 1, 1). The colors are defined by Maximum Likelihood decoding. Evidently, the denoising diffusion model reverses noisy codes towards the right distribution. An illustration of the forward process for this code is provided in Appendix B.

DDECC training procedure. 1: x 0 ∈ C 2: Input: Parity check matrix H, noise schedule β 1 , ..., β T 3

DDECC sampling procedure 1: Input: Parity check matrix H, channel's output y 2: for n -k iterations do θ (y, γ) ; ε = ysign( εy) 7:

Training and experiments were performed on a 12GB Titan V GPU. The total training time ranged from 12 to 24 hours depending on the code length, and no optimization of the self-attention mechanism was employed. Per epoch, the training time was in the range of 19-40 and 40-102 seconds for the N = 2, 6 architectures, respectively.

Figure 4: BER comparison between the ECCT N = 6, d = 32 and the proposed DDECCT, for the Rayleigh fadding channel for (a) Polar(64,32), (b) BCH(63,36), and (c) LDPC(49,24) codes.

Figure 6: BER vs the number of iterations (up to n -k) for regular and line search reverse diffusion.

Figure 7: Visualization of the forward diffusion process. The two valid codewords are represented as a blue cross and a red cross. The other six words are denoted by black crosses.

Figure 9: Illustration of the proposed DDECCT architecture. The main difference from ECCT is the green component in the Initial Embedding module.

Figure 10: BER plots comparing ECCT, DDECCT and the SCL algorithm for various Eb/N 0 and N values.

Figure 11: BER plots comparing ECCT and the proposed DDECCT for various Eb/N 0 values.

A comparison of the negative natural logarithm of Bit Error Rate (BER) for three normalized SNR values (4,5,6) of our method with literature baselines. Higher is better. The best results are in bold, second best underlined. BP-based results are obtained after L = 5 BP iterations in the first row (i.e. 10-layer neural network) and at convergence results in the second row are obtained after L = 50 BP iterations (i.e., 100-layer neural network). Our performance is presented for six different architectures: for N = {2, 6} and d = {32, 64, 128}. The presented results are obtained with the LS procedure.

A comparison between the line search procedure and the regular reverse diffusion. The ∆ column denotes the difference between the logarithm of Bit Error Rate (BER) for three normalized SNR values (i.e., ∆ = -log(BER LS ) + log(BER Reg )). The other columns represent the mean and standard deviation of the number of iterations of the reverse process until convergence, i.e., convergence to zero syndrome.

Negative natural logarithm of BER by number of iterations for N=2 models for the regular and the LS diffusion methods. Higher is better. The (column) average over the codes and the dimensions d for the ECCT is {4.58, 6.20, 8.43} on the {4, 5, 6} E b /N 0 respectively, and {4.64, 6.37, 8.78} for LS with one iteration.

presents the convergence statistics (mean and standard deviation of the number of iterations) for the regular reverse diffusion and the line search procedure. Evidently, the line-search approach substantially reduces the number of steps, especially for low SNRs where the improvement can reach one order of magnitude.

A comparison between the line search procedure and the regular reverse diffusion. The ∆ column denotes the difference between the logarithm of Bit Error Rate (BER) for three normalized SNR values (i.e., ∆ = -log(BER LS ) + log(BER Reg )). The other columns represent the mean and standard deviation of the number of iterations of the reverse process until convergence, i.e., convergence to zero syndrome.

compares the performance of ECCT and DDECCT to the SOTA SCL Polar decoder Tal & Vardy (2015) for several Polar Codes. The SCL decoder has a time and space complexity of O(LN log N ) and O(LN ), respectively.

A comparison of the negative natural logarithm of Bit Error Rate (BER) for three5,6)  between the proposed method with N = 6, d = 128, the ECCT and the SOTA SC-L algorithm. Higher is better. ,64) 8.37 11.69 13.70 9.6013.16 17.42 5.92 8.64 12.18 9.11 12.90 16.30   Polar(128,86)  7.54 10.74 15.14 9.26 13.04 17.13 6.31 9.01 12.45 7.60 10.81 15.17Polar(128,96) 6.74 9.53 13.53 8.02 11.60 18.16 6.31 9.12 12.47 7.16 10.3 13.19

ACKNOWLEDGMENTS

This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research, innovation program (grant ERC CoG 725974). This work was further supported by a grant from the Tel Aviv University Center for AI and Data Science (TAD). The contribution of the first author is part of a Ph.D. thesis research conducted at Tel Aviv University.

C LINE SEARCH HISTOGRAMS AND COMPLEXITY

C.1 HISTOGRAMS Figure 8 presents the distribution of the optimal step size λ for several codes. Each code presents a different distribution of the optimal step sizes, as can be seen from the high variance of the x-axis. 

C.2 COMPLEXITY OVERHEAD OF THE LINE SEARCH PROCEDURE

The computation of the syndrome consists of a series of efficient binary operations (xor) inducing a computational complexity that is proportional to the density of the code.The line search consists of the parallel computation of the syndrome, over the multiple words obtained for different λ values sampled over a predefined grid. Thus, the time complexity of the line-search procedure can be reduced to the very efficient computation of the syndrome which can be assumed as constant. Without parallelization, the complexity is linear with the grid size, but the overall process remains extremely efficient.

