Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks

Abstract

Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. Many successful experimental results have been achieved with empirical straight-through (ST) approaches, proposing a variety of ad-hoc rules for propagating gradients through non-differentiable activations and updating discrete weights. At the same time, ST methods can be truly derived as estimators in the stochastic binary network (SBN) model with Bernoulli weights. We advance these derivations to a more complete and systematic study. We analyze properties, estimation accuracy, obtain different forms of correct ST estimators for activations and weights, explain existing empirical approaches and their shortcomings, explain how latent weights arise from the mirror descent method when optimizing over probabilities. This allows to reintroduce ST methods, long known empirically, as sound approximations, apply them with clarity and develop further improvements.

1. Introduction

Neural networks with binary weights and activations have much lower computation costs and memory consumption than their real-valued counterparts [18, 26, 45] . They are therefore very attractive for applications in mobile devices, robotics and other resource-limited settings, in particular for solving vision and speech recognition problems [8, 56] . The seminal works that showed feasibility of training networks with binary weights [15] and binary weights and activations [27] used the empirical straightthrough gradient estimation approach. In this approach the derivative of a step function like sign, which is zero, is substituted with the derivative of some other function, hereafter called a proxy function, on the backward pass. One possible choice is to use identity proxy, i.e., to completely bypass sign on the backward pass, hence the name straight-through [5] . This ad-hoc solution appears to work -2 -1 0 1 2 -1 1 (a) -2 -1 0 1 2 -2 2 (b) -2 -1 0 1 2 -1 1 (c) -2 -1 0 1 2 -1 1 (d) -2 -1 0 1 2 -1 1 (e) -2 -1 0 1 2 1 -2 -1 0 1 2 1 -2 -1 0 1 2 1 -2 -1 0 1 2 1 -2 -1 0 1 2 1 Sign function Identity Htanh / uniform noise tanh / logistic noise ApproxSign / triangular noise Fig. 1 : The sign function and different proxy functions for derivatives used in empirical ST estimators. Variants (c-e) can be obtained by choosing the noise distribution in our framework. Specifically for a real-valued noise z with cdf F , in the upper plots we show E z [sign(az)] = 2F -1 and, respectively, twice the density, 2F in the lower plots. Choosing uniform distribution for z gives the density p(z) = 1 2 1l z∈[-1,1] and recovers the common Htanh proxy in (c). The logistic noise has cdf F (z) = σ(2z), which recovers tanh proxy in (d). The triangular noise has density p(z) = max(0, |(2x)/4|), which recovers a scaled version of ApproxSign [34] in (e). The scaling (standard deviation) of the noise in each case is chosen so that 2F (0) = 1. The identity ST form in (b) we recover as latent weight updates with mirror descent. surprisingly well and the later mainstream research on binary neural networks heavily relies on it [2, 6, 9, 11, 18, 34, 36, 45, 52, 60] . The de-facto standard straight-through approach in the above mentioned works is to use deterministic binarization and the clipped identity proxy as proposed by Hubara et al. [27] . However, other proxy functions were experimentally tried, including tanh and piece-wise quadratic ApproxSign [18, 34] , illustrated in Fig. 1 . This gives rise to a diversity of empirical ST methods, where various choices are studied purely experimentally [2, 6, 52] . Since binary weights can be also represented as a sign mapping of some real-valued latent weights, the same type of methods is applied to weights. However, often a different proxy is used for the weights, producing additional unclear choices. The dynamics and interpretation of latent weights are also studied purely empirically [51] . With such obscurity of latent weights, Helwegen et al. [24] argues that "latent weights do not exist" meaning that discrete optimization over binary weights needs to be considered. The existing partial justifications of deterministic straight-through approaches are limited to one-layer networks with Gaussian data [58] or binarization of weights only [1] and do not lead to practical recommendations. In contrast to the deterministic variant used by the mainstream SOTA, straight-through methods were originally proposed (also empirically) for stochastic autoencoders [25] and studied in models with stochastic binary neurons [5, 44] . In the stochastic binary network (SBN) model which we consider, all hidden units and/or weights are Bernoulli random variables. The expected loss is a truly differentiable function of parameters (i.e., weight probabilities) and its gradient can be estimated. This framework allows to pose questions such as: "What is the true expected gradient?" and "How far from it is the estimate computed by ST?" Towards computing the true gradient, unbiased gradient estimators were developed [20, 55, 57] , which however have not been applied to networks with deep binary dependencies due to increased variance in deep layers and complexity that grows quadratically with the number of layers [48] . Towards explaining ST methods in SBNs, Tokui & Sato [54] and Shekhovtsov et al. [48] showed how to derive ST under linearizing approximations in SBNs. These results however were secondary in these works, obtained from more complex methods. They remained unnoticed in the works applying ST in practice and recent works on its analysis [13, 58] . They are not properly related to existing empirical ST variants for activations and weights and did not propose analysis. The goal of this work is to reintroduce straight-through estimators in a principled way in SBNs, to formalize and systematize empirical ST approaches for activation and weights in shallow and deep models. Towards this goal we review the derivation and formalize many empirical variants and algorithms using the derived method and sound optimization frameworks: we show how different kinds of ST estimators can occur as valid modeling choices or valid optimization choices. We further study properties of ST estimator and its utility for optimization: we theoretically predict and experimentally verify the improvement of accuracy with network width and show that popular modifications such as deterministic ST decrease this accuracy. For deep SBNs with binary weights we demonstrate that several estimators lead to equivalent results, as long as they are applied consistently with the model and the optimization algorithm. More details on the related work, including alternative approaches for SBNs we discuss in Appendix A.

2. Derivation and Analysis

Notation We model random states x ∈ {-1, 1} n using the noisy sign mapping: x i = sign(a i -z i ), where z i are real-valued independent noises with a fixed cdf F and a i are (inputdependent) parameters. Equivalently to (1) , we can say that x i follows {-1, 1} valued Bernoulli distribution with probability p(x i =1) = P(a i -z i ≥ 0) = P(z i ≤ a i ) = F (a i ). The noise cdf F will play an important role in understanding different schemes. For logistic noise, its cdf F is the logistic sigmoid function σ. Derivation Straight-through method was first proposed empirically [25, 32] in the context of stochastic autoencoders, highly relevant to date [e.g. 16 ]. In contrast to more recent works applying variants of deterministic ST methods, these earlier works considered stochastic networks. It turns out that in this context it is possible to derive ST estimators exactly in the same form as originally proposed by Hinton. This is why we will first derive, using observations of [48, 54] , analyze and verify it for stochastic autoencoders. Let y denote observed variables. The encoder network, parametrized by φ, computes logits a(y; φ) and samples a binary latent state x via (1). As noises z Algorithm 1: Straight-Through-Activations are independent, the conditional distribution of hidden states given observations p(x|y; φ) factors as i p(x i |y; φ). The decoder reconstructs observations with p dec (y|x; θ) -another neural network parametrized by θ. The autoencoder reconstruction loss is defined as E y∼data E x∼p(x|y;φ) [-log p dec (y|x; θ)] . The main challenge is in estimating the gradient w.r.t. the encoder parameters φ (differentiation in θ can be simply taken under the expectation). The problem for a fixed observation y takes the form ∂ ∂φ E x∼p(x;φ) [L(x)] = ∂ ∂φ E z [L(sign(a -z))], where p(x; φ) is a shorthand for p(x|y; φ) and L(x) =log p dec (y|x; θ). The reparametrization trick, i.e., to draw one sample of z in (3) and differentiate L(sign(az)) fails: since the loss as a function of a and z is not continuously differentiable we cannot interchange the gradient and the expectation in zfoot_0 . If we nevertheless attempt the interchange, we obtain that the gradient of sign(az) is zero as well as its expectation. Instead, the following steps lead to an unbiased low-variance estimator. From the LHS of (3) we express the derivative as ∂ ∂φ x i p(x i ; φ) L(x) = x i i =i p(x i ; φ) ∂ ∂φ p(x i ; φ) L(x). Then we apply derandomization [40, ch. 8.7] , which performs summation over x i holding the rest of the state x fixed. Because x i takes only two values, we have xi ∂p(xi;φ) ∂φ L(x) = ∂p(xi;φ) ∂φ L(x) + ∂(1-p(xi;φ)) ∂φ L(x ↓i ) = ∂ ∂φ p(x i ; φ) L(x) -L(x ↓i ) , where x ↓i denotes the full state vector x with the sign of x i flipped. Since this expression is now invariant of x i , we can multiply it with 1 = xi p(x i ; φ) and express the gradient (4) in the form: i x¬i i =i p(x i ; φ) xi p(x i ; φ) ∂p(xi;φ) ∂φ L(x)-L(x ↓i ) x i p(x i ; φ) i ∂p(xi;φ) ∂φ L(x)-L(x ↓i ) = E x∼p(x;φ) i ∂p(xi,φ) ∂φ L(x)-L(x ↓i ) , where x ¬i denotes all states excluding x i . To obtain an unbiased estimate, it suffices to take one sample x ∼ p(x; φ) and compute the sum in i in (6) . This estimator is known as local expectations [53] and coincides in this case with go-gradient [14] , ram [54] and psa [48] . However, evaluating L(x ↓i ) for all i may be impractical. A huge simplification is obtained if we assume that the change of the loss L when only a single latent bit x i is changed can be approximated via linearization. Assuming that L is defined as a differentiable mapping R n → R (i.e., that the loss is built up of arithmetic operations and differentiable functions), we can approximate L(x) -L(x ↓i ) ≈ 2x i ∂L(x) ∂xi , where we used the identity x i -(-x i ) = 2x i . Expanding the derivative of con- ditional density ∂ ∂φ p(x i ; φ) = x i F (a i (φ)) ∂ ∂φ a i (φ), we obtain ∂p(xi,φ) ∂φ (L(x)-L(x ↓i )) ≈ 2F (a i (φ)) ∂ai(φ) ∂φ ∂L(x) ∂xi . If we now define that ∂xi ∂ai ≡ 2F (a i ), the summation over i in (6) with the approximation (8) can be written in the form of a chain rule: i 2F (a i (φ)) ∂ai(φ) ∂φ ∂L(x) ∂xi = i ∂L(x) ∂xi ∂xi ∂ai ∂ai(φ) ∂φ . To clarify, the estimator is already defined by the LHS of (9) . We simply want to compute this expression by (ab)using the standard tools, and this is the sole purpose of introducing ∂xi ∂ai . Indeed the RHS of ( 9) is a product of matrices that would occur in standard backpropagation. We thus obtained ST algorithm Alg. 1. We can observe that it matches exactly to the one described by Hinton [25] : to sample on the forward pass and use the derivative of the noise cdf on the backward pass, up to the multiplier 2 which occurred due to the use of ±1 encoding for x.

2.1. Analysis

Next we study properties of the derived ST algorithm and its relation to empirical variants. We will denote a modification of Alg. 1 that does not use sampling in Line 3, but instead computes x = sign(a), a deterministic ST; and a modification that uses derivative of some other function G instead of F in Line 5 as using a proxy G. Invariances Observe that binary activations (and hence the forward pass) stay invariant under transformations: sign(a iz i ) = sign(T (a i ) -T (z i )) for any strictly monotone mapping T . Consistently, the ST gradient by Alg. 1 is also invariant to T . In contrast, empirical straight-through approaches, in which the derivative proxy is hand-designed, fail to maintain this property. In particular, rescaling the proxy leads to different estimators. Furthermore, when applying transform T = F (the noise cdf), the backpropagation rule in line 5 of Alg. 1 becomes equivalent to using the identity proxy. Hence we see that a common description of ST in the literature as "to back-propagate through the hard threshold function as if it had been the identity function" is also correct, but only for the case of uniform noise in [-1, 1]. Otherwise, and especially so for deterministic ST, this description is ambiguous because the resulting gradient estimator crucially depends on what transformations were applied under the hard threshold.

ST Variants

Using the invariance property, many works applying randomized ST estimators are easily seen to be equivalent to Alg. 1: [16, 44, 49] . Furthermore, using different noise distributions for z, we can obtain correct ST analogues for common choices of sign proxies used in empirical ST works as shown in Fig. 1  (c-e ). In our framework they correspond to the choice of parametrization of the conditional Bernoulli distribution, which should be understood similarly to how a neural network can be parametrized in different ways. If the "straight-through" idea is applied informally, however, this may lead to confusion and poor performance. The most cited reference for the ST estimator is Bengio et al. [5] . However, [5, Eq. 13] defines in fact the identity ST variant, incorrectly attributing it to Hinton (see Appendix A). We will show this variant to be less accurate for hidden units, both theoretically and experimentally. Pervez et al. [42] use ±1 binary encoding but apply ST estimator without coefficient 2. When such estimator is used in VAE, where the gradient of the prior KL divergence is computed analytically, it leads to a significant bias of the total gradient towards the prior. In Fig. 2 we illustrate that the difference in performance may be substantial. We analyze other techniques introduced in FouST in more detail in [47] . An inappropriate scaling by a factor of 2 can be as well detrimental in deep models, where the factor would be applied multiple times (in each layer). Bias Analysis Given a rather crude linearization involved, it is indeed hard to obtain fine theoretical guarantees about the ST method. We propose an analysis targeting understanding the effect of common empirical variants and understanding conditions under which the estimator becomes more accurate. The respective formal theorems are given in Appendix B. I) When ST is unbiased? As we used linearization as the only biased approximation, it follows that Alg. 1 is unbiased if the objective function L is multilinear in x. A simple counter-example, where ST is biased, is L(x) = x 2 . [42] . The plots show training loss (negative ELBO) during epochs for different learning rates. The variant of ST algorithm used [42] is misspecified because of the scaling factor and performs substantially worse at for all learning rates. Full experiment specification is given in Appendix D.1. In this case the expected value of the loss is 1, independently of a that determines x; and the true gradient is zero. However the expected ST gradient is E[2F (a)2x] = 4F (a)(2F (a) -1), which may be positive or negative depending on a. On the other hand, any function of binary variables has an equivalent multilinear expression. In particular, if we consider L(x) = W x-y 2 , analyzed by Yin et al. [58] , then L(x) = W x-y 2i x 2 i W :,i 2 + i W :,i 2 coincides with L on all binary configurations and is multilinear. It follows that ST applied to L gives an unbiased gradient estimate of E[L], an immediate improvement compared to [58] . In the special case when L is linear in x, the ST estimator is not only unbiased but has a zero variance, i.e., it is exact. II) How does using a mismatched proxy in Line 5 of Alg. 1 affect the gradient in φ? Since diag(F ) occurs in the backward chain, we call estimators that use some matrix Λ instead of diag(F ) as internally rescaled. We show that for any Λ 0, the expected rescaled estimator has non-negative scalar product with the expected original estimator. Note that this is not completely obvious as the claim is about the final gradient in the model parameters φ (e.g., weights of the encoder network in the case of autoencoders). However, if the ST gradient by Alg. 1 is biased (when L is not multi-linear) but is nevertheless an ascent direction in expectation, the expected rescaled estimator may fail to be an ascent direction, i.e., to have a positive scalar product with the true gradient. III) When does ST gradient provide a valid ascent direction? Assuming that all partial derivatives g i (x) = ∂L(x) ∂xi are L-Lipschitz continuous for some L, we can show that the expected ST gradient is an ascent direction for any network if and only if E x [g i (x)] > L for all i. IV) Can we decrease the bias? Assume that the loss function is applied to a linear transform of Bernoulli variables, i.e., takes the form L(x) = (W x). A typical initialization uses random W normalized by the size of the fan-in, i.e., such that W k,: 2 = 1 ∀k. In this case the Lipschitz constant of gradients of L scales as O(1/ √ n), where n is the number of binary variables. Therefore, using more binary variables decreases the bias, at least at initialization. V) Does deterministic ST give an ascent direction? Let g * be the deterministic ST gradient for the state x * = sign(a) and p * = p(x * |a) be its probability. We show that deterministic ST gradient forms a positive scalar product with the expected ST gradient if |g * i | ≥ 2L(1p * ) and with the true gradient if |g * i | ≥ 2L(1-p * )+L. From this we conclude that deterministic ST positively correlates with the true gradient when L is multilinear, improves with the number of hidden units in the case described by IV and approaches expected stochastic ST as units learn to be deterministic so that the factor (1p * ) decreases. Deep ST So far we derived and analyzed ST for a single layer model. It turns out that simply applying Alg. 1 in each layer of a deep model with conditional Bernoulli units gives the correct extension for this case. We will not focus on deriving deep ST here, but remark that it can be derived rigorously by chaining derandomization and linearization steps, discussed above, for each layer [48] . In particular, [48] show that ST can be obtained by making additional linearizations in their (more accurate) PSA method. The insights from the derivation are twofold. First, since derandomization is performed recurrently, the variance for deep layers is significantly reduced. Second, we know which approximations contribute to the bias, they are indeed the linearizations of all conditional Bernoulli probabilities in all layers and of the loss function as a function of the last Bernoulli layer. We may expect that using more units, similarly to property IV, would improve linearizing approximations of intermediate layers increasing the accuracy of deep ST gradient.

3. Latent Weights do Exist!

Responding to the work "Latent weights do not exist: Rethinking binarized neural network optimization" [24] and the lack of formal basis to introduce latent weights in the literature (e.g., [27] ), we show that such weights can be formally defined in SBNs and that several empirical update rules do in fact correspond to sound optimization schemes: projected gradient descent, mirror descent, variational Bayesian learning. Let w be ±1-Bernoulli weights with p(w i =1) = θ i , let L(w) be the loss function for a fixed training input. Consistently with the model for activations (1), we can define w i = sign(η iz i ) in order to model weights w i using parameters η i ∈ R which we will call latent weights. It follows that θ i = F z (η i ). We need to tackle two problems in order to optimize E w∼p(w|θ) [L(w)] in probabilities θ: i) how to estimate the gradient and ii) how to handle constraints θ ∈ [0, 1] m . Projected Gradient A basic approach to handle constraints is the projected gradient descent: θ t+1 := clip(θ t -εg t , 0, 1), ( ) where g t = ∇ θ E w∼p(w|θ t ) [L(w)] and clip(x, a, b) := max(min(x, b), a) is the projection. Observe that for the uniform noise distribution on [-1, 1] with F (z) = clip( z+1 2 , 0, 1), we have θ i = p(w i =1) = F (η i ) = clip( ηi+1 2 , 0, 1) . Because this F is linear on [-1, 1], the update (10) can be equivalently reparametrized in η as η t+1 := clip(η t -ε h t , -1, 1), ( ) where Mirror Descent As an alternative approach to handle constraints θ ∈ [0, 1] m , we study the application of mirror descent (MD) and connect it with the identity ST update variants. A step of MD is found by solving the following proximal problem: h t = ∇ η E w∼p(w|F (η)) [L( θ t+1 = min θ g t , θ -θ t + 1 ε D(θ, θ t ) . The divergence term 1 ε D(θ, θ t ) weights how much we trust the linear approximation g t , θ-θ t when considering a step from θ t to θ. When the gradient is stochastic we speak of stochastic mirror descent (SMD) [3, 59] . A common choice of divergence to handle probability constraints is the KL-divergence D(θ i , θ t i ) = KL(Ber(θ i ), Ber(θ t i )) = θ i log( θi θ t i ) + (1 -θ i ) log( 1-θi 1-θ t i ). Solving for a stationary point of (12) gives 0 = g t i + 1 ε log( θi 1-θi ) -log( θ t i 1-θ t i ) . Observe that when F = σ we have log( θi 1-θi ) = η i . Then the MD step can be written in the well-known convenient form using the latent weights η (natural parameters of Bernoulli distribution): θ t := σ(η t ); η t+1 := η t -ε∇ θ L(θ t ). ( ) We thus have obtained the rule where on the forward pass θ = σ(η) defines the sampling probability of w and on the backward pass the derivative of σ, that otherwise occurs in Line 5 of Alg. 1, is bypassed exactly as if the identity proxy was used. We define such ST rule for optimization in weights as Alg. 2. Its correctness is not limited to logistic noise. We show that for any strictly monotone noise distribution F there is a corresponding divergence function D: Proposition 1. Common SGD in latent weights η using the identity straightthrough-weights Alg. 2 implements SMD in the weight probabilities θ with the divergence corresponding to F . Proof in Appendix C. Proposition 1 reveals that although Bernoulli weights can be modeled the same way as activations using the injected noise model w = sign(ηz), the noise distribution F for weights correspond to the choice of the optimization proximity scheme. Despite generality of Proposition 1, we view the KL divergence as a more reliable choice in practice. Azizan et al. [3] have shown that the optimization with SMD has an inductive bias to find the closest solution to the initialization point as measured by the divergence used in MD, which has a strong impact on generalization. This suggests that MD with KL divergence will prefer higher entropy solutions, making more diverse predictions. It follows that SGD on latent weights with logistic noise and identity straight-through Alg. 2 enjoys the same properties. Variational Bayesian Learning Extending the results above, we study the variational Bayesian learning formulation and show the following: Proposition 2. Common SGD in latent weights η with a weight decay and identity straight-through-weights Alg. 2 is equivalent to optimizing a factorized variational approximation to the weight posterior p(w|data) using a composite SMD method. Proof in Appendix C.2. As we can see, powerful and sound learning techniques can be obtained in a form of simple update rules using identity straightthrough estimators. Therefore, identity-ST is fully rehabilitated in this context.

4. Experiments

Stochastic Autoencoders Previous work has demonstrated that Gumbel-Softmax (biased) and arm (unbiased) estimators give better results than ST on training variational autoencoders with Bernoulli latents [16, 29, 57] . However, only the test performance was revealed to readers. We investigate in more detail what happens during training. Except of studying the training loss under the same training setup, we measure the gradient approximation accuracy using arm with 1000 samples as the reference. We train a simple yet realistic variant of stochastic autoencoder for the task of text retrieval with binary representation on 20newsgroups dataset. The autoencoder is trained by minimizing the reconstruction loss (2) . Please refer to Appendix D.2 for full specification of the model and experimental setup. For each estimator we perform the following protocol. First, we train the model with this estimator using Adam with lr = 0.001 for 1000 epochs. We then switch the estimator to arm with 10 samples and continue training for 500 more epochs (denoted as arm-10 correction phase). Fig. 3 top shows the training performance for different number of latent bits n. It is seen (esp. for 8 and 64 bits) that some estimators (esp. st and det st) appear to make no visible progress, and even increase the loss, while switching them to arm makes a rapid improvement. Does it mean that these estimators are bad and arm is very good? An explanation of this phenomenon is offered in Fig. Training Loss: each estimator is applied for 1000 epochs and then switched to arm-10 in order to correct the accumulated bias. Expected improvement: lower is better (measures expected change of the loss), the dashed line shows the maximal possible improvement knowing the true gradient. Cosine similarity: higher is better, close to 1 means that the direction is accurate while below 0 means the estimated gradient is not an ascent direction; error bars indicate empirical 70% confidence intervals using 100 trials. significant bias due to a systematic error component, which nevertheless can be easily corrected by an unbiased estimator. To measure the bias and alignment of directions, as theoretically analyzed in Section 2.1, we evaluate different estimators at the same parameter points located along the learning trajectory of the reference arm estimator. At each such point we estimate the true gradient g by arm-1000. To measure the quality of a candidate 1-sample estimator g we compute the expected cosine similarity and the expected improvement, defined respectively as: ECS = E g, g /( g g ) , EI = -E[ g, g ]/ E[ g 2 ], The expectations are taken over 100 trials and all batches. A detailed explanation of these metrics is given in Appendix D.2. These measurements, displayed in Fig. 3 for different bit length, clearly show that with a small bit length biased estimators consistently run into producing wrong directions. Identity ST and deterministic ST clearly introduce an extra bias to ST. However, when we increase the number of latent bits, the accuracy of all biased estimators improves, confirming our analysis IV, V. a 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 + w e x H x E G 4 P 2 R K T P z V 9 n T 3 / E L O o = " > A A A C m 3 i c d Z H f a h N B F M Z P 1 n 8 1 / m v 1 U o R g W v A i L L s x V r 0 Q C y U g 4 k W k b l t o Y p i d n N 0 M n Z n d z p w V w 7 D g G 3 i r j + Z b + A j u b m I k L X 5 X H 7 9 v z j n D O X E u h a U g + N X y r l 2 / c f P W 1 u 3 2 n b v 3 7 j / Y 3 n l 4 b L P C c I x 4 J j N z G j O L U m i M S J D E 0 9 w g U 7 H E k / j 8 s M 5 P v q C x I t O f a J H j R L F U i 0 R w R h W K d t n n c H e 6 3 Q 3 8 o F H n q g l X p v v 2 N z Q a T X d a 3 8 a z j B c K N X H J r D 0 L g 5 x 6 b N B L J F 7 o i W O G B J d Y t s e F x Z z x c 5 b i G R v k L E f T 6 0 i m Z 5 Z X v q e Y S Y V + E 3 I 1 c S l m C s k s N o s K S l 5 N n N B 5 Q a j 5 R u a Y s n a h 4 r K z p x j N 7 e W s h v / L k k y T 3 e x m k W z l 6 0 8 3 4 x z h 1 + k M E + t X p o K V H Q / d u G 4 X x 2 5 Y r t B o j U Z / 0 d E S c S b d 0 Z o N / 8 G m t h 5 H B o n P X e i / u A z 6 Z X u v J h J 1 S v N l q d C z a t + l q 1 Z d t p u b v a 6 1 v 7 7 Q V X P c 9 8 P n / u B j v 3 v g L 4 8 H W / A Y n s I z C O E l H M A 7 G E E E H A R 8 h x / w 0 3 v i H X r v v Q / L p 1 5 r V f M I N u R F f w C B k N C s < / l a t e x i t > a 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " A 3 7 c A c O O j H T i z M w 4 s t n 5 j E p M N U 4 = " > A A A C m 3 i c d Z H f a h N B F M Z P 1 n 8 1 / m v 1 U o R g W v A i L L s x V r 0 Q C y U g 4 k W k b l t o Y p i d n N 0 M n Z n d z p w V w 7 D g G 3 i r j + Z b + A j u b m I k L X 5 X H 7 9 v z j n D O X E u h a U g + N X y r l 2 / c f P W 1 u 3 2 n b v 3 7 j / Y 3 n l 4 b L P C c I x 4 J j N z G j O L U m i M S J D E 0 9 w g U 7 H E k / j 8 s M 5 P v q C x I t O f a J H j R L F U i 0 R w R h W K d t n n / u 5 0 u x v 4 Q a P O V R O u T P f t b 2 g 0 m u 6 0 v o 1 n G S 8 U a u K S W X s W B j n 1 2 K C X S L z Q E 8 c M C S 6 x b I 8 L i z n j 5 y z F M z b I W Y 6 m 1 5 F M z y y v f E 8 x k w r 9 J u R q 4 l L M F J J Z b B Y V l L y a O K H z g l D z j c w x Z e 1 C x W V n T z G a 2 8 t Z D f + X J Z k m u 9 n N I t n K 1 5 9 u x j n C r 9 M Z J t a v T A U r O x 6 6 c d 0 u j t 2 w X K H R G o 3 + o q M l 4 k y 6 o z U b / o N N b T 2 O D B K f u 9 B / c R n 0 y / Z e T S T q l O b L U q F n 1 b 5 L V 6 2 6 b D c 3 e 1 1 r f 3 2 h q + a 4 7 4 f P / c H H f v f A X x 4 P t u A x P I V n E M J L O I B 3 M I I I O A j 4 D j / g p / f E O / T e e x + W T 7 3 W q u Y R b M i L / g C D x N C t < / l a t e x i t > X 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " Q k J I u k T U n u a I j b + e F o n x w v t G q h 8 = " > A A A C m 3 i c d Z H b i h N B E I Y r 4 2 m N p 1 2 9 F C G Y X f A i D D M x n i 7 E h S U g 4 k V k n d 3 A J o a e T s 2 k 2 e 6 e s b t G D M 2 A b + C t P p p v 4 S M 4 M 4 m R 7 O J / 9 f P 9 X V V N V Z x L Y S k I f r W 8 K 1 e v X b + x c 7 N 9 6 / a d u / d 2 9 + 6 f 2 K w w H C O e y c y M Y 2 Z R C o 0 R C Z I 4 z g 0 y F U s 8 j c + P 6 v z 0 C x o r M v 2 R l j l O F U u 1 S A R n V K F o f / w p 3 J / t d g M / a N S 5 b M K 1 6 b 7 5 D Y 1 G s 7 3 W t 8 k 8 4 4 V C T V w y a 8 / C I K c e G / Q S i Z / 1 1 D F D g k s s 2 5 P C Y s 7 4 O U v x j A 1 y l q P p d S T T c 8 s r 3 1 P M p E K / D r m a u h Q z h W S W 2 0 U F J S + n T u i 8 I N R 8 K 3 N M W b t U c d k 5 U I w W 9 m J W w / 9 l S a b J b n e z S L b y 9 a e b c Y 7 w 6 2 y O i f U r U 8 H K T o Z u U r e L Y z c s 1 2 i 0 Q a O / 6 H i F O J P u e M O G / 2 B T W 4 8 j g 8 Q X L v S f X Q T 9 s n 1 Q E 4 k 6 p c W q V O h 5 t e / S V a s u 2 8 3 N X t V 6 v r n Q Z X P S 9 8 O n / u B D v 3 v o r 4 4 H O / A Q H s M T C O E F H M J b G E E E H A R 8 h x / w 0 3 v k H X n v v P e r p 1 5 r X f M A t u R F f w B t q t C j < / l a t e x i t > X 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " + 1 0  6 j I x 1 c V Z l + o A l M Q F M T I i N m 2 Y = " > A A A C m 3 i c d Z H b i h N B E I Y r 4 2 m N p 1 2 9 F C G Y X f A i D D M x n i 7 E h S U g 4 k V k n d 3 A J o a e T s 2 k 2 e 6 e s b t G D M 2 A b + C t P p p v 4 S M 4 M 4 m R 7 O J / 9 f P 9 X V V N V Z x L Y S k I f r W 8 K 1 e v X b + x c 7 N 9 6 / a d u / d 2 9 + 6 f 2 K w w H C O e y c y M Y 2 Z R C o 0 R C Z I 4 z g 0 y F U s 8 j c + P 6 v z 0 C x o r M v 2 R l j l O F U u 1 S A R n V K F o f / y p v z / b 7 Q Z + 0 K h z 2 Y R r 0 3 3 z G x q N Z n u t b 5 N 5 x g u F m r h k 1 p 6 F Q U 4 9 N u g l E j / r q W O G B J d Y t i e F x Z z x c 5 b i G R v k L E f T 6 0 i m 5 5 Z X v q e Y S Y V + H X I 1 d S l m C s k s t 4 s K S l 5 O n d B 5 Q a j 5 V u a Y s n a p 4 r J z o B g t 7 M W s h v / L k k y T 3 e 5 m k W z l 6 0 8 3 4 x z h 1 9 k c E + t X p o K V n Q z d p G 4 X x 2 5 Y r t F o g 0 Z / 0 f E K c S b d 8 Y Y N / 8 G m t h 5 H B o k v X O g / u w j 6 Z f u g J h J 1 S o t V q d D z a t + l q 1 Z d t p u b v a r 1 f H O h y + a k 7 4 d P / c G H f v f Q X x 0 P d u S I t B I X k W W H c L p A V K o i I c R F U H F b q Q n R e j N 2 V t 1 d m 9 0 x a r S y x B t w C 4 / G W / A I 2 E 4 I S i v + q 1 / f v z O z m o l z K S w F w a + W d + 3 6 j Z u 3 t m 6 3 7 9 y 9 d 3 9 7 Z / f B s c 0 K w z H i m c z M a c w s S q E x I k E S T 3 O D T M U S T + L z w z o / + Y r G i k x / o k W O E 8 V S L R L B G V U o 2 r v 4 H O x N d 7 q B H z T q X D X h y n T f / o Z G o + l u 6 9 t 4 l v F C o S Y u m b V n Y Z B T j w 1 6 i c Q v e u K Y I c E l l u 1 x Y T F n / J y l e M Y G O c v R 9 D q S 6 Z n l l e 8 p Z l K h 3 4 R c T V y K m U I y i 8 2 i g p J X E y d 0 X h B q v p E 5 p q x d q L j s 7 C t G c 3 s 5 q + H / s i T T Z D e 7 W S R b + f r T z T h H e D G d Y W L 9 y l S w s u O h G 9 f t 4 t g N y x U a r d H o L z p a I s 6 k O 1 q z 4 T / Y 1 N b j y C D x u Q v 9 5 5 d B v 2 z v 1 0 S i T m m + L B V 6 V u 2 7 d N W q y 3 Z z s 9 e 1 X q w v d N U c 9 / 3 w m T / 4 2 O 8 e + M v j w R Y 8 g i f w F E J 4 C Q f w D k Y Q A Q c B 3 + E H / P Q e e 4 f e e + / D 8 q n X W t U 8 h A 1 5 0 R + y N t D C < / l a t e x i t > w 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " q T a o R a Z n l l e 8 p Z l K h X 4 d c T V y K m U I y i 8 2 i g p K X E y d 0 X h B q v p E 5 p q x d q L j s 7 C l G c 3 s x q + H / s i T T Z D e 7 W S R b + f r T z T h H + H U 6 w 8 T 6 l a l g Z c d D N 6 7 b x b E b l i s 0 W q P R X 3 S 0 R J x J d 7 R m w 3 + w q a 3 H k U H i c x f 6 z y 6 C f t n e q 4 l E n d J 8 W S r 0 r N p 3 6 a p V l + 3 m Z q 9 q P V 9 f 6 L I 5 7 v v h U 3 / w v t / d 9 5 f H g y 1 4 C I / h C Y T w A v b h L Y w g A g 4 C v s M P + O k 9 8 g 6 8 d 9 7 h 8 q n X W t U 8 g A 1 5 0 R / t s N D d < / l a t e x i t > x r X g O a B 0 W a f + f G 0 N M j y p o = " > A A A C m 3 i c d Z H b b t N A E I Y n 5 l T C o S 1 c I q S I t B I X k W W H c L p A V K o i I c R F U H F b q Q n R e j N 2 V t 1 d m 9 0 x E K 0 s 8 Q b c w q P x F j w C t h O C 0 o r / 6 t f 3 7 8 y s Z u J c C k t B 8 K v l X b l 6 7 f q N r Z v t W 7 f v 3 N 3 e 2 b 1 3 b L P C c I x 4 J j N z G j O L U m i M S J D E 0 9 w g U 7 H E k / j 8 s M 5 P P q O x I t M f a J H j R L F U i 0 R w R h W K 9 r 5 8 D P e m O 9 3 A D x p 1 L p t w Z b q v f 0 O j 0 X S 3 9 W 0 8 y 3 i h U B O X z N q z M M i p x w a 9 R O I n P X H M k O A S y / a 4 s J g z f s 5 S P G O D n O V o e h 3 J 9 M z y y v c U M 6 n Q r 0 K u J i 7 F T C G Z x W Z R Q c m L i R M 6 L w g 1 3 8 g c U 9 Y u V F x 2 9 h W j u b 2 Y 1 f B / W Z J p s p v d L J K t f P 3 p Z p w j / D q d Y W L 9 y l S w s u O h G 9 f t 4 t g N y x U a r d H o L z p a I s 6 k O 1 q z 4 T / Y 1 N b j y C D x u Q v 9 p x d B v 2 z v 1 0 S i T m m + L B V 6 V u 2 7 d N W q y 3 Z z s 5 e 1 n q 0 v d N k c 9 / 3 w i T 9 4 3 + 8 e + M v j w R Y 8 g E f w G E J 4 D g f w B k Y Q A Q c B 3 + E H / P Q e e o f e W + / d 8 q n X W t X c h w 1 5 0 R + y N N D C < / l a t e x i t > Z 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " Z U k / / O G c k o u C M L A N 9 v B E K R 8 N k b M = " > A A A C n H i c d Z H b i h N B E I Y r 4 2 m N h 9 3 V S 0 G C 2 Q U v 4 j A T 4 + l C X N C A I E J k T X Z x E 0 N P p 2 b S b n f P 2 F 0 j h m b A R / B W 3 8 y 3 8 B G c m c R I d v G / + v n + r q q m K s q k s B Q E v x r e h Y u X L l / Z u t q 8 d v 3 G z e 2 d 3 V s j m + a G 4 5 C n M j X H E b M o h c Y h C Z J 4 n B l k K p J 4 F J 2 + r P K j L 2 i s S P V 7 W m Q 4 U S z R I h a c U Y l G e w 8 + f A z 3 p j v t w A 9 q t c 6 b c G X a L 3 5 D r c F 0 t / F t P E t 5 r l A T l 8 z a k z D I q M N 6 n V j i Z z 1 x z J D g E o v m O L e Y M X 7 K E j x h v Y x l a D o t y f T M 8 t J 3 F D O J 0 M 9 D r i Y u w V Q h m c V m U U 7 x 0 4 k T O s s J N d / I H F P W L l R U t P Y V o 7 k 9 m 1 X w f 1 m c a r K b 3 S y S L X 3 1 6 X q c I / w 6 n W F s / d K U s L T j v h t X 7 a L I 9 Y s V G q z R 4 C M P 2 Y Q 6 X O u C H Y j g 1 F n E 6 N o q 4 = " > A A A C n H i c d Z H b i h N B E I Y r 4 2 m N h 9 3 V S 0 G C 2 Q U v 4 j A T 4 + l C X N C A I E J k T X Z x E 0 N P p 2 b S b n f P 2 F 0 j h m b A R / B W 3 8 y 3 8 B G c m c R I d v G / + v n + r q q m K s q k s B Q E v x r e h Y u X L l / Z u t q 8 d v 3 G z e 2 d 3 V s j m + a G 4 5 C n M j X H E b M o h c Y h C Z J 4 n B l k K p J 4 F J 2 + r P K j L 2 i s S P V 7 W m Q 4 U S z R I h a c U Y l G e w 8 + f O z u T X f a g R / U a p 0 3 4 c q 0 X / y G W o P p b u P b e J b y X K E m L p m 1 J 2 G Q U Y f 1 O r H E z 3 r i m C H B J R b N c W 4 x Y / y U J X j C e h n L 0 H R a k u m Z 5 a X v K G Y S o Z + H X E 1 c g q l C M o v N o p z i p x M n d J Y T a r 6 R O a a s X a i o a O 0 r R n N 7 N q v g / 7 I 4 1 W Q 3 u 1 k k W / r q 0 / U 4 R / h 1 O s P Y + q U p Y W n H f T e u 2 k W R 6 x c r N F i j w V 9 0 u E S c S X e 4 Z v 1 / s K 6 t x p F B 4 n M X + o / O g m 7 R 3 K + I R J 3 Q f F k q 9 K z c d + H K V R f N + m b P K j 1 e X + i i F B o j E i T x N D f I V C x x H J 8 f 1 f n 4 C x o r M v 2 R l j l O F U u 1 S A R n V K F o f / y p v z / b 7 Q Z + 0 K h z 2 Y R r 0 3 3 z G x q N Z n u t b 5 N 5 x g u F m r h k 1 p 6 F Q U 4 9 N u g l E j / r q W O G B J d Y t i e F x Z z x c 5 b i G R v k L E f T 6 0 i m 5 5 Z X v q e Y S Y V + H X I 1 d S l m C s k s t 4 s K S l 5 O n d B 5 Q a j 5 V u a Y s n a p 4 r J z o B g t 7 M W s h v / L k k y T 3 e 5 m k W z l 6 0 8 3 4 x z h 1 9 k c E + t X p o K V n Q z d p G 4 X x 2 5 Y r t F o g 0 Z / 0 f E K c S b d 8 Y Y N / 8 G m t h 5 H B o k v X O g / u w j 6 Z f u g J h J 1 S o t V q d D z a t + l q 1 Z d t p u b v a r 1 f H O h y + a k 7 4 d P / c G H f v f Q X x 0 P d u Q K K R r F U 6 Y j K f y s f r M Q L h 8 t w = " > A A A C n n i c d Z H P b t N A E M Y 3 p k B J C 7 R w 5 B I 1 r V S k y L J D 6 J 9 D R S U U x A F E U J v W U h O i 9 W b s r L q 7 d n f H i G h l i W f g S l + s b 8 E j Y D s h V V r x n T 7 9 v p 2 Z 1 U y Y C m 7 Q 8 2 5 q z o O V h 4 8 e r z 6 p r 6 0 / f f Z 8 Y / P F m U k y z a D P E p H o I K Q G B F f Q R 4 4 C g l Q D l a G A 8 / D y f Z m f f w d t e K J O c Z r C U N J Y 8 Y g z i g U K t q P d 4 N u n 1 9 u j j a b n e p U a 9 4 0 / N 8 1 3 f 0 i l 3 m i z 9 n M w T l g m Q S E T 1 J g L 3 0 u x R T u t S M C V G l q q k T M B e X 2 Q G U g p u 6 Q x X N B O S l P Q r Y a g a m x Y 4 V u S 6 p i r I 5 / J o Y 0 h k Y B 6 u l y U Y X Q w t F y l G Y J i S 5 m l 0 p i p D P P G j q Q 4 M X e z E v 4 v i x K F Z = " > A A A C m 3 i c d Z H b b t N A E I Y n 5 l T C q Y V L h B S R V u I i s u w Q T h e I S l U k h H o R V N x W a k K 0 3 o y d V X f X Z n c M R C t L v A G 3 8 G i 8 B Y + A 7 Y S g t O K / + v X 9 O z O r m T i X w l I Q / G p 5 V

X L

< l a t e x i t s h a 1 _ b a s e 6 4 = " M V 6 N v f p g L W 0 W T W 0 q 9 t s S w d I E m b 0 = " > A A A C m 3 i c d Z H d a t t A E I X X a t q k 7 k 9 + e l k C J k 6 g F 0 BN layers have real-valued scale and bias parameters that can adjust scaling of activations relative to noise. Z are independent injected noises with a chosen distribution. Binary weights W ij are random ±1 Bernoulli(θ ij ) with learnable probabilities θ ij . In experiments we consider SBN with a convolutional architecture same as [15, 27]  Z I j p O m F 6 W B Y C g h F y 6 p E k P s m t V 6 J C / Z X a m 7 o x C z C P o G v W 0 f r W + R R 6 g k O y 5 O 6 L k 6 f G d n Z p k J U 8 E N e t 6 f m v N k 5 e m z 1 b X n 9 R c v X 7 1 e 3 9 j c u j B J p h k E L B G J 7 o f U g O A K A u Q o o J 9 q o D I U c B l e n 5 T 5 5 Q 1 o w x P 1 F a c p D C W N F Y 8 4 o 1 i g Y L f / 7 W x 3 t N H 0 X K 9 S 4 7 H x 5 6 b 5 6 Y 5 U 6 o 0 2 a z 8 G 4 4 R l E h Q y Q Y 2 5 8 r 0 U W 7 T T i g R 8 V 0 N L N X I m I K 8 P M g M p Z d c 0 h i v a S W k K u t U Q V I 0 N K 3 x L U h 1 z 9 d F n c m h j S C S g n i 4 X Z R g d D S 1 X a Y a g 2 F J m q T R m K s O : (2×128C3) -MP2 -(2×256C3) -MP2 -(2×512C3) -MP2 - (2×1024FC) -10FC -softmax. The practical takeaways are as follows: 1) biased estimators may perform significantly better than unbiased but might require a correction of the systematically accumulated bias; 2) with more units the ST approximation clearly improves and the bias has a less detrimental effect, requiring less correction; 3) Alg. 1 is more accurate than other ST variants in estimating the true gradient. Classification with Deep SBN In this section we verify Alg. 1 with different choice of noises in a deep network and verify optimization in binary weight probabilities using SGD on latent weights with Alg. 2. We consider CIFAR-10 dataset and use the SBN model illustrated in Fig. 4 . The SBN model, its initialization and the full learning setup is detailed in Appendix D.3. We trained this SBN with three choices of noise distributions corresponding to proxies used by prior work as in Fig. 1 (c-e ). Table 1 shows the test results in comparison with baselines. We see that training with different choices of noise distributions, corresponding to different ST rules, all achieves similar results. This is in contrast to empirical studies advocating specific proxies and is allowed by the consistency of the model, initialization and training. The identity ST applied to weights, implementing SMD updates, works well. Comparing to empirical ST baselines (all except Peters & Welling), we see that there is no significant difference in the 'det' column indicating that our derived ST method is on par with the wellguessed baselines. If the same networks are tested in the stochastic mode ('10sample' column), there is a clear boost of performance, indicating an advantage of SBN models. Out of the two experiments of Hubara et al., randomized training (rand.) also appears better confirming advantage of stochastic ST. In the stochastic mode, there is a small gap to Peters & Welling, who use a different estimation method and pretraining. Pretraining a real valued network also seem important, e.g., [19] report 91.7% accuracy with VGG-Small using pretraining and a smooth transition from continuous to binarized model. When our method is applied with an initialization from a pretrained model, improved results (92.6% 10-sample test accuracy) can be obtained with even a smaller Table 1 : Test accuracy for different methods on CIFAR-10 with the same/similar architecture. SBN can be tested either with zero noises (det) or using an ensemble of several samples (we use 10-sample). Standard deviations are given w.r.t. to 4 trials with random initialization. The two quotations for Hubara et al. [27] refer to their result with Torch7 implementation using randomized Htanh and Theano implementation using deterministic Htanh, respectively. Fig. 5 : Schematic explanation of the optimization process using a biased estimator followed by a correction with an unbiased estimator. Initially, the biased estimator makes good progress, but then the value of the true loss function may start growing while the optimization steps nevertheless come closer to the optimal location in the parameter space.

STOCHASTIC TRAINING

network [35] . There are however even more superior results in the literature, e.g., using neural architecture search with residual real connections, advanced data augmentation techniques and model distillation [10] achieve 96.1%. The takeaway message here is that ST can be considered in the context of deep SBN models as a simple and robust method if the estimator matches the model and is applied correctly. Since we achieve experimentally near 100% training accuracy in all cases, the optimization fully succeeds and thus the bias of ST is tolerable.

5. Conclusion

We have put many ST methods on a solid basis by deriving and explaining them from the first principles in one framework. It is well-defined what they estimate and what the bias means. We obtained two different main estimators for propagating activations and weights, bringing the understanding which function they have, what approximations they involve and what are the limitations imposed by these approximations. The resulting methods in all cases are strikingly simple, no wonder they have been first discovered empirically long ago. We showed how our theory leads to a useful understanding of bias properties and to reasonable choices that allow for a more reliable application of these methods. We hope that researchers will continue to use these simple techniques, now with less guesswork and obscurity, as well as develop improvements to them. Fig. 4 ] show that such relaxed objectives may indeed significantly diverge during the training. To obtain good results, a strong dropout regularization and/or pretraining is needed [43, 46] . Despite these difficulties they demonstrate on par or improved results, especially when using average prediction over multiple noise samples at test time. Dai et al. [17] perform a correct interchange of derivative and integral in (3) using weak (distributional) derivatives. After computing local expectations in this more complicated formalism, they are back with finite differences (6) which they also propose to linearize as in (7) . Thus their distributional SGD is equivalent to common SGD with the ST estimator Alg. 1. Bayesian Learning Bayesian learning in the form of simple update rules are recently (contemporaneously) studied by Khan & Rue [30] . We emphasize the interplay with the identity-ST estimator and the connection to the implicit regularization. The recent work by Meng et al. [37] proposed Bayesian learning with binary weights using Gumbel-Softmax estimator of gradients. We analyze it in [47] and demonstrate that it reduces to a different, non-Bayesian, rule when applied in large-scale experiments.

B.1 Invariances

We have the following simple yet desirable and useful property. It is easy to observe that binary activations admit equivalent reformulations as sign(a iz i ) = sign(T (a i ) -T (z i )) (16) for any strictly monotone mapping T : R → R. Proposition B.1. The gradient computed by Alg. 1 is invariant to equivalent transformations under sign as in (16) . Proof. Let us denote the transformed noise as zi = T (z i ), its cdf as G and the transformed activations as ãi = T (a i ). The sampling probability in line 2 of Alg. 1 does not change since after the transformation it computes p = G(ã i ) = P(z i ≤ ãi | ãi ) = P(z i ≤ a i | a i ) = F (a i ). The gradient returned by line 5 does not change since we have d dai G(T (a i )) = F (a i ). In contrast, empirical straight-through approaches where the proxy is handdesigned fail to maintain this property. In particular, in the deterministic straightthrough approach transforms such as sign(a i ) = sign(T (a i )) while keeping the proxy of sign used in backprop fixed lead to different gradient estimates. This partially explains why many proxies have been tried, e.g. ApproxSign [34] , and their scale needed tuning. Another pathological special case that leads to a confusion between identity straight-through and other forms is as follows. Corollary B.1. Let F be strictly monotone. Then letting T = F leads to T (z i ) being uniformly distributed. Let ãi = T (a i ). In this case the backpropagation rule in line 5 of Alg. 1 can be interpreted as replacing the gradient of sign(ã i -T (z i )) in ãi with just identity. Indeed, since zi = T (z i ) is uniform, we have G = 1 on (0, 1) and ãi = F (a i ) is guaranteed to be in (0, 1) by strict monotonicity. The gradient back-propagated by usual rules through ãi (outside of the ST Alg. 1) encounters derivative of F as before. Hence we see that the description "to back-propagate through the hard threshold function as if it had been the identity function" could be misleading as the resulting estimator crucially depends on what transformations are applied under the hard threshold despite they do not affect the network predictions in any way. We refer to the variant by [5] as identity-ST, as it specifically uses the identity proxy for the gradient in the pre-sigmoid activation.

B.2 Bias Analysis

I) Since the only approximation that we made was linearization of the objective L, we have the following basic property. Proposition B.2. If the objective function L is multilinearfoot_1 in the binary variables x, then Alg. 1 is unbiased. Proof. In this case (7) holds as equality. While extremely simple, this is an important point for understanding the ST estimator. As an immediate consequence we can easily design counter-examples where ST is wrong. Example 1. Let a ∈ R, x = sign(a-z) and L(x) = x 2 . In this case the expected value of the loss is 1, independent of a. The true gradient is zero. However the expected ST gradient is E[2F (a)2x] = 4F (a)(2L(a) -1) and can be positive or negative depending on a. Example 2 (Tokui & Sato 54). Let L(x) = xsin(2πx). Then the finite difference L(1) -L(0) = 1 but the derivative ∂L ∂x = 1 -2π cos(2πx) = -1. In this failure example, ST, even in the expectation, will point to exactly the opposite direction of the true gradient. An important observation from the above examples is that the result of ST is not invariant with respect to reformulations of the loss that preserve its values in all binary points. In particular, we have that L ≡ 1 in the first example and L(x) ≡ x in the second example for any x ∈ {-1, 1}. If we used these equivalent representations instead, the ST estimator would have been correct. More generally, any real-valued function of binary variables has a unique polynomial (and hence multilinear) representation [7] and therefore it is possible to find a loss reformulation such that the ST estimator will be unbiased. Unfortunately, this representation is intractable in most cases, but it is tractable, e.g., for a quadratic loss, useful in regression and autoencoders with a Gaussian observation model. Proposition B.3. Let L(x) = W x -y 2 . Then the multilinear equivalent reformulation of L is given by L(x) = W x -y 2 -i x 2 i W :,i 2 + i W :,i 2 , where W :,i is the i'th column of W . Proof. By expanding the square and using the identity x 2 i = 1 for x i ∈ {-1, 1}. Simply adjusting the loss using this equivalence and applying ST to it, fixes the bias problem. II) Next we ask the question, whether dropping the multiplier diag(F (a)) or changing it by another multiplier, which we call an (internal) rescaling of the estimator, can lead to an incorrect estimation. Proposition B.4. If instead of diag(F (a)) any positive semidefinite diagonal matrix Λ is used in Alg. 1, the expected rescaled estimator preserves non-negative scalar product with the original estimator. Proof. We write the chain (9) in a matrix form as J T 1 Λ 0 (a)J T 2 (x), with the Jacobians J 1 = ∂a ∂φ , Λ 0 = diag(F (a)) and J 2 (x) = ∂L(x) ∂x . The modified gradient with Λ is then defined as J T 1 Λ(a)J T 2 (x). We are interested in the scalar product between the expected gradient estimates: E[J T 1 Λ 0 J T 2 ], E[J T 1 ΛJ T 2 ] , where the expectation is over x. Since neither J 1 nor Λ, Λ 0 depend on x, we can move the expectations to J 2 . Let J2 = E ∂L(x) ∂x . Then the scalar product between the expected estimates becomes J T 1 Λ 0 J T 2 , J T 1 Λ J T 2 = Tr( J2 ΛJ 1 J T 1 Λ 0 J T 2 ). Notice that J 1 J T 1 is positive semi-definite, Λ 0 is also positive semi-definite since it is diagonal with non-negative entries. It follows that R = ΛJ 1 J T 1 Λ 0 is positive semidefinite and that J2 RJ T 2 is positive semi-definite. Its trace is non-negative. We obtained that the use of an internal rescaling, in particular identity instead of F , is not too destructive: if Alg. 1 was unbiased, the rescaled estimator may be biased but it is guaranteed to give an ascend direction in the expectation so that the optimization can in principle succeed. However, assuming that Alg. 1 is biased (when L is not multi-linear) but gives an ascent direction in the expectation, the ascent direction property cannot be longer guaranteed for the rescaled gradient. III) Next, we study whether the ST gradient is a valid ascent direction even when L is not multi-linear. Proposition B.5. Let L(x) be such that its partial derivative g i = ∂L ∂xi as a function of x i is Lipschitz continuous for all i with a constant L. Then the expected ST gradient is an ascent direction for any a(φ) and L(x) if and only if E[g i ] > L for all i. Proof. Sufficiency (if part). The true gradient using the local expectation form (6) expresses as E i ∂ai ∂φ p z (a i ) x i L(x)-L(x ↓i ) = E[J∆], where the expectation is w.r.t. x ∼ p(x; φ) and we introduced the matrix notation J = ∂a ∂φ T diag(p z (a)), and ∆ i = x i L(x)-L(x ↓i ) . The ST gradient replaces ∆ i with 2g i (x). Since in both cases J does not depend on x, the expectation can be moved to the last term. Respectively, let us define ∆ = E[∆] and ḡ = E[g]. The scalar product between the true gradient and the expected ST gradient can then be expressed as J ∆, J ḡ = Tr(J ḡ ∆T J T ). From the relation x i (L(x) -L(x ↓i )) = 1 -1 g i (x)dx i and Lipschitz continuity of g i in x i we have bounds 2(g i (x) -L) ≤ x i (L(x) -L(x ↓i )) ≤ 2(g i (x) + L). It follows that 2(E[g] -L) ≤ E[∆] ≤ 2(E[g] + L), coordinate-wise. The outer product ḡ ∆T is positive semidefinite iff ḡi ∆i ≥ 0 for all i. According to bounds above, this holds true if (∀i | ḡi ≥ 0) 2(|ḡ i | -L) ≥ 0 ( ) (∀i | ḡi < 0) 2(|ḡ i | + L) ≤ 0, or simply (∀i) |ḡ i | ≥ L. Necessity (only if part). We want to show that the requirements (20) , which are simultaneous for all coordinates of g, cannot be relaxed unless we make some further assumptions about a or L. Namely, if ∃i * such that ḡi * ∆i * < 0, then there exists a such that J ḡ, J ∆ < 0. I.e. a single wrong direction can potentially be rescaled by the downstream Jacobians to dominate the contribution of other components. This is detailed in the following steps. Assume (∃i * ) |ḡ i * | < L. Then exists L(x) such that the bounds (24) are tight (e.g. L(x) = x 2 ) and therefore there will hold ḡi * ∆i * < 0. Since Λ = diag(p z (a)) is positive semi-definite, Λḡ ∆T Λ will preserve the non-positive sign of the component (i * , i * ). There exists a(φ) such that ∂a ∂φ scales down all coordinates i = i * and scales up i * such that the Tr(J ḡ ∆T J T ) is dominated by the entry (i * , i * ). The resulting scalar product between the expected gradient and the true gradient thus can be negative. IV) Next we study, a typical use case when hidden binary variables are combined using a linear layer, initialized randomly. A typical initialization procedure would rescale the weights according to the size of the fan-in for each output. Proposition B.6. Assume that the loss function is applied after a linear normalized transform of Bernoulli variables, i.e., takes the form L(x) = (W x), where W ∈ R K×n is a matrix of normally distributed weights, normalized to satisfy W k,: 2 2 = 1 ∀k. Then the expected Lipschitz constant of gradients of L scales as O( 1 √ n ). Proof. Let u = W x and let ∂ ∂u be Lipschitz continuous with constant L. The gradient of L expresses as g i = dL(x) dxi = ∂ (u) ∂u , W :,i . By assumptions of random initialization and normalization, W k,i ∼ N (0, 1 n ). If we consider |g i | in the expectation over initialization we obtain that E W |g i (x) -g i (y)| = E W (W x) -(W y), W :,i ≤ LE W W :,i = LK 2 nπ . Therefore g i has expected Lipschitz constant LK 2 nπ . The normal distribution assumption is not principal for conclusion of O( 1 √ n ) dependance. Indeed, for any distribution with a finite variance it would hold as well, differing only in the constant factors. We obtain an important corollary. Corollary B.2. As we increase the number of hidden binary units n in the model, the bias of ST decreases, at least at initialization. V) Finally, we study conditions when a deterministic version of ST gives a valid ascent direction. Proposition B.7. Let x * = sign(a). Let g i = ∂L(x) ∂xi be Lipschitz continuous with constant L. Let g * = g(x * ) and p * = p(x * |a). The deterministic ST gradient at x * forms a positive scalar product with the expected stochastic ST gradient if |g * i | ≥ 2(1 -p * )L ∀i. Proof. Similarly to Proposition B.5, let J = ∂a ∂φ T diag(p z (a)). The scalar product between the expected ST gradient and the deterministic ST gradient is given by J E[g(x)], J g * = Tr E[g(x)]g * T J T . In order for it to be non-negative we need E[g(x) i ]g * i ≥ 0 ∀i. Observe that E[g(x) i ] is a sum that includes g * i with the weight p * . We therefore need x =x * p(x|a)g(x) i g * i + p * g * i 2 ≥ 0. ( ) From Lipschitz continuity of g i we have the bound |g(x) i - g * i | ≤ L|x i -x * i |, or using that |x i -x * i | ≤ 2 we have g * i -2L ≤ g(x) i ≤ g * i + 2L. ( ) Therefore g(x) i g * i ≥ g * i 2 -2L|g * i |. We thus can lower bound (33) as x =x * p(x|a)(|g * i | -2L)|g * i | + p * g * i 2 = -2L|g * i |(1 -p * ) + g * i 2 . ( ) This lower bound is non-negative if |g * i | ≥ 2L(1 -p * ). ( ) Compared to Proposition B.5, this condition has an extra factor of 2(1p * ). Since p * is the product of probabilities of all units x * i , we expect initially p * 1. This condition improves at the same rate with the increase in the number of hidden units as the case covered by Proposition B.6. In addition it becomes progressively more accurate as units learn to be more deterministic, because in this case the factor (1p * ) decreases. However, note that this proposition describes the gap between deterministic ST and stochastic ST. And even when this gap diminishes, the gap between ST and the true gradient remains. We can obtain a similar sufficient condition for the scalar product between deterministic ST and the executed true gradient, that (unlike the direct combination of Proposition B.5 and Proposition B.7) ensures an ascent direction. Proposition B.8. Let x * = sign(a). Let g(x) i = ∂L(x) ∂xi be Lipschitz continuous with constant L. Let g * = g(x * ) and p * = p(x * |a). The deterministic ST gradient at x * forms a positive scalar product with the true gradient if |g * i | ≥ 2(1 -p * )L + L ∀i. Proof. The proof is similar to Proposition B.7, only in this case we need to ensure E[∆ i ]g * i ≥ 0. Using (25) we get the bounds 2 (E[g] -L) ≤ E[∆] ≤ 2(E[g] + L), And using additionally (34) we get 2(p * g * i + (1 -p * )(g * i -2L) -L) ≤ E[∆ i ] ≤ 2(p * g * i + (1 -p * )(g * i + 2L) + L). ( ) Collecting the terms 2(g * i -(1 -p * )2L -L) ≤ E[∆ i ] ≤ 2(g * i + (1 -p * )2L + L). Multiplying by g * i we obtain that a sufficient condition for E[∆ i ]g * i ≥ 0 is |g * i | ≥ (1 -p * )2L + L. C Mirror Descent and Variational Mirror Descent

C.1 Mirror Descent

Mirror descent is a widely used method for constrained optimization of the form min x∈X f (x), where X ⊂ R n , introduced by Nemirovsky & Yudin [39] . Let Φ : X → R be strictly convex and differentiable on X , called a mirror map. Bregman divergence D Φ (x, y) associated with Φ is defined as D Φ (x, y) = Φ(x) -Φ(y) -∇Φ(y), x -y . ( ) An update of MD algorithm can be written as: x t+1 = arg min x∈X x, ∇f (x t ) + 1 ε D Φ (x, x t ). ( ) In the unconstrained case when X = R n or in the case when the critical point is guaranteed to be in X (as typically ensured by the design of D Φ ), the solution can be found from the critical point equations, leading to the general form of iterates ∇Φ(x t+1 ) = ∇Φ(x t ) -ε∇f (x t ) (45) x t+1 = (∇Φ) -1 (∇Φ(x t ) -ε∇f (x t )) . Proposition 1. Common SGD in latent weights η using the identity straightthrough-weights Alg. 2 implements SMD in the weight probabilities θ with the divergence corresponding to F . Proof. The proof closely follows Ajanthan et al. [1] . Differently from us, they considered deterministic ST. Their argumentation includes taking the limit in which F is squashed into the step function and which renders MD invalid. This limit is not needed in our formulation. We start from the defining equation of MD update in the form (45) . In order for (45) to match common SGD on η with η i = F -1 (θ i ), the mirror map Φ must satisfy ∇Φ(θ) = F -1 (θ), where F -1 is coordinate-wise. We can therefore consider coordinate-wise mirror maps Φ : R → R. The inverse F -1 exists if F is strictly monotone, meaning that the noise density is non-zero on the support. Finding the mirror map Φ explicitly is not necessary for our purpose, however in 1D case it can be expressed simply as Φ(x) = x 0 F -1 (η)dη. With this coordinatewise mirror map, the MD update can be written as η t+1 = η t -ε dL dθ θ=F (η t ) . ( ) Thus MD on θ takes the form of a descent step on η with the gradient dL dθ . A common SGD on η would use the gradient dL dη = ∂θ ∂η ∂L ∂θ . Thus (46) bypasses the Jacobian ∂θ ∂η . This is exactly what Alg. 2 does. More precisely, when applying the same derivations that we used to obtain ST for activations in order to estimate dL dθ , since F (η i ) = θ i , we obtain that the factor ∂ ∂θ p(w i ; θ), present in (6), expresses as dF (η) dθ = ∂F (F -1 (θ)) ∂θ = 1 and thus can be omitted from the chain rule as defined in Alg. 2.

C.2 Latent Weight Decay Implements Variational Bayesian Learning

In the Bayesian learning setting we consider a model with binary weights w and are interested in estimating p(w|D), the posterior distribution of the weights given the data D and the weights prior p(w). In the variational Bayesian (VB) formulation, this difficult and multi-modal posterior is approximated by a simpler one q(w), commonly a fully factorized distribution. The approximation is achieved by minimizing KL(q(w) p(w|D)). Let q(w) = Ber(w; θ) and p(w) = Ber(w; 1 2 ), both meant component-wise, i.e. fully factorized. Then the VB problem takes the form arg min θ -E (x 0 ,y)∼data E w∼Ber(θ) log p(y|x 0 ; w) + 1 N KL(Ber(θ) Ber( 1 2 )) , where we have rewritten the data likelihood as expectation and hence the coefficient 1/N in front of the KL term appeared. This problem is commonly solved by SGD taking one sample from the training data and one sample of w and applying backpropagation [21] . We can in principle do the same by applying an estimator for the gradient in θ. The trick that we apply, different from common practices, is not to compute the gradient of the KL term but to keep this term explicit throughout to the proximal step leading to a composite MD [59] . With this we have Proposition 2. Common SGD in latent weights η with a weight decay and identity straight-through-weights Alg. 2 is equivalent to optimizing a factorized variational approximation to the weight posterior p(w|data) using a composite SMD method. Proof. Expanding data log-likelihood as the sum over all data points, we get log p(D | w) = i log p(x i | w) =: i l i (w). ( ) When multiplying with 1 N , the first term becomes the usual expected data likelihood, where the expectation is in training data and weights w ∼ q(w). Expanding also the parametrization of q(w) = Ber(w | θ), the variational inference reads arg min θ -E w∼Ber(θ) 1 N i l i (w) + 1 N KL(q(w) p(w)) + const . We employ mirror descent to handle constraints θ ∈ [0, 1] m similar to the above but now we apply it to this composite function, linearizing only the data part and keeping the prior KL part non-linear. Let g = 1 |I| i∈I ∇ θ E w∼Ber(θ) l i (w) be the stochastic gradient of the data term in the weight probabilities θ using a min-batch I. The SMD step subproblem reads min θ g T θ + 1 ε KL(Ber(θ) Ber(θ t )) + 1 N KL(Ber(θ) Ber( 1 2 )) . We notice that KL(Ber(θ) Ber( 12 )) = -H(Ber(θ)), the negative entropy, and also introduce the prior scaling coefficient λ = 1 N in front of the entropy, which may optionally be lowered to decrease the regularization effect. With these notations, the composite proximal problem becomes min θ g T θ + 1 ε KL(Ber(θ) Ber(θ t )) -λH(Ber(θ)) . ( ) The solution is found from the critical point equation in θ: ∇ θ g T θ + 1 ε KL(Ber(θ) Ber(θ t )) -λH(Ber(θ)) = 0 (53a) g i + 1 ε log θi 1-θi -log θ t i 1-θ t i + λ log θi 1-θi = 0 (53b) (ελ + 1) log θi 1-θi = log θ t i 1-θ t i -εg i (53c) log θi 1-θi = 1 ελ+1 log θ t i 1-θ t i -ε ελ+1 g i . For the natural parameters we obtain: η = η t -εg ελ+1 = η t -ε ελ+1 λη t + g . We can further drop the correction of the step size ε ελ+1 since ελ + 1 ≈ 1 and the step size will need to be selected by cross validation anyhow. This gives us an update of the form η = η t -ε(g + λη t ), which is in the form of a standard step in any SGD or adaptive SGD optimizer. The difference is that the gradient in probabilities θ is applied to make step in logits η and the prior KL divergence contributes the logit decay λ, which in this case is the latent weight decay. Since the ST gradient in θ differs from the ST gradient in η by the factor diag(F ), the claim of Proposition 2 follows.

D.1 MNIST VAE

Here we give a specification of the experiment in Fig. 2 , which illustrates the point that mismatching the constant factor in front of ST estimator leads to poor performance when the gradient is to be combined with other gradients, in this case with the analytic gradient of KL divergence in VAE. Dataset We use MNIST data set 5 . It contains 60000 training and 10000 test images of handwritten digits. We used 50000 images for trainig, the reminder was kept as a validation set, however not utilized in this experiment. Preprocessing No preprocessing or augmentation was performed. The grayscale image intensities in [0, 1] are interpreted as target Bernoulli probabilities for the decoder. Model We used {0, 1} encoding of hidden states x. Closely following experiment design of [22, 42] , we used the following network as encoder: Linear and outputs logits ν of conditionally independent Bernoulli generative model p dec (y i =1|x) = σ(ν i ). The data images t ∈ R 784 are interpreted as target probabilities, and the negative conditional log-likelihood becomes H = -i t i log p dec (y i =1|x) + (1 -t i ) log p dec (y i =0|x) . We optimize the negative lower bound on the log-likelihood: L = E y∼data E x∼p(x|y) H + KL(p(x|y) p(x)) , where p(x) is the uniform prior: p(x i ) = 1 2 . Optimization We compute the KL term analytically for a mini-batch and use its exact gradient. The gradient of the expectation of H is estimated. We used Adam optimizer with a learning rate in {0.001, 0.0003, 0.0001}.

D.2 Stochastic Autoencoder

It was shown in the literature that semantic hashing using binary hash codes can achieve superior results using learned hash codes, in particular based on variational autoencoder (VAE) formulation, e.g., recent works [12, 16, 38] . We propose a series of experiments that targets measuring the accuracy of gradient estimators through Bernoulli units and studying the dependence of this accuracy on the number of hidden units. It is appropriate to study here the plain stochastic autoencoder (2) and not a variational autoencoder (57) for the following reasons: 1) the gradient of prior KL term is known and need not be estimated, 2) VAE usually finds solutions in a partial posterior collapse (efficiently selecting the number of hidden units to use) which is in contradiction with our goal to study the dependence on the number of hidden units. In practice, the KL prior often needs to be tuned (in the public implementation of Ñanculef et al. [38] one can find β = 0.015 is used), which is complicating and irrelevant for our goals. Dataset The 20Newsgroups data setfoot_3 is a collection of approximately 20,000 text documents, partitioned (nearly) evenly across 20 different newsgroups. In our experiments we do not use the partitioning. We used the processed version of the dataset denoted as Matlab/Octave on the dataset's web site. It contains bag-of-words representations of documents given by one sparse word-document count matrix. We worked with the training set that contains 11269 documents in the bag of words representation. Preprocessing We keep only the 10000 most frequent words in the training set to reduce the computation requirements. Each of the omitted rare words occurs not more than in 10 documents. Reconstruction Loss Let y ∈ N d be the vector of word counts of a document and x ∈ {0, 1} n be a latent binary code representing the topic that we will learn. The decoder network given the code x deterministically outputs word frequencies f ∈ [0, 1] d , i f i = 1 and the reconstruction losslog p dec (y|x; θ) is defined as -i y i log f i , i.e., the negative log likelihood of a generative model, where word counts y follow multinomial distribution with probabilities f and the number of trials equal to the length of the document. The encoder p(x|f ; φ) obtains word frequencies form y and maps them deterministically to Bernoulli probabilities p(x i |f ; φ). The loss of the autoencoder (2) is then E y∼data E z∼p(x|y) -log p dec (y|x; θ) . Networks The encoder network takes on the input word frequencies f ∈ R d and applies the following stack: FC(d × 512), ReLU, FC(512 × n), where FC is a fully connected layer. The output is the vector of logits of Bernoulli latent bits. The decoder network is symmetric: FC(n × 512), ReLU, FC(512 × d), Softmax. Its input is a binary latent code x and output is the word probabilities f . Standard weight initialization is applied to all linear layers W setting W i,j ∼ U[-1/ √ k, 1/ √ k], where k is the number of input dimensions to the layer. This is a standard initialization scheme [23] , which is consistent with the assumptions we make in Proposition B.6 and hence important for verification of our analysis.

Table D.1:

List of estimators evaluated in the stochastic autoencoder experiment.

Name Details arm

State-of-the-art unbiased estimator [57] . Gumbel(τ ) Gumbel-Softmax estimator [29] with temperature parameter τ . ST Straight-Through Alg. 1. det ST Deterministic version of ST setting the noise z = 0 during training. identity ST Identity ST variant described by [5] . Estimators Estimators evaluated in this experiment are described in Table D .1. As detailed in Section 2, in the identity ST we still draw random samples in the forward pass like in Alg. 1 but omit the multiplication by F . Alg. 1 is correctly instantiated for the {0, 1} rather than ±1 encoding in all cases. For the arm-10 correction phase and arm-1000 ground truth estimation, the average of arm estimates with the respective number of samples is taken. Optimizer We used Adam [31] optimizer with a fixed starting learning rate lr = 0.001 in both phases of the training. When switching to the arm-10 correction phase, we reinitialize Adam in order to reset the running averages. Evaluation For each bit length we save the encoder and decoder parameter vectors φ, θ every 100 epochs along the arm training trajectory. At each such point, offline to the training, we first apply arm-1000 in order to obtain an accurate estimate of the true gradient g. We then evaluate each of the 1-sample estimators, including arm itself (= arm-1). The next question we discuss is how to measure the estimator accuracy. Clearly, if we just consider the expected local performance such as E[ g, g ], unbiased estimators win regardless of their variance. This is therefore not appropriate for measuring their utility in optimization. We evaluate three metrics tailored for comparison of biased and unbiased estimators. Cosine Similarity This metric evaluates the expected cosine similarity, measuring alignment of directions: E g, g /( g g ) , where the expectation is over all training data batches and 100 stochastic trials of the estimator g. This metric is well aligned with our theoretical analysis Section 2.1. It is however does not measure how well the gradient length is estimated. If the length has a high variance, this may hinder the optimization but would not be reflected by this metric. Expected Improvement To estimate the utility of the estimator for optimization, we propose to measure the expected optimization improvement using the same proximal problem objective that is used in SGD or SMD to find an optimization step. Namely, let g = ∇ φ L(φ t ) be the true gradient at the current point. Common SGD step is defined as φ t+1 = φ t + arg min ∆φ g, ∆φ + 1 2ε ∆φ 2 . ( ) The optimal solution is given by ∆φ = -εg. Since instead of g, only an approximation is available to the optimizer, we allow it to use the solution ∆φ = -αĝ, where ĝ is an estimator of g and α is one scalar parameter to adopt the step size. We then consider the expected decrease of the proxy objectives: E g, -αĝ + α 2 2ε ĝ 2 . ( ) The parameter α correspond to a learning rate that can be tuned or adapted during learning. We set it optimistically for each estimator by minimizing the expected objective (62), which is a simple quadratic function in α. One scalar α is thus estimated for one measuring point (i.e. for one expectation over all training batches and all 100 trials). As such, it is not overfitting to each estimator. The optimal α is given by α = εE[ g, ĝ ]/E[ ĝ 2 ] ( ) and the value of the objective for this optimal α is -ε 2 E[ g, ĝ ] 2 /E[ ĝ 2 ]. For the purpose of comparing estimators, -ε 2 is irrelevant and the comparison can be made on the square root of (64). We obtain an equivalent metric that is the expected loss decrease normalized by the RMS of the gradients: -E[ g, ĝ ]/ E[ ĝ 2 ]. ( ) Confer to common adaptive methods which divide the step-length exactly by the square root of a running average of second moment of gradients, in particular Adam (applied per-coordinate there). This suggests that this metric is more tailored to measure the utility of the estimator for optimization. For brevity, we refer to (65) as expected improvement. Note also that in (65) we preserve the sign of E[ g, ĝ ] and if the estimator is systematically in the wrong direction, we expect to measure a positive value in (65), i.e. predicting objective ascent rather than descent. Root Mean Squared Error It is rather common to measure the error of biased estimators as RMSE = E[ g -ĝ 2 ]. ( ) This metric however may be less indicative and less discriminative of the utility of the estimator for optimization. In Fig. D.1 it is seen that RMSE of ARM estimator can be rather high, especially with more latent bits, yet it performs rather well in optimization. and normalize the noise distribution so that it has zero mean and F (0) = 1 2 . This choice ensures that the Jacobian 2F (a) in Line 5 of Alg. 1 at the mean value of pre-activations is the identity matrix and therefore gradients do not vanish. We want to initialize weight probabilities θ i = F ξ (η i ) as uniform in [0, 1]. The corresponding initialization of latent weights is then η i = F -1 ξ (θ i ) (which would be a completely non-obvious choice to propose empirically for deterministic ST methods). Dataset The dataset consists of 60000 32x32 color images divided in 10 classes, 6000 images per class. There is a predefined training set of 50000 examples and test set of 10000 examples. Preprocessing During training we use standard augmentation for CIFAR-10, namely random horizontal flipping and random cropping of 32×32 region with a random padding of 0-4 px on each side. Optimizer We use Adam optimizer [31] in all the experiments. The initial learning rate γ = 0.01 is used for 300 epochs and then we divide it by 10 at epochs 300 and 400 and stop at epoch 500. This is fixed for all models. All other Adam hyper-parameters such as β 1 , β 2 , ε are set to their correspondent default values in the PyTorch [41] framework. Training Loss Let the network softmax prediction on the input image x 0 with noise realizations in all layers z be denoted as p(x|z, x 0 ). The training loss for the stochastic binary network is the expected loss under the noises: E x 0 ∼data E z [-log p(x|z, x 0 )] . The training procedure is identical to how the neural networks with dropout noises are trained [50] : one sample of the noise is generated alongside each random data point. Evaluation At the test time we can either set z = 0 to obtain a deterministic binary network (denoted as 'det'). We can also consider the network as a stochastic ensemble and obtain the prediction via the expected predictive distribution E z [p(x|z, x 0 )], approximated by several samples. In the experiments we report performance in this mode using 10 samples. We observed that increasing the number of samples further improves the accuracy only marginally. We compute the mean and standard deviation for the obtained accuracy values by averaging the results over 4 different random learning trials for each experiment.



The conditions allow to apply Leibniz integral rule to exchange derivative and integral. Other conditions may suffice, e.g., when using weak derivatives[17]. E.g. x1x2x3 is trilinear and thus qualifies but x 2 1 is not multi-linear. http://yann.lecun.com/exdb/mnist/ http://qwone.com/ ∼ jason/20Newsgroups/



Fig.2: Training VAE on MNIST, closely following experimental setup[42]. The plots show training loss (negative ELBO) during epochs for different learning rates. The variant of ST algorithm used[42] is misspecified because of the scaling factor and performs substantially worse at for all learning rates. Full experiment specification is given in Appendix D.1.



Fig.3: Comparison of the training performance and gradient estimation accuracy for a stochastic autoencoder with different number of latent Bernoulli units (bits). Training Loss: each estimator is applied for 1000 epochs and then switched to arm-10 in order to correct the accumulated bias. Expected improvement: lower is better (measures expected change of the loss), the dashed line shows the maximal possible improvement knowing the true gradient. Cosine similarity: higher is better, close to 1 means that the direction is accurate while below 0 means the estimated gradient is not an ascent direction; error bars indicate empirical 70% confidence intervals using 100 trials.

A h P I Y n E M I L O I S 3 M I I I O A j 4 D j / g p / f I O / L e e e 9 X T 7 3 W u u Y B b M m L / g B v 3 t C k < / l a t e x i t > x 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " F 2 R T r m D Y T L c h k E m 9 F k 6 c Y y 3 + / F 0 = " > A A A C m 3 i c d Z H b b t N A E I Y n 5 l T C o S 1 c I q

8 6 X C L O p D t c s / 4 / W N d W 4 8 g g 8 b k L / U d n Q b d o 7 l d E o k 5 o v i w V e l b u u 3 D l q o t m f b N n l R 6 v L 3 T e j L p + + N D v v e u 2 D / z l 8 W A L 7 s A 9 u A 8 h P I E D e A 0 D G A K H T / A d f s B P 7 6 7 3 y n v j v V 0 + 9 R q r m t u w I W / 0 B / u A 0 N w = < / l a t e x i t > Z l a t e x i t s h a 1 _ b a s e 6 4 = " 2 M 0 a

8 G X X 9 8 K H f e 9 d t H / j L 4 8 E W 3 I F 7 c B 9 C e A I H 8 B o G M A Q O n + A 7 / I C f 3 l 3 v l f f G e 7 t 8 6 j V W N b d h Q 9 7 o D / 2 0 0 N 0 = < / l a t e x i t > W l a t e x i t s h a 1 _ b a s e 6 4 = " j Z N 0 9 x R a K L O v n L J p t Z e D m 9 / D l 5 Q = " > A A A C m 3 i c d Z H b i h N B E I Y r 4 2 m N p 1 2 9 F C G Y X f A i D D M x n i 7 E h S U g 4 k V k n c 3 C J o a e T s 2 k 2 e 6 e s b t G D M 2 A b + C t P p p v 4 S M 4 M 4 m R 7 O J / 9 f P 9 X V V N V Z x L Y S k I f r W 8 K 1 e v X b + x c 7 N 9 6 / a d u / d 2 9 + 6 f 2 K w w H C O e y c y c x s y

A h P I Y n E M I L O I S 3 M I I I O A j 4 D j / g p / f I O / L e e e 9 X T 7 3 W u u Y B b M m L / g B t q N C j < / l a t e x i t > f (X L ) < l a t e x i t s h a 1 _ b a s e 6 4 = " W b d 1

r m b A T S F L z 9 d j b M I P 0 Z j i I x b m A I W d t C 1 g 7 J d G N p u P k e 9 B e r 9 Q y c z x K i w J w v W v Y V V b T k O N S C b W N 9 9 e x e 0 8 / p O S Q S o G C e z U q 7 G x b 5 z W 6 w 6 r 1 c 3 O y y 1 t 7 j Q f X P W d v 0 3 b u d r u 3 n s z o 5 H V s k r s k V 2 i U / 2 y T H 5 S H q k T x g R 5 B f 5 T a 6 d h v P B + e x 8 m T 1 1 a v O a l 2 R J T v A X s V P R k w = = < / l a t e x i t > w L< l a t e x i t s h a 1 _ b a s e 6 4 = " E 0 + w z R M T Z R D P V 3 w 7 t l L T c P 9 E i r 0

d + 9 t 7 9 w / t l l h O E Y 8 k 5 k 5 j Z l F K T R G J E j i a W 6 Q q V j i S X x + U O c n n 9 F Y k e k P t M h x o l i q R S I 4 o w p F u 1 8 + H u 5 O t 7 u B H z T q X D b h y n T f / I Z G o + l O 6 9 t 4 l v F C o S Y u m b V n Y Z B T j w 1 6 i c R P e u K Y I c E l l u 1 x Y T F n / J y l e M Y G O c v R 9 D q S 6

Fig.4: Stochastic Binary Network: first and last layer have real-valued weights. BN layers have real-valued scale and bias parameters that can adjust scaling of activations relative to noise. Z are independent injected noises with a chosen distribution. Binary weights W ij are random ±1 Bernoulli(θ ij ) with learnable probabilities θ ij . In experiments we consider SBN with a convolutional architecture same as[15,27]: (2×128C3) -MP2 -(2×256C3) -MP2 -(2×512C3) -MP2 -

(784,200) → tanh → Linear(200,200) → tanh → Linear(200,200). The output of the encoder defines logits η of the encoder Bernoulli model p(x i =1|y) = σ(η i ). The decoder has the reverse architecture: Linear(200,200) → tanh → Linear(200,200) → tanh → Linear(200,784)

funding

We gratefully acknowledge support by Czech OP VVV project "Research Center for Informatics (CZ.02.1.01/0.0/0.0/16019/0000765)" /* a: preactivation */ /* F : injected noise cdf */ /* x ∈ {-1, 1} n */ 1 Forward( a ) 2 p = F (a);

A Related Work

Hinton's vs Bengio's ST The name straight-through and the first experimental comparison was proposed by Bengio et al. [5] . Referring to Hinton's lecture, they describe the idea as "simply to back-propagate through the hard threshold function as if it had been the identity function". In the aforementioned lecture [25] , however we find a somewhat different description: "during the forward pass we stochastically pick a binary value using the output of the logistic, and then during the backward pass we pretend that we've transmitted the real valued probability from the logistic". We can make two observations: 1) different variants appeared early on and 2) many subsequent works [e.g. 58] attribute these two variants in the exact opposite way, adding to the confusion.ST Analysis Yin et al. [58] analyzes deterministic ST variants. The theoretical analysis is applicable to 1 hidden layer model with quadratic loss and the input data following a Gaussian distribution. The input distribution assumption is arguably artificial, however it allows to analyze the expected loss and its gradient. They show that population ST gradients using ReLU and clipped ReLU proxy correlate positively with the true population gradient and allow for convergence while identity ST does not. In Appendix B we show that in the SBN model, a simple correction of the quadratic loss function makes the base ST estimator unbiased and all rescaled estimators including identity are ascent directions in the expectation. Also note that the approach to analyze deterministic ST methods by considering the expectation over the input has a principle limitation for extending to deep models: the expectation over the input of a deterministic network with two hidden binary layers is still non-smooth (non-differentiable) in the parameters of the second layer.Cheng et al. [13] shows for networks with 1 hidden layer that STE is approximately related to the projected Wasserstein gradient flow method proposed there.On the weights side of the problem, Ajanthan et al. [1] connected mirror descent updates for constrained optimization (e.g., w ∈ [0, 1] m ) with straightthrough methods. The connection of deterministic straight-through for weights and proximal updates was also observed in [4] . Mirror Descent has been applied to variational Bayesian learning of continuous weights e.g. in Lin et al. [33] , taking the form of update in natural parameters with the gradient in the mean parameters, same as in our case.Alternative Estimators For deep binary networks several gradient estimation approaches are based on stochastic gradients of analytically smoothed/approximated loss [43, 46] . There is however a discrepancy between analytic approximation and the binary samples used at the test time. 

D.3 Deep Stochastic Binary Networks

The verification of ST estimator in training deep neural networks with mirror descent is conducted on CIFAR-10 dataset 7 .Model Our deep SBN model with L binary layers is defined aswhere a k are pre-activations, i.e. linear mappings of preceding layer states x k-1 with weights w k . Injected noises ξ k , z k are independent for all units. The weights w k in each inner layer are ±1 Bernoulli with probability F ξ (η). Weights in the first and last layers are real-valued. Pre-activations a consist of a linear operation and batch normalization [28] :where Linear is a binary fully connected or convolutional transform and BN has real-valued affine terms (scale, bias) enabled. In several layers also Max Pooling is applied on top. The architecture specification and illustration of the model are given in Fig. 4 .Initialization The role of the affine parameters (s, b) in BN is to reintroduce the scale and bias degrees of freedom removed by the normalization [28] . In our model these degrees of freedom are important as they control the strength of pre-activation relative to noise. With the sign activation, they could be indeed equivalently represented as learnable bias and variance parameters of the noise since sign(x i s i +b i -z i ) = sign x i -zi-bi si assuming s i > 0. Without the BN layer, the result of Linear(x k-1 , w k ) is an integer in a range that depends on the size of x. If the noise variance is set to 1, this will lead to vanishing gradients in a large network. With BN and its affine transform, the right proportion can be learned, but it is important to initialize it so that the learning can make progress. We propose the following initialization. We set s i = 1 and b i = 0 (as default for BN)

