UNSUPERVISED NON-PARAMETRIC SIGNAL SEPARA-TION USING BAYESIAN NEURAL NETWORKS

Abstract

Bayesian neural networks (BNN) take the best from two worlds: the one of flexible and scalable neural networks and the one of probabilistic graphical models, the latter allowing for probabilistic interpretation of inference results. We make one extra step towards unification of these two domains and render BNN as an elementary unit of abstraction in the framework of probabilistic modeling, which allows us to promote well-known distributions to distribution fields. We use transformations to obtain field versions of several popular distributions and demonstrate the utility of our approach on the problem of signal/background separation. Starting from prior knowledge that a certain region of space contains predominantly one of the components, in an unsupervised and non-parametric manner, we recover the representation of both previously unseen components as well as their proportions.

1. INTRODUCTION

Neural networks as predictive models have been wildly successful across a variety of domains, be it image recognition or language modeling. And while they may be used to make predictions on previously unseen samples, one of fundamental weaknesses of traditional neural networks is the inability to quantify the prediction uncertainty. Evaluation of prediction uncertainty is important in basic research (identification of fundamental laws), reinforcement learning (identification of value functions), anomaly detection, etc. Uncertainty quantification in neural networks has been addressed both from frequentist (see, for instance, Pearce et al. ( 2018)) and Bayesian ( Kendall and Gal (2017)) sides. In the Bayesian setting it was naturally proposed to promote the weights of neural layers to normally distributed random variables (MacKay (1992)). Later it was shown that the learnt uncertainty in the weights improves generalization in non-linear regression problems, and it can be applied to drive the explorationexploitation trade-off in reinforcement learning ( Blundell et al. (2015) ). Depeweg et al. (2017) designed a method of separation of uncertainty into epistemic and aleatoric. Epistemic uncertainty expresses uncertainties inherent to the model and can not be reduced with additional observations, whereas aleatoric uncertainty captures the amount of noise due to training on a specific sample. In physics the former and latter are referred to as the systematic and statistical uncertainties, respectively. In treating both types of uncertainty within the same framework authors essentially bridged the gap towards graphical models. Graphical models, unlike traditional neural networks, are probabilistic in nature and allow for incorporation of prior beliefs with respect to models. They are flexible in representing various processes and allow for introduction of latent degrees of freedom. Initially graphical models used various point distribution as building blocks, while mostly normal distribution has been promoted to a random in the notable example of Gaussian random fields. In this work we propose using Bayesian Neural Networks (BNN) as building blocks in graphical models and demonstrate the power of synthesis of Probabilistic Graphical Models (PGM) and BNNs on a synthetic example of signal/background separation. As a demonstration of our approach we propose the additive mixture model: a superposition of signal and background spectra whose proportion varies in space. During inference we are able to learn the proportion of signal and background and their spectral shapes that match ground truth values to adequate precision. The paper is organized as follows. In Section 2 we recapitulate the feed-forward (vanilla) BNN and the variational inference approach. In Section 3 we present the transformations of a vanilla BNN that allow to emulate various distribution fields. To illustrate the power of composition of transformed BNNs in Section 4 we introduce a model of additive BNN mixture. In Section 5 we describe our experiments and in Section 6 we discuss the results and evaluate model performance. In Section 7 we conclude by a discussion of limitations as well as prospective domains of applications of our framework.

2. BACKGROUND: VANILLA BAYESIAN NEURAL NETWORK

We consider feed-forward deep neural architectures that are composed of dense layers. A dense layer k is an affine transformation L k with weight W k and bias B k that is followed by an element-wise non-linear transformation σ: h k = L k • h k-1 = σ(h k-1 W k + B k ) , also known as the activation function. In our experiments we set σ to be ReLU , defined as ReLU (x) = max(0, x). In what follows we work with a simple linear deep architecture which is defined as a consecutive application (composition) of dense layers: y = L K • • • • L 1 • x . In order to enable probabilistic interpretation of inference using neural networks, the weights and the biases of each layer are promoted to random variables and are sampled from a Normal distribution with corresponding parameters: W k ∼ N (µ W , Σ W ), B k ∼ N (µ B , Σ B ). In Fig. 1 , right panel we depict an elementary Bayesian Neural Network, composed of k layers and that takes as input x, consisting of N samples, and rendering y as output, using plate notation. We consider a simple BNN in the spirit of (Blundell et al. (2015) ), where authors use stochastic variational inference (SVI) (Hoffman et al. (2013) ; Wingate and Weber ( 2013)) for Gaussian posterior distributions from prior distributions of weights, biases and observations. Under these conditions it is natural to use Evidence Lower Bound (ELBO) (Mehta et al. ( 2019)) as the loss function. ELBO loss consists of two terms: log evidence of the observable variable x with learnable parameters θ, log p θ (x), and the Kullback-Leibler (KL) divergence between the approximation of the posterior distribution q φ (z), parametrized by φ, and the true posterior p θ (z|x): ELBO = log p θ (x) -KL (q φ (z)||p θ (z|x)) . (1) Taking steps in φ to increase ELBO, increases log evidence and decreases the distance between the prior and the posterior. We further illustrate this in Fig. 1 , left. Inference results depend on the choice of the optimizer, the learning rate and the number of iterations.

3. TRANSFORMED BNNS

Non-trivial examples of probabilistic models combine distributions of various types. Consider a K-component Gaussian Mixture model: each component of the mixture is normally distributed, where the mean parameterized by real-valued parameters and the scale -by a positive parameter, while the overall proportions are sampled from a Dirichlet distribution X k ∼ Dir(α), which, in turn, is parameterized by a positive vector α 1 , . . . α K > 0, and X k belong to a K -1 simplex: X k = 1. We are therefore motivated to introduce a family of transformed BNNs with various ranges. In this manuscript we consider exponential transformation (transforms unconstrained vector of K dimensions to a positive vector of K dimensions) and a stick breaking transformation (unconstrained vector of K -1 dimensions into a simplex vector of K dimensions). We propose to apply transformations after the last layer of BNN, in such a way that the range of the output is constrained to be strictly positive for the exponential transform and a k-dimensional vector summing to unity (k-simplex) for the stick breaking transformation. In what follows we denote vector y sampled from a Bayesian neural network as y ∼ BN N (x, (W, B)). We denote BN N outputs transformed by exponential and stick-breaking transforms BN N e and BN N s , respectively. Another type of BNN transformation we consider is prompted by probability distributions: in certain applications it is particularly useful to work not just with positive random fields but with normalized positive random fields. Practically such a transformation consists of an approximate normalization of the BNN output given the data.

