Drop-Bottleneck: LEARNING DISCRETE COMPRESSED REPRESENTATION FOR NOISE-ROBUST EXPLORATION

Abstract

We propose a novel information bottleneck (IB) method named Drop-Bottleneck, which discretely drops features that are irrelevant to the target variable. Drop-Bottleneck not only enjoys a simple and tractable compression objective but also additionally provides a deterministic compressed representation of the input variable, which is useful for inference tasks that require consistent representation. Moreover, it can jointly learn a feature extractor and select features considering each feature dimension's relevance to the target task, which is unattainable by most neural network-based IB methods. We propose an exploration method based on Drop-Bottleneck for reinforcement learning tasks. In a multitude of noisy and reward sparse maze navigation tasks in VizDoom (Kempka et al., 2016) and DM-Lab (Beattie et al., 2016), our exploration method achieves state-of-the-art performance. As a new IB framework, we demonstrate that Drop-Bottleneck outperforms Variational Information Bottleneck (VIB) (Alemi et al., 2017) in multiple aspects including adversarial robustness and dimensionality reduction.

1. INTRODUCTION

Data with noise or task-irrelevant information easily harm the training of a model; for instance, the noisy-TV problem (Burda et al., 2019a ) is one of well-known such phenomena in reinforcement learning. If observations from the environment are modified to contain a TV screen, which changes its channel randomly based on the agent's actions, the performance of curiosity-based exploration methods dramatically degrades (Burda et al., 2019a; b; Kim et al., 2019; Savinov et al., 2019) . The information bottleneck (IB) theory (Tishby et al., 2000; Tishby & Zaslavsky, 2015) provides a framework for dealing with such task-irrelevant information, and has been actively adopted to exploration in reinforcement learning (Kim et al., 2019; Igl et al., 2019) . For an input variable X and a target variable Y , the IB theory introduces another variable Z, which is a compressed representation of X. The IB objective trains Z to contain less information about X but more information about Y as possible, where the two are quantified by mutual information terms of I(Z; X) and I(Z; Y ), respectively. IB methods such as Variational Information Bottleneck (VIB) (Alemi et al., 2017; Chalk et al., 2016) and Information Dropout (Achille & Soatto, 2018) show that the compression of the input variable X can be done by neural networks. In this work, we propose a novel information bottleneck method named Drop-Bottleneck that compresses the input variable by discretely dropping a subset of its input features that are irrelevant to the target variable. Drop-Bottleneck provides some nice properties as follows: • The compression term of Drop-Bottleneck's objective is simple and is optimized as a tractable solution. • Drop-Bottleneck provides a deterministic compressed representation that still maintains majority of the learned indistinguishability i.e. compression. It is useful for inference tasks that require the input representation to be consistent and stable. • Drop-Bottleneck jointly trains a feature extractor and performs feature selection, as it learns the feature-wise drop probability taking into account each feature dimension's relevance to the target task. Hence, unlike the compression provided by most neural network-based IB methods, our deterministic representation reduces the feature dimensionality, which makes the following inference better efficient with less amount of data. • Compared to VIB, both of Drop-Bottleneck's original (stochastic) and deterministic compressed representations can greatly improve the robustness to adversarial examples. Based on the newly proposed Drop-Bottleneck, we design an exploration method that is robust against noisy observations in very sparse reward environments for reinforcement learning. Our exploration maintains an episodic memory and generates intrinsic rewards based on the predictability of new observations from the compressed representations of the ones in the memory. As a result, our method achieves state-of-the-art performance on multiple environments of VizDoom (Kempka et al., 2016) and DMLab (Beattie et al., 2016) . We also show that combining our exploration method with VIB instead of Drop-Bottleneck degrades the performance by meaningful margins. Additionally (Igl et al., 2019) employs VIB to make the features generalize better and encourage the compression of states as input to the actor-critic algorithm. Curiosity-Bottleneck (Kim et al., 2019) employs the VIB framework to train a compressor that compresses the representation of states, which is still informative about the value function, and uses the compressiveness as exploration signals. In-foBot (Goyal et al., 2019) proposes a conditional version of VIB to improve the transferability of a goal-conditioned policy by minimizing the policy's dependence on the goal. Variational bandwidth



, we empirically compare with VIB to show Drop-Bottleneck's superior robustness to adversarial examples and ability to reduce feature dimensionality for inference with ImageNet dataset(Russakovsky et al., 2015). We also demonstrate that Drop-Bottleneck's deterministic representation can be a reasonable replacement for its original representation in terms of the learned indistinguishability, with Occluded CIFAR dataset(Achille & Soatto, 2018). Achille & Soatto, 2018) relates the IB principle to multiple practices in deep learning, including Dropout, disentanglement and variational autoencoding. Moyer et al. (2018) obtain representations invariant to specific factors under the variational autoencoder (VAE) (Kingma & Welling, 2013) and VIB frameworks. Amjad & Geiger (2019) discuss the use of IB theory for classification tasks from a theoretical point of view. Dai et al. (2018) employ IB theory for compressing neural networks by pruning neurons in networks. Schulz et al. (2020) propose an attribution method that determines each input feature's importance by enforcing compression of the input variable via the IB framework.Similar to our goal, some previous research has proposed variants of the original IB objective. Deterministic information bottleneck (DIB) (Strouse & Schwab, 2017) replaces the compression term with an entropy term and solves the new objective using a deterministic encoder. Nonlinear information bottleneck (NIB)(Kolchinsky et al., 2019)  modifies the IB objective by squaring the compression term and uses a non-parametric upper bound on the compression term. While DIB is always in the deterministic form, we can flexibly choose the stochastic one for training and the deterministic one for test. Compared to NIB, which is more computationally demanding than VIB due to its non-parametric upper bound, our method is faster.

