TRAINING INDEPENDENT SUBNETWORKS FOR ROBUST PREDICTION

Abstract

Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant computational cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved 'for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods.

1. INTRODUCTION

Uncertainty estimation and out-of-distribution robustness are critical problems in machine learning. In medical applications, a confident misprediction may be a misdiagnosis that is not referred to a physician as during decision-making with a "human-in-the-loop." This can have disastrous consequences, and the problem is particularly challenging as patient data deviates significantly from the training set such as in demographics, disease types, epidemics, and hospital locations (Dusenberry et al., 2020b; Filos et al., 2019) . Using a distribution over neural networks is a popular solution stemming from classic Bayesian and ensemble learning literature (Hansen & Salamon, 1990; Neal, 1996) , and recent advances such as BatchEnsemble and extensions thereof achieve strong uncertainty and robustness performance (Wen et al., 2020; Dusenberry et al., 2020a; Wenzel et al., 2020) . These methods demonstrate that significant gains can be had with negligible additional parameters compared to the original model. However, these methods still require multiple (typically, 4-10) forward passes for prediction, leading to a significant runtime cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved "for free" under a single model's forward pass. The insight we build on comes from sparsity. Neural networks are heavily overparameterized models. The lottery ticket hypothesis (Frankle & Carbin, 2018) and other works on model pruning (Molchanov et al., 2016; Zhu & Gupta, 2017) show that one can prune away 70-80% of the connections in a

