UNIFORM PRIORS FOR DATA-EFFICIENT TRANSFER

Abstract

Deep Neural Networks have shown great promise on a variety of downstream applications; but their ability to adapt and generalize to new data and tasks remains a challenge. However, the ability to perform few or zero-shot adaptation to novel tasks is important for the scalability and deployment of machine learning models. It is therefore crucial to understand what makes for good, transferable features in deep networks that best allow for such adaptation. In this paper, we shed light on this by showing that features that are most transferable have high uniformity in the embedding space and propose a uniformity regularization scheme that encourages better transfer and feature reuse. We evaluate the regularization on its ability to facilitate adaptation to unseen tasks and data, for which we conduct a thorough experimental study covering four relevant, and distinct domains: few-shot Meta-Learning, Deep Metric Learning, Zero-Shot Domain Adaptation, as well as Outof-Distribution classification. Across all experiments, we show that uniformity regularization consistently offers benefits over baseline methods and is able to achieve state-of-the-art performance in Deep Metric Learning and Meta-Learning.

1. INTRODUCTION

Deep Neural Networks have enabled great success in various machine learning domains such as computer vision (Girshick, 2015; He et al., 2016; Long et al., 2015) , natural language processing (Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020) , decision making (Schulman et al., 2015; 2017; Fujimoto et al., 2018) or in medical applications (Ronneberger et al., 2015; Hesamian et al., 2019) . This can be largely attributed to the ability of networks to extract abstract features from data, which, given sufficient data, can effectively generalize to held-out test sets. However, the degree of generalization scales with the semantic difference between test and training tasks, caused e.g. by domain or distributional shifts between training and test data. Understanding how to achieve generalization under such shifts is an active area of research in fields like Meta-Learning (Snell et al., 2017; Finn et al., 2017; Chen et al., 2020) , Deep Metric Learning (DML) (Roth et al., 2020b; Hadsell et al., 2006) , Zero-Shot Domain Adaptation (ZSDA) (Tzeng et al., 2017; Kodirov et al., 2015) or low-level vision tasks (Tang et al., 2020) . In the few-shot Meta-Learning setting, a meta-learner is tasked to quickly adapt to novel test data given its training experience and a limited labeled data budget; similarly fields like DML and ZSDA study generalization at the limit of such adaptation, where predictions on novel test data are made without any test-time finetuning. Yet, despite the motivational differences, each of these fields require representations to be learned from the training data that allow for better generalization and adaptation to novel tasks and data. Although there exists a large corpus of domain-specific training methods, in this paper we seek to investigate what fundamental properties learned features and feature spaces should have to facilitate such generalization. Fortunately, recent literature provides pointers towards one such property: the notion of "feature uniformity" for improved generalization. For Unsupervised Representation Learning, Wang & Isola (2020) highlight a link between the uniform distribution of hyperspherical feature representations and the transfer performance in downstream tasks, which has been implicitly adapted in the design of modern contrastive learning methods (Bachman et al., 2019; Tian et al., 2020a; b) . Similarly, Roth et al. (2020b) show that for Deep Metric Learning, uniformity in hyperspherical embedding space coverage as well as uniform singular value distribution embedding spaces are strongly connected to zero-shot generalization performance. Both Wang & Isola (2020) and Roth et al. (2020b) link the uniformity in the feature representation space to the preservation of maximal information and reduced overfitting. This suggests that actively imposing a uniformity prior on learned feature representations should encourage better transfer properties by retaining more information and reducing bias towards training tasks, which in turn facilitate better adaptation to novel tasks. However, while both Wang & Isola (2020) and Roth et al. (2020b) propose methods to incorporate this notion of uniformity, they are defined only for hyperspherical embedding spaces or contrastive learning approachesfoot_0 , thus severely limiting the applicability to other domains. To address these limitations and leverage the benefits of uniformity for any type of novel task and data adaptation for deep neural networks, we propose uniformity regularization, which places a uniform hypercube prior on the learned features space during training, without being limited to the contrastive training approaches or a hyperspherical representation space. Unlike e.g. a multivariate Gaussian, the uniform prior puts equal likelihood over the feature space, which then enables the network to make fewer assumptions about the data, limiting model overfitting to the training task. This incentivizes the model to learn more task-agnostic and reusable features, which in turn improve generalization (Raghu et al., 2019) . Our uniformity regularization follows an adversarial learning framework that allows us to apply our proposed uniformity prior, since a uniform distribution does not have a closed-form divergence minimization scheme. Using this setup, we experimentally demonstrate that uniformity regularization aids generalization in zero-shot setups such as Deep Metric Learning, Domain Adaptation, Out-of-Distribution Detection as well as few-shot Meta-Learning. Furthermore, for Deep Metric learning and few-shot Meta-Learning, we are even able to set a new state-of-the-art over benchmark datasets. Overall, our contributions can be summarized as: • We propose to perform uniformity regularization in the embedding spaces of a deep neural network, using a GAN-like alternating optimization scheme, to increase the transferability of learned features and the ability for better adaptation to novel tasks and data. • Using our proposed regularization, we achieve strong improvements over baseline methods in Deep Metric Learning, Zero-Shot Domain Adaptation, Out-of-Distribution Detection and Meta-Learning. Furthermore, uniformity regularization allows us to set a new state-ofthe-art in Meta-Learning on the Meta-Dataset (Triantafillou et al., 2019) as well as in Deep Metric Learning over two benchmark datasets (Welinder et al., 2010; Krause et al., 2013) .

2. BACKGROUND

2.1 GENERATIVE ADVERSARIAL NETWORKS (GANS) Generative Adversarial Networks (GANs, Goodfellow et al. (2014) ) were proposed as a generative model which utilizes an alternative optimization scheme that solves a minimax two-player game between a generator, G, and a discriminator, D. The generator G(z) is trained to map samples from a prior z ∼ p(z) to the target space, while the discriminator is trained to be an arbiter between the target data distribution p(x) and the generator distribution. The generator is trained to trick the discriminator into predicting that samples from G(z) actually stem from the target distribution. While many different GAN objectives have been proposed, the standard "Non-Saturating Cost" generator objective as well as the discriminator objective can be written as L D = max D E z∼p(z) [1 -log D(G(z))] + E x∼p(x) [log D(x)] L G = min G E z∼p(z) [1 -log D(G(z))] with p(z) the generator prior and p(x) a defined target distribution (e.g. natural images).

2.2. FAST ADAPTATION AND GENERALIZATION

Throughout this work, we use the notion of "fast adaptation" to novel tasks to measure the transferability of learned features, and as such the generalization and adaptation capacities of a model. Fast adaptation has recently been popularized by different meta-learning strategies (Finn et al., 2017; Snell et al., 2017) . These methods assume distinct meta-training and meta-testing task distributions,



By imposing a Gaussian potential over hyperspherical embedding distances or pairwise sample relations.

