DISTRIBUTION-BASED INVARIANT DEEP NETWORKS FOR LEARNING META-FEATURES

Abstract

Recent advances in deep learning from probability distributions successfully achieve classification or regression from distribution samples, thus invariant under permutation of the samples. The first contribution of the paper is to extend these neural architectures to achieve invariance under permutation of the features, too. The proposed architecture, called DIDA, inherits the NN properties of universal approximation, and its robustness with respect to Lipschitz-bounded transformations of the input distribution is established. The second contribution is to empirically and comparatively demonstrate the merits of the approach on two tasks defined at the dataset level. On both tasks, DIDA learns meta-features supporting the characterization of a (labelled) dataset. The first task consists of predicting whether two dataset patches are extracted from the same initial dataset. The second task consists of predicting whether the learning performance achieved by a hyper-parameter configuration under a fixed algorithm (ranging in k-NN, SVM, logistic regression and linear SGD) dominates that of another configuration, for a dataset extracted from the OpenML benchmarking suite. On both tasks, DIDA outperforms the state of the art: DSS and DATASET2VEC architectures, as well as the models based on the hand-crafted meta-features of the literature.

1. INTRODUCTION

Deep networks architectures, initially devised for structured data such as images (Krizhevsky et al., 2012) and speech (Hinton et al., 2012) , have been extended to enforce some invariance or equivariance properties (Shawe-Taylor, 1993) for more complex data representations. Typically, the network output is required to be invariant with respect to permutations of the input points when dealing with point clouds (Qi et al., 2017) , graphs (Henaff et al., 2015) or probability distributions (De Bie et al., 2019) . The merit of invariant or equivariant neural architectures is twofold. On the one hand, they inherit the universal approximation properties of neural nets (Cybenko, 1989; Leshno et al., 1993) . On the other hand, the fact that these architectures comply with the requirements attached to the data representation yields more robust and more general models, through constraining the neural weights and/or reducing their number. Related works. Invariance or equivariance properties are relevant to a wide range of applications. In the sequence-to-sequence framework, one might want to relax the sequence order (Vinyals et al., 2016) . When modelling dynamic cell processes, one might want to follow the cell evolution at a macroscopic level, in terms of distributions as opposed to, a set of individual cell trajectories (Hashimoto et al., 2016) . In computer vision, one might want to handle a set of pixels, as opposed to a voxellized representation, for the sake of a better scalability in terms of data dimensionality and computational resources (De Bie et al., 2019) . Neural architectures enforcing invariance or equivariance properties have been pioneered by (Qi et al., 2017; Zaheer et al., 2017) for learning from point clouds subject to permutation invariance or equivariance. These have been extended to permutation equivariance across sets (Hartford et al., 2018) . Characterizations of invariance or equivariance under group actions have been proposed in the finite (Gens & Domingos, 2014; Cohen & Welling, 2016; Ravanbakhsh et al., 2017) or infinite case (Wood & Shawe-Taylor, 1996; Kondor & Trivedi, 2018) . On the theoretical side, (Maron et al., 2019a; Keriven & Peyré, 2019) have proposed a general characterization of linear layers enforcing invariance or equivariance properties with respect to the whole permutation group on the feature set. The universal approximation properties of such architectures have been established in the case of sets (Zaheer et al., 2017) Motivations. A main motivation for DIDA is the ability to characterize datasets through learned meta-features. Meta-features, aimed to represent a dataset as a vector of characteristics, have been mentioned in the ML literature for over 40 years, in relation with several key ML challenges: (i) learning a performance model, predicting a priori the performance of an algorithm (and the hyperparameters thereof) on a dataset (Rice, 1976; Wolpert, 1996; Hutter et al., 2018) ; (ii) learning a generic model able of quick adaptation to new tasks, e.g. one-shot or few-shot, through the so-called meta-learning approach (Finn et al., 2018; Yoon et al., 2018) ; (iii) hyper-parameter transfer learning (Perrone et al., 2018) , aimed to transfer the performance model learned for a task, to another task. A large number of meta-features have been manually designed along the years (Muñoz et al., 2018) , ranging from sufficient statistics to the so-called landmarks (Pfahringer et al., 2000) , computing the performance of fast ML algorithms on the considered dataset. Meta-features, expected to describe the joint distribution underlying the dataset, should also be inexpensive to compute. The learning of meta-features has been first tackled by (Jomaa et al., 2019) to our best knowledge, defining the DATASET2VEC representation. Specifically, DATASET2VEC is provided two patches of datasets, (two subsets of examples, described by two (different) sets of features), and is trained to predict whether those patches are extracted from the same initial dataset. Contributions. The proposed DIDA approach extends the state of the art (Maron et al., 2020; Jomaa et al., 2019) in two ways. Firstly, it is designed to handle discrete or continuous probability distributions, as opposed to point sets (Section 2). As said, this extension enables to leverage the more general topology of the Wasserstein distance as opposed to that of the Haussdorf distance (Section 3). This framework is used to derive theoretical guarantees of stability under bounded distribution transformations, as well as universal approximation results, extending (Maron et al., 2020) to the continuous setting. Secondly, the empirical validation of the approach on two tasks defined at the dataset level demonstrates the merit of the approach compared to the state of the art (Maron et al., 2020; Jomaa et al., 2019; Muñoz et al., 2018) (Section 4). Notations. �1; m� denotes the set of integers {1, . . . m}. Distributions, including discrete distributions (datasets) are noted in bold font. Vectors are noted in italic, with x[k] denoting the k-th coordinate of vector x.

2. DISTRIBUTION-BASED INVARIANT NETWORKS FOR META-FEATURE LEARNING

This section describes the core of the proposed distribution-based invariant neural architectures, specifically the mechanism of mapping a point distribution onto another one subject to sample and feature permutation invariance, referred to as invariant layer. For the sake of readability, this section focuses on the case of discrete distributions, referring the reader to Appendix A for the general case of continuous distributions.

2.1. INVARIANT FUNCTIONS OF DISCRETE DISTRIBUTIONS

Let z= {(x i , y i ) ∈ R d , i ∈ �1; n�} denote a dataset including n labelled samples, with x i ∈ R dX an instance and y i ∈ R dY the associated multi-label. With d X and d Y respectively the dimensions



, point clouds(Qi et al., 2017), equivariant point clouds(Segol & Lipman, 2019), discrete measures(De Bie et al., 2019), invariant (Maron  et al., 2019b)  and equivariant(Keriven & Peyré, 2019)  graph neural networks. The approach most related to our work is that of(Maron et al., 2020), handling point clouds and presenting a neural architecture invariant w.r.t. the ordering of points and their features. In this paper, the proposed distribution-based invariant deep architecture (DIDA) extends(Maron et al., 2020)  as it handles (discrete or continuous) probability distributions instead of point clouds. This enables to leverage the topology of the Wasserstein distance to provide more general approximation results, covering(Maron et al., 2020)  as a special case.

