DISTRIBUTION-BASED INVARIANT DEEP NETWORKS FOR LEARNING META-FEATURES

Abstract

Recent advances in deep learning from probability distributions successfully achieve classification or regression from distribution samples, thus invariant under permutation of the samples. The first contribution of the paper is to extend these neural architectures to achieve invariance under permutation of the features, too. The proposed architecture, called DIDA, inherits the NN properties of universal approximation, and its robustness with respect to Lipschitz-bounded transformations of the input distribution is established. The second contribution is to empirically and comparatively demonstrate the merits of the approach on two tasks defined at the dataset level. On both tasks, DIDA learns meta-features supporting the characterization of a (labelled) dataset. The first task consists of predicting whether two dataset patches are extracted from the same initial dataset. The second task consists of predicting whether the learning performance achieved by a hyper-parameter configuration under a fixed algorithm (ranging in k-NN, SVM, logistic regression and linear SGD) dominates that of another configuration, for a dataset extracted from the OpenML benchmarking suite. On both tasks, DIDA outperforms the state of the art: DSS and DATASET2VEC architectures, as well as the models based on the hand-crafted meta-features of the literature.

1. INTRODUCTION

Deep networks architectures, initially devised for structured data such as images (Krizhevsky et al., 2012) and speech (Hinton et al., 2012) , have been extended to enforce some invariance or equivariance properties (Shawe-Taylor, 1993) for more complex data representations. Typically, the network output is required to be invariant with respect to permutations of the input points when dealing with point clouds (Qi et al., 2017) , graphs (Henaff et al., 2015) or probability distributions (De Bie et al., 2019) . The merit of invariant or equivariant neural architectures is twofold. On the one hand, they inherit the universal approximation properties of neural nets (Cybenko, 1989; Leshno et al., 1993) . On the other hand, the fact that these architectures comply with the requirements attached to the data representation yields more robust and more general models, through constraining the neural weights and/or reducing their number. Related works. Invariance or equivariance properties are relevant to a wide range of applications. In the sequence-to-sequence framework, one might want to relax the sequence order (Vinyals et al., 2016) . When modelling dynamic cell processes, one might want to follow the cell evolution at a macroscopic level, in terms of distributions as opposed to, a set of individual cell trajectories (Hashimoto et al., 2016) . In computer vision, one might want to handle a set of pixels, as opposed to a voxellized representation, for the sake of a better scalability in terms of data dimensionality and computational resources (De Bie et al., 2019) . Neural architectures enforcing invariance or equivariance properties have been pioneered by (Qi et al., 2017; Zaheer et al., 2017) for learning from point clouds subject to permutation invariance or equivariance. These have been extended to permutation equivariance across sets (Hartford et al., 2018) . Characterizations of invariance or equivariance under group actions have been proposed in the finite (Gens & Domingos, 2014; Cohen & Welling, 2016; Ravanbakhsh et al., 2017) or infinite case (Wood & Shawe-Taylor, 1996; Kondor & Trivedi, 2018) . On the theoretical side, (Maron et al., 2019a; Keriven & Peyré, 2019) have proposed a general characterization of linear layers enforcing invariance or equivariance properties with respect to the whole permutation group on the feature set. The universal approximation properties of such architectures have been established in the case of sets (Zaheer et al., 2017) , point clouds (Qi et al., 2017) , equivariant point clouds (Segol & Lipman, 2019) , discrete measures (De Bie et al., 2019) , invariant (Maron et al., 2019b) and equivariant (Keriven & Peyré, 2019) graph neural networks. The approach most related to our work is that of (Maron et al., 2020) , handling point clouds and presenting a neural architecture invariant w.r.t. the ordering of points and their features. In this paper, the proposed distribution-based invariant deep architecture (DIDA) extends (Maron et al., 2020) as it handles (discrete or continuous) probability distributions instead of point clouds. This enables to leverage the topology of the Wasserstein distance to provide more general approximation results, covering (Maron et al., 2020) as a special case.

Motivations.

A main motivation for DIDA is the ability to characterize datasets through learned meta-features. Meta-features, aimed to represent a dataset as a vector of characteristics, have been mentioned in the ML literature for over 40 years, in relation with several key ML challenges: (i) learning a performance model, predicting a priori the performance of an algorithm (and the hyperparameters thereof) on a dataset (Rice, 1976; Wolpert, 1996; Hutter et al., 2018) ; (ii) learning a generic model able of quick adaptation to new tasks, e.g. one-shot or few-shot, through the so-called meta-learning approach (Finn et al., 2018; Yoon et al., 2018) ; (iii) hyper-parameter transfer learning (Perrone et al., 2018) , aimed to transfer the performance model learned for a task, to another task. A large number of meta-features have been manually designed along the years (Muñoz et al., 2018) , ranging from sufficient statistics to the so-called landmarks (Pfahringer et al., 2000) , computing the performance of fast ML algorithms on the considered dataset. Meta-features, expected to describe the joint distribution underlying the dataset, should also be inexpensive to compute. The learning of meta-features has been first tackled by (Jomaa et al., 2019) to our best knowledge, defining the DATASET2VEC representation. Specifically, DATASET2VEC is provided two patches of datasets, (two subsets of examples, described by two (different) sets of features), and is trained to predict whether those patches are extracted from the same initial dataset. Contributions. The proposed DIDA approach extends the state of the art (Maron et al., 2020; Jomaa et al., 2019) in two ways. Firstly, it is designed to handle discrete or continuous probability distributions, as opposed to point sets (Section 2). As said, this extension enables to leverage the more general topology of the Wasserstein distance as opposed to that of the Haussdorf distance (Section 3). This framework is used to derive theoretical guarantees of stability under bounded distribution transformations, as well as universal approximation results, extending (Maron et al., 2020) to the continuous setting. Secondly, the empirical validation of the approach on two tasks defined at the dataset level demonstrates the merit of the approach compared to the state of the art (Maron et al., 2020; Jomaa et al., 2019; Muñoz et al., 2018) (Section 4). Notations. �1; m� denotes the set of integers {1, . . . m}. Distributions, including discrete distributions (datasets) are noted in bold font. Vectors are noted in italic, with x[k] denoting the k-th coordinate of vector x.

2. DISTRIBUTION-BASED INVARIANT NETWORKS FOR META-FEATURE LEARNING

This section describes the core of the proposed distribution-based invariant neural architectures, specifically the mechanism of mapping a point distribution onto another one subject to sample and feature permutation invariance, referred to as invariant layer. For the sake of readability, this section focuses on the case of discrete distributions, referring the reader to Appendix A for the general case of continuous distributions. 2 = S dX × S dY denote the group of permutations independently operating on the feature and label spaces. For σ = (σ X , σ Y ) ∈ G, the image σ(z) of a labelled sample is defined as (σ X (x), σ Y (y)), with x = (x[k], k ∈ �1; d X �) and σ X (x) def. = (x[σ X (k)], k ∈ �1; d X �). For simplicity and by abuse of notations, the operator mapping a distribution z = (z i , i ∈ �1; n�) to {σ(z i )} def. = σ � z is still denoted σ. Let Z(Ω) denote the space of distributions supported on some domain Ω ⊂ R d , with Ω invariant under permutations in G. The goal of the paper is to define and train deep architectures, implementing functions ϕ on Z(Ω ⊂ R d ) that are invariant under G, i.e. such that ∀σ ∈ G, ϕ(σ � z) = ϕ(z)foot_0 . By construction, a multi-label dataset is invariant under permutations of the samples, of the features, and of the multi-labels. Therefore, any meta-feature, that is, a feature describing a multi-label dataset, is required to satisfy the above sample and feature permutation invariance properties.

2.2. DISTRIBUTION-BASED INVARIANT LAYERS

The building block of the proposed architecture, the invariant layer meant to satisfy the feature and label invariance requirements, is defined as follows, taking inspiration from (De Bie et al., 2019) . Definition 1. (Distribution-based invariant layers) Let an interaction functional ϕ : R d × R d → R r be G-invariant: ∀σ ∈ G, ∀(z 1 , z 2 ) ∈ R d × R d , ϕ(z 1 , z 2 ) = ϕ(σ(z 1 ), σ(z 2 )). The distribution-based invariant layer f ϕ is defined as f ϕ : z = (z i ) i∈�1;n� ∈ Z(R d ) � → f ϕ (z) def. =   1 n n � j=1 ϕ(z 1 , z j ), . . . , 1 n n � j=1 ϕ(z n , z j )   ∈ Z(R r ). (1) By construction, f ϕ is G-invariant if ϕ is G-invariant. The construction of f ϕ is extended to the general case of possibly continuous probability distributions by essentially replacing sums by integrals (Appendix A) Remark 1. (Varying dimensions d X and d Y ). Both in practice and in theory, it is important that f ϕ layers (in particular the first layer of the neural architecture) handle datasets of arbitrary number of features d X and number of multi-labels d Y . The proposed approach, used in the experiments (Section 4), is to define ϕ as follows. Letting z = (x, y) and z � = (x � , y � ) be two samples in R dX × R dY , let u be defined from R 4 onto R t , consider the sum of u(x[k], x � [k], y[�], y � [�]) for k ranging in �1; d X � and � in �1; d Y �, and apply mapping v from R t to R r on this sum: ϕ(z, z � ) = v � dX � k=1 dY � �=1 u(x[k], x � [k], y[�], y � [�]) � (2) Likewise, by construction ϕ is G-invariant, i.e. it is invariant to both feature and label permutations. As shown in Section 4, this invariance property is instrumental to a good empirical performance. The above definition of f ϕ is based on the aggregation of pairwise terms ϕ(z i , z j ). The motivation for using a pairwise ϕ is twofold. On the one hand, capturing local sample interactions allows to create more expressive architectures, which is important to improve the performance on some complex data sets, as illustrated in the experiments (Section 4). On the other hand, interaction functionals are crucial to design universal architectures (Appendix C, theorem 2). The proposed theoretical framework relies on the Wasserstein distance (corresponding to the convergence in law of probability distributions), which enables to compare distributions with varying number of points or even with continuous densities. In contrast, Maron et al. ( 2020) do not use interaction functionals, and establish the universality of their DSS architecture for fixed dimension d and number of points n. Moreover, DSS happens to resort to max pooling operators, discontinuous w.r.t. the Wasserstein topology (see Remark 6). Remark 2. (Varying sample size n). By construction, f ϕ is defined on Z(R d ) = ∪ n Z n (R d ) (inde- pendent of n), Two particular cases are when ϕ only depends on its first or second input: (i) if ϕ(z, z � ) = ψ(z � ), then f ϕ computes a global "moment" descriptor of the input, as f ϕ (z) = 1 n � n j=1 ψ(z j ) ∈ R r . (ii) if ϕ(z, z � ) = ξ(z), then f ϕ transports the input distribution via ξ, as f ϕ (z) = {ξ(z i ), i ∈ �1; n�} ∈ Z(R r ). This operation is referred to as a push-forward. Remark 4. (Localized computation) In practice, the quadratic complexity of f ϕ w.r.t. the number n of samples can be reduced by only computing ϕ(z i , z j ) for pairs z i , z j sufficiently close to each other. Layer f ϕ thus extracts and aggregates information related to the neighborhood of the samples. Remark 5. (Link to kernels) The use of an interaction functional ϕ is inspired from kernel ideas, albeit with significant differences: (i) in f ϕ (z i ), the detail of the pairwise interactions ϕ(z i , z j ) is lost through averaging; (ii) ϕ takes into account labels; (iii) ϕ is learnt. Further work will be devoted to investigating the properties of the f ϕ (z i ) matrix.

2.3. LEARNING META-FEATURES

The proposed distributional neural architectures defined on point distributions (DIDA) are sought as z ∈ Z(R d ) � → F ζ (z) def. = f ϕm • f ϕm-1 • . . . • f ϕ1 (z) ∈ R dm+1 where ζ are the trainable parameters of the architecture (below). Only the case d Y = 1 is considered in the remainder. The k-th layer is built on the top of ϕ k , mapping pairs of vectors in R d k onto R d k+1 , with d 1 = d (the dimension of the input samples). Last layer is built on ϕ m , only depending on its second argument; it maps the distribution in layer m -1 onto a vector, whose coordinates are referred to as meta-features. The G-invariance and dimension-agnosticity of the whole architecture only depend on the first layer f ϕ1 satisfying these properties. In the first layer, ϕ 1 is sought as ϕ 1 ((x, y), (x � , y � )) = v( � k u(x[k], x � [k], y, y � )) (Remark 1), with u(x[k], x � [k], y, y � ) = (ρ(A u •(x[k]; x � [k])+b u , 1 y� =y � ) in R t × {0, 1}, where ρ is a non-linear activation function, A u a (t, 2) matrix, (x[k]; x � [k]) the 2-dimensional vector concatenating x[k] and x � [k], and b u a t-dimensional vector. With e = � k u(x[k], x � [k], y, y � )) , function v likewise applies a non-linear activation function ρ on an affine transformation of e: v(e) = ρ(A v • e + b v ), with A v a (t, r) matrix and b v a r-dimensional vector. Note that the subsequent layers need neither be invariant w.r.t. the number of samples, nor handle a varying number of dimensions. However, maintaining the distributional nature among several layers is shown to improve performance in practice (Section 4). Every ϕ k , k ≥ 2 is defined as ϕ k = ρ(A k • +b k ), with ρ an activation function, A k a (d k , d k+1 ) matrix and b k a d k+1 -dimensional vector. The DIDA neural net thus is parameterized by ζ def. = (A u , b u , A v , b v , {A k , b k } k ) , that is classically learned by stochastic gradient descent from the loss function defined after the task at hand (Section 4).

3. THEORETICAL ANALYSIS

This section analyzes the properties of invariant-layer based neural architectures, specifically their robustness w.r.t. bounded transformations of the involved distributions, and their approximation abilities w.r.t. the convergence in law, which is the natural topology for distributions. As already said, the discrete distribution case is considered in this section for the sake of readability, referring the reader to Appendix A for the general case of continuous distributions.

3.1. OPTIMAL TRANSPORT COMPARISON OF DATASETS

Point clouds vs. distributions. Our claim is that datasets should be seen as probability distributions, rather than point clouds. Typically, including many copies of a point in a dataset amounts to increasing its importance, which usually makes a difference in a standard machine learning setting. Accordingly, the topological framework used to define and learn meta-features in the following is that of the convergence in law, with the distance among two datasets being quantified using the Wasserstein distance (below). In contrast, the point clouds setting (see for instance (Qi et al., 2017) ) relies on the Haussdorff distance among sets to theoretically assess the robustness of these architectures. While it is standard for 2D and 3D data involved in graphics and vision domains, it faces some limitations in higher dimensional domains, e.g. due to max-pooling being a non-continuous operator w.r.t. the convergence in law topology. Wasserstein distance. Referring the reader to (Santambrogio, 2015; Peyré & Cuturi, 2019) for a more comprehensive presentation, the standard 1-Wasserstein distance between two discrete probability distributions z, z � ∈ Z n (R d ) × Z m (R d ) is defined as: W 1 (z, z � ) def. = max f ∈Lip 1 (R d ) 1 n n � i=1 f (z i ) - 1 m m � j=1 f (z � j ) with Lip 1 (R d ) the space of 1-Lipschitz functions f : R d → R. To account for the invariance requirement (making indistinguishable z = (z 1 , . . . , z n ) and its permuted image (σ(z 1 ), . . . , σ(z n )) def. = σ � z under σ ∈ G), we introduce the G-invariant 1-Wasserstein distance: for z ∈ Z n (R d ), z � ∈ Z m (R d ): W 1 (z, z � ) = min σ∈G W 1 (σ � z, z � ) such that W 1 (z, z � ) = 0 if and only if z and z � belong to the same equivalence class (Appendix A), i.e. are equal in the sense of probability distributions up to sample and feature permutations. Lipschitz property. In this context, a map f from Z(R d ) onto Z(R r ) is continuous for the convergence in law (a.k.a. weak convergence on distributions, denoted �) iff for any sequence z (k) � z, then f (z (k) ) � f (z). The Wasserstein distance metrizes the convergence in law, in the sense that z (k) � z is equivalent to W 1 (z (k) , z) → 0. Furthermore, map f is said to be C-Lipschitz for the permutation invariant 1-Wasserstein distance iff ∀z, z � ∈ Z(R d ), W 1 (f (z), f (z � )) � CW 1 (z, z � ). (4) The C-Lipschitz property entails the continuity of f w.r.t. its input: if two input distributions are close in the permutation invariant 1-Wasserstein sense, the corresponding outputs are close too.

3.2. REGULARITY OF DISTRIBUTION-BASED INVARIANT LAYERS

Assuming the interaction functional to satisfy the Lipschitz property: ∀z ∈ R d , ϕ(z, •) and ϕ(•, z) are C ϕ -Lipschitz. (5) the robustness of invariant layers with respect to different variations of their input is established (proofs in Appendix B). We first show that invariant layers also satisfy Lipschitz property, ensuring that deep architectures of the form (3) map close inputs onto close outputs. Proposition 1. Invariant layer f ϕ of type ( 1) is (2rC ϕ )-Lipschitz in the sense of (4). A second result regards the case where two datasets z and z � are such that z � is the image of z through some diffeomorphism τ (z = (z 1 , . . . , z n ) and z � = τ � z = (τ (z 1 ), . . . , τ (z n )). If τ is close to identity, then the following proposition shows that f ϕ (τ � z) and f ϕ (z) are close too. More generally, if continuous transformations τ and ξ respectively apply on the input and output space of f ϕ , and are close to identity, then ξ � f ϕ (τ � z) and f ϕ (z) are also close. Proposition 2. Let τ : R d → R d and ξ : R r → R r be two Lipschitz maps with respectively Lipschitz constants C τ and C ξ . Then, ∀z ∈ Z(Ω), W 1 (ξ � f ϕ (τ � z), f ϕ (z)) � sup x∈fϕ(τ (Ω)) ||ξ(x) -x|| 2 + 2r Lip(ϕ) sup x∈Ω ||τ (x) -x|| 2 ∀z, z � ∈ Z(Ω), if τ is equivariant, W 1 (ξ � f ϕ (τ � z), ξ � f ϕ (τ � z � )) � 2r C ϕ C τ C ξ W 1 (z, z � )

3.3. UNIVERSALITY OF INVARIANT LAYERS

Lastly, the universality of the proposed architecture is established, showing that the composition of an invariant layer (1) and a fully-connected layer is enough to enjoy the universal approximation property, over all functions defined on Z(R d ) with dimension d less than some D (Remark 1). Theorem 1. Let F : Z(Ω) → R be a G-invariant map on a compact Ω, continuous for the convergence in law. Then ∀ε > 0, there exists two continuous maps ψ, ϕ such that ∀z ∈ Z(Ω), |F (z) -ψ • f ϕ (z)| < ε where ϕ is G-invariant and independent of F. Proof. The sketch of the proof is as follows (complete proof in Appendix C). Let us define ϕ = g •h where: (i) h is the collection of d X elementary symmetric polynomials in the features and d Y elementary symmetric polynomials in the labels, which is invariant under G; (ii) a discretization of h(Ω) on a grid is then considered, achieved thanks to g that aims at collecting integrals over each cell of the discretization; (iii) ψ applies function F on this discretized measure; this requires h to be bijective, and is achieved by h, through a projection on the quotient space S d /G and a restriction to its image compact Ω � . To sum up, f ϕ defined as such computes an expectation which collects integrals over each cell of the grid to approximate measure h � z by a discrete counterpart � h � z. Hence ψ applies F to h-1 � ( � h � z). Continuity is obtained as follows: (i) proximity of h � z and � h � z follows from Lemma 1 in (De Bie et al., 2019) ) and gets tighter as the grid discretization step tends to 0; (ii) Map h-1 is 1/d-Hölder, after Theorem 1.3.1 from (Rahman & Schmeisser, 2002)); therefore Lemma 2 entails that W 1 (z, h-1 � � h � z ) can be upper-bounded; (iii) since Ω is compact, by Banach-Alaoglu theorem, Z(Ω) also is. Since F is continuous, it is thus uniformly weakly continuous: choosing a discretization step small enough ensures the result. Remark 6. (Comparison with (Maron et al., 2020 )) The above proof holds for functionals of arbitrary input sample size n, as well as continuous distributions, generalizing results in (Maron et al., 2020) . Note that the two types of architectures radically differ (more in Section 4). Remark 7. (Approximation by an invariant NN) After theorem 1, any invariant continuous function defined on distributions with compact support can be approximated with arbitrary precision by an invariant neural network (Appendix C). The proof involves mainly three steps: (i) an invariant layer f ϕ can be approximated by an invariant network; (ii) the universal approximation theorem (Cybenko, 1989; Leshno et al., 1993) ; (iii) uniform continuity is used to obtain uniform bounds. Remark 8. (Extension to different spaces) Theorem 1 also extends to distributions supported on different spaces, via embedding them into a high-dimensional space. Therefore, any invariant function on distributions with compact support in R d with d ≤ D can be uniformly approximated by an invariant network (Appendix C).

4. EXPERIMENTAL VALIDATION

The experimental validation presented in this section considers two goals of experiments: (i) assessing the ability of DIDA to learn accurate meta-features; (ii) assessing the merit of the DIDA invariant layer design, building invariant f ϕ on the top of an interactional function ϕ (Eq. 1). As said, this architecture is expected to grasp contrasts among samples, e.g. belonging to different classes; the proposed experimental setting aims to empirically investigate this conjecture. Baselines. These goals of experiments are tackled by comparing DIDA to three baselines: DSS layers (Maron et al., 2020) ; hand-crafted meta-features (HC) (Muñoz et al., 2018) (Table 5 in Appendix D); DATASET2VEC (Jomaa et al., 2019) . We implemented DSS, the code being not available. 2 In order to cope with varying dataset dimensions (as required by the UCI and OpenML benchmarks), the original DSS was augmented with an aggregator summing over the features. Three DSS baselines are considered: linear or non-linear invariant layers, possibly preceded by equivariant layers. Similarly, the original DATASET2VEC implementation has been augmented to address our experimental setting. The baselines are detailed in Appendix D.3. Figure 1 : Learning meta-features with DIDA. Top: the DIDA architecture (BN stands for batch norm; FC for fully connected layer). Bottom left: Learning meta-features for patch identification using a Siamese architecture (section 4.1). Bottom right: learning meta-features for performance modelling, specifically to rank two hyper-parameter configurations θ 1 and θ 2 (section 4.2). Experimental setting. Two tasks defined at the dataset level are considered: patch identification (section 4.1) and performance modelling (section 4.2). The dataset preprocessing protocols are detailed in Appendix D.1. On both tasks, the same DIDA architecture is considered (Fig 1 ), involving 2 invariant layers followed by 3 fully connected (FC) layers. Meta-features F ζ (z) consist of the output of the third FC layer, with ζ denoting the trained DIDA () parameters. All experiments run on 1 NVIDIA-Tesla-V100-SXM2 GPU with 32GB memory, using Adam optimizer with base learning rate 10 -3 .

4.1. TASK 1: PATCH IDENTIFICATION

The patch identification task consists of detecting whether two blocks of data are extracted from the same original dataset (Jomaa et al., 2019) . Letting u denote a n-sample, d-dimensional dataset, an n z , d z patch z is constructed from u by selecting n z examples in u (sampled uniformly with replacement) and retaining their description along d z features (sampled uniformly with replacement). The size n z and number d z of features of the patch are uniformly selected in fixed intervals (Table 4 , Appendix D). To each pair of patches z, z � with same number of instances n z = n z � , is associated a binary meta-label �(z, z � ) set to 1 iff z and z' are extracted from the same initial dataset u. DIDA parameters ζ are trained to minimize the cross-entropy loss of model �ζ (z, z � ) = exp (-||F ζ (z) -F ζ (z � )|| 2 ), with F ζ (z) and F ζ (z � ) the meta-features computed for z and z � : Minimize L(ζ) = - � z,z � �(z, z � ) log( �ζ (z, z � )) + (1 -�(z, z � )) log(1 -�ζ (z, z � )) DIDA and all baselines are trained using a Siamese approach (Figure 1 , bottom left): the same DIDA (or baseline) architecture is used to compute meta-features F ζ (z) and F ζ (z � ) from patches z and z � , and trained to minimize the cross-entropy loss w.r.t. �(z, z � ). The classification results on toy datasets and UCI datasets (Table 1 , detailed in Appendix D) show the pertinence of the DIDA meta-features, particularly so on the UCI datasets where the number of features widely varies from one dataset to another. The relevance of the interactional invariant layer design is established on this problem as DIDA outperforms both DATASET2VEC, DSS as well as the function learned on the top of the hand-crafted meta-features. An ablation study is conducted to assess the impact of (i) the feature permutation invariance; (ii) considering one vs two invariant layers of type (1). The so-called NO-FINV-DSS baseline, detailed in Appendix D, is built upon (Zaheer et al., 2017) significantly lower (Table 1 ), showcasing the benefits of enforcing the feature invariance property. Secondly, we compare the 2-invariant layers DIDA, with the 1-invariant layer DIDA (1L-DIDA and 2L-DIDA for short): 1L-DIDA yields significantly lower performances, which confirms the advantages of maintaining the distributional nature among several layers, as already noted by (De Bie et al., 2019) . Note that the 1L-DIDA still outperforms the non feature-invariant baseline, while requiring much fewer parameters.

4.2. TASK 2: PERFORMANCE MODEL LEARNING

The performance modelling task aims to assess a priori the accuracy of the classifier learned from a given machine learning algorithm with a given configuration θ (vector of hyper-parameters ranging in a hyper-parameter space Θ, Table 6 in Appendix D), on a dataset z (for brevity, the performance of θ on z) (Rice, 1976) . For each ML algorithm, ranging in Logistic regression (LR), SVM, k-Nearest Neighbours (k-NN), linear classifier learned with stochastic gradient descent (SGD), a set of meta-features is learned to predict whether some configuration θ 1 outperforms some configuration θ 2 on dataset z: to each triplet (z, θ 1 , θ 2 ) is associated a binary value �(z, θ 1 , θ 2 ), set to 1 iff θ 2 yields better performance than θ 1 on z. DIDA parameters ζ are trained to build model �ζ , minimizing the (weighted version of) cross-entropy loss (6), where �ζ (z, θ 1 , θ 2 ) is a 2-layer FC network with input vector [F ζ (z); θ 1 ; θ 2 ], depending on the considered ML algorithm and its configuration space. In each epoch, a batch made of triplets (z, θ 1 , θ 2 ) is built, with θ 1 , θ 2 uniformly drawn in the algorithm configuration space (Table 6 ) and z a n-sample d-dimensional patch of a dataset in the OpenML CC-2018 (Bischl et al., 2019) with n uniformly drawn in [700; 900] and d in [3; 10]. Algorithm 1 summarizes the training procedure. The quality of the DIDA meta-features is assessed from the ranking accuracy (Table 2 ), showing their relevance. The performance gap compared to the baselines is higher for the k-NN modelling Sample (θ 1 , θ 2 ), two hyper-parameters of CLF � Search space: Table 6 7: Sample patch z from dataset u � Patch dimension: Table 

5. CONCLUSION

The theoretical contribution of the paper is the DIDA architecture, able to learn from discrete and continuous distributions on R d , invariant w.r.t. feature ordering, agnostic w.r.t. the size and dimension d of the considered distribution sample (with d less than some upper bound D). This architecture enjoys universal approximation and robustness properties, generalizing former results obtained for point clouds (Maron et al., 2020) . The merits of DIDA are demonstrated on two tasks defined at the dataset level: patch identification and performance model learning, comparatively to the state of the art (Maron et al., 2020; Jomaa et al., 2019; Muñoz et al., 2018) . The ability to accurately describe a dataset in the landscape defined by ML algorithms opens new perspectives to compare datasets and algorithms, e.g. for domain adaptation (Ben-David et al., 2007; 2010) and meta-learning (Finn et al., 2018; Yoon et al., 2018) , in light of kernel methods.



As opposed to G-equivariant functions that are characterized by ∀σ ∈ G, ϕ(σ � z) = σ � ϕ(z) The code source of DIDA and (our implementation of) baselines are available in supplementary materials.



such that it supports inputs of arbitrary cardinality n. Remark 3.(Discussion w.r.t. (Maron et al., 2020))

F ζ ← meta-feature extractor (DIDA, DSS, DATASET2VEC, or Hand-crafted) 2: MLP ← NN[Linear(64)-ReLU-Linear(32)-ReLU-Linear(1)] 3: CLF ← machine learning classifier (SGD, SVM, LR or k-NN) 4: error ← 3-CV classification error function 5: for iteration=1, 2, . . . do 6:

4

(MLP(F ζ (z), θ 1 ), MLP(F ζ (z), θ 2 )) 9:Backpropagate logloss(pred, 0 if error(z, CLF(θ 1 )) < error(z, CLF(θ 2 )) else 1) 10: end for task; this is explained as the sought performance model only depends on the local geometry of the examples. Still, good performances are observed over all considered algorithms. Note that the 2L-DIDA yields significantly better (respectively, similar) performances than 1L-DIDA on the k-NN model (resp. on all other models).Meta-feature assessment. A regression setting is thereafter considered, aimed to predict the actual performance of a configuration θ based on the (frozen) meta-features F ζ (z). The regression accuracy is illustrated for the configurations of the k-NN algorithm on Figure2, left (results for other algorithms are presented in Appendix D). The comparison with the regression models based on DSS meta-features or hand-crafted features (Figure2, middle and right) shows the merits of the DIDA architecture; a tentative interpretation for the DIDA better performance is based on the interactional nature of DIDA architecture, better capturing local interactions.

Figure 2: k-NN: True performance vs performance predicted by regression on top of the metafeatures (i) learned by DIDA, (ii) DSS or (iii) Hand-crafted statistics.

.1 INVARIANT FUNCTIONS OF DISCRETE DISTRIBUTIONS Let z= {(x i , y i ) ∈ R d , i ∈ �1; n�} denote a dataset including n labelled samples, with x i ∈ R dX an instance and y i ∈ R dY the associated multi-label. With d X and d Y respectively the dimensions of the instance and label spaces, let d def. = d X + d Y . By construction, z is invariant under permutation on the sample ordering; it is viewed as an n-size discrete distribution 1 n � n i=1 δ zi in R d with δ zi the Dirac function at z i . In the following, Z n (R d ) denotes the space of such n-size point distributions, with Z(R d )

; it only differs from the DSS baseline as it is not feature permutation invariant. With ca the same number of parameters as DSS, its performances are Patch identification (binary classification accuracy) on 10 runs of DIDA and considered baselines. %± 0.26 81.29 %± 1.65 87.65 %± 0.03 68.55 %± 2.84 DSS (Non-linear aggregation) 74.13 %± 1.01 83.38 %± 0.37 87.92 %± 0.27 73.07 %± 0.77 DIDA (1 invariant layer) 77.31 %± 0.16 84.05 %± 0.71 90.16 %± 0.17 74.41 %± 0.93 DIDA (2 invariant layers) 78.41 %± 0.41 84.14 %± 0.02 89.77 %± 0.50 78.91 %± 0.54

Pairwise ranking of configurations, for ML algorithms SGD, SVM, LR and k-NN: performance on test set of DIDA, hand-crafted, DATASET2VEC and DSS (average and std deviation on 3 runs).

