LEARNING AGGREGATION FUNCTIONS

Abstract

Learning on sets is increasingly gaining attention in the machine learning community, due to its widespread applicability. Typically, representations over sets are computed by using fixed aggregation functions such as sum or maximum. However, recent results showed that universal function representation by sum-(or max-) decomposition requires either highly discontinuous (and thus poorly learnable) mappings, or a latent dimension equal to the maximum number of elements in the set. To mitigate this problem, we introduce LAF (Learning Aggregation Functions), a learnable aggregator for sets of arbitrary cardinality. LAF can approximate several extensively used aggregators (such as average, sum, maximum) as well as more complex functions (e.g. variance and skewness). We report experiments on semi-synthetic and real data showing that LAF outperforms state-of-theart sum-(max-) decomposition architectures such as DeepSets and library-based architectures like Principal Neighborhood Aggregation.

1. INTRODUCTION

The need to aggregate representations is ubiquitous in deep learning. Some recent examples include max-over-time pooling used in convolutional networks for sequence classification (Kim, 2014) , average pooling of neighbors in graph convolutional networks (Kipf & Welling, 2017) , max-pooling in Deep Sets (Zaheer et al., 2017) , in (generalized) multi-instance learning (Tibo et al., 2017) and in GraphSAGE (Hamilton et al., 2017) . In all the above cases (with the exception of LSTM-pooling in GraphSAGE) the aggregation function is predefined, i.e., not tunable, which may be in general a disadvantage (Ilse et al., 2018) . Sum-based aggregation has been advocated based on theoretical findings showing the permutation invariant functions can be sum-decomposed (Zaheer et al., 2017; Xu et al., 2019) . However, recent results (Wagstaff et al., 2019) showed that this universal function representation guarantee requires either highly discontinuous (and thus poorly learnable) mappings, or a latent dimension equal to the maximum number of elements in the set. This suggests that learning set functions that are accurate on sets of large cardinality is difficult. Inspired by previous work on learning uninorms (Melnikov & Hüllermeier, 2016) , we propose a new parametric family of aggregation functions that we call LAF, for learning aggregation functions. A single LAF unit can approximate standard aggregators like sum, max or mean as well as model intermediate behaviours (possibly different in different areas of the space). In addition, LAF layers with multiple aggregation units can approximate higher order moments of distributions like variance, skewness or kurtosis. In contrast, other authors (Corso et al., 2020) suggest to employ a predefined library of elementary aggregators to be combined. Since LAF can represent sums, it can be seen as a smooth version of the class of functions that are shown in Zaheer et al. (2017) to enjoy universality results in representing set functions. The hope is that being smoother, LAF is more easily learnable. Our empirical findings show that this can be actually the case, especially when asking the model to generalize over large sets. In particular, in this paper we offer an extensive experimental analysis showing that: • LAF layers can learn a wide range of aggregators (including higher-order moments) on sets of scalars without background knowledge on the nature of the aggregation task • LAF layers on the top of traditional layers can learn the same wide range of aggregators on sets of high dimensional vectors (MNIST images) • LAF outperforms state-of-the-art set learning methods such as DeepSets and PNA on realworld problems involving point clouds and text concept set retrieval. The rest of this work is structured as follows. In Section 2 we define the LAF framework and show how appropriate parametrizations of LAF allow to represent a wide range of popular aggregation functions. In Section 3 we discuss some relevant related work. Section 4 reports synthetic and realworld experiments showing the advantages of LAF over (sets of) predifined aggregators. Finally, conclusions and pointers to future work are discussed in Section 5. Name Definition a b c d e f g h α β γ δ limits constant c ∈ R 0 1 --0 1 --c 0 1 0 max max i x i 1/r r --0 1 --1 0 1 0 r → ∞ min min i x i 0 1 1/r r 0 1 --1 -1 1 0 r → ∞ sum i x i 1 1 --0 1 --1 0 1 0 nonzero count |{i : x i = 0}| 1 0 --0 1 --1 0 1 0 mean 1/N i x i 1 1 --1 0 --1 0 1 0 kth moment 1/N i x k i 1 k --1 0 --1 0 1 0 lth power of kth moment (1/N i x k i ) l l k --l 0 --1 0 1 0 min/max min i x i / max i x i 0 1 1/r r 1/s s --1 1 1 0 r, s → ∞ max/min max i x i / min i x i 1/r r --0 1 1/s s 1 0 1 1 r, s → ∞

2. THE LEARNING AGGREGATION FUNCTION FRAMEWORK

We use x = {x 1 , . . . , x N } to denote finite multisets of real numbers x i ∈ R. Note that directly taking x to be a multiset, not a vector, means that there is no need to define properties like exchangeability or permutation equivariance for operations on x. An aggregation function agg is any function that returns for any multiset x of arbitrary cardinality N ∈ N a value agg(x) ∈ R. Standard aggregation functions like mean and max can be understood as (normalized) L p -norms. We therefore build our parametric LAF aggregator around generalized L p -norms of the form L a,b (x) := i x b i a (a, b ≥ 0). L a,b is invariant under the addition of zeros: L a,b (x) = L a,b (x ∪ 0) where 0 is a multiset of zeros of arbitrary cardinality. In order to also enable aggregations that can represent conjunctive behavior such as min, we make symmetric use of aggregators of the multisets 1 -x := {1 -x i |x i ∈ x}. For L a,b (1 -x) to be a well-behaved, dual version of L a,b (x), the values in x need to lie in the range [0, 1]. We therefore restrict the following definition of our learnable aggregation function to sets x whose elements are in [0, 1]: LAF(x) := αL a,b (x) + βL c,d (1 -x) γL e,f (x) + δL g,h (1 -x) defined by tunable parameters a, . . . , h ≥ 0, and α, . . . , δ ∈ R. In cases where sets need to be aggregated whose elements are not already bounded by 0, 1, we apply a sigmoid function to the set elements prior to aggregation. Table 1 shows how a number of important aggregation functions are special cases of LAF (for values in [0, 1]). We make repeated use of the fact that L 0,1 returns the constant 1. For max and min LAF only provides an asymptotic approximation in the limit of specific function parameters (as indicated in the limits column of Table 1 ). In most cases, the parameterization of LAF for the functions in Table 1 will not be unique. Being able to encode the powers of moments implies that e.g. the variance of x can be expressed as the difference 1/N i x 2 i -(1/N i x i ) 2 of two LAF aggregators. Since LAF includes sum-aggregation, we can adapt the results of Zaheer et al. (2017) and Wagstaff et al. (2019) on the theoretical universality of sum-aggregation as follows.



Different functions achievable by varying the parameters in the formulation in Eq. 2

