LEARNING AGGREGATION FUNCTIONS

Abstract

Learning on sets is increasingly gaining attention in the machine learning community, due to its widespread applicability. Typically, representations over sets are computed by using fixed aggregation functions such as sum or maximum. However, recent results showed that universal function representation by sum-(or max-) decomposition requires either highly discontinuous (and thus poorly learnable) mappings, or a latent dimension equal to the maximum number of elements in the set. To mitigate this problem, we introduce LAF (Learning Aggregation Functions), a learnable aggregator for sets of arbitrary cardinality. LAF can approximate several extensively used aggregators (such as average, sum, maximum) as well as more complex functions (e.g. variance and skewness). We report experiments on semi-synthetic and real data showing that LAF outperforms state-of-theart sum-(max-) decomposition architectures such as DeepSets and library-based architectures like Principal Neighborhood Aggregation.

1. INTRODUCTION

The need to aggregate representations is ubiquitous in deep learning. Some recent examples include max-over-time pooling used in convolutional networks for sequence classification (Kim, 2014) , average pooling of neighbors in graph convolutional networks (Kipf & Welling, 2017) , max-pooling in Deep Sets (Zaheer et al., 2017) , in (generalized) multi-instance learning (Tibo et al., 2017) and in GraphSAGE (Hamilton et al., 2017) . In all the above cases (with the exception of LSTM-pooling in GraphSAGE) the aggregation function is predefined, i.e., not tunable, which may be in general a disadvantage (Ilse et al., 2018) . Sum-based aggregation has been advocated based on theoretical findings showing the permutation invariant functions can be sum-decomposed (Zaheer et al., 2017; Xu et al., 2019) . However, recent results (Wagstaff et al., 2019) showed that this universal function representation guarantee requires either highly discontinuous (and thus poorly learnable) mappings, or a latent dimension equal to the maximum number of elements in the set. This suggests that learning set functions that are accurate on sets of large cardinality is difficult. Inspired by previous work on learning uninorms (Melnikov & Hüllermeier, 2016) , we propose a new parametric family of aggregation functions that we call LAF, for learning aggregation functions. A single LAF unit can approximate standard aggregators like sum, max or mean as well as model intermediate behaviours (possibly different in different areas of the space). In addition, LAF layers with multiple aggregation units can approximate higher order moments of distributions like variance, skewness or kurtosis. In contrast, other authors (Corso et al., 2020) suggest to employ a predefined library of elementary aggregators to be combined. Since LAF can represent sums, it can be seen as a smooth version of the class of functions that are shown in Zaheer et al. (2017) to enjoy universality results in representing set functions. The hope is that being smoother, LAF is more easily learnable. Our empirical findings show that this can be actually the case, especially when asking the model to generalize over large sets. In particular, in this paper we offer an extensive experimental analysis showing that: • LAF layers can learn a wide range of aggregators (including higher-order moments) on sets of scalars without background knowledge on the nature of the aggregation task • LAF layers on the top of traditional layers can learn the same wide range of aggregators on sets of high dimensional vectors (MNIST images) • LAF outperforms state-of-the-art set learning methods such as DeepSets and PNA on realworld problems involving point clouds and text concept set retrieval.

