A TRAINABLE OPTIMAL TRANSPORT EMBEDDING FOR FEATURE AGGREGATION AND ITS RELATIONSHIP TO ATTENTION

Abstract

We address the problem of learning on sets of features, motivated by the need of performing pooling operations in long biological sequences of varying sizes, with long-range dependencies, and possibly few labeled data. To address this challenging task, we introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost. Our aggregation technique admits two useful interpretations: it may be seen as a mechanism related to attention layers in neural networks, or it may be seen as a scalable surrogate of a classical optimal transport-based kernel. We experimentally demonstrate the effectiveness of our approach on biological sequences, achieving state-of-the-art results for protein fold recognition and detection of chromatin profiles tasks, and, as a proof of concept, we show promising results for processing natural language sequences. We provide an open-source implementation of our embedding that can be used alone or as a module in larger learning models at https://github.com/claying/OTK.

1. INTRODUCTION

Many scientific fields such as bioinformatics or natural language processing (NLP) require processing sets of features with positional information (biological sequences, or sentences represented by a set of local features). These objects are delicate to manipulate due to varying lengths and potentially long-range dependencies between their elements. For many tasks, the difficulty is even greater since the sets can be arbitrarily large, or only provided with few labels, or both. Deep learning architectures specifically designed for sets have recently been proposed (Lee et al., 2019; Skianis et al., 2020) . Our experiments show that these architectures perform well for NLP tasks, but achieve mixed performance for long biological sequences of varying size with few labeled data. Some of these models use attention (Bahdanau et al., 2015) , a classical mechanism for aggregating features. Its typical implementation is the transformer (Vaswani et al., 2017) , which has shown to achieve state-of-the-art results for many sequence modeling tasks, e.g, in NLP (Devlin et al., 2019) or in bioinformatics (Rives et al., 2019) , when trained with self supervision on large-scale data. Beyond sequence modeling, we are interested in this paper in finding a good representation for sets of features of potentially diverse sizes, with or without positional information, when the amount of training data may be scarce. To this end, we introduce a trainable embedding, which can operate directly on the feature set or be combined with existing deep approaches. More precisely, our embedding marries ideas from optimal transport (OT) theory (Peyré & Cuturi, 2019) and kernel methods (Schölkopf & Smola, 2001) . We call this embedding OTKE (Optimal Transport Kernel Embedding) . Concretely, we embed feature vectors of a given set to a reproducing kernel Hilbert space (RKHS) and then perform a weighted pooling operation, with weights given by the transport plan between the set and a trainable reference. To gain scalability, we then obtain a finite-dimensional embedding by using kernel approximation techniques (Williams & Seeger, 2001) . The motivation for using kernels is to provide a non-linear transformation of the input features before pooling, whereas optimal transport allows to align the features on a trainable reference with fast algorithms (Cuturi, 2013) . Such combination provides us with a theoretically grounded, fixed-size embedding that can be learned either without any label, or with supervision. Our embedding can indeed become adaptive to the problem at hand, by optimizing the reference with respect to a given task. It can operate on large sets with varying size, model long-range dependencies when positional information is present, and scales gracefully to large datasets. We demonstrate its effectiveness on biological sequence classification tasks, including protein fold recognition and detection of chromatin profiles where we achieve state-of-the-art results. We also show promising results in natural language processing tasks, where our method outperforms strong baselines. Contributions. In summary, our contribution is three-fold. We propose a new method to embed sets of features of varying sizes to fixed size representations that are well adapted to downstream machine learning tasks, and whose parameters can be learned in either unsupervised or supervised fashion. We demonstrate the scalability and effectiveness of our approach on biological and natural language sequences. We provide an open-source implementation of our embedding that can be used alone or as a module in larger learning models.

2. RELATED WORK

Kernel methods for sets and OT-based kernels. The kernel associated with our embedding belongs to the family of match kernels (Lyu, 2004; Tolias et al., 2013) , which compare all pairs of features between two sets via a similarity function. Another line of research builds kernels by matching features through the Wasserstein distance. A few of them are shown to be positive definite (Gardner et al., 2018) and/or fast to compute (Rabin et al., 2011; Kolouri et al., 2016) . Except for few hyperparameters, these kernels yet cannot be trained end-to-end, as opposed to our embedding that relies on a trainable reference. Efficient and trainable kernel embeddings for biological sequences have also been proposed by Chen et al. (2019a; b) . Our work can be seen as an extension of these earlier approaches by using optimal transport rather than mean pooling for aggregating local features, which performs significantly better for long sequences in practice. Deep learning for sets. Deep Sets (Zaheer et al., 2017) feed each element of an input set into a feed-forward neural network. The outputs are aggregated following a simple pooling operation before further processing. Lee et al. (2019) propose a Transformer inspired encoder-decoder architecture for sets which also uses latent variables. Skianis et al. (2020) compute some comparison costs between an input set and reference sets. These costs are then used as features in a subsequent neural network. The reference sets are learned end-to-end. Unlike our approach, such models do not allow unsupervised learning. We will use the last two approaches as baselines in our experiments. Interpretations of attention. Using the transport plan as an ad-hoc attention score was proposed by Chen et al. (2019c) in the context of network embedding to align data modalities. Our paper goes beyond and uses the transport plan as a principle for pooling a set in a model, with trainable parameters. Tsai et al. ( 2019) provide a view of Transformer's attention via kernel methods, yet in a very different fashion where attention is cast as kernel smoothing and not as a kernel embedding.

3.1. PRELIMINARIES

We handle sets of features in R d and consider sets x living in X = x | x = {x 1 , . . . , x n } such that x 1 , . . . , x n ∈ R d for some n ≥ 1 . Elements of X are typically vector representations of local data structures, such as k-mers for sequences, patches for natural images, or words for sentences. The size of x denoted by n may vary, which is not an issue since the methods we introduce may take a sequence of any size as input, while providing a fixed-size embedding. We now revisit important results on optimal transport and kernel methods, which will be useful to describe our embedding and its computation algorithms.

