COLES: CONTRASTIVE LEARNING FOR EVENT SE-QUENCES WITH SELF-SUPERVISION

Abstract

We address the problem of self-supervised learning on discrete event sequences generated by real-world users. Self-supervised learning incorporates complex information from the raw data in low-dimensional fixed-length vector representations that could be easily applied in various downstream machine learning tasks. In this paper, we propose a new method CoLES, which adopts contrastive learning, previously used for audio and computer vision domains, to the discrete event sequences domain in a self-supervised setting. Unlike most previous studies, we theoretically justify under mild conditions that the augmentation method underlying CoLES provides representative samples of discrete event sequences. We evaluated CoLES on several public datasets and showed that CoLES representations consistently outperform other methods on different downstream tasks.

1. INTRODUCTION

A promising and rapidly growing approach known as self-supervised learningfoot_0 is the main choice for pre-training in situations where the amount of labeled data for the target task of interest is limited. Most of the research in the area of self-supervised learning has been focused on the core machine learning domains, including NLP (e.g., ELMO (Peters et al., 2018) , BERT (Devlin et al., 2019) ), speech (e.g., CPC (van den Oord et al., 2018) ) and computer vision (Doersch et al., 2015; van den Oord et al., 2018) . However, there has been very little research on self-supervised learning in the domain of discrete event sequences, including user behavior sequences (Ni et al., 2018) such as credit card transactions at banks, phone calls and messages at telecom, purchase history at retail and click-stream data of online services. Produced in many business applications, such data is a major key to the growth of modern companies. User behavior sequence is attributed to a person and captures regular and routine actions of a certain type. The analysis of these sequences constitutes an important sub-field of machine learning (Laxman et al., 2008; Wiese and Omlin, 2009; Zhang et al., 2017; Bigon et al., 2019) . NLP, audio and computer vision domains are similar in the sense that the data of this type is "continuous": a short term in NLP can be accurately reconstructed from its context (like a pixel from its neighboring pixels). This fact underlies popular NLP approaches for self-supervision such as BERT's Cloze task (Devlin et al., 2019) and approaches for self-supervision in audio and computer vision, like CPC (van den Oord et al., 2018) . In contrast, for many types of event sequence data, a single token cannot be determined using its nearby tokens, because the mutual information between a token and its context is small. For this reason, most state-of-the-art self-supervised methods are not applicable to event sequence data. In this paper, we propose the COntrastive Learning for Event Sequences (CoLES) method that learns low-dimensional representations of discrete event sequences. It is based on a novel theoretically grounded data augmentation strategy, which adapts the ideas of contrastive learning (Xing et al., 2002; Hadsell et al., 2006) to the discrete event sequences domain in a self-supervised setting. The aim of contrastive learning is to represent semantically similar objects (positive pairs of images, video, audio, etc.) closer to each other, while dissimilar ones (negative pairs) further away. Positive pairs are obtained for training either explicitly, e.g., in a manual labeling process or implicitly using different data augmentation strategies (Falcon and Cho (2020)). We treat explicit cases as a supervised approach and implicit cases as a self-supervised one. In most applications, where each person is represented by one sequence of events, there are no explicit positive pairs, and thus only self-supervised approaches are applicable. Our CoLES method is self-supervised and based on the observation that event sequences usually possess periodicity and repeatability of their events. We propose and theoretically justify a new augmentation algorithm, which generates sub-sequences of an observed event sequence and uses them as different high-dimensional views of the same (sequence) object for contrastive learning. Representations produced by the CoLES model can be used directly as a fixed vector of features in some supervised downstream task (e. g. classification task) similarly to (Mikolov et al., 2013; Song et al., 2017; Zhai et al., 2019) . Alternatively, the trained CoLES model can be fine-tuned (Devlin et al., 2019) for the specific downstream task. We applied CoLES to several user behavior sequence datasets with different downstream classification tasks. When used directly as feature vectors, CoLES representations achieve strong performance comparable to the hand-crafted features produced by data scientists. We demonstrate that fine-tuned CoLES representations consistently outperform methods based on other representations by a significant margin. We provide the full source code for all the experiments described in the paperfoot_1 . This paper makes the following contributions: (1) We present the CoLES method that adapts contrastive learning in the self-supervised setting to the discrete event sequence domain. ( 2) We propose a novel theoretically grounded augmentation method for discrete event sequences. (3) We demonstrate that CoLES consistently outperforms previously introduced supervised, self-supervised and semi-supervised learning baselines adapted to the event sequence domain. We also conducted a pilot study on event sequence data of a large European bank. We tested CoLES against the baselines and achieved superior performance on downstream tasks which produced significant financial gains, measured in hundreds of millions of dollars yearly. The rest of the paper is organized as follows. In the next section, we discuss related studies on self-supervised and contrastive learning. In Section 3 we introduce our new method CoLES for discrete event sequences. In Section 4 we demonstrate that CoLES outperforms several strong baselines including previously proposed contrastive learning methods adapted to event sequence datasets. Section 5 is dedicated to the discussion of our results and conclusions.

2. RELATED WORK

Contrastive learning has been successfully applied to constructing low-dimensional representations (embeddings) of various objects, such as images (Chopra et al., 2005; Schroff et al., 2015) , texts (Reimers and Gurevych, 2019), and audio recordings (Wan et al., 2018) . The aim of these studies is to identify the object based on its sample (Schroff et al., 2015; Hu et al., 2014; Wan et al., 2018) . Therefore, their training datasets explicitly contain several independent samples per each particular object, which form positive pairs as a critical component for learning. These supervised approaches are not applicable to our setting. For situations when positive pairs are not available or their amount is limited, augmentation techniques were proposed in the computer vision domain. One of the first frameworks with augmentation was proposed by Dosovitskiy et al. (2014) . In this work, surrogate classes for model training were introduced using augmentations of the same image. Several recent works (Bachman et al., 2019; He et al., 2019; Chen et al., 2020) extended this idea by applying contrastive learning methods, they are nicely summarised by Falcon and Cho (2020) . Although augmentation techniques proposed in these studies provide good performance empirically, we note that no theoretical background behind different augmentation approaches has been proposed so far. Contrastive Predictive Coding (CPC) is a self-supervised learning approach proposed for non-discrete sequential data (van den Oord et al., 2018) . CPC extracts meaningful representations by predicting latent representations of future observations of the input sequence and using autoregressive methods. CPC representations demonstrated strong performance on four distinct domains: audio, computer vision, natural language and reinforcement learning. We adapted the CPC based approach to the domain of discrete event sequences and compared it with our CoLES approach (see Section 4.2).



See, e.g., keynote by Yann LeCun at ICLR-20: https://www.iclr.cc/virtual_2020/speaker_7.html https://github.com/***/*** (the link was anonymized for the double-blind peer review purposes)

