USERBERT: SELF-SUPERVISED USER REPRESENTA-TION LEARNING

Abstract

This paper extends the BERT model to user data for pretraining user representations in a self-supervised way. By viewing actions (e.g., purchases and clicks) in behavior sequences (i.e., usage history) in an analogous way to words in sentences, we propose methods for the tokenization, the generation of input representation vectors and a novel pretext task to enable the pretraining model to learn from its own input, omitting the burden of collecting additional data. Further, our model adopts a unified structure to simultaneously learn from long-term and short-term user behavior as well as user profiles. Extensive experiments demonstrate that the learned representations result in significant improvements when transferred to three different real-world tasks, particularly in comparison with task-specific modeling and representations obtained from multi-task learning.

1. INTRODUCTION

The choice of data representations, i.e., how to create meaningful features, imposes tremendous impact on the performance of machine learning applications (Bengio et al., 2013) . Therefore, data processing and feature engineering have been the decisive steps in developing machine learning models. To extend the applicability of the models, recent research on representation learning aims to discover the underlying explanatory factors hidden in raw data. With rapid advances in this direction, we have witnessed many breakthroughs in the areas of computer vision (CV) (Doersch et al., 2015; Sharif Razavian et al., 2014; Simo-Serra et al., 2015) and natural language processing (NLP) (Mikolov et al., 2013; Pennington et al., 2014; Lin et al., 2017) . Similarly, for building user-oriented industrial applications like next purchase prediction and recommendation, much effort has been spent on understanding business models and user behavior for creating useful features (Richardson et al., 2007; Covington et al., 2016) . This is a time-consuming and application-specific process. Also, it is challenging to reuse these features or share gained knowledge between different services and applications. To solve the issues of isolated feature engineering and task-oriented pipeline design, the pretrainingtransfer learning paradigm has been explored. For example, multi-task learning (MTL) has shown promising results (Ni et al., 2018) . Nevertheless, MTL has its intrinsic challenges, e.g., deciding which tasks to learn jointly (Standley et al., 2019) , or how to weigh tasks (Kendall et al., 2018) , to achieve optimal performance. More importantly, the learning still hinges on large amounts of well-annotated user labels. Inspired by the BERT model and its variations (Devlin et al., 2019; Lan et al., 2020) , this paper explores the feasibility of understanding users in a similar way to how language is understood. We think it is conceptually intuitive to make such an analogy since understanding language and users share a similar goal, i.e., understanding a conveyed message, but with different mediums. The former models what is said (sentences) while the latter learns from what is done (behavior). The syntax and semantics of a sentence are comparable with the behavioral patterns and the characteristics of a user. Hence, we hypothesize the learning procedure can be consistent in methodology as well, and propose to build upon BERT for pretraining user representations on unlabeled behavior data. Our proposal, UserBERT, simultaneously learns from three categories of user data, i.e., long-term and short-term behavior as well as user profiles, via a unified architecture. In particular, different action types (e.g., page views, clicks and purchases) and attributes (e.g., shop and item genre) are chosen to represent long-term and short-term user behavior. For these two behavior types, we first present distinct strategies to discretize them into a sequence of behavioral words. Instead of modeling single user actions sequentially, the applied discretization leads to better generalization. The token representation of these behavioral words is computed by the concatenation and mean calculation of the word embeddings of the attribute IDs in each action, and this is followed by the summation of token, position and segment embeddings. These representation vectors are finally aligned with the word embeddings of user categorical profiles as the input to UserBERT. With this input, we design a novel pretext task, masked multi-label classification, and the UserBERT model is pretrained via optimizing the multi-label classifications of the multiple attributes in the masked behavioral words. Despite the parallels between user behavior and sentences, there are substantial differences and challenges in designing the learning procedure in a coherent way. Our model is able to deal with heterogeneous user behavior data, and achieve generalization via effective tokenization and the pretraining task. While there is prior work applying BERT to task-specific user modeling (Sun et al., 2019b) , this paper is built upon the assumption that behavioral patterns can be understood like the structure of a language. The UserBERT model explores integrating various types of user data in a unified architecture and learning generic representations with self-supervised signals. In our experiments, the pretrained model is fine-tuned on three different real-world tasks, and the results show that UserBERT outperforms task-specific modeling and multi-task learning based pretraining. Our contributions are summarized as follows: • We propose UserBERT, a self-supervised learning model, to pretrain user representations via analogizing actions in a user behavior sequence to words in sentence. It eliminates the needs of previous approaches for collecting additional user annotated labels. • We design the discretization of user raw data sequences, the generation of the input representation and a novel pretext task for pretraining. • UserBERT adopts a unified model architecture to enable the simultaneous learning from heterogeneous data including long, short-term behavior as well as demographics. • We demonstrate the empirical power of UserBERT with extensive experiments. Our model is compared with task-specific models without pretraining and multi-task learning based pretraining models, and achieves performance gains on three real-world applications.

2.1. PRETRAINING AND TRANSFER LEARNING

Recent studies have demonstrated that pretraining on large, auxiliary datasets followed by finetuning on target tasks is a promising paradigm for boosting performance (Oquab et al., 2014; Donahue et al., 2014; Hendrycks et al., 2019; Ghadiyaram et al., 2019) . Multi-task learning has been one of the commonly adopted approaches for pretraining due to its ability to improve generalization (Zhang & Yang, 2017; Ruder, 2017) . It is shown that the pretrained MTL models can boost performance even when transferred to unseen tasks (Liu et al., 2015; Ni et al., 2018) . Despite its success, MTL still has many challenges, such as negative transfer and the learning adjustment between different tasks (Guo et al., 2018) . Also, MTL requires large amounts of well-annotated labels to produce satisfying outputs. There are two common forms of adaptation when transferring the pretrained models to a given target task, i.e., feature-based in which the pretrained weights are frozen, and directly fine-tuning the pretrained model (Peters et al., 2019) . We fine-tune pretrained models in our experiments.

2.2. SELF-SUPERVISED LEARNING

Deep learning models can already compete with humans on challenging tasks like semantic segmentation in the CV area (He et al., 2015) as well as a few language understanding tasks (Liu et al., 2019) . However, such success relies on adequate amounts of quality training data, which can be extremely expensive or even impossible to obtain (Kolesnikov et al., 2019) . As a result, a lot of

