USERBERT: SELF-SUPERVISED USER REPRESENTA-TION LEARNING

Abstract

This paper extends the BERT model to user data for pretraining user representations in a self-supervised way. By viewing actions (e.g., purchases and clicks) in behavior sequences (i.e., usage history) in an analogous way to words in sentences, we propose methods for the tokenization, the generation of input representation vectors and a novel pretext task to enable the pretraining model to learn from its own input, omitting the burden of collecting additional data. Further, our model adopts a unified structure to simultaneously learn from long-term and short-term user behavior as well as user profiles. Extensive experiments demonstrate that the learned representations result in significant improvements when transferred to three different real-world tasks, particularly in comparison with task-specific modeling and representations obtained from multi-task learning.

1. INTRODUCTION

The choice of data representations, i.e., how to create meaningful features, imposes tremendous impact on the performance of machine learning applications (Bengio et al., 2013) . Therefore, data processing and feature engineering have been the decisive steps in developing machine learning models. To extend the applicability of the models, recent research on representation learning aims to discover the underlying explanatory factors hidden in raw data. With rapid advances in this direction, we have witnessed many breakthroughs in the areas of computer vision (CV) (Doersch et al., 2015; Sharif Razavian et al., 2014; Simo-Serra et al., 2015) and natural language processing (NLP) (Mikolov et al., 2013; Pennington et al., 2014; Lin et al., 2017) . Similarly, for building user-oriented industrial applications like next purchase prediction and recommendation, much effort has been spent on understanding business models and user behavior for creating useful features (Richardson et al., 2007; Covington et al., 2016) . This is a time-consuming and application-specific process. Also, it is challenging to reuse these features or share gained knowledge between different services and applications. To solve the issues of isolated feature engineering and task-oriented pipeline design, the pretrainingtransfer learning paradigm has been explored. For example, multi-task learning (MTL) has shown promising results (Ni et al., 2018) . Nevertheless, MTL has its intrinsic challenges, e.g., deciding which tasks to learn jointly (Standley et al., 2019) , or how to weigh tasks (Kendall et al., 2018) , to achieve optimal performance. More importantly, the learning still hinges on large amounts of well-annotated user labels. Inspired by the BERT model and its variations (Devlin et al., 2019; Lan et al., 2020) , this paper explores the feasibility of understanding users in a similar way to how language is understood. We think it is conceptually intuitive to make such an analogy since understanding language and users share a similar goal, i.e., understanding a conveyed message, but with different mediums. The former models what is said (sentences) while the latter learns from what is done (behavior). The syntax and semantics of a sentence are comparable with the behavioral patterns and the characteristics of a user. Hence, we hypothesize the learning procedure can be consistent in methodology as well, and propose to build upon BERT for pretraining user representations on unlabeled behavior data. Our proposal, UserBERT, simultaneously learns from three categories of user data, i.e., long-term and short-term behavior as well as user profiles, via a unified architecture. In particular, different action types (e.g., page views, clicks and purchases) and attributes (e.g., shop and item genre) 1

