CONSTRAINING LATENT SPACE TO IMPROVE DEEP SELF-SUPERVISED E-COMMERCE PRODUCTS EMBED-DINGS FOR DOWNSTREAM TASKS

Abstract

The representation of products in a e-commerce marketplace is a key aspect to be exploited when trying to improve the user experience on the site. A well known example of the importance of a good product representation are tasks such as product search or product recommendation. There is however a multitude of lesser known tasks relevant to the business, examples are the detection of counterfeit items, the estimation of package sizes or the categorization of products, among others. It is in this setting that good vector representations of products that can be reused on different tasks are very valuable. Past years have seen a major increase in research in the area of latent representations for products in e-Commerce. Examples of this are models like Prod2Vec or Meta-Prod2Vec which leverage from the information of a user session in order to generate vectors of the products that can be used in product recommendations. This work proposes a novel deep encoder model for learning product embeddings to be applied in several downstream tasks. The model uses pairs of products that appear together in a browsing session of the users and adds a proximity constraint to the final latent space in order to project the embeddings of similar products close to each other. This has a regularization effect which gives better features representations to use across multiple downstream tasks, we explore such effect in our experimentation by assessing its impact on the performance of the tasks. Our experiments show effectiveness in transfer learning scenarios comparable to several industrial baselines.

1. INTRODUCTION

The e-Commerce environment has been growing at a fast rate in recent years. As such, new tasks propose new challenges to be resolved. Some key tasks like product search and recommendation usually have large amounts of data available and dedicated teams to work on them. On the other hand, some lesser known but still valuable tasks have less quality annotated data available and the main goal is to resolve them with a small investment. Examples of the latter are counterfeit/forbidden product detection, package size estimation, etc. For these scenarios, the use of complex systems is discouraged in favor of industry proven baselines like bag-of-words or fastText (Joulin et al., 2016) . In particular, with the advent of "Feature Stores" (Li et al., 2017) , industrial applications are seeing a rise in the adoption of organization-wide representations of business entities (customers, products, etc.). These are needed in order to speed up the process of building machine learning pipelines to enable both batch training and real-time predictions with as low effort as possible. In the present work we explore the representation learning of marketplace products to apply in downstream tasks. More specifically we aim to train an encoder that that can transform products into embeddings to be used as features of a linear classifier for a specific task, thus avoiding feature engineering for the task. The encoder model training is done is a self-supervised fashion by leveraging browsing session data of users in our marketplace. Using product metadata and an architecture inspired on the recent work of Grill et al. ( 2020), we explore how the use of pairs of products in a session can enable transfer learning into several downstream tasks. As we discuss further in Section 3, we extend on the work of Grill et al. (2020) with a new objective function that combines their original idea with a cross entropy objective. Our experiments show that the added objective helps the model converge to better representations for our tasks. We also show, through experimental evaluation, that the encoder model learns good representations that achieve comparable results with several strong baselines including fastText (Joulin et al., 2016) , Meta-Prod2Vec (Vasile et al., 2016 ), Text Convolutional Networks (Kim, 2014) and BERT (Devlin et al., 2018) in a set of downstream tasks that come from some of our industrial datasets. This paper is structured as follows: Section 2 presents other works in the area of product representation and also the works we take inspiration to design the encoder model and establish how our approach differs from the previous literature. Section 3 describes in detail all the components of our proposed architecture. Section 4 lists our experimental evaluation setup. Section 5 shows the results of our experimentation. Finally, in Section 6 we summarize our findings and delimit our line of future work.

2. BACKGROUND

Recent years have seen a dramatic increase of latent representations, which have proven to be more relevant in transfer learning scenarios in Computer Vision with the aid of large pre-trained models (Raina et al., 2007; Huh et al., 2016) ; and, more recently, with the aid of architectures for training unsupervised language models like LSTMs (Merity et al., 2017) or the attention mechanism (Vaswani et al., 2017) , transfer learning has seen an explosion of applications in Natural Language Processing (Howard & Ruder, 2018; Devlin et al., 2018; Radford et al., 2018) . For the case of the e-commerce environment, there is extensive research work in the area of latent representation for some of the main tasks. In the area of recommender systems there is a very large body of work in which the idea is to use information of the user shopping session to generate latent representations of the products. The Prod2Vec algorithm (Grbovic et al., 2015) proposed the use of word2vec (Mikolov et al., 2013) in a sequence of product receipts coming from emails. The Meta-Prod2Vec algorithm (Vasile et al., 2016) extended upon Prod2Vec by adding information on the metadata of a user shopping session. Using metadata of the products during a stream of user clicks is explored with the aid of parallel recurrent neural networks (Hidasi et al., 2016) where the authors use images and text of an product to expand in order to have richer features to model the products in the session. Other works that uses more metadata, in this case the user review of an product, is DeepCoNN (Zheng et al., 2017) , which consists of two parallel neural networks coupled in the last layers. One network learns user behaviour and the other learns product properties, based on the reviews written. There is also work in the area of modelling information on session-aware recommender systems (Twardowski, 2016) , where the user information is not present and the focus of the task is leaning towards using the session information to recommend products. This extensive research of representation learning for marketplace products is heavily influenced with the end goal of recommendation. Many of them also leverage from the unsupervised information available such as sessions, reviews, metadata, etc. In this work the end goal of the representations is not recommendations, but different downstream tasks that we have available from challenges we face in our marketplace. For that we propose a deep encoder architecture that follows the work presented in "Bootstrap Your Own Latent" (BYOL) (Grill et al., 2020) with the intended objective of learning embeddings of products of the same session close to each other in the latent space. However, our experiments showed that this was not enough to ensure the transfer of knowledge, as such we extended the learning objective of BYOL to have a cross entropy objective using the product category as target and we explore how the correct combination of each part of the objective function impacts on the quality of the final embeddings. The main contributions of our paper are the following: 1) a novel deep encoder architecture that can be trained on pairs of products found in user browsing sessions, 2) a study of how this architecture performs for downstream tasks compared to some strong proposed baselines, 3) an extension to the BYOL architecture to a different domain from the proposed by Grill et al. (2020) and how it impacts on the final results.

