WHERE TO GO NEXT FOR RECOMMENDER SYSTEMS? ID-VS. MODALITY-BASED RECOMMENDER MODELS REVISITED

Abstract

Recommender models that utilize unique identities (IDs for short) to represent distinct users and items have been the state-of-the-arts and dominating the recommender system (RS) literature for over a decade. In parallel, the pre-trained modality encoders, such as BERT (Devlin et al., 2018) and ResNet (He et al., 2016), are becoming increasingly powerful in modeling raw modality features, e.g., text and images. In light of this, a natural question arises: whether the modality (a.k.a, content) only based recommender models (MoRec) can exceed or be on par with the ID-only based models (IDRec) when item modality features are available? In fact, this question had been answered once a decade ago, when IDRec beat MoRec with strong advantages in terms of both recommendation accuracy and efficiency. We aim to revisit this 'old' question and systematically study MoRec from several aspects. Specifically, we study several sub-questions: (i) which recommender paradigm, MoRec or IDRec, performs best in various practical scenarios, including regular, cold and new item scenarios? does this hold for items with different modality features? (ii) will MoRec benefit from the latest technical advances in corresponding communities, for example, natural language processing and computer vision? (iii) what is an effective way to leverage item modality representations, freezing them or adapting them by fine-tuning on new data? (iv) are there any other factors that affect the efficacy of MoRec. To answer these questions, we conduct rigorous experiments for item recommendations with two popular modalities, i.e., text and vision. We provide empirical evidence that MoRec with standard end-to-end training is highly competitive and even exceeds IDRec in some cases. Many of our observations imply that the dominance of IDRec in terms of recommendation accuracy does not hold well when items' raw modality features are available. We promise to release all related codes & datasets upon acceptance.

1. INTRODUCTION

Recommender systems (RS) model the historical interactions of users and items and recommend items that users may interact with in the future. RS are playing a key role in search engines, advertising systems, e-commerce websites, video and music streaming services, and various other Internet platforms. Mainstream recommender models usually use unique IDs to represent items, which can be broadly categorized into two classes: two-tower based architectures (Rendle et al., 2012; Huang et al., 2013) and sequence or session-based neural architectures (Hidasi et al., 2015; Yuan et al., 2019; Kang & McAuley, 2018; Sun et al., 2019) . These ID-only or ID-based recommender models (IDRec) are well-established and have been dominating the RS field for over a decade. Despite their popularity and success, there are also key weaknesses that should not be ignored. First, IDRec highly rely on the ID interactions, which fail to provide recommendations when users and items have few interactions (Yuan et al., 2020) , a.k.a. the cold-start setting. Second, pre-trained IDRec are not transferable across platforms given that user IDs and item IDs are in general not shareable in practice. This issue seriously limits the development of big & general-purpose RS models (Ding et al., 2021; Bommasani et al., 2021; Wang et al., 2022) , an emerging paradigm in other deep learning application areas. Third, IDRec represent items mainly by ID embedding features, ignoring the inherent content features and thus are prone to achieving sub-optimal performance. Moreover, maintaining a large and frequently updated ID embedding matrix for users and items remains a key challenge in industrial applications (Sun et al., 2020) . Beyond these issues, ID-only recommender models cannot benefit from advances in other communities, such as powerful representation models developed in NLP (natural language processing) and CV (computer vision) areas. Last but not the least, recommender models leveraging ID features have obvious drawbacks in terms of interpretability, visualization and evaluation. In contrast to IDRec, content-based recommender models (CoRec) rely heavily on item features, i.e., characteristics of the item such as the color of an object, authors of a book, and keywords in an article. While intuitive and interpretable, they have been far less prevalent than IDRec over the past decade. A key reason for this could be that the content-based item encoders are not as expressive as the standard item ID embedding, therefore leading to unsatisfactory performance. Nevertheless, we believe that given the recent extraordinary success of deep representation learning, it is time to revisit the critical comparison between CoRec and IDRec. In particular, BERT (Devlin et al., 2018) , GPT-3 (Brown et al., 2020) and Vision Transformers (Dosovitskiy et al., 2020; Liu et al., 2021) have revolutionized the NLP and CV fields in terms of representing the raw text and vision features. Whether the item representations learned by these backbone models are better suited for recommender systems than ID embeddings remains largely unknown until now. In this paper, we intend to rethink the potential of CoRec and study a key question: should we still stick to the ID-based recommender paradigm? We concentrate on item recommendation based on the text and vision modalities -the two most common modalities in literature. To differentiate from traditional attribute-based CoRec, we refer to recommender models directly encoding items' raw modality features as MoRec. To be concise, we attempt to address the following sub-questions: We address this question by performing three experiments. First, we evaluate MoRec by comparing modality-based item encoders (e.g. BERT and ResNet (He et al., 2016) ) with vs without pretraining on corresponding NLP and CV datasets; second, we evaluate MoRec by comparing weaker vs stronger ME where weaker and stronger are determined by NLP and CV tasks; third, we evaluate MoRec by comparing smaller vs larger ME given that ME with larger model sizes tend to perform better than their smaller counterparts in various downstream tasks. Q(iii): How can we effectively employ item modality representations derived from an NLP or CV encoder network? Is the end-to-end (E2E) fine-tuned representation largely superior to the frozen representation given that the E2E training fashion requires much more compute and training time? The de facto practice for industrial recommender systems is to first extract item modality representations through some ME as 'off-the-shelf' features and then incorporate them into a recommender model (McAuley et al., 2015; Covington et al., 2016) , often referred to as the twostage (TS) paradigm. While such TS paradigm is architecturally flexible, easy-to-implement and requires less compute and training time, we show that there is a substantial accuracy loss compared to the E2E paradigm. Q(i): Beyond these key questions, we also identify several other factors that affect the training of MoRec in practice. To serve as a foundation for further research of MoRec, we will publish all our codes and datasets, including a large-scale real-world video recommendation dataset (collected by ourselves) containing over 4 million user-video interactions with around 128K video thumbnails and 400K users. 1



K is short for thousand.



Equipped with strong modality encoders (ME), can MoRec perform comparably or better than IDRec in various recommendation scenarios? To answer this question, we conduct empirical studies by taking into account the two most representative recommender architectures (i.e., two-tower based DSSM (Huang et al., 2013; Rendle et al., 2020) and session-based SASRec (Kang & McAuley, 2018)) equipped with four powerful ME evaluated on three large-scale recommendation datasets with two modalities (text and vision) and three recommendation scenarios (regular, cold & new item settings). Q(ii): If Q(i) is yes, can the recent technical advances developed in NLP and CV fields be translated into accuracy improvement for MoRec when they utilize text and vision features?

