RECOMMENDER TRANSFORMERS WITH BEHAVIOR PATHWAYS

Abstract

Sequential recommendation requires the recommender to capture the evolving behavior characteristics from logged user behavior data for accurate recommendations. Nevertheless, user behavior sequences are viewed as a script with multiple ongoing threads intertwined. We find that only a small set of pivotal behaviors can be evolved into the user's future action. As a result, the future behavior of the user is hard to predict. We conclude this characteristic for sequential behaviors of each user as the behavior pathway. Different users have their unique behavior pathways. Among existing sequential models, transformers have shown great capacity in capturing global-dependent characteristics. However, these models mainly provide a dense distribution over all previous behaviors using the self-attention mechanism, making the final predictions overwhelmed by the trivial behaviors not adjusted to each user. In this paper, we build the Recommender Transformer (RETR) with a novel Pathway Attention mechanism. RETR can dynamically plan the behavior pathway specified for each user, and sparingly activate the network through this behavior pathway to effectively capture evolving patterns useful for recommendation. The key design is a learned binary route to prevent the behavior pathway from being overwhelmed by trivial behaviors. Pathway attention is model-agnostic and can be applied to a series of transformer-based models for sequential recommendation. We empirically evaluate RETR on seven intra-domain benchmarks and RETR yields state-of-the-art performance. On another five cross-domain benchmarks, RETR can capture more domain-invariant representations for sequential recommendation.

1. INTRODUCTION

Recommender systems (Hidasi et al., 2015; Lu et al., 2015; Zhao et al., 2021) have been widely adopted in real-world industrial applications such as E-commerce and social media. Benefiting from the increase in computing power and model capacity, some recent efforts formulate recommendation as a time-series forecasting problem, known as sequential recommendation (Kang & McAuley, 2018; Sun et al., 2019; Chen et al., 2021) . The core idea of this field is to infer upcoming actions based on user's historical behaviors, which are reorganized as time-ordered sequences. This intuitive modeling of recommendation is proved time-sensitive and context-aware to make precise predictions. Recent advanced sequential recommendation models, such as SASRec (Kang & McAuley, 2018) , Bert4Rec (Sun et al., 2019) and S3-Rec (Zhou et al., 2020) , have achieved significant improvements. Transformers enable these models to recognize global-range sequential patterns, and to model how future behaviors are anchored in historical ones. The self-attention mechanism does make it possible to explore all previous behaviors of each user, with the whole neural network activated. However, misuse of all user information, regardless of whether it is informative or not, floods models with trivial ones, makes models dense in neuron connections and inefficient in computation, and results in key behaviors losing voice. And this clearly contradicts with the way our brain works. The human being has many different parts of the brain specialized for various tasks, yet the brain only calls upon the relevant pieces for a given situation (Zaidi, 2010) . To some extent, user behavior sequences can be viewed as a script with multiple ongoing threads intertwined. And only key clues suggest what will happen next. In sequential recommendation, we find that only a small part of pivotal behaviors can be evolved into the user's future action. And we conclude this characteristics of sequential behaviors as the behavior pathway. Different users have their unique behavior pathways, and we have provided three typical examples: (a) Correlated behavior pathway: A user's behavior pathway is closely associated with behaviors in a certain period. As shown in the first line of Figure 1 , the mouse is clicked many times recently, leading to the final decision to buy a mouse. (b) Casual behavior pathway: A user's behavior pathway is interested in a specific item at casual times. In the second line of Figure 1 , the backpack is randomly clicked sequentially in a multi-hop manner. (c) Drifted behavior pathway: A user's behavior pathway in a particular brand might drift over time. In the third line of Figure 1 , the user was initially interested in a keyboard, but suddenly became interested in buying a phone at last. It's challenging to capture these potential behaviors dynamically for each user to make precise recommendations. Motivated by the Pathways (Dean, 2021) , a new way of thinking about AI, which builds a single model that is sparsely activated for all tasks with small pathways through the network called into action as needed, we propose a novel Recommender Transformer (RETR) with a Pathway Attention mechanism. RETR dynamically explores behavior pathways for different users and then captures evolving patterns through these pathways effectively. Specifically, the user-dependent pathway attention, which incorporates a pathway router, determines whether or not a behavior token will be maintained in the behavior pathway. Technically, the pathway router generates a customized binary route for each token based on their information redundancy. RETR has a stacked structure, and successive pathway routers constitute a hierarchical evolution of user behaviors. To enable the pathway router to be end-to-end optimized, we propose an adaptive Gumbel-Softmax sampling strategy to overcome the non-differentiable problem of sampling from a Bernoulli distribution. To effectively capture the evolving patterns via the behavior pathway, our pathway attention mechanism makes RETR mainly attend to the obtained pathway. We force the model to focus on the most informative behaviors by using the query routed through the behavior pathway. We cut off the interaction from the off-pathway behaviors of the query. Compared with using all previous behaviors, our pathway attention mechanism is obviously more effective and can avoid the most informative tokens being overwhelmed by trivial behaviors. Besides, our pathway attention mechanism is model-agnostic and can be easily applied to the existing transformer-based models. To validate the effectiveness of our approach, we conduct experiments on seven intra-domain competitive datasets for sequential recommendations and RETR achieves state-of-the-art performance; Furthermore, our RETR also achieves consistent performance improvements under the cross-domain setting, indicating RETR can capture more domain-universal representation for sequential recommendation. Our main contributions can be summarized as follows: • We first propose the concept of behavior pathway for sequential recommendation, and find the key to the recommender is to dynamically capture the behavior pathway for each user. • We propose the novel Recommender Transformer (RETR) with a novel pathway attention mechanism, which can generate the behavior pathway hierarchically and capture the evolving patterns dynamically through the pathway. • We validate the effectiveness of RETR on seven intra-domain benchmarks and five crossdomain benchmarks, both achieving state-of-the-art performance. RETR can capture more domain-invariant representations and our pathway attention can be applied together with a rich family of transformer-based models to yield consistent performance improvements.

2. RELATED WORK

Traditional recommendation approaches. Capturing evolving behavior characteristics is crucial for many online applications, such as advertising, social media and E-commerce, and it is the key challenge for sequential recommendation (Adomavicius & Tuzhilin, 2005; Kang & McAuley, 2018; Cui et al., 2018; Yan et al., 2019; Fang et al., 2020; Pi et al., 2020; Bian et al., 2021; Zhou et al., 2022; Liu et al., 2022) . Traditional recommendation approach, such as the collaborative filtering (CF) (Herlocker et al., 1999) based on matrix approximation (Koren, 2008; Koren et al., 2009) , always assumes that the user's behavior is static. However, in practice, user behaviors often change over time due to various reasons, making the CF deteriorate in a real-world application. Sequential recommendation approaches. To overcome this challenge, some methods, such as FPMC (He & McAuley, 2016) and HRM (Wang et al., 2015) , use Markov chains to capture sequential patterns by learning user-specific transition matrices. Higher-order Markov Chains assume the next action is related to several previous actions. Benefit from this strong inductive bias, MC-based methods (He & McAuley, 2016; He et al., 2016) show superior performance in capturing shortterm patterns. At the same time, there is a potential state space explosion problem when these approaches are faced with different possible sequences (Wu et al., 2017) . In recent years, many works have been using the deep neural network for sequential recommendation. The GRU4Rec (Hidasi et al., 2015) and the RepeatNet (Ren et al., 2019) adopt the recurrent network to capture dynamic patterns from the user behaviors dependent on sequence positions. The RNN-based models achieve competitive performance in capturing short-term behavior patterns but cannot capture long-term behavior patterns effectively. The CNN-based model, such as Caser (Tang & Wang, 2018) , applies convolutional operations to extract transitions while tending to overlook the intrinsic relationship across user behaviors. The GNN-based methods, such as SRGNN (Wu et al., 2019) , GCSAN (Xu et al., 2019) , Jodie (Kumar et al., 2019) and TGN (Rossi et al., 2020) model behavior sequences as graph-structured data and incorporate an attention mechanism for a session-based recommendation. In addition, DIN (Zhou et al., 2018) uses the gate mechanism to weight different user behaviors. However, concatenating all behaviors makes these models overlook the sequential characteristics. Recently, the MLP-based model like FMLP-Rec (Zhou et al., 2022) uses the MLP as the backbone for sequential recommendation. However, these methods are still overwhelmed by the trivial behaviors. Transformer-based models for Sequential Recommendation. SASRec (Kang & McAuley, 2018) , BertRec (Sun et al., 2019) , S3-Rec (Zhou et al., 2020) , TGSRec (Fan et al., 2021b) , LightSANs (Fan et al., 2021a) and SSE-PT (Wu et al., 2020) introduce the transformer architecture into sequential recommendation, which might lead to the over-parameterized architecture of Transformer-based methods. These models capture the evolving patterns by the self-attention mechanism, interacting with all previous behaviors. However, dense interactions will make the model not adapt to different users and overwhelm behavior pathways. Some methods like Locker (He et al., 2021) and Recdenoiser (Chen et al., 2022) propose the sparse attention mechanism with learned mask, while they may overlook the ability of capturing the behavior pathway in the token level. To tackle this challenge, our paper builds the Recommender Transformer (RETR) with a new Pathway Attention mechanism that is dynamically activated for the behavior pathway of all users. Distinct from the previous routing architecture like Switch Transformer (Fedus et al., 2021) using the MoE (Shazeer et al., 2017) structure for natural language tasks or TRAR (Zhou et al., 2021) using the learned sparse attention for visual question answering, our RETR is designed explicitly for sequential recommendation. Our RETR uses the pathway router to adaptively route the sequential behavior of each user rather than routing the experts of feed-forward networks in switch transformer.

3. METHOD

Suppose that we have a set of users and items, denoted by U and I respectively. In the task of sequential recommendation, chronologically-ordered behaviors of a user u ∈ U could be represented by a user-interacted item sequence: {i 1 , • • • , i n }. Formally, given a user u with her or his behavior sequence {i 1 , • • • , i n }, the goal of sequential recommendation is to predict the next item the user u would interact with at the (n + 1)-th step, denoted as p (i n+1 | i 1:n ). As aforementioned, we highlight the key to sequential recommendation as the exploration of usertailored behavior pathways, through which evolving characteristics could be learned. Motivated by y one task for predicting the output. In this way, the select-or-skip policy of all blocks and tasks = {ul,k}lL,kK) determines the adaptive feature sharing mechanism over the given task set T . the number of potential configurations for U is 2 L⇥K which grows exponentially with the ber of blocks and tasks, it becomes intractable to manually find such a U to get the optimal ture sharing pattern in multi-task learning. Instead of handcrafting this policy, we adopt Gumbelftmax Sampling [25] to optimize U jointly with the network parameters W through standard k-propagation. Moreover, we introduce two policy regularizations to achieve effective knowledge ring in a compact multi-task network, as well as a curriculum learning strategy to stabilize the imization in the early stages. After the training finishes, we sample the binary decision ul,k for h block l from ul,k to decide what blocks to select or skip in the task Tk. Specifically, with the p of the select-or-skip decisions, we form a novel and non-trivial network architecture for MTL ameter-sharing, and share knowledge at different levels across all tasks in a flexible and efficient y. At test time, when a novel input is presented to the multi-task network, the optimal policy is lowed, selectively choosing what blocks to compute for each task. Our proposed approach not only ourages positive sharing among tasks via shared blocks but also minimizes negative interference using task-specific blocks when necessary. arning a Task-Specific Policy. In AdaShare, we learn the select-or-skip policy U and network ights W jointly through standard back-propagation from our designed loss functions. However, h select-or-skip policy ul,k is discrete and non-differentiable and this makes direct optimization ficult. Therefore, we adopt to resolve this non-differentiability enable direct optimization of the discrete policy ul,k using back-propagation. mbel-Softmax Sampling. The Gumbel-Softmax trick [25, 36] is a simple and effective way to stitutes the original non-differentiable sample from a discrete distribution with a differentiable ple from a corresponding Gumbel-Softmax distribution. We let ⇡l,k = [1 ↵l,k, ↵l,k] be the tribution vector of the binary random variable ul,k that we want to optimize, where the logit ↵l,k resents the probability that the l-th block is selected to execute in the task Tk. Gumbel-Softmax Sampling, instead of directly sampling a select-or-skip decision ul,k for the l-th ck in the task Tk from its distribution ⇡l,k, we generate it as, ul,k = arg max j2{0,1} log ⇡l,k(j) + Gl,k(j) , ere Gl,k = log( log Ul,k) is a standard Gumbel distribution with Ul,k sampled from a uniform . distribution Unif(0, 1). To remove the non-differentiable argmax operation in Eq. 1, the Gumbel ftmax trick relaxes one-hot(ul,k) 2 {0, 1} 2 (the one-hot encoding of ul,k) to vl,k 2 R 2 (the soft ect-or-skip decision for the l-th block in Tk) with the reparameterization trick [25] : vl,k(j) = exp (log ⇡l,k(j) + Gl,k(j))/⌧ P i2{0,1} exp (log ⇡l,k(i) + Gl,k(i))/⌧ , 4 Figure 2 : Illustration of our proposed approach. AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or-skip policy decision sampled from the learned task-specific policy distribution (logits). These select-or-skip vectors define which blocks should be executed in different tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output. During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling. We use task-specific losses and policy regularizations (to encourage sparsity and sharing) in training. Best viewed in color. only one task for predicting the output. In this way, the select-or-skip policy of all blocks and tasks (U = {ul,k}lL,kK) determines the adaptive feature sharing mechanism over the given task set T . As the number of potential configurations for U is 2 L⇥K which grows exponentially with the number of blocks and tasks, it becomes intractable to manually find such a U to get the optimal feature sharing pattern in multi-task learning. Instead of handcrafting this policy, we adopt Gumbel-Softmax Sampling [25] to optimize U jointly with the network parameters W through standard back-propagation. Moreover, we introduce two policy regularizations to achieve effective knowledge sharing in a compact multi-task network, as well as a curriculum learning strategy to stabilize the optimization in the early stages. After the training finishes, we sample the binary decision ul,k for each block l from ul,k to decide what blocks to select or skip in the task Tk. Specifically, with the help of the select-or-skip decisions, we form a novel and non-trivial network architecture for MTL parameter-sharing, and share knowledge at different levels across all tasks in a flexible and efficient way. At test time, when a novel input is presented to the multi-task network, the optimal policy is followed, selectively choosing what blocks to compute for each task. Our proposed approach not only encourages positive sharing among tasks via shared blocks but also minimizes negative interference by using task-specific blocks when necessary. Learning a Task-Specific Policy. In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard back-propagation from our designed loss functions. However, each select-or-skip policy ul,k is discrete and non-differentiable and this makes direct optimization difficult. Therefore, we adopt Gumbel-Softmax Sampling [25] to resolve this non-differentiability and enable direct optimization of the discrete policy ul,k using back-propagation. Gumbel-Softmax Sampling. The Gumbel-Softmax trick [25, 36] is a simple and effective way to substitutes the original non-differentiable sample from a discrete distribution with a differentiable sample from a corresponding Gumbel-Softmax distribution. We let ⇡l,k = [1 ↵l,k, ↵l,k] be the distribution vector of the binary random variable ul,k that we want to optimize, where the logit ↵l,k represents the probability that the l-th block is selected to execute in the task Tk. In Gumbel-Softmax Sampling, instead of directly sampling a select-or-skip decision ul,k for the l-th block in the task Tk from its distribution ⇡l,k, we generate it as, ul,k = arg max j2{0,1} log ⇡l,k(j) + Gl,k(j) , where Gl,k = log( log Ul,k) is a standard Gumbel distribution with Ul,k sampled from a uniform i.i.d. distribution Unif(0, 1). To remove the non-differentiable argmax operation in Eq. 1, the Gumbel Softmax trick relaxes one-hot(ul,k) 2 {0, 1} 2 (the one-hot encoding of ul,k) to vl,k 2 R 2 (the soft select-or-skip decision for the l-th block in Tk) with the reparameterization trick [25]: vl,k(j) = exp (log ⇡l,k(j) + Gl,k(j))/⌧ P i2{0,1} exp (log ⇡l,k(i) + Gl,k(i))/⌧ , 4 Figure 2 : Illustration of our proposed approach. AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or-skip policy decision sampled from the learned task-specific policy distribution (logits). These select-or-skip vectors define which blocks should be executed in different tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output. During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling. We use task-specific losses and policy regularizations (to encourage sparsity and sharing) in training. Best viewed in color. only one task for predicting the output. In this way, the select-or-skip policy of all blocks and tasks (U = {ul,k}lL,kK) determines the adaptive feature sharing mechanism over the given task set T . As the number of potential configurations for U is 2 L⇥K which grows exponentially with the number of blocks and tasks, it becomes intractable to manually find such a U to get the optimal feature sharing pattern in multi-task learning. Instead of handcrafting this policy, we adopt Gumbel-Softmax Sampling [25] to optimize U jointly with the network parameters W through standard back-propagation. Moreover, we introduce two policy regularizations to achieve effective knowledge sharing in a compact multi-task network, as well as a curriculum learning strategy to stabilize the optimization in the early stages. After the training finishes, we sample the binary decision ul,k for each block l from ul,k to decide what blocks to select or skip in the task Tk. Specifically, with the help of the select-or-skip decisions, we form a novel and non-trivial network architecture for MTL parameter-sharing, and share knowledge at different levels across all tasks in a flexible and efficient way. At test time, when a novel input is presented to the multi-task network, the optimal policy is followed, selectively choosing what blocks to compute for each task. Our proposed approach not only encourages positive sharing among tasks via shared blocks but also minimizes negative interference by using task-specific blocks when necessary. Learning a Task-Specific Policy. In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard back-propagation from our designed loss functions. However, each select-or-skip policy ul,k is discrete and non-differentiable and this makes direct optimization difficult. Therefore, we adopt Gumbel-Softmax Sampling [25] to resolve this non-differentiability and enable direct optimization of the discrete policy ul,k using back-propagation. Gumbel-Softmax Sampling. The Gumbel-Softmax trick [25, 36] is a simple and effective way to substitutes the original non-differentiable sample from a discrete distribution with a differentiable sample from a corresponding Gumbel-Softmax distribution. We let ⇡l,k = [1 ↵l,k, ↵l,k] be the distribution vector of the binary random variable ul,k that we want to optimize, where the logit ↵l,k represents the probability that the l-th block is selected to execute in the task Tk. In Gumbel-Softmax Sampling, instead of directly sampling a select-or-skip decision ul,k for the l-th block in the task Tk from its distribution ⇡l,k, we generate it as, ul,k = arg max j2{0,1} log ⇡l,k(j) + Gl,k(j) , where Gl,k = log( log Ul,k) is a standard Gumbel distribution with Ul,k sampled from a uniform i.i.d. distribution Unif(0, 1). To remove the non-differentiable argmax operation in Eq. 1, the Gumbel Softmax trick relaxes one-hot(ul,k) 2 {0, 1} 2 (the one-hot encoding of ul,k) to vl,k 2 R 2 (the soft select-or-skip decision for the l-th block in Tk) with the reparameterization trick [25]: vl,k(j) = exp (log ⇡l,k(j) + Gl,k(j))/⌧ P i2{0,1} exp (log ⇡l,k(i) + Gl,k(i))/⌧ , 4 Figure 2 : Illustration of our proposed approach. AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or-skip policy decision sampled from the learned task-specific policy distribution (logits). These select-or-skip vectors define which blocks should be executed in different tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output. During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling. We use task-specific losses and policy regularizations (to encourage sparsity and sharing) in training. Best viewed in color. only one task for predicting the output. In this way, the select-or-skip policy of all blocks and tasks (U = {ul,k}lL,kK) determines the adaptive feature sharing mechanism over the given task set T . As the number of potential configurations for U is 2 L⇥K which grows exponentially with the number of blocks and tasks, it becomes intractable to manually find such a U to get the optimal feature sharing pattern in multi-task learning. Instead of handcrafting this policy, we adopt Gumbel-Softmax Sampling [25] to optimize U jointly with the network parameters W through standard back-propagation. Moreover, we introduce two policy regularizations to achieve effective knowledge sharing in a compact multi-task network, as well as a curriculum learning strategy to stabilize the optimization in the early stages. After the training finishes, we sample the binary decision ul,k for each block l from ul,k to decide what blocks to select or skip in the task Tk. Specifically, with the help of the select-or-skip decisions, we form a novel and non-trivial network architecture for MTL parameter-sharing, and share knowledge at different levels across all tasks in a flexible and efficient way. At test time, when a novel input is presented to the multi-task network, the optimal policy is followed, selectively choosing what blocks to compute for each task. Our proposed approach not only encourages positive sharing among tasks via shared blocks but also minimizes negative interference by using task-specific blocks when necessary. Learning a Task-Specific Policy. In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard back-propagation from our designed loss functions. However, each select-or-skip policy ul,k is discrete and non-differentiable and this makes direct optimization difficult. Therefore, we adopt Gumbel-Softmax Sampling [25] to resolve this non-differentiability and enable direct optimization of the discrete policy ul,k using back-propagation. Gumbel-Softmax Sampling. The Gumbel-Softmax trick [25, 36] is a simple and effective way to substitutes the original non-differentiable sample from a discrete distribution with a differentiable sample from a corresponding Gumbel-Softmax distribution. We let ⇡l,k = [1 ↵l,k, ↵l,k] be the distribution vector of the binary random variable ul,k that we want to optimize, where the logit ↵l,k represents the probability that the l-th block is selected to execute in the task Tk. In Gumbel-Softmax Sampling, instead of directly sampling a select-or-skip decision ul,k for the l-th block in the task Tk from its distribution ⇡l,k, we generate it as, ul,k = arg max j2{0,1} log ⇡l,k(j) + Gl,k(j) , where Gl,k = log( log Ul,k) is a standard Gumbel distribution with Ul,k sampled from a uniform i.i.d. distribution Unif(0, 1). To remove the non-differentiable argmax operation in Eq. 1, the Gumbel Softmax trick relaxes one-hot(ul,k) 2 {0, 1} 2 (the one-hot encoding of ul,k) to vl,k 2 R 2 (the soft select-or-skip decision for the l-th block in Tk) with the reparameterization trick [25]: vl,k(j) = exp (log ⇡l,k(j) + Gl,k(j))/⌧ P i2{0,1} exp (log ⇡l,k(i) + Gl,k(i))/⌧ , 4 Figure 2 : Illustration of our proposed approach. AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or-skip policy decision sampled from the learned task-specific policy distribution (logits). These select-or-skip vectors define which blocks should be executed in different tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output. During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling. We use task-specific losses and policy regularizations (to encourage sparsity and sharing) in training. Best viewed in color. only one task for predicting the output. In this way, the select-or-skip policy of all blocks and tasks (U = {ul,k}lL,kK) determines the adaptive feature sharing mechanism over the given task set T . As the number of potential configurations for U is 2 L⇥K which grows exponentially with the number of blocks and tasks, it becomes intractable to manually find such a U to get the optimal feature sharing pattern in multi-task learning. Instead of handcrafting this policy, we adopt Gumbel-Softmax Sampling [25] to optimize U jointly with the network parameters W through standard back-propagation. Moreover, we introduce two policy regularizations to achieve effective knowledge sharing in a compact multi-task network, as well as a curriculum learning strategy to stabilize the optimization in the early stages. After the training finishes, we sample the binary decision ul,k for each block l from ul,k to decide what blocks to select or skip in the task Tk. Specifically, with the help of the select-or-skip decisions, we form a novel and non-trivial network architecture for MTL parameter-sharing, and share knowledge at different levels across all tasks in a flexible and efficient way. At test time, when a novel input is presented to the multi-task network, the optimal policy is followed, selectively choosing what blocks to compute for each task. Our proposed approach not only encourages positive sharing among tasks via shared blocks but also minimizes negative interference by using task-specific blocks when necessary. Learning a Task-Specific Policy. In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard back-propagation from our designed loss functions. However, each select-or-skip policy ul,k is discrete and non-differentiable and this makes direct optimization difficult. Therefore, we adopt Gumbel-Softmax Sampling [25] to resolve this non-differentiability and enable direct optimization of the discrete policy ul,k using back-propagation. Gumbel-Softmax Sampling. The Gumbel-Softmax trick [25, 36] is a simple and effective way to substitutes the original non-differentiable sample from a discrete distribution with a differentiable sample from a corresponding Gumbel-Softmax distribution. We let ⇡l,k = [1 ↵l,k, ↵l,k] be the distribution vector of the binary random variable ul,k that we want to optimize, where the logit ↵l,k represents the probability that the l-th block is selected to execute in the task Tk. In Gumbel-Softmax Sampling, instead of directly sampling a select-or-skip decision ul,k for the l-th block in the task Tk from its distribution ⇡l,k, we generate it as, ul,k = arg max j2{0,1} log ⇡l,k(j) + Gl,k(j) , where Gl,k = log( log Ul,k) is a standard Gumbel distribution with Ul,k sampled from a uniform i.i.d. distribution Unif(0, 1). To remove the non-differentiable argmax operation in Eq. 1, the Gumbel Softmax trick relaxes one-hot(ul,k) 2 {0, 1} 2 (the one-hot encoding of ul,k) to vl,k 2 R 2 (the soft select-or-skip decision for the l-th block in Tk) with the reparameterization trick [25]: vl,k(j) = exp (log ⇡l,k(j) + Gl,k(j))/⌧ P i2{0,1} exp (log ⇡l,k(i) + Gl,k(i))/⌧ , 4 Figure 2 : Illustration of our proposed approach. AdaShare learns the layer sharing pattern among multiple tasks through predicting a select-or-skip policy decision sampled from the learned task-specific policy distribution (logits). These select-or-skip vectors define which blocks should be executed in different tasks. A block is said to be shared across two tasks if it is being used by both of them or task-specific if it is being used by only one task for predicting the output. During training, both policy logits and network parameters are jointly learned using standard back-propagation through Gumbel-Softmax Sampling. We use task-specific losses and policy regularizations (to encourage sparsity and sharing) in training. Best viewed in color. only one task for predicting the output. In this way, the select-or-skip policy of all blocks and tasks (U = {ul,k}lL,kK) determines the adaptive feature sharing mechanism over the given task set T . As the number of potential configurations for U is 2 L⇥K which grows exponentially with the number of blocks and tasks, it becomes intractable to manually find such a U to get the optimal feature sharing pattern in multi-task learning. Instead of handcrafting this policy, we adopt Gumbel-Softmax Sampling [25] to optimize U jointly with the network parameters W through standard back-propagation. Moreover, we introduce two policy regularizations to achieve effective knowledge sharing in a compact multi-task network, as well as a curriculum learning strategy to stabilize the optimization in the early stages. After the training finishes, we sample the binary decision ul,k for each block l from ul,k to decide what blocks to select or skip in the task Tk. Specifically, with the help of the select-or-skip decisions, we form a novel and non-trivial network architecture for MTL parameter-sharing, and share knowledge at different levels across all tasks in a flexible and efficient way. At test time, when a novel input is presented to the multi-task network, the optimal policy is followed, selectively choosing what blocks to compute for each task. Our proposed approach not only encourages positive sharing among tasks via shared blocks but also minimizes negative interference by using task-specific blocks when necessary. Learning a Task-Specific Policy. In AdaShare, we learn the select-or-skip policy U and network weights W jointly through standard back-propagation from our designed loss functions. However, each select-or-skip policy ul,k is discrete and non-differentiable and this makes direct optimization difficult. Therefore, we adopt Gumbel-Softmax Sampling [25] to resolve this non-differentiability and enable direct optimization of the discrete policy ul,k using back-propagation. Gumbel-Softmax Sampling. The Gumbel-Softmax trick [25, 36] is a simple and effective way to substitutes the original non-differentiable sample from a discrete distribution with a differentiable sample from a corresponding Gumbel-Softmax distribution. We let ⇡l,k = [1 ↵l,k, ↵l,k] be the distribution vector of the binary random variable ul,k that we want to optimize, where the logit ↵l,k represents the probability that the l-th block is selected to execute in the task Tk. In Gumbel-Softmax Sampling, instead of directly sampling a select-or-skip decision ul,k for the l-th block in the task Tk from its distribution ⇡l,k, we generate it as, ul,k = arg max j2{0,1} log ⇡l,k(j) + Gl,k(j) , where Gl,k = log( log Ul,k) is a standard Gumbel distribution with Ul,k sampled from a uniform i.i.d. distribution Unif(0, 1). To remove the non-differentiable argmax operation in Eq. 1, the Gumbel Softmax trick relaxes one-hot(ul,k) 2 {0, 1} 2 (the one-hot encoding of ul,k) to vl,k 2 R 2 (the soft select-or-skip decision for the l-th block in Tk) with the reparameterization trick [25]: select-or-skip policy of all blocks and tasks haring mechanism over the given task set T . vl,k(j) = exp (log ⇡l,k(j) + Gl,k(j))/⌧ P i2{0,1} exp (log ⇡l,k(i) + Gl,k(i))/⌧ , L⇥K which grows exponentially with the manually find such a U to get the optimal handcrafting this policy, we adopt Gumbele network parameters W through standard gularizations to achieve effective knowledge urriculum learning strategy to stabilize the hes, we sample the binary decision u l,k for or skip in the task T k . Specifically, with the d non-trivial network architecture for MTL els across all tasks in a flexible and efficient e multi-task network, the optimal policy is r each task. Our proposed approach not only cks but also minimizes negative interference rn the select-or-skip policy U and network from our designed loss functions. However, rentiable and this makes direct optimization ing [25] to resolve this non-differentiability k using back-propagation. ck [25, 36] is a simple and effective way to a discrete distribution with a differentiable ution. We let ⇡ l,k = [1 ↵ l,k , ↵ l,k ] be the at we want to optimize, where the logit ↵ l,k to execute in the task T k . ing a select-or-skip decision u l,k for the l-th ate it as, ) + G l,k (j) , istribution with U l,k sampled from a uniform tiable argmax operation in Eq. 1, the Gumbel -hot encoding of u l,k ) to v l,k 2 R 2 (the soft this, we propose a novel Recommender Transformer (RETR) with a new Pathway Attention, the core subassembly of which is a pathway router. Besides the modification of architecture, we additionally introduce a hierarchical update strategy for the behavior pathway in the feed-forward procedure.

3.1. RECOMMENDER TRANSFORMER

Considering the limitation of overwhelming attention in Transformers (Bhattacharjee & Das, 2017) for sequential recommendation, we renovate the vanilla architecture to the Recommender Transformer (RETR) with a Pathway Attention mechanism, as shown in Figure 2 . Model inputs. To obtain the model inputs, we follow the sliding window practice and transform the user's behavior sequence into a fixed-length-N sequence s = (s 1 , s 2 , . . . , s N ). Then we produce an item embedding matrix E I ∈ R |I|×d , where d is the embedding dimensionality. We perform a look-up operation from E I to retrieve the input embedding matrix E s ∈ R N ×d for sequence s. Besides, we also add a learnable position embedding P s ∈ R N ×d for sequence s. Finally, we can generate the input embedding of each behavior sequence s as X s = E s + P s ∈ R N ×d . Overall architecture. Recommender Transformer is characterized by stacking the Pathway Attention blocks and feed-forward layers alternately, containing L blocks. This stacking structure is conducive to learning behavior representations hierarchically. The overall equations of block l are formalized as: Z l , R l = Path-MSA (Z l-1 , R l-1 ) Z l = LN ( Z l + Z l-1 ) Z l = LN (FFN ( Z l ) + Z l ), where Z l ∈ R N ×d , l ∈ {1, • • • , L} denotes the output of the l-th block. The initial input Z 0 = X s ∈ R N ×d represents the raw behavior embedding. R l-1 ∈ R N ×1 is the previous route from the (l -1)-th block and we initialize all elements in the route R 0 to 1. Path-MSA(•) is to conduct the pathway multi-head self-attention. LN(•) is to conduct layer normalization (Ba et al., 2016) and FFN represents the point-wise feed-forward network (Bhattacharjee & Das, 2017) .

3.1.1. PATHWAY ATTENTION

Note that the single-branch self-attention mechanism (Bhattacharjee & Das, 2017) in vanilla transformer cannot model the behavior pathway dynamically, resulting in key behaviors being overwhelmed by those non-pivotal or trivial ones. To solve this problem, we propose the Pathway Attention mechanism, as shown in Figure 2 , which can dynamically attend to the behavior pathway of pivotal behavior tokens. Pathway router. The pathway attention employs a sequence-adaptive pathway router to custom-tailor behavior pathway routes for users. The router generates a binary route R l ∈ {0, 1} N to determine whether a behavior token would be part of the behavior pathway or not. Each router takes the pre-order route R l-1 and user behavior tokens Z l-1 ∈ R N ×d of the (l -1)-th block as its inputs. All elements in the route are initialized by 1 and are updated progressively in training. Foremost, to suppress the potential disturbance to the model caused by the local drifted interest (Figure 1 ), it is crucial to incorporate the global information in the route generation. We apply the average pooling to all the preserved behavior tokens routed by R l-1 , and produce the global sequential representation via a multilayer perceptron (MLP) module. Then, we combine this global representation with the inputs and employ a residual connection to maintain the original input information. Finally, we feed them to another MLP layer to predict the probabilities of keeping or dropping the behavior tokens. All MLP layers are column-wise and operate on the embedding dimensionality. The above procedure can be formulated as follows: Z l emb = Z l-1 + Z l-1 ⊙ MLP N i=1 R l-1 i Z l-1 i N i=1 R l-1 i π = Softmax (MLP(Z l emb )) ∈ R N ×2 , ( ) where ⊙ is the Hadamard product. For t ∈ {1, 2, • • • , N }, we let π t = [1 -α t , α t ] , where the logit α t denotes the probability that the t-th behavior token is kept alive for the behavior pathway. Adaptive Gumbel-Softmax sampling from π for router. Our goal is to generate the binary route from π. However, sampling from π directly is non-differentiable, and it will impede the gradient-based training. Gumbel-Softmax (Jang et al., 2016) is an effective way to approximate the original non-differentiable sample from a discrete distribution with a differentiable sample from a Gumbel-Softmax distribution. Thus, we adapt the Gumbel-Softmax technique to achieve such a sampling procedure. Instead of directly sampling a keep-or-drop decision R l t for the t-th behavior token from the distribution π t , we generate it as: R l t = arg max j∈{0,1} (log π t (j) + G t (j)) , where G t =log(-log U t ) is a standard Gumbel distribution, and U t is sampled i.i.d. from a uniform distribution Uniform(0, 1). To remove the non-differentiable argmax operation in equation 3, the standard Gumbel-Softmax uses the reparameterization trick (Jang et al., 2016) as a differentiable approximation to relax the one-hot R l t ∈ {0, 1} to v t ∈ R 2 : v t (j) = exp((log π t (j) + G t (j))/τ ) i∈{0,1} exp((log π t (i) + G t (i))/τ ) , j ∈ {0, 1}, where τ is the temperature parameter of the Softmax. However, it remains a well-known challenge to tune the temperature in Gumbel-Softmax since a low temperature will cause a high variance in gradient magnitude and a high temperature will lead to an over-smoothing probability. Furthermore, a fixed temperature is not adaptive across different datasets or behaviors of each user, which incurs huge tweaking cost. Motivated by these difficulties, our propose an adaptive variant of Gumbel-Softmax that introduces the token-specific weight mechanism into the Gumbel-Softmax: ω = ReLU MLP Z l emb ∈ R N ×1 , v t (j) = exp(ω t log π t (j) + G t (j)) i∈{0,1} exp(ω t log π t (i) + G t (i)) , j ∈ {0, 1}, where ω t is the weight specific for each token t of each sequence. For different user behaviors, we use an MLP module to dynamically introduce the weight from the inputs Z l emb , avoiding the high variance in the gradient and mitigating the over-smoothing phenomenon. Our adaptive Gumbel-Softmax can make RETR dynamically adapt to diverse datasets and user behaviors without tuning the temperature. Hierarchical update strategy for router. The preliminary route R l , sampled from π, is not a final decision. In our design, once a token fails to be routed in a certain block, it would permanently lose the privilege to be part of the behavior pathway in the following feed-forward procedure. This constitutes a more efficient hierarchical pathway router strategy. Thus finally we formulate the route R l ∈ R N ×1 as the Hadamard product of R l and the pre-order route R l-1 in the (l -1)-th block: R l = R l ⊙ R l-1 . ( ) Multi-head pathway attention. The standard multi-head self-attention mechanism retrieves sequential characteristics by exploiting all behavior tokens, making the behavior pathway overwhelmed by the trivial behaviors. In the proposed pathway attention, the pathway router would be firstly applied to the input behavior tokens to route information. The pathway router would not pare down the number of tokens, but only the interactions between the off-pathway and on-pathway tokens, as these off-pathway tokens may also convey contextual information. Specifically, for the query Q, key K, and value V in the pathway attention: the query is routed by the pathway router through token-wise multiplication between R l t and Z l-1 t , to prevent the pathway from being overwhelmed and to force the pathway attention to attend to the behavior pathway. The key and value are the original input behavior tokens, to ensure that the contextual information from off-pathway behavior tokens can be captured as well: Q m , K m , V m = (R l Z l-1 )W l Qm , Z l-1 W l Km , Z l-1 W l Vm Z l m = Softmax Q m K T m d/h V l m , where m ∈ {1, 2, • • • , h} is the head index in the multi-head self-attention; W l Qm , W l Km , W l Vm ∈ R d× d h are transformation matrices learned from data. Finally, the outputs Z l m ∈ R N × d h 1≤m≤h of multiple heads are concatenated into Z l ∈ R N ×d . We use Z l , R l = Path-MSA (Z l-1 , R l-1 ) to summarize the above pathway attention. Its output is further transformed by equation 1 to form the final output of the l-th block Z l ∈ R N ×d . In the prediction of the (t + 1)-th behavior, only the first t observable behaviors should be taken into account. To avoid a future information leak and ensure causality, we apply a causal pathway attention in that a look-ahead mask is employed and all links between Q j and K i (j > i) are removed. Model-agnostic pathway mechanism. It is worth noting that our pathway attention is a lightweight module readily pluggable into any transformer-based model by replacing the self-attention mechanism with our pathway attention while remaining the architecture unchanged. To verify the effectiveness of our model-agnostic pathway mechanism, we apply our pathway attention to mainstream sequential recommendation transformers: BERTRec (Sun et al., 2019) , SASRec (Kang & McAuley, 2018) , SMRec (Chen et al., 2021) , S3-Rec (Zhou et al., 2020) , TGSRec (Fan et al., 2021b) , and LightSANs (Fan et al., 2021a) , which further enhances the performance and generalization of the these models.

3.2. PREDICTION LAYER AND TRAINING OBJECTIVE

Prediction layer. In the final layer of RETR, we calculate the user's preference score for the item k in the step (t + 1) in the context of user behavior history as p (i t+1 = k | i 1:t ) = e k • Z L t , where e k is the representation of item k from item embedding matrix E I , and Z L t is the output of the L-th block in RETR at step t, with L being the number of RETR blocks. Training objective. We adopt the pairwise ranking loss to optimize the RETR model parameters as: L = - u∈U n t=1 log σ p(i t+1 | i 1:t ) -p(i - t+1 | i 1:t ) , where we pair each ground-truth item i t+1 with a randomly sampled negative item i - t+1 . In each epoch, we randomly generate one negative item for each time step in each sequence. This pairwise ranking loss is widely adopted in previous literature of sequential recommendation (Kang & McAuley, 2018; Zhou et al., 2022) .

4. EXPERIMENTS

We extensively evaluate the proposed Recommender Transformer (RETR) on seven intra-domain real-world benchmarks and five cross-domain benchmarks. Due to page limitation, we also include further ablation study results and visualization examples in Appendix B and Appendix C respectively. Intra-domain setting. We evaluate RETR on seven intra-domain datasets: Netflix, MSD, Taobao, Yelp, Tmall, Steam, and MovieLens1M. All methods are trained from scratch on these datasets. The statistics of the seven datasets are summarized in Table 4 of Appendix A and the description for these datasets can be found therein. All datasets are widely used for sequential recommendation task. It is notable that Netflix, MSD, Taobao and Steam are large-scale datasets. Cross-domain setting. We evaluate the ability of RETR to capture the domain-invariant representation for sequential recommendation under the cross-domain setting. Specifically, we follow the training strategy in UniSRec (Hou et al., 2022) to pre-train our RETR on multiple datasets "Grocery and Gourmet Food", "Home and Kitchen", "CDs and Vinyl", "Kindle Store" and "Movies and TV", and then finetune the pre-trained RETR respectively on different target datasets "Prime Pantry", "Industrial and Scientific", "Musical Instruments", "Arts, Crafts and Sewing" and "Office Products". These datasets are all sub-categories in the Amazon review datasets (Ni et al., 2019) . The detailed descriptions of source and target datasets are deferred to Appendix A. Following previous works (Kang & McAuley, 2018) , we group the interaction records by users or sessions for all datasets and sort them by the timestamps in ascending order. We split the historical sequence for each user into three parts: (1) the most recent behavior for testing, (2) the second most recent behavior for validation, and (3) all remaining behaviors for training. During testing, the input sequences contain training behaviors and validation behaviors. We filter less popular items and inactive users with fewer than five interaction records. Note that we also provide the rolling prediction results following the protocol from previous works (Chen et al., 2021) in Appendix C. Evaluation metrics. Following the previous literature (Zhou et al., 2022; Kang & McAuley, 2018) , we apply top-k Hit Ratio (HR@k), top-k Normalized Discounted Cumulative Gain (NDCG@k) and Mean Reciprocal Rank (MRR) for evaluation. We report HR@10, NDCG@10 and MRR of the results. Besides, following the standard strategy in SASRec (Kang & McAuley, 2018) , we pair the ground-truth item with 100 randomly sampled negative items that the user has not interacted with. All metrics are calculated according to the ranking of the items and we report the average score. Baseline methods. We compare our RETR with several state-of-the-art sequence recommendation models. Specifically, we compare our RETR with state-of-the-art transformer-based sequential recommendation models: Rec-denoiser (Chen et al., 2022) , Locker (He et al., 2021) , SASRec (Kang & McAuley, 2018) , BertRec (Sun et al., 2019) , SMRec (Chen et al., 2021) . S3-Rec (Zhou et al., 2020) , TGSRec (Fan et al., 2021b) and LightSANs (Fan et al., 2021a) . These methods adopt the attention mechanism to make precise recommendations. Note that Rec-denoiser (Chen et al., 2022) and Locker (He et al., 2021) are novel transformer-based models with learnable sparse attention. Besides, we also compare our RETR with state-of-the-art graph-based sequential recommendation methods: Jodie (Kumar et al., 2019) and TGN (Rossi et al., 2020) . We further compare our approach with some cross-domain recommendation models RecGURU (Li et al., 2022) and UniSRec (Hou et al., 2022) . All baseline methods are configured using default parameters of the original paper or optimal parameters which can produce their best results through a grid search. Implementation details. Our model is supervised by the pairwise rank loss in equation 8, using the ADAM (Kingma & Ba, 2015) optimizer with an initial learning rate of 0.001. Batch size is set to 512. The maximum number of training epochs for all methods is set to 300. All hyperparameters are tuned on the validation set. The training process is early stopped within 10 epochs. Our RETR has L = 2 layers, and each layer has h = 4 heads (the ablation study of multi-head attention can be found in Appendix B) and dimension d is set to be 256. The maximum sequence length N is set to 200 for MovieLens1M and 100 for the other intra-domain and cross-domain datasets. In the cross-domain setting, we further pre-train the proposed approach and baseline methods from multiple datasets following the training strategy in UniSRec (Hou et al., 2022) for 300 epochs using the default parameters from UniSRec (Hou et al., 2022) . All models are pretrained without using the item ID embeddings. In the phase of finetuning on the target domain, we use the item ID embeddings and fix the backbone for all competing pretrained models. All experiments are repeated three times, implemented in PyTorch (Paszke et al., 2019) , and conducted on a single NVIDIA 3090 GPU.

4.1. INTRA-DOMAIN RESULTS

The results of different methods on seven intra-domain datasets are shown in Table 1 . We can easily find that transformer-based models, SASRec (Kang & McAuley, 2018) , BertRec (Sun et al., 2019) , SMRec (Chen et al., 2021) , S3-Rec (Zhou et al., 2020) , TGSRec (Fan et al., 2021b) and Table 1 : Performance comparison to state-of-the-art models under the intra-domain setting. Recdenoiser (Chen et al., 2022) LightSANs (Fan et al., 2021a) , achieve competitive performance on most datasets, indicating that the transformer-based models have a better capacity to capture sequential behaviors of complex characteristics. These models can capture the interaction information between all previous user behaviors via the attention mechanism. Besides, the graph-based models like Jodie (Kumar et al., 2019) and TGN (Rossi et al., 2020 ) also achieve competitive performance. Rec-denoiser (Chen et al., 2022) and Locker (He et al., 2021) introduce novel sparse attention mechanisms, thereby achieving better performance compared with other baselines. This validates the effectiveness of the learned mask attention. While these baselines are strong competitors, our RETR can achieve state-of-the-art performance by a large margin on most datasets compared with the Rec-denoiser and Locker. Results on Yelp, MovieLens1M and Tmall. Our RETR achieves competitive performance on Yelp and Tmall. These datasets are sparse, containing less action information. Thus they have lots of noisy logged information. By effectively capturing the behavior pathway, RETR is not affected by this trivial behavior information and captures the most informative behavior representation to achieve better performance. Note that under the Tmall benchmark, RETR gains 7% HR@10, 12% NDCG@10 and 14% MRR against the strongest baseline SMRec (Chen et al., 2021) . Besides, for the MoveLens1M benchmark, RETR also achieves the best performance among all competing baselines. Results on large-scale datasets. Our RETR can consistently achieve state-of-the-art results on large-scale datasets (Netflix, MSD, Taobao, and Steam). These datasets are challenging and difficult to capture pivotal behavior pathway useful for precise recommendation from the rich but noisy user's behaviors. Especially for the Taobao dataset, RETR gains relative improvements of 12% HR@10, 37% NDCG@10 and 20% MRR against the strongest baseline SINE (Tan et al., 2021) . It provides evidence that RETR can achieve competitive performance in both small-and large-scale datasets. The substantial performance gains of our RETR indicate that focusing more on the behavior pathway enables RETR to capture sequential characteristics more efficiently and effectively than the vanilla self-attention mechanism, which considers all previous user behaviors and is easily overwhelmed.

4.2. CROSS-DOMAIN RESULTS

To verify the ability of RETR to capture domain-invariant representations, we evaluate pre-trained RETR on five target datasets under the cross-domain setting. The multi-domain pre-training version of RETR, denoted as X-RETR, can be effectively transferred to new recommendation domains. We also provide the results for multi-domain pre-training version of Rec-denoiser, SASRec, SMRec, and LightSAN, denoted as X-Rec-denoiser, UniSRec, X-SMRec, and X-LightSAN respectively. Technically, we follow the pretraining strategy of UniSRec (Hou et al., 2022) to train all models. As shown in Table 2 , the X-RETR already achieves competitive cross-domain performance, outperforming the state-of-the-art cross-domain method UniSRec by a large margin on most target datasets. Compared with other multi-domain pre-trained backbones, X-RETR achieves the highest performance and em-powers better transferability among different backbones. Specially, X-RETR gains 12% HR@10 and 22.5% NDCG@10 compared with X-SMRec, on the Scientific benchmark. These results indicate that RETR can extract domain-invariant representations for sequential recommendation, indicating that a stronger backbone is crucial in parallel with transfer-learning method for enhancing the transferability. RETR can be regarded as a general backbone to capture more domain-invariant representations. Table 2 : Performance comparison of different competitive methods under the cross-domain setting. X-* indicates the model pre-trained on multiple datasets and finetuned on a target dataset ("X" stands for "cross-domain") following the training procedure of UniSRec (Hou et al., 2022) . Pathway attention towards different transformer-based models. As described before, our RETR yields state-of-the-art performance on all datasets. We further apply our pathway mechanism towards different transformer-based models like BERTRec (Sun et al., 2019) , SMRec (Chen et al., 2021) , S3-Rec (Zhou et al., 2020) , TGSRec (Fan et al., 2021b) , and LightSANs (Fan et al., 2021a) . In Table 3 , we observe that our pathway attention can improve the performance of all baseline transformer-based models substantially. RETR can be further enhanced using advanced backbones alternative to the vanilla transformers and achieve the best results among all competing methods. These results provide strong evidences that our proposed pathway attention is model-agnostic to transformer-based methods and not limited to a particular architectural choice. 

5. CONCLUSION

A sequential recommender is designed to make accurate recommendations based on users' historical behaviors. However, the users' behaviors are dynamic and come in a continually evolving manner. A user's current decision may only call upon the interest from the certain relevant behaviors of the past. We conclude these sequential characteristics as the behavior pathway. We propose the Recommender Transformer (RETR) with a novel pathway attention mechanism to tackle these challenges. The pathway attention mechanism develops a pathway router to dynamically allocate the behavior pathway for each user and capture the evolving patterns. RETR can capture more domain-invariant representations and the pathway attention is model-agnostic and can be easily applied to a series of transformer-based methods. RETR achieves state-of-the-art performance on seven intra-domain datasets and five cross-domain benchmarks for sequential recommendation. A DESCRIPTIONS OF THE DATASETS Cross-domain datasets. We choose five categories from Amazon review datasets (Hou et al., 2022) : Grocery and Gourmet Food, Home and Kitchen, CDs and Vinyl, Kindle Store, and Movies and TV, as the source domain datasets for pre-training. For the target datasets, we choose another five categories from Amazon review datasets (Hou et al., 2022) : Prime Pantry, Industrial and Scientific, Musical Instruments, Arts, Crafts and Sewing, and Office Products, as target domain datasets to evaluate the proposed approach under the cross-domain setting. The detail statistics are shown in Table 5 . 

B FURTHER ABLATION STUDY

Number of heads and maximum sequence length. In the left column of Table 6 , we adjust the number of heads for RETR on Yelp. We find that the performance first increases rapidly with the growth of the head number and achieves the best performance at h = 4. We perform a similar grid search on other datasets. In the right column of Table 6 , we adjust the maximum sequence length N for RETR on Yelp. As shown in Table 6 , we find that the performance of our RETR first increases rapidly with the growth of the block number and achieves the best performance at N = 100. We perform a similar grid search on other datasets. Effectiveness of each model component and number of blocks. In the left column of Table 7 , we analyze the efficacy of each component in RETR on the Yelp dataset and have the following observations. First, we remove the pathway router module and randomly choose whether it can be maintained or dropped for each input behavior token. Removing the pathway router decreases the prediction performance a lot (MRR: 0.4354 → 0.3887), showing the necessity of learning behavior pathway effectively based on a data-dependent module. Second, discarding the hierarchical update strategy for the behavior pathway also decreases the prediction performance, suggesting that this strategy is crucial for RETR to get a more accurate behavior pathway. In the right column of Table 7 , we adjust the of blocks for RETR on Yelp. We find that the performance first increases rapidly with the growth of the block number and achieves the best performance at 2. We perform a similar grid search on other datasets. (Tan et al., 2021) and SMRec (Chen et al., 2021) on the Yelp dataset. The computation cost is measured with gigabit floating-point operations (GFLOPs) on the self-attention module with position encoding. Meanwhile, the model scale measured with parameters is also presented. As shown in Table 8 , our RETR has almost the same number of parameters or GFLOPs, compared with SASRec, indicating that our pathway router is a light-weight module. Our pathway attention does not bring more costs. It's worth noticing that the parameter scales and GLOPs of other competing transformers (apart from SASRec) are larger than RETR, but our RETR achieves higher performance. This result shows that our RETR is more efficient and effective than other competing attention-based models. Why we need to use this pathway router in sequential recommendation? Previous sequential recommendation methods have proved that the recommender can be benefited a lot from the user's historical behaviors, even though the behavior sequence may be short. However, when meeting with the short behavior sequence, the recommender still needs to deal with various behavior pathways and can be overwhelmed by the trivial behaviors. As shown in Figure 4 , we show the last 10 behaviors from a random user in the Steam dataset. Further, in Figure 4 , we show that the state-of-the-art MLP-based model, FMLP-Rec (Zhou et al., 2022) , is still overwhelmed by the old drifted behaviors (simulation games). To avoid the recommender being overwhelmed by the trivial behaviors, we design the pathway router to capture the pivotal behavior pathway that explains the user's preferences, whenever the behavior sequence is short or long. It's crucial to develop the pathway router to capture the behavior pathway for making precise recommendations. The pathway router is important for sequential recommendation models to avoid being overwhelmed by trivial user behaviors. Replace the proposed pathway-based method with other sparse attention methods. We replace the proposed pathway-based method with two sparse attention methods: LogSparse (Li et al., 2019) and sparse attention (Child et al., 2019) . As shown in Table 9 , our RETR using the pathway attention Taskonomy 5-Task learning. The darkness of a block represents the probability of that block selected f given task. We also provide the select-and-skip decision U from our AdaShare. In (b), we provide th correlation, i.e. the cosine similarity between task-specific dataset. Two 3D tasks (Surface Normal Pred and Depth Prediction) are more correlated and so as two 2D tasks (Keypoint Detection and Edge Detectio order to improve the performance of Semantic Segmentation. In contrast, our approach is still to improve the segmentation performance instead of suffering from the negative interferenc the other two tasks. The same reduction in negative transfer is also observed in Surface No Prediction in Tiny-Taskonomy 5-Task Learning. However, our proposed approach AdaShare performs the best using less than 1/5 parameters of most of the baselines (Table 4 ). Moreover, our proposed AdaShare also achieves better overall performance across the same on different domains. Policy Visualization and Task Correlation. In Figure 3 : (a), we visualize our learned policy distributions (via logits) and the feature sharing policy in Tiny-Taskonomy 5-Task Learning (more visualizations are included in supplementary material). We also adopt the cosine similarity between task-specific policy logits as an effective representation of task correlations (Figure 3 : (b), Figure 4 ). We have the following key observations. (a) The execution probability of each block for task k shows that not all blocks contribute to the task equally and it allows AdaShare to mediate among tasks and decide remarkably outperforms other competing methods with two sparse attention methods on Tmall. These experiment results show that sparse attention methods cannot capture the exact behavior pathway and show worse performance than RETR. Quantitative results on whether the proposed model effectively captures a useful pathway. We give quantitive results to validate that our RETR can effectively capture various behavior pathways. Specifically, we evaluate our RETR using a subset of sequences derived from the obtained behavior pathway on Tmall. Technically, we first train RETR on Tmall. For each user, we take the captured behavior pathway from our RETR as the inputs to retrain a RETR rather than using the whole user's behaviors. As shown in Table 10 , we find that using the behavior pathway as the inputs can achieve comparable results as the original RETR which uses complete user behaviors. It provides the evidence that our RETR can aptly capture the useful pathway for each user. What's the essential difference between RETR and sequential model with attention mechanism? Our RETR has two essential differences compared with sequential model with attention mechanism: (1) Our RETR designs the pathway router to capture the behavior pathway, while the other sequential models have not considered it before. As shown in Figure 4 , the previous self-attention mechanism mainly focuses on the recent behaviors, and cannot capture the accurate behavior pathway. Only our RETR can capture the precise behavior pathway. (2) The pathway attention for RETR is the crossattention between the pathway behavior tokens and off-pathway tokens. Our pathway cross-attention mechanism can avoid the trivial interaction between the off-pathway tokens. Why use the cross-attention mechanism? We choose the cross-attention mechanism for three main reasons: (1) The cross-attention mechanism can force the pathway attention to attend to the behavior pathway; (2) It can ensure that the contextual information from off-pathway behavior tokens can be captured, using the original input behavior tokens as the key and value; (3) Our pathway crossattention mechanism avoids the trivial interaction between the off-pathway tokens, while the previous self-attention mechanism for sequential models can be overwhelmed by the trivial information in the off-pathway behavior tokens. To verify our explanation, we further conduct evaluation experiments on Tmall. Specifically, we train RETR on Tmall, and then use the trained RETR on Tmall to capture the behavior pathway for each user in Tmall. We use the pathway behaviors and off-pathway behaviors as the inputs to train SASRec respectively. As shown in Table 10 , we can see that SASRec achieves better performance using the behavior pathway as the inputs compared with the original SASRec using the whole user's behavior as the inputs. On the contrary, the off-pathway inputs hurt SASRec's performance seriously. Finally, our RETR achieves the best performance, indicating that the pathway-offpathway cross-attention is more effective than the pathway self-attention.

C VISUAL EXAMPLES

Setups. We also provide qualitative visualizations for our RETR, and SASRec Kang & McAuley (2018) . Taskonomy 5-Task learning. The darkness of a block represents the probability of that block selected f given task. We also provide the select-and-skip decision U from our AdaShare. In (b), we provide th correlation, i.e. the cosine similarity between task-specific dataset. Two 3D tasks (Surface Normal Pred and Depth Prediction) are more correlated and so as two 2D tasks (Keypoint Detection and Edge Detecti order to improve the performance of Semantic Segmentation. In contrast, our approach is stil to improve the segmentation performance instead of suffering from the negative interferen the other two tasks. The same reduction in negative transfer is also observed in Surface N Prediction in Tiny-Taskonomy 5-Task Learning. However, our proposed approach AdaShar performs the best using less than 1/5 parameters of most of the baselines (Table 4 ). The indie game is clicked many times recently, leading to the final decision to an indie game. Our RETR can effectively capture the correlated behavior pathway. However, the SASRec provides higher attention scores on the recent RPG games. On the contrary, our RETR pays no attention to these wrong results, showing that it has a greater ability to cope with the correlated behavior pathway. (3) Drifted behavior pathway: As shown in Figure 4 (c). The user was initially interested in the indie game, but suddenly became interested in simulation games recently and chose an indie game at last. Our RETR captures the drifted behavior pathway for the indie game and has not concentrated on the old drifted pathway -simulation games, while the SASRec is affected by the trivial behaviors of simulation games. These visualization results strongly show that our RETR can capture various behavior pathways dynamically for each user.

D ROLLING PREDICTION ON HELD-OUT TEST USER SEQUENCES

Setup. We build a rolling prediction setup on held-out test user sequences. Specifically, we split the users into train/val/test sets with the ratio 8 : 1 : 1. For training users, we use the last record for prediction, while all remaining behaviors are for training. For validation and test users, the last 5 records of each user are for the rolling prediction, and we tune hyper-parameters using the validation users. We report the average results of the last 5 records on test users for the model that achieves the best results on the validation users. Rolling prediction results. We conduct rolling prediction experiments on the Tmall dataset. Note that we evaluate the models on the users that are not present at training time. This setup is more complicated than the case where test and training users overlap. As shown in Table 11 , our RETR achieves state-of-the-art performance among all baselines. Note that Rec-denoiser and Locker also achieve competitive performance with the help of learned sparse attention, but our RETR outperforms these two methods substantially. These results show that our RETR develops an effective pathwayattention mechanism with strong capacity to make better rolling predictions on held-out test user sequences. E COMPARISON WITH TRAR TRAR (Zhou et al., 2021) uses an efficient sparse attention mechanism for visual question answering by routing the attention span, while our RETR is designed explicitly for sequential recommendation. Besides the incomparable application domains, the architecture of RETR is substantially different from TRAR, especially in three main aspects: • TRAR routes the choice of adjacency masks, automatically selecting the attention span. The sparse attention mechanism in TRAR is effective for visual question answering, but it is not designed specifically to be effective for sequential recommendation. As shown in Table 12 , we replace the pathway attention in RETR with the sparse attention in TRAR and conduct experiments on Tmall, MovieLens1M, and Yelp. We find that unsurprisingly, TRAR achieves worse performance than SASRec, while our RETR outperforms TRAR and SASRec considerably. These results are sufficient evidence that capturing the behavior pathway is crucial for sequential recommendation and our pathway attention mechanism is more effective for sequential recommendation. • RETR is aimed at capturing the behavior pathway. It develops the pathway router to capture the behavior pathway dynamically for each user. Technically, the pathway router can embed global information from the whole behavior sequence and keep the original information from the input representation via the residual connection, while the pathway controller in TRAR only captures the global information via attention pooling. As shown in Table 13 , we replace the pathway router in RETR with the pathway controller in TRAR. Our RETR performs worse using the pathway controller. These results show that the pathway controller cannot capture the behavior pathway effectively and achieve worse performance. • RETR develops a hierarchical update strategy for router as described in Equ 6. As shown in Table 7 , discarding the hierarchical update procedure will lead to worse performance, indicating that this strategy is crucial to capture the behavior pathway effectively from the intrinsically hierarchical representations. However, the pathway controller in TRAR cannot update the choice of mask hierarchically. 

F ABLATION STUDY FOR ADAPTIVE GUMBEL-SOFTMAX

The temperature parameter τ is a crucial hyperparameter for the standard Gumbel-Softmax. A fixed temperature cannot be adaptive across different datasets or users. It is widely-known to be uneasy to tune the temperature parameter, in that a lower value may lead to high variances in gradients and a higher value may lead to over-smoothing probabilities. To mitigate these technical issues, we propose a novel adaptive Gumbel-Softmax mechanism to eliminate the need of temperature tuning, which can produce token-specific weights automatically adjusted to varying behaviors of each user. As shown in Tables 14 and 15 , we find that RETR with the standard Gumbel-Softmax achieves the highest performance in different datasets at different temperatures (τ = 0.8 for Tmall and τ = 0.6 for Yelp). These results show that a fixed temperature is not adaptive across diverse datasets. However, lower temperatures (τ = 0.2, 0.4) cause the failure of model training for RETR due to the high variance in gradients. Higher temperatures (τ = 1, 2) may perform worse because of the over-smoothing probabilities in Gumbel-Softmax. It proves difficult to tune the temperature for the standard Gumbel-Softmax. Previous work like TRAR (Zhou et al., 2021) develops a schedule that starts with a high temperature and gradually anneals it to a small but non-zero value. This schedule can make the training more stable, but it cannot achieve the best performance adaptively across different datasets. In contrast, RETR with the proposed adaptive Gumbel-Softmax featured by the token-specific weight mechanism achieves the best performance compared with the standard Gumbel-Softmax under different temperatures. These results indicate that the adaptive Gumbel-Softmax is much more effective in capturing the behavior pathway and can be dynamically adapted to different datasets without tuning the temperature. This adaptive mechanism can simultaneously overcome the over-smoothing phenomenon and dynamically avoid the high variances in gradients. Taskonomy 5-Task learning. The darkness of a block represents the probability of that block selected f given task. We also provide the select-and-skip decision U from our AdaShare. In (b), we provide th correlation, i.e. the cosine similarity between task-specific dataset. Two 3D tasks (Surface Normal Pred and Depth Prediction) are more correlated and so as two 2D tasks (Keypoint Detection and Edge Detecti order to improve the performance of Semantic Segmentation. In contrast, our approach is still to improve the segmentation performance instead of suffering from the negative interferenc the other two tasks. The same reduction in negative transfer is also observed in Surface No We further explore how the behavior pathway updates hierarchically via the visualization in Figure 5 . Technically, we use the GradCAM Selvaraju et al. ( 2017) to generate behavior heat maps of the output of the first and last layer in RETR, respectively. As shown in the Figure 5 , we observe that our RETR can update the captured behavior pathway hierarchically, dropping the drifted behavior pathway Adventure in the last layer (row 2) from the first layer (row 1). This visualization shows that our RETR can determine the practical behavior pathway via the hierarchical update procedure.



Figure 1: Three typical examples of the behavior pathway for different users: correlated, causal, and drifted. The behavior pathway is outlined by the red boxes.

Illustration of our proposed approach. AdaShare learns the layer sharing pattern among multiple s through predicting a select-or-skip policy decision sampled from the learned task-specific policy distribution its). These select-or-skip vectors define which blocks should be executed in different tasks. A block is said e shared across two tasks if it is being used by both of them or task-specific if it is being used by only one for predicting the output. During training, both policy logits and network parameters are jointly learned ng standard back-propagation through Gumbel-Softmax Sampling. We use task-specific losses and policy ularizations (to encourage sparsity and sharing) in training. Best viewed in color.

layer sharing pattern among multiple d from the learned task-specific policy distribution uld be executed in different tasks. A block is said m or task-specific if it is being used by only one gits and network parameters are jointly learned ampling. We use task-specific losses and policy . Best viewed in color.

Figure 2: The architecture of Recommender Transformer (RETR) on the right subfigure. Pathway Attention (left) explores the behavior pathway by the pathway router (orange module) and captures the evolving sequential characteristics of the use behaviors by the multi-head attention.

Figure 3: Policy Visualization and Task Correlation. (a) We visualize the learned policy logits A in



Figure 3: Illustrations of how RETR, SASRec and FMLP-Rec (Zhou et al., 2022) differs on utilizing the historical behaviors of a random user in Steam Dataset. We provide the visualizations of behavior heatmaps for RETR, SASRec and FMLP-Rec of a random user in Steam dataset.

Figure 3: Policy Visualization and Task Correlation. (a) We visualize the learned policy logits A in

Figure 4: Visualizations of behavior heatmaps for RETR and SASRec of three random users in Steam dataset. They are corresponding to casual, correlated and drifted behavior pathways respectively.

Figure 3: Policy Visualization and Task Correlation. (a) We visualize the learned policy logits A in

Figure 5: Visualization for hierarchical update of RETR. We provide the visualization of behavior heat maps of RETR for different blocks of a random user in Steam dataset.

uses BertRec as the backbone.

Ablation study of model-agnostic pathway attention on MovieLens. Results in each column are obtained without/with pathway attention. (↑: positive improvement using pathway attention.)

Statistics of the intra-domain datasets. Tmall contains users' shopping logs on Tmall online shopping platform, which is from the IJCAI-15 competition. (6) Steam(Kang & McAuley, 2018): Steam dataset is collected from a large online video game distribution platform. This dataset includes 2,567,538 users, 15,474 games and 7,793,069 English reviews from October 2010 to January 2018. (7) MovieLens1M: this is a widely used benchmark dataset for evaluating collaborative filtering algorithms. The version we use is MovieLens-1M, which includes 1 million user ratings.

Statistics of the cross-domain datasets.

Ablation study of the head number (Left) and the maximum sequence length for RETR (Right). Experiments are conducted on the Yelp Dataset.

Ablation study of (Left) the effectiveness of each model component and (Right) the number of blocks for each RETR block. Experiments are conducted on the Yelp Dataset.

Ablation study of (Left) the effectiveness of different temperatures; Comparison Parameters and GFLOPs (Right). All ablation study experiments are conducted on the Yelp Dataset.

For image classification on DomainNet[42], AdaShare improves ave accuracy over Multi-Task baseline on 6 different visual domains by 4.6% (62.2% vs. 57.6%), wit maximum 16% improvement in quickdraw domain. For text classification task, AdaShare outperf the Multi-Task baseline by 7.2% (76.1% vs. 68.9%) in average over 10 different NLP datase and maximally improves 27.8% in sogou_news dataset.

Ablation study of sparse attention methods on the Tmall dataset.

Quantitative results on the Tmall dataset.

Rolling prediction performance comparison to state-of-the-art models under the intradomain setting on Tmall. Rec-denoiser(Chen et al., 2022) uses BertRec as the backbone.

Quantitative results on the Tmall, MovieLens1M, and Yelp dataset.

Ablation on the pathway attention mechanism on the Tmall dataset.

Quantitative results on the Tmall dataset. "-" indicates failure case of model training.

Quantitative results on the Yelp dataset. "-" indicates failure case of model training.

