DO WE REALLY NEED COMPLICATED MODEL ARCHI-TECTURES FOR TEMPORAL NETWORKS?

Abstract

Recurrent neural network (RNN) and self-attention mechanism (SAM) are the de facto methods to extract spatial-temporal information for temporal graph learning. Interestingly, we found that although both RNN and SAM could lead to a good performance, in practice neither of them is always necessary. In this paper, we propose GraphMixer, a conceptually and technically simple architecture that consists of three components: 1 a link-encoder that is only based on multi-layer perceptrons (MLP) to summarize the information from temporal links, 2 a node-encoder that is only based on neighbor mean-pooling to summarize node information, and 3 an MLP-based link classifier that performs link prediction based on the outputs of the encoders. Despite its simplicity, GraphMixer attains an outstanding performance on temporal link prediction benchmarks with faster convergence and better generalization performance. These results motivate us to rethink the importance of simpler model architecture.

1. INTRODUCTION

In recent years, temporal graph learning has been recognized as an important machine learning problem and has become the cornerstone behind a wealth of high-impact applications Yu et al. (2018) ; Bui et al. (2021) ; Kazemi et al. (2020) ; Zhou et al. (2020) ; Cong et al. (2021b) . Temporal link prediction is one of the classic downstream tasks which focuses on predicting the future interactions among nodes. For example, in an ads ranking system, the user-ad clicks can be modeled as a temporal bipartite graph whose nodes represent users and ads, and links are associated with timestamps indicating when users click ads. Link prediction between them can be used to predict whether a user will click an ad. Designing graph learning models that can capture node evolutionary patterns and accurately predict future links is a crucial direction for many real-world recommender systems. In temporal graph learning, recurrent neural network (RNN) and self-attention mechanism (SAM) have become the de facto standard for temporal graph learning Kumar et al. (2019) ; Sankar et al. (2020) ; Xu et al. (2020) ; Rossi et al. (2020) ; Wang et al. (2020) , and the majority of the existing works focus on designing neural architectures with one of them and additional components to learn representations from raw data. Although powerful, these methods are conceptually and technically complicated with advanced model architectures. It is non-trivial to understand which parts of the model design truly contribute to its success, and whether these components are indispensable. Thus, in this paper, we aim at answering the following two questions: Q1: Are RNN and SAM always indispensable for temporal graph learning? To answer this question, we propose GraphMixer, a simple architecture based entirely on the multi-layer perceptrons (MLPs) and neighbor mean-pooling, which does not utilize any RNN or SAM in its model architecture (Section 3). Despite its simplicity, GraphMixer could obtain outstanding results when comparing it v 5

Temporal graph (now) (past)

< l a t e x i t s h a 1 _ b a s e 6 4 = " Q 8 d 2 q G k W R h h P v n V v q G j D 5 4 C k P / s = " > A A A B 7 3 i c b V D J S g N B E K 2 J W 4 x b 1 K O X x i B 4 k D A j c T k G v X i M Y B Z I h q G n 0 5 M 0 6 V n s r h H C k J / w 4 k E R r / 6 O N / / G T j I H T X x Q 8 H i v i q p 6 f i K F R t v + t g o r q 2 v r G 8 X N 0 t b 2 z u 5 e e f + g p e N U M d 5 k s Y x V x 6 e a S x H x J g q U v J M o T k N f 8 r Y / u p 3 6 7 S e u t I i j B x w n 3 A 3 p I B K B Y B S N 1 E H P O S P o X X j l i l 2 1 Z y D L x M l J B X I 0 v P J X r x + z N O Q R M k m 1 7 j p 2 g m 5 G F Q o m + a T U S z V P K B v R A e 8 a G t G Q a z e b 3 T s h J 0 b p k y B W p i I k M / X 3 R E Z D r c e h b z p D i k O 9 6 E 3 F / 7 x u i s G 1 m 4 k o S Z F H b L 4 o S C X B m E y f J 3 2 h O E M 5 N o Q y J c y t h A 2 p o g x N R C U T g r P 4 8 j J p n V e d y 2 r t v l a p 3 + R x F O E I j u E U H L i C O t x B A 5 r A Q M I z v M K b 9 W i 9 W O / W x 7 y 1 Y O U z h / A H 1 u c P y / u P K w = = < / l a t e x i t > t 1 , t 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " x R F v c C A H e 3 s d Z K 6 o 4 9 M L G 0 z y J b M = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E 1 G P R i 8 e K 9 g P a U D b b S b t 0 s w m 7 m 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 1 H G q G D Z Y L G L V D q h G w S U 2 D D c C 2 4 l C G g U C W 8 H o b u a 3 x q g 0 j + W T m S T o R 3 Q g e c g Z N V Z 6 H P e 8 X r n i V t 0 5 y C r x c l K B H P V e + a v b j 1 k a o T R M U K 0 7 n p s Y P 6 P K c C Z w W u q m G h P K R n S A H U s l j V D 7 2 f z U K T m z S p + E s b I l D Z m r v y c y G m k 9 i Q L b G V E z 1 M v e T P z P 6 6 Q m v P E z L p P U o G S L R W E q i I n J 7 G / S 5 w q Z E R N L K F P c 3 k r Y k C r K j E 2 n Z E P w l l 9 e J c 2 L q n d V v X y 4 r N R u 8 z i K c A K n c A 4 e X E M N 7 q E O D W A w g G d 4 h T d H O C / O u / O x a C 0 4 + c w x / I H z + Q M L k o 2 n < / l a t e x i t > v 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " p d 6 C G + f g X j P a 9 T r I / + S v i f D 5 V x Y = " > A A A B 6 n i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H Y J U Y 9 E L x 4 x y i O B D Z k d e m H C 7 O x m Z p a E E D 7 B i w e N 8 e o X e f N v H G A P C l b S S a W q O 9 1 d Q S K 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J U 8 e p Y t h g s Y h V O 6 A a B Z f Y M N w I b C c K a R Q I b A W j u 7 n f G q P S P J Z P Z p K g H 9 G B 5 C F n 1 F j p c d y r 9 I o l t + w u Q N a J l 5 E S Z K j 3 i l / d f s z S C K V h g m r d 8 d z E + F O q D G c C Z 4 V u q j G h b E Q H 2 L F U 0 g i 1 P 1 2 c O i M X V u m T M F a 2 p C E L 9 f f E l E Z a T 6 L A d k b U D P W q N x f / 8 z q p C W / 8 K Z d J a l C y 5 a I w F c T E Z P 4 3 6 X O F z I i J J Z Q p b m 8 l b E g V Z c a m U 7 A h e K s v r 5 N m p e x d l a s P 1 V L t N o s j D 2 d w D p f g w T X U 4 B 7 q 0 A A G A 3 i G V 3 h z h P P i v D s f y 9 a c k 8 2 c w h 8 4 n z 8 N F o 2 o < / l a t e x i t > v 2 Link features t 1 t 2 t 3 t 4 t 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " z w v 9 / b O l m o J k t B d H Y C 6 V Z V W y U X g = " > A A A C L X i c h V D J S g N B E O 2 J W 4 z b q E c v g 0 F I I I S Z E J d j U A 8 e I 5 g F k j j 0 d H q S J j 0 L 3 T W S M I w f 5 M V f E c F D R L z 6 G 3 a W g y a C D x o e r 1 5 V V z 0 n 5 E y C a Y 6 1 1 M r q 2 v p G e j O z t b 2 z u 6 f v H 9 R l E A l C a y T g g W g 6 W F L O f F o D B p w 2 Q 0 G x 5 3 D a c A Z X k 3 r j g Q r J A v 8 O R i H t e L j n M 5 c R D E q y 9 e u 2 h 6 H v u P E w s W O r U E r u 2 0 C H E K t 5 g y Q H t p U v P P 5 j O c 3 b e t Y s m l M Y y 8 S a k y y a o 2 r r r + 1 u Q C K P + k A 4 l r J l m S F 0 Y i y A E U 6 T T D u S N M R k g H u 0 p a i P P S o 7 8 f T a x D h R S t d w A 6 G e D 8 Z U / d k R Y 0 / K k e c o 5 2 R x u V i b i H / V W h G 4 F 5 2 Y + W E E 1 C e z j 9 y I G x A Y k + i M L h O U A B 8 p g o l g a l e D 9 L H A B F T A G R W C t X j y M q m X i t Z Z s X x b z l Y u 5 3 G k 0 R E 6 R j l k o X N U Q T e o i m q I o C f 0 g s b o X X v W 3 r Q P 7 X N m T W n z n k P 0 C 9 r X N y X k q U Y = < / l a t e x i t > x link 1,2 (t 1 ), x link 1,2 (t 5 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " N 1 b Q F 1 e q D f m D x C p l z s J m D B d g C j c = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 Q e 0 o W y 2 m 3 b p Z h N 2 J 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 x E n C / Y g O l Q g F o 2 i l B + y 7 / X L F r b p z k F X i 5 a Q C O R r 9 8 l d v E L M 0 4 g q Z p M Z 0 P T d B P 6 M a B Z N 8 W u q l h i e U j e m Q d y 1 V N O L G z + a n T s m Z V Q Y k j L U t h W S u / p 7 I a G T M J A p s Z 0 R x Z J a 9 m f i f 1 0 0 x v P Y z o Z I U u W K L R W E q C c Z k 9 j c Z C M 0 Z y o k l l G l h b y V s R D V l a N M p 2 R C 8 5 Z d X S e u i 6 l 1 W a / e 1 S v 0 m j 6 M I J 3 A K 5 + D B F d T h D h r Q B A Z D e I Z X e H O k 8 + K 8 O x + L 1 o K T z x z D H z i f P w c C j a Q = < / l a t e x i t > t 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " z 8 1 i x 6 Q e H P 3 t 6 9 N C 0 T O p 7 j 1 U n K c = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q s e i F 4 8 V 7 Q e 0 o W y 2 m 3 b p Z h N 2 J 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 x E n C / Y g O l Q g F o 2 i l B + z X + u W K W 3 X n I K v E y 0 k F c j T 6 5 a / e I G Z p x B U y S Y 3 p e m 6 C f k Y 1 C i b 5 t N R L D U 8 o G 9 M h 7 1 q q a M S N n 8 1 P n Z I z q w x I G G t b C s l c / T 2 R 0 c i Y S R T Y z o j i y C x 7 M / E / r 5 t i e O 1 n Q i U p c s U W i 8 J U E o z J 7 G 8 y E J o z l B N L K N P C 3 k r Y i G r K 0 K Z T s i F 4 y y + v k t Z F 1 a t V L + 8 v K / W b P I 4 i n M A p n I M H V 1 C H O 2 h A E x g M 4 R l e 4 c 2 R z o v z 7 n w s W g t O P n M M f + B 8 / g A Q G o 2 q < / l a t e x i t > t 6 < l a t e x i t s h a 1 _ b a s e 6 4 = " z 8 1 i x 6 Q e H P 3 t 6 for v 1 ) and each temporal link has its link features (e.g., x link 1,2 (t 1 ), x link 1,2 (t 5 ) are link features between v 1 , v 2 at t 1 , t 5 ). For scenarios without node or link features, we use all-zero vectors instead. 9 N C 0 T O p 7 j 1 U n K c = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q s e i F 4 8 V 7 Q e 0 o W y 2 m 3 b p Z h N 2 J 0 I J / Q l e P C j i 1 V / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q m T j V j D d Z L G P d C a j h U i j e R I G S d x L N a R R I 3 g 7 G t z O / / c S 1 E b F 6 x E n C / Y g O l Q g F o 2 i l B + z X + u W K W 3 X n I K v E y 0 k F c j T 6 5 a / e I G Z p x B U y S Y 3 p e m 6 C f k Y 1 C i b 5 t N R L D U 8 o G 9 M h 7 1 q q a M S N n 8 1 P n Z I z q w x I G G t b C s l c / T 2 R 0 c i Y S R T Y z o j i y C x 7 M / E / r 5 t i e O 1 n Q i U p c s U W i 8 J U E o z J 7 G 8 y E J o z l B N L K N P C 3 k r Y i G r K 0 K Z T s i F 4 y y + v k t Z F 1 a t V L + 8 v K / W b P I 4 i n M A p n I M H V 1 C H O 2 h A E x g M 4 R l e 4 c 2 R z o v z 7 n w s W g t O P n M M f + B 8 / g A Q G o 2 q < / l a t against baselines that are equipped with the RNNs and SAM. In practice, it achieves state-of-the-art performance in terms of different evaluation metrics (e.g., average precision, AUC, Recall@K, and MRR) on real-world temporal graph datasets, with the even smaller number of model parameters and hyper-parameters, and a conceptually simpler input structure and model architecture (Section 4). Q2: What are the key factors that lead to the success of GraphMixer? We identify three key factors that contribute to the success of GraphMixer: 1 The simplicity of GraphMixer's input data and neural architecture. Different from most deep learning methods that focus on designing conceptually complicated data preparation techniques and technically complicated neural architectures, we choose to simplifying the neural architecture and utilize a conceptually simpler data as input. Both of which could lead to a better model performance and better generalization (Section 4.4). 2 A time-encoding function that encodes any timestamp as an easily distinguishable input vector for GraphMixer. Different from most of the existing methods that propose to learn the time-encoding function from the raw input data, our time-encoding function utilizes conceptually simple features and is fixed during training. Interestingly, we show that our fixed time-encoding function is more preferred than the trainable version (used by most previous studies), and could lead to a smoother optimization landscape, a faster convergence speed, and a better generalization (Section 4.2); 3 A link-encoder that could better distinguish temporal sequences. Different from most existing methods that summarize sequences using SAM, our encoder module is entirely based on MLPs. Interestingly, our encoder can distinguish temporal sequences that cannot be distinguished by SAM, and it could generalize better due to its simpler neural architecture and lower model complexity (Section 4.3). To this end, we summarize our contributions as follows: 1 We propose a conceptually and technically simple architecture GraphMixer; 2 Even without RNN and SAM, GraphMixer not only outperforms all baselines but also enjoys a faster convergence and better generalization ability; 3 Extensive study identifies three factors that contribute to the success of GraphMixer. 4 Our results could motivate future research to rethink the importance of the conceptually and technically simpler method.

2. PRELIMINARY AND EXISTING WORKS

Preliminary. Figure 1 is an illustration on the temporal graph. Our goal is to predict whether two nodes are connected at a specific timestamp t 0 based on all the available temporal graph information happened before that timestamp. For example, to predict whether v 1 , v 2 are connected at t 0 , we only have access to the graph structure, node features, and link features with timestamps from t 1 to t 6 . Related works. Most of the temporal graph learning methods are conceptually and technically complicated with advanced neural architectures. It is non-trivial to fully understand the algorithm details without looking into their implementations. Therefore, we select the four most representative and most closely-related methods to introduce and compare them in more details. • JODIE Kumar et al. (2019) is a RNN-based method. Let us denote x i (t) as the embedding of node v i at time t, x link ij (t) as the link feature between v i , v j at time t, and m i as the timestamp that v i latest interact with other node. JODIE pre-processes and updates the representation of each node via RNNs (is it just one RNN or multiple RNNs). More specifically, when an interaction between v i , v j happens at time t, JODIE updates the temporal embedding using RNN by  x i (t) = RNN x i (m i ), x j (m j ), x link ij (t), t -m i . Then, the dynamic embedding of node v i at time t 0 is computed by h i (t 0 ) = (1 + (t 0 -m i )w) • x i (m i ).

3. GRAPHMIXER: A CONCEPTUALLY AND TECHNICALLY SIMPLE METHOD

In this section, we first introduce the neural architecture of GraphMixer in Section 3.1 then explicitly highlight its difference to baseline methods in Section 3.2.

3.1. DETAILS ON GRAPHMIXER: NEURAL ARCHITECTURE AND INPUT DATA

GraphMixer has three modules: 1 link-encoder is designed to summarize the information from temporal links (e.g., link timestamps and link features); 2 node-encoder is designed to summarize the information from nodes (e.g., node features and node identity); 3 link classifier predicts whether a link exists based on the output of the aforementioned two encoders. Link-encoder. The link-encoder is designed to summarize the temporal link information associated with each node sorted by timestamps, where temporal link information is referring to the timestamp and features of each link. For example in Figure 1 , the temporal link information for node v 2 is {(t 1 , x link 1,2 (t 1 )), (t 3 , x link 2,4 (t 3 )), (t 4 , x link 2,4 (t 4 )), (t 5 , x link 1,2 (t 5 ))} and for node v 5 is {(t 2 , x link 3,5 (t 2 )), (t 6 , x link 4,5 (t 6 ))}. In practice, we only keep the top K most recent temporal link information, where K is a dataset dependent hyper-parameter. If multiple links have the same timestamps, we simply keep them the same order as the input raw data. To summarize temporal link information, our link-encoder should have the ability to distinguish different timestamps (achieved by our time-encoding function) and different temporal link information (achieved by the Mixer module). • Time-encoding function. To distinguish different timestamps, we introduce our time-encoding function cos(tω), which utilizes features ω = {α -(i-1)/β } d i=1 to encode each timestamps into < l a t e x i t s h a 1 _ b a s e 6 4 = " r 3 5 m n h e 9 S Y i l n f I / p + c 5 x K 7 H p x Y = " > A A A C 0 n i c f Z J N b x M x E I a 9 y 1 c J H 0 3 h y M U i A p V L t J t G w L G C C z e K 1 L S V 4 i i a 9 U 4 2 V r 3 2 y p 4 F p a t Q V V z 5 d d z 4 C f w L n G 2 E o A 0 d y f K r d / z I M 2 N n l V a e k u R n F N + 6 f e f u v a 3 7 n Q c P H z 3 e 7 u 4 8 O f K 2 d h J H 0 m r r T j L w q J X B E S n S e F I 5 h D L T e J y d v l / l j z + j 8 8 q a Q 1 p U O C m h M G q m J F C w p t 1 f I s N C m Q a 0 K g z m y 8 5 L T t O U C 6 e K O Y F z 9 g s X 0 v r d 1 s y s z v 2 i D F s j b I k F L F 9 x I V p k s A k Z 3 I j s b U L 2 b k S G m 5 D h f x G B J v / T 2 b T b S / p J G / y 6 S N e i x 9 Z x M O 3 + E L m V d Y m G p A b v x 2 l S 0 a Q B R 0 p q X H Z E 7 b E C e Q o F j o M 0 U K K f N O 2 T L P m L 4 O R 8 Z l 1 Y h n j r / k 0 0 U P p V x e F k C T T 3 V 3 M r c 1 N u X N P s 7 a R R p q o J j b y 8 a F Z r T p a v 3 p f n y q E k v Q g C p F O h V i 7 n 4 E B S + A W d M I T 0 a s v X x d G g n 7 7 u D z 8 N e / v v 1 u P Y Y s / Y c 7 b L U v a G 7 b M P 7 I C N m I w + R n X 0 N T q P D + O z + C L + d n k 0 j t b M U / Z P x N 9 / A w K f 4 N U = < / l a t e x i t > t 1 ! cos(t 1 !) t 2 ! cos(t 2 !) t 3 ! cos(t 3 !) t 4 ! cos(t 4 !) < l a t e x i t s h a 1 _ b a s e 6 4 = " r 3 5 m n h e 9 S Y i l n f I / p + c 5 x K 7 H p x Y = " > A A A C 0 n i c f Z J N b x M x E I a 9 y 1 c J H 0 3 h y M U i A p V L t J t G w L G C C z e K 1 L S V 4 i i a 9 U 4 2 V r 3 2 y p 4 F p a t Q V V z 5 d d z 4 C f w L n G 2 E o A 0 d y f K r d / z I M 2 N n l V a e k u R n F N + 6 f e f u v a 3 7 n Q c P H z 3 e 7 u 4 8 O f K 2 d h J H 0 m r r T j L w q J X B E S n S e F I 5 h D L T e J y d v l / l j z + j 8 8 q a Q 1 p U O C m h M G q m J F C w p t 1 f I s N C m Q a 0 K g z m y 8 5 L T t O U C 6 e K O Y F z 9 g s X 0 v r d 1 s y s z v 2 i D F s j b I k F L F 9 x I V p k s A k Z 3 I j s b U L 2 b k S G m 5 D h f x G B J v / T 2 b T b S / p J G / y 6 S N e i x 9 Z x M O 3 + E L m V d Y m G p A b v x 2 l S 0 a Q B R 0 p q X H Z E 7 b E C e Q o F j o M 0 U K K f N O 2 T L P m L 4 O R 8 Z l 1 Y h n j r / k 0 0 U P p V x e F k C T T 3 V 3 M r c 1 N u X N P s 7 a R R p q o J j b y 8 a F Z r T p a v 3 p f n y q E k v Q g C p F O h V i 7 n 4 E B S + A W d M I T 0 a s v X x d G g n 7 7 u D z 8 N e / v v 1 u P Y Y s / Y c 7 b L U v a G 7 b M P 7 I C N m I w + R n X 0 N T q P D + O z + C L + d n k 0 j t b M U / Z P x N 9 / A w K f 4 N U = < / l a t e x i t > t 1 ! cos(t 1 !) t 2 ! cos(t 2 !) t 3 ! cos(t 3 !) t 4 ! cos(t 4 !) < l a t e x i t s h a 1 _ b a s e 6 4 = " w D N e V Y a o N q i 9 U k g r M 1 t r o q R K x M 4 = " > A A A D g H i c p V J b b 9 M w F H Y T L q P c u v H I i 0 U F S q W u S 7 q M I Z 4 m e O F x S H S b V I f I c Z z U q u N E 8 Q l a l W U / g x / G G z 8 G C a e t B L R 9 Q T u S 5 U / f 8 f n O x S c q p N D g u j 8 7 l n 3 v / o O H e 4 + 6 j 5 8 8 f f a 8 t 3 9 w o f O q Z H z C c p m X V x H V X A r F J y B A 8 q u i 5 D S L J L + M 5 h 9 b / + U 3 X m q R q y + w K H i Q 0 V S J R D A K h g r 3 O 9 8 x A X 4 N t Q b K 5 g 2 R P A G H R D w V q q Z S p I r H T f f N l L B c O w 6 E L j 7 E E H o D A i L j G p M o l 7 F e Z O a q S Z 7 x l D a D 2 5 u b W 5 J R m E V J f d 2 E t T c c N 1 9 X O U y R 8 8 Z p 4 w N M y J b s 8 f / I j o f + p u z x b l n / j r L + b t m T O w 7 h Z C V L u I r / T J q U I p 3 B A H f D X t 8 d u U v D 2 8 B b g z 5 a 2 3 n Y + 0 H i n F U Z V 8 A k 1 X r q u Q U E N S 1 B M M m N c q V 5 Y b 6 Y p n x q o K K m 9 K B e L l C D X x s m x k l e m q M A L 9 m / I 2 q a 6 b Z D 8 7 J t S m / 6 W n K X b 1 p B 8 i 6 o h S o q 4 I q t E i W V x J D j d h t x L E r O Q C 4 M o K w U p l b M Z r S k D M z O t k P w N l v e B h f j k f d 2 5 H / 2 + 2 c f 1 u P Y Q y / R K + Q g D 5 2 i M / Q J n a M J Y p 1 f V t 8 a W o e 2 Z T v 2 k e 2 t n l q d d c w L 9 I / Z 7 3 8 D T q Q f G g = = < / l a t e x i t > stack 0 B B B B @ [cos((t 0 t 1 ) ⇥ !) || x link 1,2 (t 1 )] [cos((t 0 t 3 ) ⇥ !) || x link 2,4 (t 3 )] [cos((t 0 t 4 ) ⇥ !) || x link 2,4 (t 4 )] [cos((t 0 t 5 ) ⇥ !) || x link 1,2 (t 5 )] 1 C C C C A

Channel-mixer

Mean-pooling 1-layer MLP-Mixer Zero-padding (a) (b) < l a t e x i t s h a 1 _ b a s e 6 4 = " + u j s t V M e f R Z d m v + U s C S 0 L r q J o A o = " > A A A B + n i c b V B N T 8 J A F N z i F + J X 0 a O X j c Q E L 6 Q l R D 0 S v X j E R M A E m m a 7 b G H D d t v s v m p I 5 a d 4 8 a A x X v 0 l 3 v w 3 L t C D g p N s M p l 5 L 2 9 2 g k R w D Y 7 z b R X W 1 j c 2 t 4 r b p Z 3 d v f 0 D u 3 z Y 0 X G q K G v T W M T q P i C a C S 5 Z G z g I d p 8 o R q J A s G 4 w v p 7 5 3 Q e m N I / l H U w S 5 k V k K H n I K Q E j + X a 5 H x E Y B W E G U 7 9 e B d 8 5 8 + 2 K U 3 P m w K v E z U k F 5 W j 5 9 l d / E N M 0 Y h K o I F r 3 X C c B L y M K O B V s W u q n m i W E j s m Q 9 Q y V J G L a y + b R p / j U K A M c x s o 8 C X i u / t 7 I S K T 1 J A r M 5 C y o X v Z m 4 n 9 e L 4 X w 0 s u 4 T F J g k i 4 O h a n A E O N Z D 3 j A F a M g J o Y Q q r j J i u m I K E L B t F U y J b j L X 1 4 l n X r N P a 8 1 b h u V 5 l V e R x E d o x N U R S 6 6 Q E 1 0 g 1 q o j S h 6 R M / o F b 1 Z T 9 a L 9 W 5 9 L E Y L V r 5 z h P 7 A + v w B d / 2 T f Q = = < / l a t e x i t > t 2 (t 0 ) Figure 2 : (a) Time-encoding function that pre-process timestamp t into a vector cos(tω). The x-axis is the vector dimension and the y-axis is the cosine value. (b) link-encoder takes the temporal link information of node v 2 as inputs and outputs a vector t 2 (t 0 ) that will be used for link prediction. a d-dimensional vector. More specifically, we first map each t to a vector with monotonically exponentially decreasing values tω ∈ (0, t] among the feature dimension, then use cosine function to project all values to cos(tω) ∈ [-1, +1]. The selection of α, β is depending on the scale of the maximum timestamp t max we wish to encode. In order to distinguish all timestamps, we have to make sure t max × α -(i-1)/β → 0 as i → d to distinguish all timestamps. In practice, we found d = 100 and α = β = √ d works well for all datasets. Notice that ω is fixed and will not be updated during training. As shown in Figure 2a , the output of this time-encoding function has two main properties that could help GraphMixer distinguish different timestamps: similar timestamps have similar time-encodings (e.g., the plot of t 1 , t 2 ) and the larger the timestamp the later the values in time-encodings converge to +1 (e.g., the plot of t 1 , t 3 or t 1 , t 4 ). • Mixer for information summarizing. We use a 1-layer MLP-mixer Tolstikhin et al. (2021) to summarize the temporal link information. Figure 2b is an example on summarizing the temporal link information of node v 2 . Recall that the temporal link information of node v 2 is {(t 1 , x link 1,2 (t 1 )), (t 3 , x link 2,4 (t 3 )), (t 4 , x link 2,4 (t 4 )), (t 5 , x link 1,2 (t 5 ))}. We first encode timestamps by our time-encoding function then concatenate it with its corresponding link features. For example, we encode (t 1 , x link 1,2 (t 1 )) as [cos((t 0 -t 1 )ω) || x link 1,2 (t 1 ))] where t 0 is the timestamp that we want to predict whether the link exists. Then, we stack all the outputs into a big matrix and zero-pad to the fixed length K denoted as T 2 (t 0 ). Finally, we use an 1-layer MLP-mixer with mean-pooling to compress T 2 (t 0 ) into a single vector t 2 (t 0 ). Specifically, the MLP-mixer takes T 2 (t 0 ) as input T 2 (t 0 ) → H input , H token = H input + W (2) token GeLU(W (1) token LayerNorm(H input )), H channel = H token + GeLU(LayerNorm(H token )W (1) channel )W (2) channel , and output the temporal encoding t 2 (t 0 ) = Mean(H channel ). Please notice that zero-padding operator is important to capture how often a node interacts with other nodes. The node with more zero-padded dimensions has less temporal linked neighbors. This information is very important in practice according to our experimental observation. Node-encoder. The node-encoder is designed to capture the node identity and node feature information via neighbor mean-pooling. Let us define the 1-hop neighbor of node v i with link timestamps from t to t 0 as N (v i ; t, t 0 ). For example in Figure 1  , we have N (v 2 ; t 4 , t 0 ) = {v 1 , v 4 } and N (v 5 ; t 4 , t 0 ) = {v 3 }. Then, the node-info feature is computed based on the 1-hop neighbor by s i (t 0 ) = x node i + Mean{x node j | v j ∈ N (v i ; t 0 -T, t 0 )}, where T is a dataset-dependent hyperparameter. In practice, we found 1-hop neighbors are enough to achieve good performance, and we use one-hot node representations for datasets without node features. Link classifier. Link classifier is designed to classify whether a link exists at time t 0 using the output of link-encoder t i (t 0 ) and the output of node-encoder s i (t 0 ). Let us denote the node v i 's representation at time t 0 as the concatenation of the above two encodings h i (t 0 ) = [s i (t 0 ) || t i (t 0 )]. Then, the prediction on whether an interaction between node v i , v j happens at time t 0 is computed by applying a 2-layer MLP model on [h i (t 0 ) || h j (t 0 )], i.e., p ij = MLP([h i (t 0 ) || h j (t 0 )]).

3.2. COMPARISON TO EXISTING METHODS

In the following, we highlight some differences between GraphMixer and other methods, which will be explicitly ablation studied in the experiment section (Section 4.4). Temporal graph as undirected graph. Most of the existing works consider temporal graphs as directed graphs with information only flows from the source node (e.g., users in the recommender system) to the destination nodes (e.g., ads in the recommender system). However, we consider the Table 1 : Comparison on the average precision score for link prediction. GraphMixer uses one-hot node encoding for datasets without node features (marked by ♮). For each dataset, we indicate whether we have the corresponding feature ("L" link features, "N" node features, and "T" link timestamps). Red is the best score, Blue is the best score excluding GraphMixer and its variants. temporal graph as an undirected graph. By doing so, if two nodes are frequently connected in the last few timestamps, the "most recent 1-hop neighbors" sampled for the two nodes on the "undirected" temporal graph would be similar. In other words, the similarity between the sampled neighbors provides information on whether two nodes are frequently connected in the last few timestamps, which is essential for temporal graph link prediction. Intuitively, if two nodes are frequently connected in the last few timestamps, they are also likely to be connected in the recent future. Selection on neighbors. Existing methods consider either "multi-hop recent neighbors" or "multihop uniform sampled neighbors", whereas we only consider the "1-hop most recent neighbors". 2020) samples neighbors by random walks, which can also be think of as multi-hop recent neighbors. Although sampling more neighbors could provide a sufficient amount of information for models to reason about, it could also carry much spurious or noisy information. As a result, more complicated model architectures (e.g., RNN or SAM) are required to extract useful information from the raw data, which could lead to a poor model trainability and potentially weaker generalization ability. Instead, we only take the "most recent 1-hop neighbors" into consideration, which is conceptually simpler and enjoys better performance.

4. EXPERIMENTS

Dataset. We conduct experiments on five real-world datasets, including the Reddit, Wiki, MOOC, LastFM datasets that are used in Kumar et al. (2019) and the GDELT datasetfoot_0 which is introduced in Zhou et al. (2022) . Besides, since GDELT is the only dataset with both node and link features, we create its two variants to understand the effect of training data on model performance: GDELT-e removes the link feature from GDELT and keep the node feature and link timestamps, GDELT-ne removes both the link and edge features from GDELT and only keep the link timestamps. For each dataset, we use the same 70%/15%/15% chronological splits for the train/validation/test sets as existing works. The detailed dataset statistics are summarized in Appendix A.2. Baselines. We compare baselines that are introduced in Section 2. Besides, we create two variants to better understand how node-and link-information contribute to our results, where GraphMixer-L is only using link-encoder and GraphMixer-N is only using node-encoder. We conduct experiments under the transductive learning setting and use average precision for evaluation. The detailed model configuration, training and evaluation process are summarized in Appendix A.3. Due to the space limit, more experiment results on using Recall@K, MRR, and AUC as the evaluation metrics, comparison on wall-clock time and number of parameters are deferred to Appendix C Outline. We first compare GraphMixer with baselines in Section 4.1 then highlight the three key factors that contribute to the success of GraphMixer in Section 4.2, Section 4.3, and Section 4.4.

4.1. MAIN EMPIRICAL RESULTS.

GraphMixer achieves outstanding performance. We compare the average precision score with baselines in Table 1 . We have the following observations: 1 GraphMixer outperforms all baselines on all datasets. The experiment results provide sufficient support on our argument that neither RNN nor SAM is necessary for temporal graph link prediction. 2 According to the performance of GraphMixer-L on datasets only have link timestamp information (MOOC, LastFM, and GDELT-ne), we know that our time-encoding function could successfully pre-process each timestamp into a meaningful vector. In fact, we will show later in Section 4.2 that our time-encoding function is more preferred than baselines' trainable version. 3 By comparing the performance GraphMixer-N and GraphMixer on Wiki, MOOC, and LastFM datasets, we know that node-encoder alone is not enough to achieve a good performance. However, it provides useful information that could benefit the link-encoder. 4 By comparing the performance of GraphMixer-N on GDELT and GDELT-ne, we observe that using one-hot encoding outperforms using node features. This also shows the importance of node identity information because one-hot encoding only captures such information. 5 More complicated methods (e.g., CAWs, TGSRec, and DDGCL) do not perform well when using the default hyper-parametersfoot_1 , which is understandable because these methods have more components with an excessive amount of hyper-parameters to tune. GraphMixer enjoys a smoother loss landscape. To understand why "GraphMixer converges faster and generalizes better, while baselines suffer training unstable issue and generalize poorly", we explore the loss landscape by using the visualization tools introduced in Li et al. (2018a) . We illustrate the loss landscape in Figure 4 by calculating and visualizing the loss surface along two random directions near the pre-trained optimal parameters. The x-and y-axis indicate how much the optimal solution is stretched along the two random directions, and the optimal point is when x-and y-axis are zero. 1 From Figure 4a , 4d, we know GraphMixer enjoys a smoother landscape with a flatter surface at the optimal point, the slope becomes steeper when stretching along the two random directions. The steeper slope on the periphery explains why GraphMixer could converge fast, the flatter surface at the optimal point explains why it could generalize well. 2 Surprisingly, we find that baselines have a non-smooth landscape with many spikes on its surface from Figure 4b , 4c, 4e, 4f. This observation provides sufficient explanation on the training instability and poor generalization issue of baselines as shown in Figure 3 , 8. Interestingly, as we will show later in Section 4.2, the trainable time-encoding function in baselines is the key to this non-smooth landscape issue. Replacing it with our fixed time-encoding function could flatten the landscape and boost their model performance. Existing works (i.e., JODIE, TGAT, and TGN) leverage a trainable time-encoding function z(t) = cos(tw ⊤ + b) to represent timestampsfoot_2 . However, we argue that using trainable time-encoding function could cause instability during training because its gradient ∂ cos(tw+b) ∂w = t × sin(tw + b) scales proportional to the timestamps, which could lead to training instability issue and cause the baselines' the non-smooth landscape issue as shown in Figure 4 . As an alternative, we utilize the fixed time-encoding function z(t) = cos(tω) with fixed features ω that could capture the relative difference between two timestamps (introduced in Section 3.1). To verify this, we introduce a simple experiment to test whether the time-encoding functions (both our fixed version and baselines' trainable version) are expressive enough, such that a simple linear classifier can distinguish the time-encodings of two different timestamps produced by the time-encoding functions. Specially, our goal is to classify if t 1 > t 2 by learning a linear classifier on [z(t 1 ) || z(t 2 )]. During training, we randomly generate two timestamps t 1 , t 2 ∈ [0, 10 6 ] and ask a fully connected layer to classify whether a timestamp is greater than another. As shown in Figure 5a , using the trainable time-encoding function (orange curve) will suffer from the unstable exploding gradient issue (left upper figure) and its performance remains almost the same during the training process (left lower figure). However, using our fixed time-encoding function (blue curve) does not have the unstable exploding gradient issue and can quickly achieve high accuracy within several iterations. Meanwhile, we compare the parameter trajectories of the two models in Figure 5b . We observe that the change of parameters on the trainable time-encoding function is drastically larger than our fixed version. A huge change in weight parameters could deteriorate the model's performance. Most importantly, by replacing baselines' trainable time-encoding function with our fixed version, most baselines have a smoother optimization landscape (Figure 6 ) and a better model performance (in Table 2 ), which further verifies our argument that our fixed time-encoding function is more preferred than the trainable version. 

4.3. ON THE IMPORTANCE OF MLP-MIXER IN GRAPHMIXER'S LINK-ENCODER

In this section, we aim to achieve a deeper understanding on the expressive power of the link-encoder by answering the following two questions: "Can we replace the MLP-mixer in link-encoder with selfattention?" and "Why MLP-mixer is a good alternative of self-attention?" To answer these questions, Comparison on the trajectories of parameter change, where the radius is r t = ∥δ t ∥/∥δ 0 ∥, the angle is θ t = arccos⟨δ t /∥δ t ∥ 2 , δ 0 /∥δ 0 ∥ 2 ⟩, and δ t = w tw ⋆ is the difference between w t to optimal point w ⋆ . The more the model parameters change during training, the larger the semicircle. 3 , GraphMixer suffers from performance degradation when using self-attention: the best performance is achieved when using MLP-mixer with zero-padding, while the model performance drop slightly when using selfattention with sum-pooling (row 2 and 4), and the performance drop significantly when using self-attention with mean-pooling (row 3 and 5). Self-attention with mean-pooling has a weaker model performance because it cannot distinguish "temporal sequences with identical link timestamps and features" (e.g., cannot distinguish [a 1 , a 1 ] and [a 1 ] and it cannot explicitly capture "the length of temporal sequences" (e.g., cannot distinguish if [a 1 , a 2 ] is longer than [a 3 ]), which are both very important for GraphMixer understand how frequent a node interacts with other nodes. We explicitly verify this in Figure 7 by first generating two temporal sequences (with timestamps but without link features), then encoding the timestamps into vectors via time-encoding function, and asking full self-attention and MLP-mixer to distinguish. As shown in Figure 7 , self-attention with mean-pooling cannot distinguish two temporal sequences with identical timestamps (because all the self-attention weights are equivalent if the features of the node on the two sides of a link are identical) and cannot capture the sequence length (because of mean-pooling simply averages the inputs and does not take the input size into consideration). However, MLP-mixer in GraphMixer can distinguish the above two sequences because of zero-padding. Fortunately, the aforementioned two weaknesses could be alleviated by replacing the mean-pooling in temporal self-attention with the sum-pooling, which explains why using sum-pooling brings better model performance than mean-pooling. However, since self-attention modules have more parameters and are harder to train, they could generalize poor when the downstream task is not too complicated.

4.4. KEY FACTORS TO THE BETTER PERFORMANCE

One of the major factors that contributes to GraphMixer's success is the simplicity of GraphMixer's neural architecture and input data. Using conceptually simple input data that better aligned with their labels allows a simple neural network model to capture the underlying mapping between the input to their labels, which could lead to a better generalization ability. In the following, we explicitly verify this by comparing the performance of GraphMixer with different input data in Table 4 : 1 Recall from Section 3.2 that the "most recent 1-hop neighbors" sampled for the two nodes on the "undirected" temporal graph could provide information on whether two nodes are frequently connected in the last few timestamps, which is essential for temporal graph link prediction. To verify this, we conduct ablation study by comparing the model performance on direct and undirected temporal graphs. As shown in the 1st and 2nd row of Table 4 , changing from undirected to direct graph results in a significant performance drop because such information is missing. 2 Recall from Section 3.1 that instead of feeding the raw timestamp to GraphMixer and encoding each timestamp with a trainable time-encoding function, GraphMixer encodes the timestamps via our fixed time-encoding function and feed the encoded representation to GraphMixer, which reduces the model complexity of learning a time-encoding function from data. This could be verified by the 3rd and 4th rows of Table 4 , where using the pre-encoded time information could give us a better performance. 3 Selecting the input data that has similar distribution in training and evaluation set could also potentially improve the evaluation error. For example, using relative timestamps (i.e., each neighbor's timestamp is subtracted by its root node's timestamp) is better than absolute timestamps (e.g., using Unix timestamp)because the absolute timestamps in the evaluation set and training set are from different range when using chronological splits, but they are very likely to overlap if using relative timestamps. As shown in the 3rd to 6th rows of Table 4 , using relative time information always gives a better model performance than using absolute time information. 4 Selecting the most representative neighbors for each node. For example, we found 1-hop most recent interacted neighbors are the most representative for link prediction. Switching to either 2-hop neighbors or uniform sampled neighbors will hurt the model performance according to the 7th to the 10th row of Table 4 . 

5. CONCLUSION

In this paper, we propose a conceptually and technically simple architecture GraphMixer for temporal link prediction. GraphMixer not only outperforms all baselines but also enjoys a faster convergence speed and better generalization ability. An extensive study identifies three key factors that contribute to the success of GraphMixer and highlights the importance of simpler neural architecture and input data structure. An interesting future direction, not limited to temporal graph learning, is designing algorithms that could automatically select the best input data and data pre-processing strategies for different downstream tasks. Training and evaluation. A unified training and evaluation process (e.g., mini-batch and data preparation) is used for GraphMixer and baselines. Specifically, each mini-batch is constructed by first sampling a set of positive node pairs and an equal amount of negative node pairs. Then, an algorithm-dependent node sampler is used to sample the neighboring of each mini-batch node and computed their node representation based on the sampled neighborhood. Finally, we concatenate each node pair and use the link prediction classifier (introduced in Section 3.1) for binary classification. We conduct experiments under the transduction learning setting and use average precision for evaluation. B MORE DISCUSSION ON EXISTING TEMPORAL GRAPH METHODS

B.1 RECENT METHODS THAT WE DO NOT COMPARE WITH

There are other temporal graph learning algorithms that are related to the temporal link prediction task but we did not compare GraphMixer with them because (1) the official implementation of some of the above works are not released by the authors and we could not reproduce their results as reported in the paper, and (2) we already compare many recent baselines that we believe it is enough to verify the success of GraphMixer. For example, MeTA Wang et al. (2021c) proposes data augmentation to overcome the over-fitting issue in temporal graph learning. More specifically, they generate a few graphs with different data augmentation magnitudes and perform the message passing between these graphs to provide adaptively augmented inputs for every prediction. TCL Wang et al. (2021a) proposes to use a transformer to separately extract the temporal neighborhoods representations associated with the two interaction nodes and then utilizes a co-attentional transformer to model inter-dependencies at a semantic level. To boost model performance, contrastive learning is used to maximize mutual information between the predictive representations of two future interaction nodes. Please notice that we are not claiming conceptually and technically complicated is bad. Instead, we are simply suggesting that the conceptually and technically complicated simpler methods might be more preferred than the complicated one if they could achieve similar performance. • We say a method is conceptually complicated if the underlying idea behind the method is non-trivial. For example, CAWs represents network dynamics by "motifs extracted using temporal random walks", represents node identity by "hitting counts of the nodes based on a set of sampled walks"; TGSRec takes "temporal collaborative signals" into consideration. These concepts are non-trivial to understand in the first place and could potentially require much domain knowledge from other fields to understand the behavior of the method. • We say a method is technically complicated if the method is non-trivial to implement due to many hyper-parameters and many details that need to be taken care of, which could potentially make the application to a real-world scenario challenge. For example, JODIE and TGN require maintaining a "memory" for each node by using RNN, and this "memory" needs to be reset every time after evaluation because then it might carry information about the evaluation data. CAWs extracts features by using multiple temporal random walks, which makes implementing and hyper-parameter finetuning more challenging. MeTA and TCL consider many data augmentation strategies, each of which is not trivial to implement and could affect the model's performance in different ways.

B.3 THEORETICAL WORKS ON TEMPORAL GRAPH LEARNING

Recently, researchers have investigated the expressive power of temporal graph neural networks using graph isomorphism tests. For instance, Gao & Ribeiro (2022) have categorized temporal graph learning methods into "time-and-graph" and "time-then-graph" and compared their expressiveness. They have demonstrated that "time-then-graph" outperforms "time-and-graph" in terms of 1-WL test expressive power. This partially explains why GraphMixer has shown good performance, as it can be thought of as a "time-then-graph" algorithm. Additionally, Souza et al. ( 2022) have shown that incorporating temporal encoding and using the 1-WL test can enhance the expressive power of temporal graph neural networks.

C.1 COMPARISON ON CONVERGENCE SPEED AND GENERALIZATION

We include the missing figures of Section 4. Results on other datasets and the discussion on the experiment results could be found next to Figure 3 . 

C.2 TRANSDUCTIVE LEARNING WITH RECALL@K AND MRR AS EVALUATION METRIC

Recall@K and MRR (mean reciprocal rank) are popular evaluation metrics used in the real-world recommendation system. The larger the numbers, the better the model performance. Our Recall@K and MRR is implemented based on the Open Graph Benchmark's link prediction evaluation metrics' implementation 13 . More specifically, we first sample 100 negative destination nodes for the source node of each temporal link node pair, then our goal is to rank the positive temporal link node pairs higher than 100 negative destination nodes. In the following, we compare the Recall@5 and MRR score of GraphMixer with the selected four most representative baselines. Please notice that since these methods are implemented under the same framework, the model performance is evaluated by using the same model used in Table 1 , the comparison is guaranteed to be fair. We have the following observations on Table 1 : • According to the results in Table 6 , GraphMixer could achieve outstanding performance across all datasets. Especially on the LastFM and GDELT datasets (denser graphs than other datasets). This might implies GraphMixer is more suitable for denser graphs than other baseline methods. • The results of some baseline methods behave less better with Recall@K and MRR evaluation metrics. For example, TGN on LastFM dataset, TGAT on LastFM and GDELT, etc. The above results also imply the limitation of only considering average precision and AUC score for temporal link evaluation.

C.3 COMPARISON ON WALL-CLOCK TIME

In the following, we compare the wall-clock time it takes for GraphMixer and baselines to finish a single epoch of training. When comparing the computation time of GraphMixer with CAWs, TGSRec, and DDGCL, we found that GraphMixer takes significantly lesser time than these baselines, which indicates the effectiveness of GraphMixer. When comparing the computation time of GraphMixer with JODIE, DySAT, TGAT, APAN, and TGN, we found that GraphMixer is very close to or even slightly faster than some baseline methods. Our computation time is slightly slower than other baseline methods (e.g., JODIE and TGN) mainly because these baselines are using well-optimized computation functions from DGL Wang et al. (2019) , while GraphMixer is just using a composition of basic PyTorch functions. In fact, according to the Table 8 , GraphMixer has a similar/smaller amount of parameters with these baselines. Besides, our current implementation also need to preprocess the input data for each epoch of training, e.g., sorting nodes in subgraph according to temporal order and removing duplicated edges. For example, the data preparation at each epoch takes 41 sec on Reddit, 9 sec on Wiki, 20 sec on MOOC, 48 sec on LastFM, and 71 sec on GDELT. However, by caching the pre-processed data in the memory, we only need to pre-process the input data at the first epoch of the training process because our neighbor selection is deterministic and the input data does not change at each epoch.

C.4 TRANSDUCTIVE LEARNING WITH AUC AS EVALUATION METRIC

AUC (Under the ROC Curve) is one of the most widely accepted evaluation metric for link prediction, which has been used in many existing works Xu et al. (2020) ; Rossi et al. (2020) . In the following, we compare the AUC score of GraphMixer with baselines. We have the following observations: 1 GraphMixer outperforms all baselines on all datasets. In particular, GraphMixer attains more than 1% gain over all baselines on the LastFM, GDELT-ne, and GDELT-e datasets, attains around 2% gain over non-RNN methods DySAT and TGAT on the Wiki dataset, and attains around 11% gain over non-RNN methods DySAT and TGAT on the GDELT-ne dataset. The experiment results provide sufficient support on our argument that neither RNN nor SAM is necessary for temporal graph link prediction. 2 According to the performance of GraphMixer-L on datasets only have link timestamp information (MOOC, LastFM, and GDELT-ne), we know that our time-encoding function could successfully pre-process each timestamp into a meaningful vector. 3 By comparing the performance GraphMixer-N and GraphMixer on Wiki, MOOC, and LastFM datasets, we know that node-info encoder alone is not enough to achieve a good performance. However, it provides useful information that could benefit the link-info encoder. 4 By comparing the performance of GraphMixer-N on GDELT and GDELT-ne, we observe that using one-hot encoding outperforms using node features. This also shows the importance of node identity information because one-hot encoding only captures such information. GraphMixer utilizes undirected temporal graph to capture whether two nodes are frequently connected in the last few timestamps. In the following, we test whether using undirected temporal graph could improve the performance of baseline methods. As we can see from Table 10 , using undirected temporal graph cannot improve the performance of baseline methods much because such information are already implicitly captured via their neural architecture design or sampling methods. Our results show that GraphMixer could outperform baselines on LastFM dataset with a large margin, which is due to a composite effect of multiple factors. In the following, we summarize several potential factors that lead to our observation on the model performance. • Larger average time-gap. LastFM has a larger average time-gap (t maxt min )/|E| than other datasets. As shown in the dataset statistic (Table 5 ), LastFM has an average time gap of 106, which is significantly larger than other datasets. For example, Reddit's average time gap is 4, Wiki's average time gap is 17, MOOC's average time gap is 3.6, and GDELT's average time gap is 0.1. Since baseline methods are relying on RNN and SAM to process the historical temporal information, they implicitly assumes the temporal information is "smooth" and with smaller average time gap. Therefore, baseline methods could potentially work better on the dataset with a smaller average time gap but are less ideal on LastFM. GraphMixer is not relying on RNN or SAM, therefore could be less affected by the aforementioned issue. • Larger average node degree. LastFM has a larger average node degree |E|/|V| than other datasets, which potentially prone to over-smoothing (aggregating features from many neighbors make output representation less distinguishable Li et al. (2018b) ), over-squashing (aggregating much information into a limited memory might compress too much useful information Alon & Yahav (2020) ) and over-fitting Cong et al. (2021a) effect. For example, according to the dataset statistic in Table 5 , LastFM has an average node degree of 653, which is larger than the Reddit's average node degree 61, Wiki's average node degree 17, MOOC's average node degree 57, and GDELT's average node degree 216. Existing methods either use the memory cell in RNN to store temporal information or use SAM to aggregate temporal information from multi-hops, which could be less ideal on a dense graph due to over-smoothing and over-squashing. However, GraphMixer is less relying on the aggregation schema, therefore its performance is better than the baseline methods. • Larger maximum timestamp. GraphMixer is using fixed time encoder but baselines are using trainable time-encoders. Since the largest timestamp t max in LastFM is larger than other datasets, the trainable time-encoder is more affected by the unbounded gradient issue as discussed in Table 2 and Section 4 



The GDELT dataset used in our paper is a sub-sampled version because the original dataset is too big to fit into memory for single-machine training. In practice, we keep 1 temporal link per 100 continuous temporal link. In fact, we tried different hyper-parameters based on their default values, but the results are similar. In fact, other baselines (e.g., CAWs, TGSRec, APAN) also utilize this trainable time-encoding function. However, we focus our discussion on the selected methods for the ease of presentation. Download from http://snap.stanford.edu/jodie/reddit.csv Download from http://snap.stanford.edu/jodie/wikipedia.csv Download from http://snap.stanford.edu/jodie/lastfm.csv Download from http://snap.stanford.edu/jodie/mooc.csv Download from https://github.com/amazon-research/tgl/blob/main/down.sh The TGL framework can be found at https://github.com/amazon-research/tgl CAW's official implementation can be found at https://github.com/snap-stanford/CAW TGSRec's official implementation can be found at https://github.com/DyGRec/TGSRec DDGCL's official implementation can be found at https://github.com/ckldan520/DDGCL https://github.com/snap-stanford/ogb/blob/master/ogb/linkproppred/ evaluate.py



For example,TGAT Xu et al. (2020),DySAT Sankar et al. (2020), and TGSReFan et al. (2021) consider multi-hop uniform sampled neighbors; JODIEKumar et al. (2019),TGN Rossi et al. (2020), andAPAN Wang et al. (2021b)  maintain the historical node interactions via RNN, which can be think of as multi-hop recent neighbors; CAWsWang et al. (

Figure 3: Comparison on the training set average precision and generalization gap for the first 100 training epochs. Results on other datasets can be found in Figure 8.GraphMixer enjoys better convergence and generalization ability. To better understand the model performance, we take a closer look at the dynamic of training accuracy and the generalization gap (the absolute difference between training and evaluation score). The results are reported in Figure3and Figure8: 1 The slope of training curves reflects the expressive power and convergence speed of an algorithm. From the first row figures, we can observe that GraphMixer always converge to a high average precision score in just a few epochs, and the training curve is very smooth when compared to baselines. Interestingly, we can observe that the baseline methods cannot always fit the training data, and their training curves fluctuate a lot throughout the training process. 2 The generalization gap reflects how well the model could generalize and how stable the model could perform on unseen data (the smaller the better). From the second row figures, the generalization gap curve of GraphMixer is lesser and smoother than baselines, which indicates the generalization power of GraphMixer.

Figure 4: Comparison on the training loss landscape. Results on other datasets and other baselines can be found in Appendix E.

Figure 5: (a) Comparison on the gradient / parameters norm and accuracy at each iteration. (b)Comparison on the trajectories of parameter change, where the radius is r t = ∥δ t ∥/∥δ 0 ∥, the angle is θ t = arccos⟨δ t /∥δ t ∥ 2 , δ 0 /∥δ 0 ∥ 2 ⟩, and δ t = w tw ⋆ is the difference between w t to optimal point w ⋆ . The more the model parameters change during training, the larger the semicircle.

Figure 6: Comparison on the training loss landscape fixed time-encoding function. Results on other datasets and baselines can be found in Appendix F. let us first conduct experiments by replacing the MLP-mixer in link-encoder with full/1-hop selfattention and sum/mean-pooling, where full self-attention is widely used in Transformers and 1-hop self-attention is widely used in graph attention networks. As shown in Table3, GraphMixer suffers from performance degradation when using self-attention: the best performance is achieved when using MLP-mixer with zero-padding, while the model performance drop slightly when using selfattention with sum-pooling (row 2 and 4), and the performance drop significantly when using self-attention with mean-pooling (row 3 and 5). Self-attention with mean-pooling has a weaker model performance because it cannot distinguish "temporal sequences with identical link timestamps and features" (e.g., cannot distinguish [a 1 , a 1 ] and [a 1 ] and it cannot explicitly capture "the length of temporal sequences" (e.g., cannot distinguish if [a 1 , a 2 ] is longer than [a 3 ]), which are both very important for GraphMixer understand how frequent a node interacts with other nodes. We explicitly verify this in Figure7by first generating two temporal sequences (with timestamps but without link features), then encoding the timestamps into vectors via time-encoding function, and asking full self-attention and MLP-mixer to distinguish. As shown in Figure7, self-attention with mean-pooling cannot distinguish two temporal sequences with identical timestamps (because all the self-attention weights are equivalent if the features of the node on the two sides of a link are identical) and cannot capture the sequence length (because of mean-pooling simply averages the inputs and does not take the input size into consideration). However, MLP-mixer in GraphMixer can distinguish the above two sequences because of zero-padding. Fortunately, the aforementioned two weaknesses could be alleviated by replacing the mean-pooling in temporal self-attention with the sum-pooling, which explains why using sum-pooling brings better model performance than mean-pooling. However, since self-attention modules have more parameters and are harder to train, they could generalize poor when the downstream task is not too complicated.

Figure 7: (a) We generate identical timestamp sequences with different length, then ask MLP-mixer and GAT to distinguish whether the generated sequence are identical (b) We generate random sequence with different length, then ask MLP-mixer and GAT to classify which sequence is longer.

Figure 8: Comparison of the link prediction training average precision and generalization gap for the first 100 training epochs. Results on other datasets can be found in Figure 3.

Figure 9: Comparison on the training loss landscape (Contour) on Wiki Dataset.

Figure 12: Comparison on the training loss landscape (Surface) on Reddit Dataset.

Figure 15: Comparison on the training loss landscape (Contour) on LastFM Dataset.

Figure 16: Comparison on the training loss landscape (Surface) on LastFM Dataset.

Figure 18: Comparison on the training loss landscape (Surface) on GDELT-ne Dataset.

Figure 20: Comparison on the training loss landscape (Surface) on GDELT-e Dataset.

Figure 22: Training loss landscape (Surface) of TGAT with fixed time-encoding function.

Figure 26: Training loss landscape (Surface) of JODIE with fixed time-encoding function.

Figure1: (Left) Temporal graph with nodes v 1 , . . . , v 5 , per-link timestamps t 1 , . . . , t 6 indicate when two nodes interact. For example, v 1 , v 2 interact at t 1 , t 5 . (Right) Each node has its node features (e.g., x node

Finally, the prediction on any node pair at time t 0 is computed by MLP([h i (t 0 ) || h j (t 0 )]), where [•||•] is the concatenate operation and MLP(x) is applying 2-layer MLP on x. • DySAT Sankar et al. (2020) is a SAM-based method. DySAT requires pre-processing the temporal graph into multiple snapshot graphs by first splitting all timestamps into multiple time-slots, then merging all edges in each time-slot. Let G t (V, E t ) denote the t-th snapshot graph. To capture spatial information, DySAT first applies Graph Attention Network (GAT) Veličković et al. (2018) on each snapshot graph G t independently by X(t) = GAT(G t ). Then, to capture of temporal information for each node, Transformer is applied to x i (t) = [X(t)] i at different timestamps to capture the temporal information by h i (t k ), . . . h i (t 0 ) = Transformer x i (t k ), . . . , x i (t 0 ) . Finally, the prediction on any node pair at time t 0 is computed by MLP([h i (t 0 ) || h j (t 0 )]). • TGAT Xu et al. (2020) is a SAM-based method that could capture the spatial and temporal information simultaneously. TGAT first generates the time augmented feature of node i at time t by concatenating the raw feature x i with a trainable time encoding z(t) of time t, i.e., x i (t) = [x i || z(t)] and z(t) = cos(tw+b). Then, SAM is applied to the time augmented features and produces node representationh i (t 0 ) = SAM (x i (t 0 ), {x u (h u ) | u ∈ N t0 (i)}),where N t0 (i) denotes the neighbors of node i at time t 0 and h u denotes the timestamp of the latest interaction of node u. Finally, the prediction on any node pair at time t 0 is computed by MLP([h i (t 0 ) || h j (t 0 )]). • TGN Rossi et al. (2020) is a mixture of RNN-and SAM-based method. In practice, TGN first captures the temporal information using RNN (similarly to JODIE), and then applies graph attention convolution to capture the spatial and temporal information jointly (similarly to TGAT).

± 0.02 98.14 ± 0.01 98.70 ± 0.98 69.39 ± 0.81 95.96 ± 0.10 97.38 ± 0.23 96.77 ± 0.18 GraphMixer-L 99.84 ± 0.01 99.70 ± 0.01 99.81 ± 0.01 95.50 ± 0.03 98.99 ± 0.02 96.14 ± 0.02 98.99 ± 0.02 GraphMixer-N 99.24 ± 0.01 ♮ 90.33 ± 0.01 ♮ 97.35 ± 0.02 ♮ 63.80 ± 0.03 ♮ 94.44 ± 0.02 96.00 ± 0.02 ♮ 98.81 ± 0.02 ♮ GraphMixer 99.93 ± 0.01 ♮ 99.85 ± 0.01 ♮ 99.91 ± 0.01 ♮ 96.31 ± 0.02 ♮ 98.89 ± 0.02 98.39 ± 0.02 ♮ 98.22 ± 0.02 ♮

Comparison on average precision score with fixed/trainable time encoding function (TEF). The results before "→" is for trainable TEF (same as Table1) and after "→" is for fixed TEF.

Comparison on the average precision score. ♮ use 20 neighbors due to out of GPU memory.

Comparison on the average precision score of GraphMixer with different input data. The highlighted rows are identical to our default setting.

Comparison on the Recall@K and MRR.

Comparison on the wall-clock computation time for single-epoch of training.

Comparison on the number of model parameters.

Comparison on the AUC score for link prediction. GraphMixer uses one-hot node encoding for datasets without node features (marked by ♮). For each dataset we indicate whether we have the corresponding feature ("L" link features, "N" node features, and "T" link timestamps). ± 0.01 97.25 ± 0.01 98.58 ± 0.01 62.73 ± 0.64 96.46 ± 0.11 98.39 ± 0.17 97.85 ± 0.19 GraphMixer-L 99.84 ± 0.01 99.70 ± 0.01 99.87 ± 0.01 97.04 ± 0.02 98.99 ± 0.02 96.54 ± 0.02 98.99 ± 0.02 GraphMixer-N 99.53 ± 0.01 ♮ 91.49 ± 0.01 ♮ 98.66 ± 0.02 ♮ 71.51 ± 0.03 ♮ 94.44 ± 0.02 96.00 ± 0.02 ♮ 98.81 ± 0.02 ♮ GraphMixer 99.94 ± 0.01 ♮ 99.82 ± 0.01 ♮ 99.93 ± 0.01 ♮ 97.38 ± 0.02 ♮ 98.89 ± 0.02 98.50 ± 0.02 ♮ 98.48 ± 0.02 ♮ C.5 BASELINES WITH UNDIRECTED TEMPORAL GRAPH

Comparison on baselines with undirected temporal graph (Average precision | AUC score).

.2. For example, the t max in LastFM is 137 millon, while t max in GDELT 0.2 millon, t max in Reddit 2.6 millon, t max in Wiki 2.6 millon, and t max in MOOC 2.6 millon.

ACKNOWLEDGEMENTS

This work was supported in part by NSF grant 2008398. Majority of this work was completed during Weilin Cong's internship at Meta AI under the mentorship of Si Zhang. We also extend our gratitude to Long Jin for his co-mentorship and for his contribution to the idea of using MLP-Mixer on graphs.

A EXPERIMENT SETUP DETAILS A.1 HARDWARE SPECIFICATION AND ENVIRONMENT

We run our experiments on a single machine with Intel i9-10850K, Nvidia RTX 3090 GPU, and 64GB RAM memory. The code is written in Python 3.8 and we use PyTorch 1.12.1 on CUDA 11.6 to train the model on the GPU. Implementation details could be found at https://github.com/CongWeilin/GraphMixer.

A.2 DETAILS ON DATASET

The dataset used in this paper could be automatically downloaded by this script. Reddit dataset 4 consists of one month of posts made by users on subreddits. The link feature is extracted by converting the text of each post into a feature vector. Wikipedia dataset 5 consists of one month of edits made by edits on Wikipedia pages. The link feature is extracted by converting the edit test into an LIWCfeature vector. LastFM dataset 6 : consists of one month of who listens-to-which song information. MOOC dataset 7 consists of actions done by students on a MOOC online course. GDELT dataset 8 is a temporal knowledge graph dataset originated from the Event Database which records events happening in the world from news and articles. Baseline implementations. The implementation on JODIE, DySAT, TGAT, TGN, and APPN follows the temporal graph learning framework Zhou et al. (2022) 9 . Compared to the original baselines' implementation, this framework's implementation could achieve a better overall score than its original implementation. The implementation of CAWs-mean and CAWs-attn follows their official implementation 10 , we choose the number of random walk steps from 8, 16, 32 to balance the training time. The implementation of TGSRec follows their official implementation 11 . The implementation of DDGCL follows their official implementation 12 . We directly test using their official implementation by changing our data structure to their required structure and using their default hyper-parameters.GraphMixer implementation. We implement GraphMixer under the TGL framework Zhou et al. (2022) and use their default hyper-parameters (e.g., learning rate 0.0001, weight decay 10 -6 , batch size 600, hidden dimension 100, etc) to achieve a fair comparison. In GraphMixer, there are only two hyper-parameters as introduced in Section 3: The number of 1-hop most recent neighbors K and the time-slot size T . In practice, hyper-parameter T is set the time-gap of the last 2, 000 interactions, which is fixed for all datasets; hyper-parameter K = 10 for Reddit and LastFM, K = 20 for MOOC, and K = 30 for GDELT and Wiki. 

