GRAPH CONTRASTIVE LEARNING FOR SKELETON-BASED ACTION RECOGNITION

Abstract

In the field of skeleton-based action recognition, current top-performing graph convolutional networks (GCNs) exploit intra-sequence context to construct adaptive graphs for feature aggregation. However, we argue that such context is still local since the rich cross-sequence relations have not been explicitly investigated. In this paper, we propose a graph contrastive learning framework for skeletonbased action recognition (SkeletonGCL) to explore the global context across all sequences. In specific, SkeletonGCL associates graph learning across sequences by enforcing graphs to be class-discriminative, i.e., intra-class compact and interclass dispersed, which improves the GCN capacity to distinguish various action patterns. Besides, two memory banks are designed to enrich cross-sequence context from two complementary levels, i.e., instance and semantic levels, enabling graph contrastive learning in multiple context scales. Consequently, SkeletonGCL establishes a new training paradigm, and it can be seamlessly incorporated into current GCNs. Without loss of generality, we combine SkeletonGCL with three GCNs (2S-ACGN, CTR-GCN, and InfoGCN), and achieve consistent improvements on NTU60, NTU120, and NW-UCLA benchmarks. The source

1. INTRODUCTION

Graph convolutional networks (GCNs) have been widely applied in skeleton-based action recognition since they can naturally process non-grid skeleton sequences. For GCN-based methods, how to effectively learn the graphs remains a core and challenging problem. In particular, ST-GCN (Yan et al., 2018 ) is a milestone work, using pre-defined graphs to extract the action patterns. However, the pre-defined graphs only enable each joint to access the fixed neighboring joints but are hard to capture long-range dependency adaptively. Therefore, a mainstream of subsequent works (Li et al., 2019; Shi et al., 2019; Zhang et al., 2020b; a; Ye et al., 2020; Chen et al., 2021b; Chi et al., 2022) take efforts to solve this issue by generating adaptive graphs. The adaptive graphs can dynamically aggregate the features within each sequence and thus show significant advantages in performance comparison. Generally, adaptive graphs are constructed by using intra-sequence context. However, such context will still be "local" when viewing the cross-sequence information as an available context. Therefore, we wonder: Is it possible to involve the cross-sequence context in graph learning? To find out the answer, in Fig. 1 , we visualize the adaptive graphs learned from sequences of two easily confused classes ("point to something" and "take a selfie"). The graphs are learned by a strong GCN, i.e., Figure 1 : Graph visualization of sequences from two easily confused classes ("point to something" and "take a selfie"). The graphs are learned by CTR-GCN (Chen et al., 2021b) . We take the tip of the hand that does the action as the anchor. The size of the red circles and the width of the blue lines both denote the strengths of connections between joints. For simplicity, only representative frames are visualized. (a) Three sequences from class "point to something" are correctly classified, where the graphs contain connections to the body joints. (b) Three sequences from class "take a selfie" are correctly classified, where the graphs highly emphasize the connections to the hands, while the connections to the body are suppressed. (c) A sequence from class "point to something" is misclassified as "take a selfie", whose graph resembles the graphs in (b), but is dissimilar from graphs in (a). Hence, we realize that the class-ambiguous graph representations would make negative impacts on recognition performance. CTR-GCN (Chen et al., 2021b) . From the visualization, we find that (1) For sequences that are correctly classified in Fig. 1 (a) and Fig. 1 (b), the learned graphs in the same class look similar, while graphs in different classes have distinct differences. (2) For a misclassified sequence in Fig. 1 (c), the learned graph resembles the graphs from the misclassified class more than those from the ground truth class. These observations remind us that graph learning in current adaptive GCNs can implicitly learn class-specific graph representations to some extent. But without explicit constraints, it leads to class-ambiguous representations in some cases, thereby affecting the GCN capacity to discriminate classes (in Tab. 9 of Sec. 4.4, we provide quantitative results to further support our hypothesis). Therefore, we speculate that if the cross-sequence semantic relations are incorporated as guidance in graph learning, the class-ambiguity issue will be alleviated and the graph representations will better express individual characteristics of actions. In recent years, contrastive learning has achieved great success in self-supervised representation learning (He et al., 2020; Chen et al., 2020; 2021a) . It studies cross-sample relations in the dataset. The essence of contrastive learning is "comparing", which pulls together the feature embedding from positive pairs and pushes away the feature embedding from negative pairs. Based on the analysis above and the advances in contrastive learning, we propose a graph contrastive learning framework for skeleton-based action recognition in the fully-supervised setting, dubbed SkeletonGCL. Instead of just using the local information within each sequence, SkeletonGCL explores the cross-sequence global context to improve graph learning. The core idea is to pull together the learned graphs from the same class while pushing away the learned graphs from different classes. Since graphs can reveal the action patterns of sequences, enforcing graph consistency in the same class and inconsistency among different classes helps the model understand various motion modes. In addition, to enrich the cross-sequence context, we build memory banks to store the graphs from historical sequences. In specific, an instance-level memory bank stores the sequence-wise graphs, which hold the individual properties of each sequence. In contrast, a semantic-level memory bank stores the aggregation of graphs from each class, which contains the class-level representation. The two banks are complementary to each other, enabling us to leverage more samples. Skele-tonGCL can be seamlessly combined with existing GCNs. Eventually, we combine SkeletonGCL with three GCNs (2S-AGCN (Shi et al., 2019) , CTR-GCN (Chen et al., 2021b) , and InfoGCN (Chi et al., 2022) ), and conduct experiments on three popular datasets (NTU60 (Shahroudy et al., 2016) , NTU120 (Liu et al., 2019) and NW-UCLA (Wang et al., 2014) ). SkeletonGCL achieves consistent improvements with these models using different testing protocols (single-modal or multi-modal) on all three datasets, which widely demonstrates the effectiveness of our design. Notably, SkeletonGCL only introduces a small amount of training consumption but has no impact at the test stage. Though there exist some works that apply contrastive learning in skeleton-based action recognition (Li et al., 2021; Guo et al., 2022; Mao et al., 2022) , our method differs from them as follows: (1) The previous methods took pooled feature vectors to conduct contrastive learning as in (He et al., 2020; Chen et al., 2020) , where the structural properties in skeletons are lost. In contrast, SkeletonGCL uses graphs to contrast, which maintains the structure details of skeletons and offers high-order connection information between joints. (2) The previous methods used memory banks to store instance-level representations only. Differently, our memory banks store both instance-level and semantic-level representations, allowing us to leverage context from individual sequences and class-specific aggregations, which are complementary to each other. (3) The previous methods were used in the pre-training stage, while SkeletonGCL is incorporated into the fully-supervised setting without extra pre-training cost. Summarily, the contribution of this paper can be concluded as follows: • We present a new perspective for graph learning of GCN models in skeleton-based action recognition. In specific, we propose to make use of the cross-sequence context to guide graph learning, whose goal is to enforce graphs to be intra-class compact and inter-class dispersed. • Motivated by the advances in contrastive learning, we smoothly combine the ideas of contrastive learning and cross-sequence graph learning together, then propose a new training paradigm for skeleton-based action recognition, called SkletonGCL. SkeletonGCL incorporates an instance-level and a semantic-level memory bank to enrich the cross-sequence context comprehensively. Besides, it can be seamlessly incorporated into current GCNs. • SkeletonGCL achieves consistent improvements combined with three GCNs (2S-AGCN, CTR-GCN, and InfoGCN) on three popular benchmarks (NTU60, NTU120, NW-UCLA) using both single-modal and multi-modal testing protocols. In addition, SkeletonGCL is training-efficient and has no impact at the test stage.

2.1. SKELETON-BASED ACTION RECOGNITION

Skeleton-based action recognition is to classify actions from sequences of estimated key points. The early deep-learning methods applied convolution neural networks (CNNs) (Chéron et al., 2015; Liu et al., 2017b) or recurrent neural networks (RNNs) (Du et al., 2015; Lev et al., 2016; Wang & Wang, 2017; Liu et al., 2017a) to model the skeletons, but they could not explicitly explore the topological structure of skeletons, thus the performances were limited. Recently, PoseC3D (Duan et al., 2022) revisited the CNN-based method by stacking the heatmaps as 3D volumes, which maintained the spatial-temporal properties of skeletons and obtained marginal performance improvements. In the past few years, the mainstream works in skeleton-based action recognition were GCN models. ST-GCN (Yan et al., 2018) was the first work that adopted GCN as the feature extractor, which heuristically designed fixed graphs to model the skeletons. The follow-up methods proposed spatialtemporal graphs (Liu et al., 2020) , multi-scale graph convolutions (Chen et al., 2021c) , channeldecoupled graphs (Chen et al., 2021b; Cheng et al., 2020a) and adaptive graphs (Li et al., 2019; Shi et al., 2019; Ye et al., 2020; Zhang et al., 2020b; Chen et al., 2021b; Chi et al., 2022) to improve the capacity of GCNs. Tracking the development of GCN-based methods, we find that graph learning has always been a core problem and now the adaptive GCNs are leading since they can model the intrinsic topology of skeletons. However, current adaptive GCNs generated the graphs based on the local context within each sequence, where the cross-sequence relations have been neglected. In contrast, we propose to explore the cross-sequence global context to shape graph representations. In this way, the learned graphs can not only describe the individual characteristics within each sequence but also emphasize the similarity and dissimilarity of motion patterns across sequences.

2.2. CONTRASTIVE LEARNING

In recent years, numerous representation learning methods (Wu et al., 2018; Oord et al., 2018; He et al., 2020; Chen et al., 2020; Wang et al., 2021) with contrastive learning have emerged, especially in self-supervised representation learning. The key idea is to pull together the positive pairs and push away the negative pairs in the feature space. Generally, the features are vectors obtained from feature extractors followed by a pooling layer. A standard approach to obtaining the positive pairs is augmenting an original sample into two different views. The negative samples are selected randomly or using hard mining strategies (Khosla et al., 2020; Robinson et al., 2020; Kalantidis et al., 2020) . To increase the capacity of negative samples, the memory bank mechanism was devised in (He et al., 2020; Misra & Maaten, 2020) to store more negative instances. By contrasting positive pairs against negative pairs, the model can learn to focus on semantic representations. In the field of skeleton-based action recognition, prior works (Li et al., 2021; Mao et al., 2022; Guo et al., 2022) proposed to apply contrastive learning in the pre-training stage by roughly following the frameworks mentioned above. CrossCLR (Li et al., 2021) mined positive pairs in the data space and explored the cross-modal distribution relationships. Further, CMD (Mao et al., 2022) transferred the cross-modal knowledge in a distillation manner. And AimCLR (Guo et al., 2022) used extreme augmentations to improve the representation universality. Compared with the above methods, we use graph representations to contrast instead of using pooled feature vectors. Meanwhile, we establish two different memory banks at complementary levels, i.e., instance and semantic levels, to enrich the context scales. Besides, the proposed method is used with GCNs under the fully-supervised setting, which requires no pre-training procedure.

3.1. PRELIMINARY

We denote a human skeleton as a vertex set V = {v 1 , v 2 , ..., v N }, where N denotes the number of vertices. For each vertex v i , the feature dimension is set as C. Hence, a skeleton sequence with T frames can be denoted as X ∈ R T ×N ×C . Graph topology is used to represent the correlations between joints, formulated as g. GCNs in Skeleton-Based Action Recognition. Generally, GCN models alternatively apply graph convolutions and temporal convolutions to extract the spatial configuration and motion pattern of skeletons, respectively. The graph g is vital for graph convolutions since it determines the message passing among joints. In current adaptive GCNs, g is learned within each sequence and has different sizes, e.g., g ∈ R K S ×N ×N in 2S-AGCN (Shi et al., 2019) and g ∈ R K S ×C×N ×N in CTR-GCN (Chen et al., 2021b) . The K S denotes the number of sub-graphs, normally set as 3. In general, the graph convolution is defined as: X S = K S k=1 g k XW k S , where X S ∈ R T ×N ×C ′ denotes the spatial extracted feature with C ′ channels, and W S ∈ R K S ×C×C ′ denotes the spatial feature transformation filters. Next, temporal convolutions are applied on X S , producing motion extracted feature X T ∈ R T ×N ×C ′ . After stacking layers of graph convolutions and temporal convolutions, a global average pooling (GAP) layer summarizes the global features, then a classification head (fully-connected layer) followed by a Softmax activation function is applied to obtain the class prediction ŷ ∈ R C k , where C k denotes the number of classes. Finally, a cross-entropy loss L CE supervises the class prediction with the ground truth label y as follows: L CE = - i y i log ŷi (2) Self-Supervised Contrastive Learning. In the context of self-supervised contrastive learning, each input sample is processed by data augmentations to produce a positive pair: I and I + . Through a feature extraction network, I and I + are transformed into feature vectors f ∈ R D and f + ∈ R D . As for the negative samples, they are selected from the dataset excluding I and I + , represented as a negative set N -. Each negative in N -is denoted as f -∈ R D . The similarity between two feature vectors is calculated as sim(f (Gutmann & Hyvärinen, 2010 ; Oord et al., 2018) is widely adopted for contrastive learning, which is formulated as: + , f -) = f + f - ∥f + ∥∥f -∥ . InfoNCE L NCE = -log sim(f , f + )/τ sim(f , f + )/τ + f -∈N -sim(f , f -)/τ , where temperature τ > 0 is a hyper-parameter.

3.2. GRAPH CONTRASTIVE LEARNING

The proposed SkeletonGCL is illustrated in Fig. 2 . The framework consists of two branches, where the classification branch is parallel to the graph contrast branch. Taking a skeleton sequence I as input, the GCN encoder outputs a feature vector f for classification and a graph g for graph contrast. Graph Projection Head. In order to contrast the graphs in a common feature space, we embed the graphs into vectors by a graph projection head. The projection heads for different GCNs are similar (see App. 6.1 for details). In Fig. 2 , taking the graph g ∈ R K S ×C×N ×N learned in CTR-GCN (Chen et al., 2021b) as an example, we first squeeze g along the channel dimension by an average pooling layer into g ∈ R K S ×N ×N . Then, we flatten graph g into a 1D vector as g ′ ∈ R K S N 2 and project g ′ into a vector v ∈ R Cg by an FC layer W G ∈ R K S N 2 ×Cg . Since different channels in W G are specific to different vertices in the graph, the graph projection is vertex-aware and thus can encode the structures of skeletons. Afterward, we update two memory banks with v. The memory banks are illustrated in Fig. 2 , and detailed next. Memory Bank. To enrich the cross-sequence context, we build memory banks to store the crossbatch graphs. In specific, two memory banks are constructed, i.e., an instance-level memory bank M Ins ∈ R C k ×P ×Cg and a semantic-level memory bank M Sem ∈ R C k ×Cg . P denotes the number of instances stored for each class in M Ins . Particularly, each element in M Ins denotes a graph instance from a class. In contrast, each element in M Sem denotes the graph aggregation of a class. Therefore, the two memory banks are on complementary levels, where the instance-level memory bank emphasizes the instance discrimination of each sequence, while the semantic-level memory bank covers the class properties across sequences. We update M Ins in a first-in-first-out manner, which maintains the number of instances for each class as P . As for M Sem , we use a momentum update strategy, which integrates the graphs of the same class from the current timestamp and all previous timestamps, regarded as a long-term representation. The momentum update is defined as follows: m c * sem ← αm c * sem + (1 -α)v, where m c * sem is the representation for class c * , c * is the class label for the input I and 0 < α < 1 is a hyper-parameter. Loss. To achieve the graph contrast, we sample positives and negatives from the memory banks M Ins and M Sem . For M Ins , vector v is set as the anchor, hence samples in the positive set N + Ins are with label c * , and samples in the negative set N - Ins are with different labels. Consequently, the InfoNCE loss in Eq. 3 can be rewritten as: L Ins NCE = - v + ∈N + Ins log sim(v, v + )/τ sim(v, v + )/τ + v -∈N - Ins sim(v, v -)/τ , L Sem NCE = - v + ∈N + Sem log sim(v, v + )/τ sim(v, v + )/τ + v -∈N - Sem sim(v, v -)/τ . L Ins NCE leverages multiple positives compared with Eq. 3 by using label information, which mines more semantic-related samples. Similarly, we can define the InfoNCE loss L Sem NCE , which is specific for the memory bank M Sem . Summarily, the overall contrastive loss is written as follows: L NCE = L Ins NCE + L Sem NCE . And the overall loss function is defined as follows: L = L NCE + L CE . (8) Hard Sampling. As the training continues, most samples become too easy, which contribute less to the training. Therefore, methods in (Tabassum et al., 2022; Robinson et al., 2020; Kalantidis et al., 2020; Wang et al., 2021) are proposed to use hard mining strategies to focus on informative samples. In this paper, considering the massive number of instances in M Ins , contrasting with all these instances naturally leads to redundancy and hinders the training. To alleviate this issue, we propose to mine hard examples in M Ins . Specifically, we take the similarity calculation sim(v, v ′ ) as a criterion to evaluate hardness. Harder positives are with lower similarities, and harder negatives are with higher similarities. In total, for M Ins , we select We follow the official evaluation protocol: train data are captured by the first two cameras, and test data are captured by the third camera.

4.2. IMPLEMENTATION DETAILS

To thoroughly validate SkeletonGCL, we take three GCNs (2S-AGCN (Shi et al., 2019) , CTR-GCN (Chen et al., 2021b) , and InfoGCN (Chi et al., 2022) ) as baseline models. For CTR-GCN and InfoGCN, we follow their training recipes. Particularly, for 2S-AGCN, since its training recipe is out of date, we borrow the training recipe from CTR-GCN, which effectively improves its baseline performance. P , the number of stored instances for each class in M Ins , is set as 684 on NTU60 and NTU120, and 342 on NW-UCLA. The dimension of graph vector C g is set to 256. For all datasets, the number of sampling examples K + H , K - H , and K - R are set as 128, 512, and 512, respectively. For different models used in different modalities, we experiment with temperature τ of 0.5, 0.8, 1.0, and 1.5, and choose the best one. The hyper-parameter α for momentum updating is set as 0.85. Besides, we fix the random seed to ensure experiment reproducibility. All experiments are conducted using a single NVIDIA V100 GPU.

4.3. COMPARED WITH THE STATE-OF-THE-ART

In this section, we combine our method with three GCNs, and compare them with the state-of-theart (SoTA) methods. In Tab. 1 and Tab. 2, we list current SoTA methods in skeleton-based action recognition except PoseC3D (Duan et al., 2022) . PoseC3D is a promising CNN-based method, but it uses non-official skeleton data and applies a multi-crop test protocol (GCN methods typically use one crop), which are unfair for comparison here. In evaluation, four modalities are used: "joint stream" (J) denotes the joint coordinates, "bone stream" (B) denotes the coordinate difference between spatially connected joints, "joint motion" (J-M) denotes the coordinate difference between temporally adjacent frames, and "bone motion" (B-M) denotes the bone difference between temporally adjacent frames. The 4-stream ensemble (4S) denotes using the four modalities together. Following the widely-adopted protocol, we evaluate models using J, B, J + B, and 4S modalities. NTU60 and NTU120. Tab. 1 lists the results on NTU60 and NTU120. From the results, we find that: (1) Combined with SkeletonGCL, all three baseline models achieve solid improvements on these two benchmarks over different settings and modalities. Taking the J modality on NTU60 X-Sub as an example, 2S-AGCN improves by 1.0% (88.9% to 89.9%), CTR-GCN improves by 1.0% (89.8% to 90.8 %), and InfoGCN improves by 0.7% (89.4% to 90.1%). Considering NTU60 is an extensively-benchmarked dataset, such improvements are quite hard. ( 2 

4.4. DIAGNOSTIC EXPERIMENTS

In this section, we conduct diagnostic experiments to verify the design of SkeletonGCL. Otherwise stated, we use CTR-GCN as the GCN encoder to perform the experiments on the NTU60 dataset under the X-Sub setting using the joint modality (J). See App. 6.2 for more diagnostic experiments. Intra-batch vs. Inter-batch Graph Contrast. In Tab. 3, the effectiveness of introducing crosssequence context is investigated. We find that only contrasting the graphs within one batch can bring improvement with 0.4% (89.8% to 90.2%), which owes to the cross-sequence relation mining. And further exploring the inter-batch relations can bring more improvements to 1.0% (89.8% to 90.8%), which explains that different batches provide richer context than a single batch. Graph Contrast vs. Feature Contrast. In Tab. 4, the comparison of using features f to contrast and using graphs g to contrast is investigated. We find that feature contrast can improve the performance on the baseline with 0.4% (89.8% to 90.2%). But graph contrast can obviously outperform it by 0.6% (90.2% to 90.8%). The results suggest that, due to the high-order structural information in graphs, graph contrast can better benefit graph convolution learning in GCNs. Memory Banks. In Tab. 5, the effectiveness of instance-level and semantic-level memory banks is investigated. We find that both memory banks benefit the recognition, and using them together achieves much higher performance, which proves their complementary properties. Sampling Strategy. In Tab. 6, we compare different sampling strategies for SkeletonGCL. We find that selecting hard positive/negative examples can generally improve recognition. And also random negative samples are meaningful to recognition, which allows the contrastive loss to involve more negative samples. InfoNCE Loss vs. Triplet Loss. In Tab. 7, we compare the performance of using another popular metric learning loss, i.e., triplet loss (Schroff et al., 2015) . We find that using triplet loss can achieve similar performance compared to InfoNCE loss. The results indicate the generality of our idea that it does not depend on a certain loss but can boost the performance using different losses. Traning Comsumption. In Tab. 8, we report the training consumption on NTU60. With our method, the training time only slightly increases with different baseline models, ranging from 2.6% to 7.0%, which proves the efficiency of the design. Quantitative Results about Graph Similarities. As shown in Tab. 9, we statistically calculate the graph distances between each sample and all classes (detailed in App. 6.4). For incorrectlyclassified samples, we find that: (1) The graph distance to the misclassified class (0.68) is much lower than the average distance (1.70) to all classes. (2) The graph distance to the misclassified class (0.68) is indeed slightly lower than the distance to the correct class (0.74), which explains that not learning class-specific graphs could truly degrade recognition performance. In addition, for correctly-classified samples, we notice that: (1) The average graph distance (2.20) is higher than that for the misclassified ones (1.70), which indicates that the inter-class graph representations are more dispersed for the correctly-classified samples. (2) The distance to the correct class (0.47) is lower than that for the misclassified ones (0.74), which reveals that the intra-class representations are more compact for the correctly classified samples. To sum up, these quantitative results illustrate the strong correlation between recognition performance and class-specific graph representation. Performance vs. Graph Quality. In Tab. 10. we first calculate the graph distances between each sample and all classes (detailed in App. 6.4) for CTR-GCN. Then, we rank the distances from low to high. In Tab. 10, we report the recognition accuracies of samples according to their distance ranks to the correct class. Here, higher ranks indicate that graphs are of higher quality (intra-class compact and inter-class dispersed), while lower ranks indicate that graphs are of lower quality (intraclass dispersed and inter-class aliasing). We note that: (1) Considering samples from lower ranks to higher ranks, performances improve monotonically, revealing the significant correlations between graph quality and recognition performance. (2) Combined with the proposed method, we improve performances in all cases, where larger improvements are obtained on the samples with lower-quality graphs. These results prove that our method can alleviate the problem caused by learning low-quality graphs.

5. CONCLUSION

In this paper, we establish a new training paradigm for skeleton-based action recognition, called SkeletonGCL, which explicitly explores the rich semantic context across sequences. Concretely, SkeletonGCL contrasts the learned graphs among sequences, guiding the graph representations to be class-associated, hence improving GCN capacity to recognize different actions. We improve the current methods significantly to achieve SoTA on three benchmarks. Limitation. In this paper, we push away the negative pairs from different classes in the same way without considering their intrinsic relations. Therefore, a comprehensive contrasting manner may be more suitable by delicately involving cross-class relations. We leave this for future work.

6. APPENDIX

6.1 IMPLEMENTATIONS OF GRAPH PROJECTION HEADS FOR GCNS. In Fig. 3 , we illustrate the implementation details of graph projection heads for different GCNs. g 𝑻𝑻 × 𝑲𝑲 𝑺𝑺 × 𝑵𝑵 × 𝑵𝑵 Average Pool 𝑲𝑲 𝑺𝑺 × 𝑵𝑵 × 𝑵𝑵 Flatten 𝑲𝑲 𝑺𝑺 𝑵𝑵 𝟐𝟐 g' FC v � g 𝑪𝑪 𝒈𝒈 g 𝑲𝑲 𝑺𝑺 × 𝑪𝑪 × 𝑵𝑵 × 𝑵𝑵 Average Pool 𝑲𝑲 𝑺𝑺 × 𝑵𝑵 × 𝑵𝑵 Flatten 𝑲𝑲 𝑺𝑺 𝑵𝑵 𝟐𝟐 g' FC v � g 𝑪𝑪 𝒈𝒈 (a) CTR-GCN g 𝑲𝑲 𝑺𝑺 × 𝑵𝑵 × 𝑵𝑵 Flatten 𝑲𝑲 𝑺𝑺 𝑵𝑵 𝟐𝟐 g' FC v 𝑪𝑪 𝒈𝒈 (b) 2S-AGCN (c) InfoGCN Particularly, for CTR-GCN and InfoGCN, we first apply an average pooling layer to summarize the information along the channel and temporal dimensions, respectively. Then, the same as 2S-AGCN, we flatten the graphs and embed them with an FC layer. 6.2 MORE DIAGNOSTIC EXPERIMENTS. Comparison of using cross-entropy loss. Since cross-entropy loss is a widely used classification loss in learning class-discriminative representations, in Tab. 11, we investigate its performance to supervise graph learning. We find that directly using cross-entropy loss for graph learning has negligible effects on the performance (89.8% to 89.7%), which indicates that it is impractical to learn favorable class-discriminative graphs by naively using a classification loss. In this paper, we find a practical way to achieve this goal by introducing the cross-sequence context for guiding graph learning. Impact of FC in Projection Head. In Tab. 12, the effectiveness of transformation layer (FC layer) in the graph projection head is investigated. We find that the model achieves obvious improvement (90.1% to 90.8%) equipped with the FC layer, which proves the importance of vertex-aware graph encoding. In addition, we find that using an MLP achieves a similar but lower accuracy, hence we use a simple FC in the framework. Which Layer to Contrast Graphs? In Tab. 13, we apply graph contrast on different layers. We find that contrasting graphs on deeper layers outperform on shallower layers. One possible explanation is that deeper layers can provide higher-level semantics that is relevant to recognition. Impact of the size of M Ins . In Tab. 14, the impact of the size of M Ins is investigated, where we use different values of P to control the size. We find that appropriately increasing the size can effectively expand the cross-sequence context, and improve recognition performance. However, an over large memory bank stores old samples from a few batches ago, which hinders representation learning. Impact of dimension C g . In Tab. 15, the influences of the dimension of graph vector C g are investigated. For pursuing the best performance, we set C g as 256. Impact of temperature τ . In Tab. 16, the influences of the temperature τ are investigated. For pursuing the best performance, we set τ as 1.0. Impact of α. In Tab. 17, the influences of the momentum updating hyper-parameter α are investigated. For pursuing the best performance, we set α as 0.85. Impact of the number of sampling examples. In Tab. 18, the impact of selecting K + H hardest positive examples is investigated. In Tab. 19, the impact of selecting K - H hardest negative examples is investigated. In Tab. 20, the impact of selecting K - R random negative examples is investigated. For pursuing the best performance, we set K + H , K - H , and K - R to 128, 512 and 512, respectively. The quantitative analysis of accuracy improvement. In Tab. 21, the recognition accuracies of the top-10 hardest classes for CTR-GCN on NTU-60 are presented. The improvements in four classes (i.e., "reading", "typing on a keyboard", "headache" and "point to something") are over 4%. Though performances in three classes decrease, they are relatively small (-1.5% on "writing", -1.0% on "take off a shoe", and -2.5% on "sneeze/cough") vs. others' increase. Overall, we obtain an average improvement in the 10 classes of 2.7%. In Tab. 22, the recognition accuracies of top-10 improved classes for CTR-GCN on NTU-60 are presented. The accuracy of the above 10 classes shows an average gain of 4.6%. In Tab. 9, the statistics of graph distance for all samples are investigated. In CTR-GCN, the graph g ∈ R K S ×C×N ×N is learned for graph convolution. For the convenience of calculation, we use an average pooling to squeeze g and reshape it into g = g ∈ R 1×N 2 as the graph embedding to conduct distance measurement.



Figure2: Overview of SkeletonGCL. An input skeleton sequence I is fed into a GCN encoder, producing a feature vector f for classification and a learned graph g for graph contrastive learning. The graph g is embedded into a vector by a projection head. And two memory banks are built to store the embedded graphs. From the memory banks, we sample the positives and negatives according to the labels, then perform contrastive loss. The memory banks are only used in the training stage but will be removed during the testing stage.

NTU RGB+D. NTU RGB+D (NTU60)(Shahroudy et al., 2016) is a large-scale skeleton-based action recognition dataset, which contains 60 action classes and 56,880 sequences. Each sequence is annotated as skeletons with 25 joints. All the sequences are performed by 40 subjects and filmed by 3 Kinect cameras from three different views. Generally, two protocols are used to evaluate the performances: (1) cross-subject (X-Sub): train data are performed by 20 subjects, and test data are performed by other 20 subjects. (2) cross-view (X-View): train data from cameras 2 and 3, and test data from camera 1. NTU RGB+D 120. NTU RGB+D 120 (NTU120)(Liu et al., 2019) is an extension of NTU RGB+D dataset, which newly includes 57,367 skeletons of 60 extra classes. All the sequences are performed by 106 subjects and filmed by three cameras from three different views. In addition, NTU RGB+D 120 has 32 setups, where each denotes a unique location. Generally, two protocols are used to evaluate the performances: (1) cross-subject (X-Sub): train data are performed by 53 subjects, and test data are performed by other 53 subjects. (2) cross-setup (X-Set): train data are samples with even setup IDs, and test data are samples with odd setup IDs.Northwestern-UCLA. Northwestern-UCLA (NW-UCLA) dataset(Wang et al., 2014) contains 1494 sequences from 10 action classes. Each sequence is annotated as skeletons with 20 joints. All sequences are performed by 10 subjects and filmed by three Kinect cameras from different views.

) With SkeletonGCL, CTR-GCN and InfoGCN can set new SoTA performance. NW-UCLA. Tab. 2 lists the results on NW-UCLA. SkeletonGCL can still achieve consistent improvements based on the three models. And new state-of-the-art performances are achieved by combining SkeletonGCL with CTR-GCN and InfoGCN.

Figure 3: Illustration of graph projection heads for GCNs.

we visualize the t-SNE distribution of graph and feature representations of sequences from six classes, illustrating the impact of SkeletonGCL. As shown in Fig. 4(a), SkeletonGCL can shape the graph representation structure, where the graphs from the same class get together and graphs from different classes spread out. Consequently, in Fig. 4(b), with SkeletonGCL, the features from different classes become more distinguishable, which indicates that graph contrast indeed improves the feature extraction capacity. t-SNE visualization of feature representation.

Figure 4: t-SNE visualization. t-SNE (Van der Maaten & Hinton, 2008) visualization of graph and feature representations from sequences in the test set of NTU 60. Each color denotes a certain class. Best viewed with zoom in.

Top-1 accuracy comparison (%) with the state-of-the-art methods on NTU 60 and NTU120 datasets. The numbers in gray indicate the results reported in their papers. * indicates that we retrain the models using their officially released code. Particularly, 2S-AGCN is retrained using a stronger train recipe from CTR-GCN.

Top-1 accuracy Comparison (%) with the state-of-the-art methods on the NW-UCLA dataset. Numbers in gray denote the results reported in their papers. * indicates that we retrain the models using their officially released codes. Particularly, 2S-AGCN is retrained using a stronger train recipe from CTR-GCN.

Comparison of intrabatch and inter-batch contrast.

Comparison of feature and graph contrast.

Training consumption on NTU60.

Graph Distance (dis.) comparison using Euclidean distance (10 -5 ).

Performance (%) of samples with different graph distance ranks to the correct class.

Comparison of using cross-entropy loss to supervise graph learning.

Impact of the FC layer in graph projection head.

Impact of the size of MIns.

Performance comparison with different Cg.

Impact of τ .

Impact of α.

Performance comparison with different K + H .

Performance (%) on top 10 hardest classes for CTR-GCN.

Performance (%) on top 10 improved classes for CTR-GCN.

ACKNOWLEDGEMENTS

This research is supported by the NSFC (grants No. 61773176 and No. 61733007).

availability

code will be available at https://github.com/OliverHxh/SkeletonGCL.

Published as a conference paper at ICLR 2023

To acquire the graph embedding for each class c * , we first calculate the centroid vector s as follows,where N c * denotes the number of samples which belongs to class c * . To effectively reveal the relation between graph quality and classification accuracy, we introduce three types of graph distance, i.e., d all , d cor and d mis . d all measures the average distance to all classes. d cor measures the distance to the correct class c * . d mis measures the distance to the misclassified class c mis .where K denotes the number of classes in the dataset.

