GRAPH CONTRASTIVE LEARNING FOR SKELETON-BASED ACTION RECOGNITION

Abstract

In the field of skeleton-based action recognition, current top-performing graph convolutional networks (GCNs) exploit intra-sequence context to construct adaptive graphs for feature aggregation. However, we argue that such context is still local since the rich cross-sequence relations have not been explicitly investigated. In this paper, we propose a graph contrastive learning framework for skeletonbased action recognition (SkeletonGCL) to explore the global context across all sequences. In specific, SkeletonGCL associates graph learning across sequences by enforcing graphs to be class-discriminative, i.e., intra-class compact and interclass dispersed, which improves the GCN capacity to distinguish various action patterns. Besides, two memory banks are designed to enrich cross-sequence context from two complementary levels, i.e., instance and semantic levels, enabling graph contrastive learning in multiple context scales. Consequently, SkeletonGCL establishes a new training paradigm, and it can be seamlessly incorporated into current GCNs. Without loss of generality, we combine SkeletonGCL with three GCNs (2S-ACGN, CTR-GCN, and InfoGCN), and achieve consistent improvements on NTU60, NTU120, and NW-UCLA benchmarks. The source code will be available at https://github.com/OliverHxh/SkeletonGCL.

1. INTRODUCTION

Graph convolutional networks (GCNs) have been widely applied in skeleton-based action recognition since they can naturally process non-grid skeleton sequences. For GCN-based methods, how to effectively learn the graphs remains a core and challenging problem. In particular, ST-GCN (Yan et al., 2018 ) is a milestone work, using pre-defined graphs to extract the action patterns. However, the pre-defined graphs only enable each joint to access the fixed neighboring joints but are hard to capture long-range dependency adaptively. Therefore, a mainstream of subsequent works (Li et al., 2019; Shi et al., 2019; Zhang et al., 2020b; a; Ye et al., 2020; Chen et al., 2021b; Chi et al., 2022) take efforts to solve this issue by generating adaptive graphs. The adaptive graphs can dynamically aggregate the features within each sequence and thus show significant advantages in performance comparison. Generally, adaptive graphs are constructed by using intra-sequence context. However, such context will still be "local" when viewing the cross-sequence information as an available context. Therefore, we wonder: Is it possible to involve the cross-sequence context in graph learning? To find out the answer, in Fig. 1 , we visualize the adaptive graphs learned from sequences of two easily confused classes ("point to something" and "take a selfie"). The graphs are learned by a strong GCN, i.e.,



* Work done when Xiaohu Huang was an intern at Baidu VIS. † Corresponding authors.

