IS END-TO-END LEARNING ENOUGH FOR FITNESS ACTIVITY RECOGNITION?

Abstract

End-to-end learning has taken hold of many computer vision tasks, in particular, related to still images, with task-specific optimization yielding very strong performance. Nevertheless, human-centric action recognition is still largely dominated by hand-crafted pipelines, and only individual components are replaced by neural networks that typically operate on individual frames. As a testbed to study the relevance of such pipelines, we present a new fully annotated video dataset of fitness activities. Any recognition capabilities in this domain are almost exclusively a function of human poses and their temporal dynamics, so pose-based solutions should perform well. We show that, with this labelled data, end-to-end learning on raw pixels can compete with state-of-the-art action recognition pipelines based on pose estimation. We also show that end-to-end learning can support temporally fine-grained tasks such as real-time repetition counting.

1. INTRODUCTION

Action recognition in videos has slowly been transitioning to real-world applications following extensive advancements in feature representation and deep learning-based architectures. In many applications, models need to extract detailed information of the underlying spatio-temporal dynamics. Towards this, end-to-end learning has recently had a lot of success on generic action recognition datasets comprised of varied everyday activities (Carreira & Zisserman, 2017; Goyal et al., 2017; Materzynska et al., 2019) . However, pose-based pipelines seem to remain the preferred solution when the task is strongly related to analyzing body motions (Bazarevsky et al., 2020; Shahroudy et al., 2016; Liu et al., 2020a; b; Yan et al., 2018) , such as in the rapidly growing application domain of virtual fitness, where an AI system can be used to deliver real-time form feedback and count exercise repetitions. In this paper, we present a new fitness action recognition dataset with granular intra-exercise labels and compare few-shot learning abilities of pose estimation-based pipelines with end-to-end learning from raw pixels. We also compare the influence of using different pre-training datasets on the chosen models and additionally train them for repetition counting. Common approaches to generic video understanding based on end-to-end learning include combinations of 2D-CNNs for spatial feature extraction followed by an LSTM module for learning temporal dynamics (Donahue et al., 2017; Ng et al., 2015) , directly learning spatio-temporal dynamics with a 3D-CNN (Ji et al., 2013) , or combining a 3D-CNN with an LSTM (Molchanov et al., 2016) . The temporal understanding can be further improved in a two-stream approach with a second CNN-based stream trained on optical flow (Carreira & Zisserman, 2017; Feichtenhofer et al., 2016; Simonyan & Zisserman, 2014) . The large parameter space of 3D-CNNs can be prohibitive and efforts to reduce this include dual-pathway approaches to low/high frame-rate (Feichtenhofer et al., 2019) and resolution (Fan et al., 2019) , temporally shifting frames in a 2D- CNN (Lin et al., 2019) , and non-uniformly aggregating features temporally (Li et al., 2020) . Using a multi-task approach, an end-to-end model jointly trained for pose estimation and subsequent action classification was shown to improve performance of individual components (Li et al., 2017) -but pose information is still needed for training. Pose-based solutions for action recognition have two main stages: pose extraction and action classification. While bottom-up pose estimation approaches extract skeletons in one step (Cao et al., 2016;  

