IS END-TO-END LEARNING ENOUGH FOR FITNESS ACTIVITY RECOGNITION?

Abstract

End-to-end learning has taken hold of many computer vision tasks, in particular, related to still images, with task-specific optimization yielding very strong performance. Nevertheless, human-centric action recognition is still largely dominated by hand-crafted pipelines, and only individual components are replaced by neural networks that typically operate on individual frames. As a testbed to study the relevance of such pipelines, we present a new fully annotated video dataset of fitness activities. Any recognition capabilities in this domain are almost exclusively a function of human poses and their temporal dynamics, so pose-based solutions should perform well. We show that, with this labelled data, end-to-end learning on raw pixels can compete with state-of-the-art action recognition pipelines based on pose estimation. We also show that end-to-end learning can support temporally fine-grained tasks such as real-time repetition counting.

1. INTRODUCTION

Action recognition in videos has slowly been transitioning to real-world applications following extensive advancements in feature representation and deep learning-based architectures. In many applications, models need to extract detailed information of the underlying spatio-temporal dynamics. Towards this, end-to-end learning has recently had a lot of success on generic action recognition datasets comprised of varied everyday activities (Carreira & Zisserman, 2017; Goyal et al., 2017; Materzynska et al., 2019) . However, pose-based pipelines seem to remain the preferred solution when the task is strongly related to analyzing body motions (Bazarevsky et al., 2020; Shahroudy et al., 2016; Liu et al., 2020a; b; Yan et al., 2018) , such as in the rapidly growing application domain of virtual fitness, where an AI system can be used to deliver real-time form feedback and count exercise repetitions. In this paper, we present a new fitness action recognition dataset with granular intra-exercise labels and compare few-shot learning abilities of pose estimation-based pipelines with end-to-end learning from raw pixels. We also compare the influence of using different pre-training datasets on the chosen models and additionally train them for repetition counting. Common approaches to generic video understanding based on end-to-end learning include combinations of 2D-CNNs for spatial feature extraction followed by an LSTM module for learning temporal dynamics (Donahue et al., 2017; Ng et al., 2015) , directly learning spatio-temporal dynamics with a 3D-CNN (Ji et al., 2013) , or combining a 3D-CNN with an LSTM (Molchanov et al., 2016) . The temporal understanding can be further improved in a two-stream approach with a second CNN-based stream trained on optical flow (Carreira & Zisserman, 2017; Feichtenhofer et al., 2016; Simonyan & Zisserman, 2014) . The large parameter space of 3D-CNNs can be prohibitive and efforts to reduce this include dual-pathway approaches to low/high frame-rate (Feichtenhofer et al., 2019) and resolution (Fan et al., 2019) , temporally shifting frames in a 2D- CNN (Lin et al., 2019) , and non-uniformly aggregating features temporally (Li et al., 2020) . Using a multi-task approach, an end-to-end model jointly trained for pose estimation and subsequent action classification was shown to improve performance of individual components (Li et al., 2017) Reiter, 2017; Li et al., 2017 ), LSTMs (Liu et al., 2016; Zhu et al., 2016; Shahroudy et al., 2016; Zhang et al., 2017 ), Graph CNNs (Yan et al., 2018; Thakkar & Narayanan, 2019; Si et al., 2019) , or 3D-CNNs on top of pose heatmaps (Duan et al., 2021) . ✓ ✓ ✓ × × × × Fine-grained labels ✓ ✓ ✓ × ✓ ✓ × × Controlled environment ✓ ✓ × ✓ × × × × "In the wild" ✓ × ✓ ✓ ✓ ✓ ✓ ✓ Large-scale × ✓ ✓ ✓ ✓ ✓ ✓ ✓ In addition to an appropriate model architecture, a dataset with a fine-grained action taxonomy is crucial to learning robust action representations. 2019) control camera and worker positioning and additionally, constrain human motion to appropriately specified hand gestures. A similarly constrained dataset for exact human body movement, that also controls camera motion, does not exist and we believe home fitness is the perfect domain in which to create one as workers can be instructed to move in very specific ways to perform exercises. Pose-specific datasets contain an additional layer of annotated skeletal joints obtained either through annotation of scraped video datasets (either manually or using a pose estimation model Liu et al. We present a new crowd-sourced benchmark dataset to fill a gap in the dataset landscape (see Table 1 ): videos of fitness exercises in a home setting are recorded in the wild providing challenging scene variety while also following a fine-grained label taxonomy. We compare end-to-end action classification models with state-of-the-art pose estimation-based action classifiers and show that the end-to-end approaches can outperform the pose estimation-based alternatives, if the end-to-end models are pre-trained on a large and granular labelled video corpus. We also show that the pose estimation models themselves can greatly benefit from pre-training on the large labelled dataset. 2 THE Exercise Videos Datasetfoot_0 -A NEW BENCHMARK DATASET Fitness activities are defined by a well-constrained set of body movements outside of which an individual risks injury or ineffectiveness. There is an opportunity for AI systems to detect mistakes and



The name is a temporary placeholder due to double blind submission.



Existing RGB-based video datasets such as Kinetics Kay et al. (2017), Moments in Time Monfort et al. (2019) and Sports-1M Karpathy et al. (2014) are based on a high-level taxonomy and further, possess correlated scene-action pairings resulting in pronounced representation bias Choi et al. (2019); Li et al. (2018). These concerns can be mitigated through crowd-sourced collections of predefined labels where the same action can be collected from multiple workers such as in the Something-Something Goyal et al. (2017), and Charades Sigurdsson et al. (2016) datasets. However, the "everyday general human actions" within these datasets are loosely specified and left to the worker's interpretation resulting in a high inter-worker action variance. On the other hand, FineGym Shahroudy et al. (2016) focuses on specific fine-grained body motions but includes variability in camera position resulting in lower overall action salience. In contrast, gesture recognition datasets such as Jester Materzynska et al. (

)) or a sensor-derived approach in constrained lab settings Shahroudy et al. (2016); Liu et al. (2020a).

-but pose information is still needed for training.Pose-based solutions for action recognition have two main stages: pose extraction and action classification. While bottom-up pose estimation approaches extract skeletons in one step(Cao et al., 2016;

Side-by-side comparison of the Exercise Videos Dataset (ours) versus common video datasets including NTU RGBD+D(Liu et al., 2020a), FineGym (Shao et al., 2020), Jester  (Materzynska et al., 2019),Something-something Goyal et al. (2017), Charades (Sigurdsson et al.,  2016),Kinetics (Kay et al., 2017)  andMoments (Monfort et al., 2019)  based on five criteria: a) focus on body motion, b) fine-grained label taxonomy (e.g. presence of intra-activity variations), c) controlled environment (e.g. fixed camera angle in a home environment), d) "in the wild" (as opposed to e.g. recorded in a lab), and e) dataset size sufficient for stand-alone pre-training.Cheng et al., 2019; Newell & Deng, 2016; Geng et al., 2021), top-down methods split pose estimation into first localization and then pose extraction(Bazarevsky et al., 2020; Newell et al., 2016;  Sun et al., 2019; Xiao et al., 2018). The classification stage is then optimized independently, with no end-to-end finetuning of the whole pipeline. Pose-based action classifiers typically use either hand-crafted features(Ofli et al., 2014; Wang et al., 2012; Vemulapalli et al., 2014)  or, increasingly, deep learning-based modules. Recent approaches have employed CNNs(Ke et al., 2017; Kim &

