BALLROOM DANCE MOVEMENT RECOGNITION USING A SMART WATCH AND REPRESENTATION LEARNING

Abstract

Smart watches are being increasingly used to detect human gestures and movements. Using a single smart watch, whole body movement recognition remains a hard problem because movements may not be adequately captured by the sensors in the watch. In this paper, we present a whole body movement detection study using a single smart watch in the context of ballroom dancing. Deep learning representations are used to classify well-defined sequences of movements, called figures. Those representations are found to outperform ensembles of decision trees and hidden Markov models. The classification accuracy of 85.95% was improved to 92.31% by modeling a dance as a first-order Markov chain of figures.

1. INTRODUCTION

Recent work has used low-cost smart watches to track the movement of human body parts. ArmTrak tracks arm movement, assuming that the body and torso are stationary (Shen et al., 2016) . In this paper, we perform whole body movement recognition using a single smart watch, which is a hard problem given that body movements need to be inferred using readings taken from a single location on the body (the wrist). The movements in the study are from ballroom dancing, which engages tens of thousands of competitors in the U.S. and other countries. Competitors dance at different skill levels and each level is associated with an internationally recognized syllabus, set by the World Dance Sport Federation. The syllabus breaks each dance into smaller segments with well-defined body movements. Those segments are called figures. In the waltz, for example, each figure has a length of one measure of the waltz song being danced to; the entire dance is a sequence of 40 to 60 figures (depending on the length of the song). The sequence is random, but the figures themselves are well-defined. The sequence is illustrated in Fig. 1 . The International Standard ballroom dances are a subset of ballroom dances danced around the world, and they include the waltz, tango, foxtrot, quickstep and Viennese waltz. A unique characteristic of all these dances is that the couple is always in a closed-hold, meaning they never separate. Also, both dancers in the couple maintain a rigid frame, meaning the arms and torso move together as one unit. The head and the lower body, however, move independently of that arms-torso unit. Our hypothesis in this paper is that the figures in each of these dances can be recognized with high accuracy using deep learning representations of data obtained from a single smart watch worn by the lead in the couple. That is possible because the rigid frame makes it unnecessary to separately instrument the arms and torso, and because most figures are characterized by distinct movements (translations and rotations in space) of the arms and torso. We refer the interested reader to the website www.ballroomguide.com for free videos and details on the various syllabus figures in all the International Standard ballroom dance styles. In this paper, we validate our hypothesis on the quintessential ballroom dance-the waltz. We chose 16 waltz figures that are most commonly danced by amateurs. The full names of the figures are included in Appendix A. Our goal is to accurately classify those figures in real-time using data from a smart watch. That data can be pushed to mobile devices in the hands of spectators at ballroom competitions, providing them with real-time commentary on the moves that they will have just watched being performed. That is an augmented-reality platform serving laymen in the audience who want to become more engaged with the nuances of the dance that they are watching. The main beneficiary of the analysis of dance movements would be the dancers themselves. The analysis will help them identify whether or not they are dancing the figures correctly. If a figure Dance (Eg: Waltz, Quickstep, Tango, Foxtrot) is confused for a different figure, it may be because the dancers have not sufficiently emphasized the difference in their dancing and need to improve their technique on that figure. That confusion metrics could also be used by competition judges to mark competitors on how well dancers are performing figures; that task is currently done by eye-balling multiple competitors on the floor, and is challenging when there are over ten couples to keep track of. We make three main contributions in detecting ballroom dance movements using learning representations. • First, we show that representations using data from a single smart watch are sufficient for discriminating between complex dancing movements. • Second, we identify and evaluate six learning representations that can be used for classifying the figures with varying accuracies. The representations are 1) Gaussian Hidden Markov Model, 2) Extra Trees Classifier, 3) Feed-Forward Neural Network, 4) Recurrent Neural Network (LSTM), 5) Convolution Neural Network, and 6) a Convolution Neural Network that feeds into a Recurrent Neural Network. • Finally, we model the sequence of figures as a Markov chain, using the fact that the transitions between figures are memoryless. We use the rules of the waltz to determine which transitions are possible and which are not. With that transition knowledge, we correct the immediately previous figure's estimate. This leads to an average estimation accuracy improvement of 5.33 percentage points.

2. DATASET DESCRIPTION

2.1 DATA COLLECTION The data was collected using an Android app on a Samsung Gear Live smart watch. The app was developed for this work on top of the ArmTrak data collection app. We were able to reliably collect two derived sensor measurements from the Android API: • Linear Acceleration. This contains accelerometer data in the X, Y and Z directions of the smart watch, with the effect of gravity removed. • Rotation Vector. This provides the Euler angles (roll, pitch and yaw) by fusing accelerometer, gyroscope and magnetometer readings in the global coordinate space. We use only the yaw (rotation about the vertical axis) in this study, and that is based on prior knowledge that roll and pitch are insignificant in the waltz figures included in the study. In total, we collected readings from 4 sensor axes (three from the Linear Acceleration and the yaw from the Rotation Vector sensors). The readings were reported by watch operating system asynchronously, at irregular intervals, whenever a change was sensed. In order to facilitate signal processing, we downsampled the data such that each figure contained exactly 100 sensor samples, which was possible because the effective sampling rate was greater than that. The downsampling was done by taking the median (instead of the mean, which is sensitive to outliers) values of 100



Figure 1 Figure 2 Figure N …

Figure1: A dance is random a sequence of well-defined figures (movements). If the dancer is instrumented with sensors, the figures emit sensor readings that should be similar for each type of figure.

