MMTSA: MULTIMODAL TEMPORAL SEGMENT AT-TENTION NETWORK FOR EFFICIENT HUMAN ACTIV-ITY RECOGNITION

Abstract

Multimodal sensors (e.g., visual, non-visual, and wearable) provide complementary information to develop robust perception systems for recognizing activities. However, most existing algorithms use dense sampling and heterogeneous subnetwork to extract unimodal features and fuse them at the end of their framework, which causes data redundancy, lack of multimodal complementary information and high computational cost. In this paper, we propose a novel multi-modal neural architecture based on RGB and IMU wearable sensors (e.g., accelerometer, gyroscope) for human activity recognition called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first employs a multimodal data isomorphism mechanism based on Gramian Angular Field (GAF) and then applies a novel multimodal sparse sampling method to reduce redundancy. Moreover, we propose an inter-segment attention module in MMTSA to fuse multimodal features effectively and efficiently. We demonstrate the importance of imu data imaging and attention mechanism in human activity recognition by rigorous evaluation on three public datasets, and achieved superior improvements (11.13% on the MMAct dataset) than the previous state-of-the-art methods.

1. INTRODUCTION

With the increased interests of wearable technologies and the development of deep learning, human activity recognition (HAR) has recently attracted widespread attention in human-computer interaction, healthcare, and multimedia analysis. Uni-modal (e.g., RGB video, audio, acceleration, infrared 2018). However, unimodal HAR methods limit generalization ability in real-world scenarios. The performance of a video-based HAR method is usually sensitive to illumination intensity, visual occlusion, or complex background. Meanwhile, noisy or missing data and movement variance between different users can negatively affect a sensor-based HAR method. In these cases, it is important to leverage both vision-based and sensor-based modalities to compensate for the weaknesses of single modality and improve the performance of HAR in multi-modal manners. For instance, the movement trajectories of hands are similar when people eat or drink, thus it is not easy to distinguish the two actions only based on imu sensor data. However, when the vision modality is considered, the two actions can be distinguished based on the visual characteristics of the objects held by the hands. Another example is outdoor activity recognition. Due to the similarity of the outdoor environment, it is challenging for the single-modality systems to classify whether a user is walking or running by relying only on the visual data of the smart glasses. On the other hand, the imu sensor data (i.e., accelerometer data) of these two activities reveal significantly different feature representations. In recent years, sensors have been widely equipped in smart glasses, smartwatches, and smartphones, making the available input modalities for HAR more abundant. Therefore, various multimodal learning methods have been proposed to exploit complementary properties among modalities for fusion and co-learning-based HAR. Existing multimodal learning methods fuse vision-sensor and imu sensor data have the following shortcomings: 1) Owing to data heterogeneity, most existing methods feed uni-modal data into separate sub-networks with different structures to extract features and fuse them at the end stages. Obviously, there is a huge structure divergence between imu sensor and



sequence, etc.) HAR methods have been extensively investigated in the past years Wang et al. (2016); Lin et al. (2019); García-Hernández et al. (2017); Slim et al. (2019); Akula et al. (

