MMTSA: MULTIMODAL TEMPORAL SEGMENT AT-TENTION NETWORK FOR EFFICIENT HUMAN ACTIV-ITY RECOGNITION

Abstract

Multimodal sensors (e.g., visual, non-visual, and wearable) provide complementary information to develop robust perception systems for recognizing activities. However, most existing algorithms use dense sampling and heterogeneous subnetwork to extract unimodal features and fuse them at the end of their framework, which causes data redundancy, lack of multimodal complementary information and high computational cost. In this paper, we propose a novel multi-modal neural architecture based on RGB and IMU wearable sensors (e.g., accelerometer, gyroscope) for human activity recognition called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first employs a multimodal data isomorphism mechanism based on Gramian Angular Field (GAF) and then applies a novel multimodal sparse sampling method to reduce redundancy. Moreover, we propose an inter-segment attention module in MMTSA to fuse multimodal features effectively and efficiently. We demonstrate the importance of imu data imaging and attention mechanism in human activity recognition by rigorous evaluation on three public datasets, and achieved superior improvements (11.13% on the MMAct dataset) than the previous state-of-the-art methods.

1. INTRODUCTION

With the increased interests of wearable technologies and the development of deep learning, human activity recognition (HAR) has recently attracted widespread attention in human-computer interaction, healthcare, and multimedia analysis. Uni-modal (e.g., RGB video, audio, acceleration , infrared sequence, etc.) HAR methods have been extensively investigated in the past years Wang et al. (2016) ; Lin et al. (2019) ; García-Hernández et al. (2017) ; Slim et al. (2019) ; Akula et al. (2018) . However, unimodal HAR methods limit generalization ability in real-world scenarios. The performance of a video-based HAR method is usually sensitive to illumination intensity, visual occlusion, or complex background. Meanwhile, noisy or missing data and movement variance between different users can negatively affect a sensor-based HAR method. In these cases, it is important to leverage both vision-based and sensor-based modalities to compensate for the weaknesses of single modality and improve the performance of HAR in multi-modal manners. For instance, the movement trajectories of hands are similar when people eat or drink, thus it is not easy to distinguish the two actions only based on imu sensor data. However, when the vision modality is considered, the two actions can be distinguished based on the visual characteristics of the objects held by the hands. Another example is outdoor activity recognition. Due to the similarity of the outdoor environment, it is challenging for the single-modality systems to classify whether a user is walking or running by relying only on the visual data of the smart glasses. On the other hand, the imu sensor data (i.e., accelerometer data) of these two activities reveal significantly different feature representations. In recent years, sensors have been widely equipped in smart glasses, smartwatches, and smartphones, making the available input modalities for HAR more abundant. Therefore, various multimodal learning methods have been proposed to exploit complementary properties among modalities for fusion and co-learning-based HAR. Existing multimodal learning methods fuse vision-sensor and imu sensor data have the following shortcomings: 1) Owing to data heterogeneity, most existing methods feed uni-modal data into separate sub-networks with different structures to extract features and fuse them at the end stages. Obviously, there is a huge structure divergence between imu sensor and 2017). Thus, the form of input of existing multi-modal learning models not only ignore the temporal synchronization correlation between multimodal data but also lose valuable complementary information. Additionally, it costs to add new sub-networks to extract the learning representations of new modality inputs. 2) Dense temporal sampling, which means sampling frames densely in a video clip or sampling the entire series of sensor data in a time period, is widely used in previous work to capture long-range temporal information in long-lasting activities. For example, those methods Tanberk et al. ( 2020); Wei et al. ( 2019) mainly rely on dense temporal sampling to improve the performance, which results in data redundancy and unnecessary computation since the adjacent frames in the video have negligible difference and the imu data of some activities are periodic. 3) Although some newly proposed attention-based multi-modal learning methods have improved the performance of HAR tasks, their complicated architectures lead to high computational overhead and make them challenging to be deployed on mobile and wearable devices. Islam & Iqbal (2021; 2022) . To address the challenges above, we propose a novel and efficient HAR multimodal network for vision and imu sensors called MMTSA. Our method includes three contributions: • We design MMTSA, a novel multi-modal neural architecture based on RGB videos and IMU sensor data for end-to-end human activity recognition application. MMTSA efficiently explores information from heterogeneous multi-modal data through an isomorphism mechanism and enables extensions for new modalities. Moreover, MMTSA leverages a segment-based sparse co-sampling scheme to enable a long-lasting activity recognition at low computational resources. • MMTSA first applies Gramian Augular Field (GAF) methods to encode the imu sensor data into multi-channel grayscale images in the multi-modal human activity recognition. The transformation arrows the structural gap among the RGB data and imu sensor data and enhances the resuability of the model's sub-networks, which bring the computation improvement. • To better mine the complementary information between synchronized data of different modalities and the correlation between different temporal stages, we design inter-segment modality attention mechanisms for spatiotemporal feature fusion. We compared the performance of our method to several state-of-the-art HAR algorithms and traditional algorithms on three multi-modal activity datasets. MMTSA achieved an improvement of 11.13% and 2.59% (F1-score) on the MMAct dataset for the cross-subject and cross-session evaluations. 



vision-sensor data. Since the imu sensor data are one-dimensional time-series signals, most of the previous literature utilize 1D-CNN or LSTM network to extract spatial and temporal features of raw imu sensor data Steven Eyobu & Han (2018); Panwar et al. (2017); Wang et al. (2019a). The vision-sensor activity data, however, are usually images or videos with two or more dimensions, which is suitable for 2D-CNN or 3D-CNN to extract visual features Simonyan & Zisserman (2014); Karpathy et al. (2014b); Sun et al. (

The accuracy of video-based HAR methods is highly related to video representation learning. Early works try to use 2D based methods Karpathy et al. (2014a) to learn video representations. Later, 3D convolutional networks Tran et al. (2015); Xie et al. (2018) are explored and achieve excellent performance. However, 3D based methods always take several consecutive frames as input, so that huge complexity and computational costs make these methods expensive to be used. Recently, 2D convolutional networks with temporal aggregationWang et al. (2016); Lin et al. (2019) achieve significant improvements, which have much lower computational and memory

