SELF-SUPERVISED SPEECH ENHANCEMENT USING MULTI-MODAL DATA

Abstract

Modern earphones come equipped with microphones and inertial measurement units (IMU). When a user wears the earphone, the IMU can serve as a second modality for detecting speech signals. Specifically, as humans speak to their earphones (e.g., during phone calls), the throat's vibrations propagate through the skull to ultimately induce a vibration in the IMU. The IMU data is heavily distorted (compared to the microphone's recordings), but IMUs offer a critical advantage they are not interfered by ambient sounds. This presents an opportunity in multi-modal speech enhancement, i.e., can the distorted but uninterfered IMU signal enhance the user's speech when the microphone's signal suffers from strong ambient interference and mitigate the need of labeled data for model development? We combine the best of both modalities (microphone and IMU) by designing a cooperative and self-supervised network architecture that does not rely on clean speech data from the user. Instead, using only noisy speech recordings, the IMU learns to give hints on where the target speech is likely located. The microphone uses this hint to enrich the speech signal, which then trains the IMU to improve subsequent hints. This iterative approach yields promising results, comparable to a supervised denoiser trained on clean speech signals. When clean signals are also available to our architecture, we observe promising SI-SNR improvement. We believe this result can aid speech-related applications in earphones and hearing aids, and potentially generalize to others, like audio-visual denoising.

1. INTRODUCTION

Speech enhancement/denoising are long-standing problems in audio analysis. The recent deep learning approaches have successfully broken through the performance walls, to the extent that even pre-trained voice assistants like Siri, Alexa, Google are remarkably successful (Tulsiani et al., 2020) . It is not surprising that the bar on speech enhancement is getting raised, with newer form-factors and more challenging use-cases in the horizon. A growing domain of interest is in the context of "earables" (e.g., earphones, hearing aids, and glasses). Even though the user speaks from close to the earphone, the problem is particularly challenging because: (1) the background interference can be high in real-world public environments (e.g., restaurants, airports, busses, trains) (Schwartz, 2022). (2) Users tend to speak softly, lest they disturb others around them. Finally, (3) the relatively fewer microphones on earable devices must forgo some of the array processing gains (compared to, say, table-top devices such as Amazon Alexa or teleconferencing systems). In sum, the SINR (Signal to Interference Noise Ratio) of the target speech signal can be very low in real-world scenarios. Although challenging, unique opportunities emerge as well. Modern earphones are equipped with IMUs that sense the vibrations due to human speech (Jabra, 2022) . Of course, the IMU's sampling rate is ≈ 400 Hz, hence the recording of the human speech is heavily aliased and distorted by the non-linear human bone-channel (Blue et al., 2013) . However, ambient sounds do not induce vibrations in the IMU, implying that the IMU signal remains immune to external interference. The microphone on the other hand records a high quality signal from the user's mouth (44 kHz), but can be heavily polluted by ambient interference. Non-stationary interference, such as voices of other people, are difficult to denoise; even today's best denoisers (Wang & Chen, 2018) , that perform remarkably well on stationary noise or on pre-trained distributions, falter against speech and music. Moreover, existing techniques mostly require clean speech for training the models. With multi-modal

