SELF-SUPERVISED SPEECH ENHANCEMENT USING MULTI-MODAL DATA

Abstract

Modern earphones come equipped with microphones and inertial measurement units (IMU). When a user wears the earphone, the IMU can serve as a second modality for detecting speech signals. Specifically, as humans speak to their earphones (e.g., during phone calls), the throat's vibrations propagate through the skull to ultimately induce a vibration in the IMU. The IMU data is heavily distorted (compared to the microphone's recordings), but IMUs offer a critical advantage they are not interfered by ambient sounds. This presents an opportunity in multi-modal speech enhancement, i.e., can the distorted but uninterfered IMU signal enhance the user's speech when the microphone's signal suffers from strong ambient interference and mitigate the need of labeled data for model development? We combine the best of both modalities (microphone and IMU) by designing a cooperative and self-supervised network architecture that does not rely on clean speech data from the user. Instead, using only noisy speech recordings, the IMU learns to give hints on where the target speech is likely located. The microphone uses this hint to enrich the speech signal, which then trains the IMU to improve subsequent hints. This iterative approach yields promising results, comparable to a supervised denoiser trained on clean speech signals. When clean signals are also available to our architecture, we observe promising SI-SNR improvement. We believe this result can aid speech-related applications in earphones and hearing aids, and potentially generalize to others, like audio-visual denoising.

1. INTRODUCTION

Speech enhancement/denoising are long-standing problems in audio analysis. The recent deep learning approaches have successfully broken through the performance walls, to the extent that even pre-trained voice assistants like Siri, Alexa, Google are remarkably successful (Tulsiani et al., 2020) . It is not surprising that the bar on speech enhancement is getting raised, with newer form-factors and more challenging use-cases in the horizon. A growing domain of interest is in the context of "earables" (e.g., earphones, hearing aids, and glasses). Even though the user speaks from close to the earphone, the problem is particularly challenging because: (1) the background interference can be high in real-world public environments (e.g., restaurants, airports, busses, trains) (Schwartz, 2022). (2) Users tend to speak softly, lest they disturb others around them. Finally, (3) the relatively fewer microphones on earable devices must forgo some of the array processing gains (compared to, say, table-top devices such as Amazon Alexa or teleconferencing systems). In sum, the SINR (Signal to Interference Noise Ratio) of the target speech signal can be very low in real-world scenarios. Although challenging, unique opportunities emerge as well. Modern earphones are equipped with IMUs that sense the vibrations due to human speech (Jabra, 2022) . Of course, the IMU's sampling rate is ≈ 400 Hz, hence the recording of the human speech is heavily aliased and distorted by the non-linear human bone-channel (Blue et al., 2013) . However, ambient sounds do not induce vibrations in the IMU, implying that the IMU signal remains immune to external interference. The microphone on the other hand records a high quality signal from the user's mouth (44 kHz), but can be heavily polluted by ambient interference. Non-stationary interference, such as voices of other people, are difficult to denoise; even today's best denoisers (Wang & Chen, 2018) , that perform remarkably well on stationary noise or on pre-trained distributions, falter against speech and music. Moreover, existing techniques mostly require clean speech for training the models. With multi-modal data from both the microphone and IMU (see spectrograms in Fig. 1 ), we see an opportunity to close gaps in speech enhancement and remove the necessity of collecting clean and labeled speech. We propose a self-supervised architecture that does not rely on clean speech data to train the network. Instead, we utilize the everyday, noisy recordings from the earphone that the user can record on the fly. The key idea is to develop a cooperation between the IMU and the microphone, so each modality can teach and learn from the other. To this end, our architecture, called IMU-to-Voice (IMUV), is composed of two separate models -a Translator and a Denoiser. Briefly, the Translator translates the distorted IMU signal to higher-resolution audio, and then constructs a time-frequency mask to crudely identify the locations of user's speech. The Denoiser, which only has noisy speech signals, uses this crude mask to slightly enrich the user's speech signal. The Denoiser's output -the slightly enriched speech signal -now offers a reference to the Translator to learn a better mask, which is in turn fed back to the Denoiser to further enrich the speech signal. The iterations converge to an SNR-enhanced voice signal at the output of the Denoiser, even in the presence of strong interference. Importantly, no clean speech is needed to bootstrap or train this network; the noisy voice signal can even be at 0 dB SINR. Zooming out to a higher level, the results are not surprising. Given that the IMU is completely unaffected by strong interference, it should be able to guide the audio Denoiser down the correct path of gradient descent. The only risk emerges from the fact that the IMU has no way to validate whether its guidance is correct, and given the IMU is heavily distorted, it is easy to make mistakes. However, this is where we find that even a noisy voice recording gives the needed validation to the IMU, so the Translator and Denoiser can teach each other and independently descend in the right direction.



Figure 1: Earphone measurements: (a) Audio signal recorded without interference, (b) Audio recording with interference, (c) IMU recording from earphone. (d) Zoomed in view of IMU signal.

acknowledgement

We show that our proposed two-step model actually inherits the structure of expectation maximization (EM) (?), with the likelihood and posterior functions estimated by neural networks. EM is known to be sensitive to its initial condition, similar to how the initial mask from the Translator is crucial for downstream convergence. Although our Translator is able to provide one acceptable mask, the question around the optimal mask (upper bound), or the minimally adequate mask to ensure convergence (lower bound), remains open. We leave this to future research.

annex

Summary of Results: With help from 4 volunteers, we gathered IMU and microphone data from earphones, and injected interference from a public audio dataset into the microphone data stream. The self-supervised IMUV model is trained on this unclean dataset (at varying SINR levels). We evaluate the final denoised signal using two metrics: scale invariant signal to noise ratio (SI-SNR), and word error rate (WER) from an automatic speech recognizer (ASR) (Yu & Deng, 2016) . Results show that in terms of WER, self-supervised IMUV is comparable with the supervised audio Denoiser (trained with clean voice data), achieving less than 1% difference. When we allow IMUV to also train on clean signals, supervised IMUV exceeds self-supervised IMUV by 5%. Finally, when using SI-SNR as the metric, the gains are higher and more consistent with both supervised and self-supervised IMUV. In closing, we find that IMU extends only one of two advantages -we can either choose to improve denoising performance, or relieve the user from collecting clean voice data.

2. MULTI-MODAL SELF-SUPERVISION

2.1 PROBLEM STATEMENT We consider two input streams: a high-resolution audio signal H from the microphone, and a low-resolution surface-vibration signal L from the IMU. Since all recordings are from everyday

