ACOUSTIC NEIGHBOR EMBEDDINGS

Abstract

This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings where speech or text of arbitrary length are mapped to a vector space of fixed, reduced dimensions by adapting stochastic neighbor embedding (SNE) to sequential inputs. The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences. Two encoder neural networks are trained: an acoustic encoder that accepts speech signals in the form of frame-wise subword posterior probabilities obtained from an acoustic model and a text encoder that accepts text in the form of subword transcriptions. Compared to a triplet loss criterion, the proposed method is shown to have more effective gradients for neural network training. Experimentally, it also gives more accurate results with low-dimensional embeddings when the two encoder networks are used in tandem in a word (name) recognition task, and when the text encoder network is used standalone in an approximate phonetic matching task. In particular, in an isolated name recognition task depending solely on Euclidean nearest-neighbor search between the proposed embedding vectors, the recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.

1. INTRODUCTION

Acoustic word embeddings (Levin et al., 2013; Maas et al., 2012) are vector representations of words that capture information on how the words sound, as opposed to word embeddings that capture information on what the words mean. A number of acoustic word embedding methods have been proposed, applied to word discrimination (He et al., 2017; Jung et al., 2019) , lattice rescoring in automatic speech recognition (ASR) (Bengio & Heigold, 2014) , and query-by-example keyword search (Settle et al., 2017) or detection (Chen et al., 2015) . Previous works have also applied multilingual acoustic word embeddings (Kamper et al., 2020; Hu et al., 2020) to zero-resource languages, and acoustic word embeddings or acoustically-grounded word embeddings to improve acoustic-to-word (A2W) speech recognition Settle et al. ( 2019 Recently, triplet loss functions (He et al., 2017; Settle et al., 2019; Jung et al., 2019) have been used to train two neural networks simultaneously: an acoustic encoder f (•) networkfoot_0 that accepts speech, and a text encoder g(•) network that accepts text as the input. By training the two to transform their inputs into a common space where matching speech and text get mapped to the same coordinates, f and g can be used in tandem in applications where a speech utterance is compared against a database of text, or vice versa. They can also each be used as a standalone, general-purpose word embedding network that maps similar-sounding speech (in the case of f ) or text (in the case of g) to similar locations in the embedding space. Note that in stricter choices of terminology, one may use the term acoustic word embedding to refer only to the output of g, and speech embedding (Algayres et al., 2020) to refer to the output of f . An obvious application of such embeddings is highly-scalable isolated word (or name) recognition (e.g. in a music player app where the user can tap the search bar to say a song or album title), where



It is interesting to note that the basic notion of "memorizing" audio signals in fixed dimensions can be traced back to as early asLonguet-Higgins (1968)



); Shi et al. (2020).

