ACOUSTIC NEIGHBOR EMBEDDINGS

Abstract

This paper proposes a novel acoustic word embedding called Acoustic Neighbor Embeddings where speech or text of arbitrary length are mapped to a vector space of fixed, reduced dimensions by adapting stochastic neighbor embedding (SNE) to sequential inputs. The Euclidean distance between coordinates in the embedding space reflects the phonetic confusability between their corresponding sequences. Two encoder neural networks are trained: an acoustic encoder that accepts speech signals in the form of frame-wise subword posterior probabilities obtained from an acoustic model and a text encoder that accepts text in the form of subword transcriptions. Compared to a triplet loss criterion, the proposed method is shown to have more effective gradients for neural network training. Experimentally, it also gives more accurate results with low-dimensional embeddings when the two encoder networks are used in tandem in a word (name) recognition task, and when the text encoder network is used standalone in an approximate phonetic matching task. In particular, in an isolated name recognition task depending solely on Euclidean nearest-neighbor search between the proposed embedding vectors, the recognition accuracy is identical to that of conventional finite state transducer(FST)-based decoding using test data with up to 1 million names in the vocabulary and 40 dimensions in the embeddings.

1. INTRODUCTION

Acoustic word embeddings (Levin et al., 2013; Maas et al., 2012) are vector representations of words that capture information on how the words sound, as opposed to word embeddings that capture information on what the words mean. A number of acoustic word embedding methods have been proposed, applied to word discrimination (He et al., 2017; Jung et al., 2019) , lattice rescoring in automatic speech recognition (ASR) (Bengio & Heigold, 2014) , and query-by-example keyword search (Settle et al., 2017) or detection (Chen et al., 2015) . Previous works have also applied multilingual acoustic word embeddings (Kamper et al., 2020; Hu et al., 2020) to zero-resource languages, and acoustic word embeddings or acoustically-grounded word embeddings to improve acoustic-to-word (A2W) speech recognition Settle et al. (2019); Shi et al. (2020) . Recently, triplet loss functions (He et al., 2017; Settle et al., 2019; Jung et al., 2019) have been used to train two neural networks simultaneously: an acoustic encoder f (•) networkfoot_0 that accepts speech, and a text encoder g(•) network that accepts text as the input. By training the two to transform their inputs into a common space where matching speech and text get mapped to the same coordinates, f and g can be used in tandem in applications where a speech utterance is compared against a database of text, or vice versa. They can also each be used as a standalone, general-purpose word embedding network that maps similar-sounding speech (in the case of f ) or text (in the case of g) to similar locations in the embedding space. Note that in stricter choices of terminology, one may use the term acoustic word embedding to refer only to the output of g, and speech embedding (Algayres et al., 2020) to refer to the output of f . An obvious application of such embeddings is highly-scalable isolated word (or name) recognition (e.g. in a music player app where the user can tap the search bar to say a song or album title), where a given speech input is mapped via f to an embedding vector f , which is then compared against a database of embedding vectors {g 1 , • • • , g N }, prepared via g, that represents a vocabulary of N words, to classify the speech. Using the Euclidean distance, the classification rule is: i = arg min 1≤j≤N ||f -g j || 2 , where || • || is the L 2 norm. Vector distances have been used in the past for other matching problems (e.g. Schroff et al. ( 2015)). For speech recognition, a rule like ( 1) is interesting because it can be easily parallelized for fast evaluation on modern hardware (e.g. Garcia et al. ( 2008)), and the vocabulary can also be trivially updated since each entry is a vector. If proven to work well in isolated word recognition, the embedding vectors could be used in continuous speech recognition where speech segments hypothesized to contain named entities can be recognized via nearest-neighbor search, particularly when we want an entity vocabulary that is both large and dynamically updatable. However, none of the aforementioned papers have reported results in isolated word recognition. The main contribution in this paper is a new training method for f and g which adapts stochastic neighbor embedding (SNE) (Hinton & Roweis, 2003) to sequential data. It will be shown by analysis of the gradients of the proposed loss function that it is more effective than the triplet loss for mapping similar-sounding inputs to similar locations in a vector space. It will also be shown that the low-dimensional embeddings produced by the proposed method, called Acoustic Neighbor Embeddings (ANE), work better than embeddings produced by a triplet loss in isolated name recognition and approximate phonetic matching experiments. This paper also claims to be the first work that successfully uses a simple L 2 distance between vectors to directly perform isolated word recognition that can match the accuracy of conventional FST-based recognition over large vocabularies. One design choice in this work for the acoustic encoder f is that instead of directly reading acoustic features, a separate acoustic model is used to preprocess them into (framewise) subword posterior probability estimates. "Posteriorgrams" have been used in past studies (e.g. Hazen et al. ( 2009)) for speech information retrieval. In the proposed work, they allow the f network to be much smaller, since the task of resolving channel and speaker variability can be delegated to a state-of-the-art ASR acoustic model. One can still use acoustic features with the proposed method, but in many practical scenarios an acoustic model is already available for ASR that computes subword scores for every speech input, so it is feasible to reuse those scores. As inputs to the text encoder g, this study experiments with two types of text: phone sequences and grapheme sequences. In the former case, a grapheme-to-phoneme converter (G2P) is used to convert each text input to one or more phoneme sequences, and each phoneme sequence is treated as a separate "word." This approach reduces ambiguities caused by words that could be pronounced multiple ways, such as "A.R.P.A", which could be pronounced as (in ARPABET phones) [aa r p aa], or [ey aa r p iy ey], or [ey d aa t aa r d aa t p iy d aa t ey] (pronouncing every "." as "dot"). In the latter case using graphemes, more errors can occur in word recognition because a single embedding vector may not capture all the variations in how a word may sound. On the other hand, such a system can be more feasible because it does not require a separate G2P. Also note that while we use the term "word embedding" following known terminology in the literature, a "word" in this work can actually be a sequence of multiple words, such as "John W Smith" or "The Hilton Hotel" as in the name recognition experiments in Section 5.

2. REVIEW OF STOCHASTIC NEIGHBOR EMBEDDING

In short, stochastic neighbor embedding (SNE) (Hinton & Roweis, 2003) is a method of reducing dimensions in vectors while preserving relative distances, and is a popular method for data visualization. Given a set of N coordinates {x 1 , • • • , x N }, SNE provides a way to train a function f (•) that maps each coordinate x i to another coordinate of lower dimensionality f i where the relative distances among the x i 's are preserved among the corresponding f i 's. The distance between two points x i and x j in the input space is defined as the squared Euclidean distance with some scale factor σ i : d 2 ij = ||x i -x j || 2 2σ 2 i . (2)



It is interesting to note that the basic notion of "memorizing" audio signals in fixed dimensions can be traced back to as early asLonguet-Higgins (1968)

