WAV2TOK: DEEP SEQUENCE TOKENIZER FOR AUDIO RETRIEVAL

Abstract

Search over audio sequences is a fundamental problem. In this paper, we propose a method to extract concise discrete representations for audio that can be used for efficient retrieval. Our motivation comes from orthography which represents speech of a given language in a concise and distinct discrete form. The proposed method, wav2tok, learns such representations for any kind of audio, speech or non-speech, from pairs of similar audio. wav2tok compresses the query and target sequences into shorter sequences of tokens that are faster to match. The learning method makes use of CTC loss and expectation-maximization algorithm, which are generally used for supervised automatic speech recognition and for learning discrete latent variables, respectively. Experiments show the consistent performance of wav2tok across two audio retrieval tasks: music search (query by humming) and speech search via audio query, outperforming state-of-the-art baselines.

1. INTRODUCTION

Sequence Retrieval aims at retrieving sequences similar to a query sequence, with the constraint that an ordered alignment exists between the query and the target sequence. In this paper, we address the following problem: Can we extract discrete tokens from any continuous signal for the purpose of retrieval of similar signals? This problem has deep connections with tasks such as child language acquisition, music cognition and learning languages without written forms. Some direct applications of the proposed task include speech search, where the order of constituent units, such as phonemes, syllables or words, remains same; and music search -query by humming or query by examplewhere the order of constituent units, such as relative notes or phrases, remains same. Apart from audio, the problem extends to tasks such as handwritten word search and gesture search. One can define similarity metrics over sequences using methods based on Dynamic Time Warping (DTW) (Müller, 2007) . These methods are inefficient if the sequences are continuous valued and have high sampling rates. Moreover, they depend on matching hand-made features, which are ineffective in the face of high variability of query sequences. Problems such as spoken term detection involve detection of a query utterance in a long speech audio. The search space is huge, and performing DTW based search of query takes long time (Rodriguez-Fuentes et al., 2014) . A more efficient way of sequence retrieval is by mapping them to sequences of discrete tokens. Automatic speech recognition (ASR) can be employed for this purpose (Mamou et al., 2013) . However, ASR training requires knowledge of basic units of transcription. The popularly used units are phonemes and graphemes. This method thus becomes language dependent. Non-linguistic sounds, such as cough and sneeze, could be mapped to certain tokens defined for them. This approach could not be used when precise tokens are not defined, e.g., music search. ] In query by humming based music search, audio is mapped to discrete melody-related tokens, such as notes, and these token sequence are matched for search (Unal et al., 2008) . However, several music traditions do not have precise transcription systems. There, one can tell if two pieces, or motifs, are similar but cannot precisely transcribe them to tokens. The embellishments used in music could be too dynamic to be transcribed precisely. Moreover, when a musically untrained user sings a query, s/he cannot hit the right notes matching the target song. So the matching could rely on several factors other than notes, such as phonemes of lyrics (Mesaros & Virtanen, 2010) , onset times (rhythm) (Kosugi et al., 2000) , and note transitions (Ranjan & Arora, 2020) . Hence, the tokens to be used may not be derived from notes alone. In this way, each tokenizer -for speech, music or other signals, in general -uses domain-specific hand-made tokens defined by a domain expert. In this paper, we propose a tokenizer to map audio sequences to sequences of discrete tokens with an aim of retrieval. The mapping is learned only from pairs of similar audio sequences. The tokens are not defined manually but correspond to distinct semantic units learned from pairs of similar audio sequences. The method is general and can be applied to signals other than audio. In this paper, we apply the proposed method to speech and music audio search, for the problems of spoken term detection and query by humming, respectively. The proposed method, named wav2tok, encodes audio via a BiLSTM (Schuster & Paliwal, 1997) network. The encoder-generated representations are then mapped to discrete tokens via a K-means vector quantizer network. Each discrete token corresponds to a discrete representation in the vector quantizer's codebook which is initialized and updated via offline K-means clustering only. wav2tok is trained with pairs of similar audio sequences in a self-supervised fashion without any transcription using a novel training algorithm. For each pair, we average the encoder-generated representations, which map to the same token, by the K-means vector quantizer network to generate a prototype for that token. We then perform a contrastive learning task to increase the similarity between the generated prototype for a particular token and the quantizer codebook discrete representation corresponding to the same token. We simultaneously minimize the edit distance between the token sequences generated from each sequence in the pair via Connectionist temporal classification (CTC) (Graves et al., 2006) framework to constrain both sequences to get mapped to the same token sequence. We compare wav2tok to state-of-the-art (SOTA) methods for discrete representation learning, such as wav2vec 2.0, and SOTA ASR models fine-tuned to perform phonetic tokenization. We evaluate the generalization capability of the tokens generated by the models on search experiments, namely, query-by-humming and spoken term detection. wav2tok outperforms the baselines in performance and uses much lesser trainable parameters, ensuring faster inference and deployment.

2. RELATED WORK

Sequence Labelling. With expert-defined tokens, various methods are popularly used for mapping sequences to tokens. In conventional methods, Hidden Markov Models (Rabiner & Juang, 1986) and Conditional Random Fields (Lafferty et al., 2001) have been popularly used for sequence labeling. These methods involve a significant amount of domain knowledge and many assumptions to make tractable models, which are avoided by End-to-End learning models such as Recurrent Neural Networks (RNNs) using Connectionist Temporal Classification framework (Graves et al., 2006) . Sequence labeling can be used for sequence retrieval by converting the sequences to tokens, which are easy to search over. But this approach inevitably depends upon expert-defined tokens. Unsupervised Speech Representation Learning. Automatic Speech Recognition systems are pretrained on large amounts of untranscribed speech data to generate SOTA continuous representations which encode the slowly varying phoneme features in raw speech. The representations are then mapped to phoneme tokens via Connectionist Temporal Classification (CTC) (Graves et al., 2006) fine-tuning on a small amount of transcribed audio. Works like Contrastive Predictive Coding (CPC) (van den Oord et al., 2018) , Autoregressive Predictive Coding (APC) (Chung & Glass, 2020) , and wav2vec (Schneider et al., 2019) generate continuous representations with powerful autoregressive models pre-trained to predict future time-step representations. Further works started discretizing the continuous representations with vq-VAE (van den Oord et al., 2017) to generate discrete representations for speech. Works like vq-wav2vec (Baevski et al., 2019) and vq-APC (Chung et al., 2020) discretize the representations and perform the same prediction tasks as in wav2vec (Schneider et al., 2019) and APC (Chung & Glass, 2020) respectively but over discrete representations. In vq-wav2vec, the discrete representations are generated with either a K-Means Vector Quantizer (Baevski et al., 2019) or Gumbel-Softmax based Vector Quantizer (Baevski et al., 2019) . The learned discrete representations are used to pre-train a BERT (Devlin et al., 2018) to generate stronger continuous representations much like BERT pre-training in Natural Language Processing. wav2vec 2.0 (Baevski et al., 2020) uses a Gumble Softmax based Vector Quantizer (Baevski et al., 2019) to generate discrete representations. The training involves masking of spans of time steps and then predicting the correct discrete representations at each masked time step with transformer representation at that time step. In these methods, raw audio is discretized in a latent space to model all possible acoustic units than phonetic or sub-phonetic units. The tokens generated by the vector quantizers aren't constrained to be interpretable and are initialized in large numbers (∼ 102.4K codes). After pre-training, a subset of these codes or tokens are chosen more often by the vector quantizers and are considered to represent acoustic units. CTC-based fine-tuning with transcription groups these discrete acoustic units to K distinct phonemes or linguistic units as present in the transcriptions. Works like HuBERT and wav2vec-Unsupervised learn phonemic units directly. HuBERT (Hsu et al., 2021) pre-trains a transformer network via BERT-like masked prediction task over noisy targets generated with a clustering model trained offline. The targets may be generated with an ensemble of K-means clusterers with K = {100, 500} clusters on MFCC features or transformer representations. wav2vec-Unsupervised (Baevski et al., 2021) learns phonetic tokens adversarially from phonemized unlabelled text data. A discriminator identifies if the phoneme sequence generated by model is real or fake based on phonemized unlabelled text. All aforementioned approaches use powerful auto-regressive models pre-trained on large amounts of unlabeled audio and fine-tuned on transcribed audio. Our learning approach can learn semantic tokens with small models while training pairwise on small amount of unlabelled audio data. Audio Representations for Retrieval. Now Playing (Arcas et al., 2017) and (Chang et al., 2020) use a Neural Network Fingerprinter (NNFP) module outputting representations which are efficient for search in query-by-example tasks where the difference between query and the actual song is pretty minute in comparison to humming where only the melody is sung. Now Playing (Arcas et al., 2017) trains representations by optimizing the triplet loss (Schroff et al., 2015) and (Chang et al., 2020) trains representations by simulating the Maximum Inner Product Search (MIPS) on minibatches of representations. For Query by Humming task, (Mostafa et al., 2016) and (Mostafa & Fung, 2017) use deep learning models like DNNs and CNNs to generate representations which they map to MIDI-numbers or note tokens. Such works require note-transcribed data to train the models. For Spoken Term Detection task, approaches like (Zhang & Glass, 2009) , (Rodriguez-Fuentes et al., 2014) , (Lee et al., 2015) , (Ram et al., 2018) convert audio to sequences of feature vectors and apply different variations of DTW based template matching to detect query in long utterances of speech which is time-consuming. Cross Domain Alignment. Given a pair of semantically similar inputs for training, tasks such as visual question answering (text and image) and machine translation (text) involve learning an alignment. The alignment here is not ordered and the inputs may be from different modalities. Attention models have been used to find alignment between output entities and input regions (Yao et al., 2018) . (Chen et al., 2020) use Gromov-Wasserstein distance between output and input entities to match them. However, there is no notion of tokens there, rather the salient entities in the input are represented as vectors in a graph. Graph Matching. Graph Neural Networks (Gori et al., 2005) are used to generate embeddings for graphs. These embeddings are used to perform graph matching to find similarity of structured graphs (Li et al., 2019) . However, they perform the matching jointly on the pair of inputs, rather than representing each input independently. This makes them unsuitable for the search problem at hand due to large run-time complexity. The distance metrics used for graph matching are based on edit distance (Li et al., 2019) and Wasserstein distance (Chen et al., 2020) .

3. PROBLEM STATEMENT

We aim to map X , a sequence of vectors, to T , a sequence of discrete tokens from a finite alphabet A, such that the similarity of sequences is preserved in the sense of edit distance. The length of sequence T may be less than or equal to that of the sequence X . In other words, given a pair of similar sequences (X i , X j ) and sequence X k which is not similar to either sequences in the pair, we want to map them to token sequences such that ED( Ti , Tj ) should be less than min{ED( Ti , Tk ), ED( Tj , Tk )}, where ED(•, •) is the edit distance between two sequences. wav2tok is comprised of an encoder f : X → Z which takes as input a temporal sequence of audio features X = [x t ∈ R n ; t ∈ [T ]] of length T , where x t is the feature vector at time step t, and outputs a sequence of L-2 normalised representations Z = [z t = f (x t ) ∈ R m ; t ∈ [T ]]. The encoder is implemented as a 2-layer BiLSTM followed by an L-2 normalization layer. BiLSTMs summarise information in both directions and encode surrounding context. A K-means vector quantizer network g : Z → T then labels sequence Z at each time-step with tokens belonging to a finite K-element alphabet A = [K] and generates sequence of tokens T = [τ t = g(z t ) ∈ A; t ∈ [T ]]. Network g vector quantizes input z t with a codebook E = {e k ∈ Z; k ∈ [K]} comprised of |A| = K discrete representations which are cluster centroids in representation space Z and outputs token τ t = arg max k z t • e k . Note, here the dot product gives a cosine similarity score since both the vectors are L-2 normalized, as a result, e k ∈ E closest to z t is chosen as its discrete representation and index k as it's token τ t . The K discrete representations in network g are trainable parameters. A compressor C compresses sequence of tokens T to sequence T of length T ≤ T by deleting all consecutive repetitions of tokens. C also generates the corresponding compressed sequence Z of length T by averaging representations z t ∈ Z over the consecutive tokens and L-2 normalising the averaged representation. Figure 1a presents an illustration demonstrating our model architecture.

5. TRAINING

wav2tok is trained on pairs of sequences of audio features (X , X ′ ) where the raw audio corresponding to X ′ is an augmented replica of that corresponding to X . We apply either pitch shift or time stretch or both augmentations to raw audio to generate its augmented replica. X and X ′ may differ in sources as well, i.e. a different person may sing the recording corresponding to X ′ . The discrete representations in quantizer g codebook E are initialized as K centroids obtained via offline K-means clustering over freshly initialized encoder-generated representations. Given (X , X ′ ), encoder f generates sequence of representations Z from input X and Z ′ from X ′ . Quantizer g generates a sequence of tokens T from input Z and T ′ from Z ′ via cosine similarity-based comparison with codebook vectors e ∈ E initialized via offline clustering over freshly initialized representation space Z. Compressor C compresses sequence of tokens T to sequence T and T ′ to T ′ . We average all encoder-generated representations in pair (Z, Z ′ ) which map to the same token, say τ , to generate a prototype for τ . We then perform a contrastive task where we compare the prototype with each of the K discrete representations in codebook E and increase its similarity with the discrete representation corresponding to τ . We also increase the likelihood that wav2tok maps pair (X , X ′ ) to the same token sequence via CTC framework to minimize ED( T , T ′ ). Our loss function is defined as, L = L m (X , X ′ ) + αL ctc (X , T ′ ) + βL ctc (X ′ , T ) where L m is loss defined for contrastive task, L ctc is the loss maximising aforementioned likelihood, and α, β are positive constants. We optimize this loss function in a manner similar to the Expectation Maximization algorithm. The clustering is used as the E-step to update the discrete representations in quantizer g codebook, while gradient descent over L acts as the M-step. Contrastive Loss. Let the set of unique tokens occurring in pair ( T , T ′ ) be U ⊂ [K], |U| = K ′ ≤ K. We generate a list of token prototypes P = {p τ ; τ ∈ U} where p τ is L-2 normalised mean of representations in {z ∈ {Z; Z ′ } : g(z) = τ }. Figure 1b presents an illustration demonstrating how we generate list of token prototypes P. Given p τ ∈ P, we perform a contrastive task to increase its similarity with discrete representation e τ ∈ E. To compare p τ with the codebook, metrics such as cosine similarity and Euclidean distance could be used. However, we find that using the following parameterized score for this purpose gives better performance, where sg(x) ≡ x, d dx sg(x) ≡ 0 is the stop-gradient operator, σ(.) is sigmoid function generating a score in the range [0, 1] and W ∈ R 1×d is a parameter matrix. s τ,k acts as a parameterized similarity score between p τ and discrete representation e k ∈ E. We define our contrastive loss L m as, s τ,k = σ(W • (p τ -sg(e k ))) ∈ [0, 1] L m (X , X ′ ) = - τ ∈U log exp(s τ,τ ) K k=1 exp(s τ,k ) Likelihood Loss. We maximize the likelihood that sequence X maps to token sequence T ′ , which corresponds to X ′ , via the CTC framework (see Figure 1c ). It puts a constraint to generate the same token sequence for X and X ′ . We calculate the probability of x t mapping to token τ t = k as l t,k = exp(f (xt)•sg(e k )) K i=1 exp(f (xt)•sg(ei)) . The likelihood P ( T ′ |X ) is then calculated as sum of probabilities of all T -length paths π over tokens τ ∈ A such that C(π) = T ′ . The loss is defined as, L ctc (X , T ′ ) = -log π∈C -1 ( T ′ ) P (π|X ) where the path probabilities are calculated over token probability scores in sequence l = {l t ∈ R K ; t ∈ [T ]} via CTC forward-backward framework (Graves et al., 2006) without the use of blanks. We present the CTC forward and backward variables for our use case in Appendix B. Clustering. We perform offline K-means clustering on a subset of encoder representations during initialization of our network and at regular intervals during training to set the discrete representations in codebook E of network g. Initializing the clusters in this way prevents wav2tok from converging to a local optimum during the matching task, as is the case we found with random initialization of centroids. The intermittent clustering during training iteratively refines the discrete representations and prevents codebook collapse. We use the sklearn library to perform K-means clustering. We train wav2tok using the ADAM (Kingma & Ba, 2017) optimizer and a linear learning schedule with a learning rate of 0.001 and 8% of the training steps as warm-up steps.

6. EXPERIMENTS

We test the performance of tokens and encoder-generated continuous representations of wav2tok in audio retrieval. We perform Query by Humming (QbH) and Spoken Term Detection experiments to evaluate the performance of wav2tok in comparison to the baselines.

6.1. MUSIC MELODY SEARCH: QUERY BY HUMMING

Task. Given a test query audio, we are to find the audio with the most similar melody in the search audio database. Experiment Details. We use the MIR-QbSH dataset which is composed of 4431 humming audio recordings of 30s duration corresponding to 48 songs. Each song is sung by several individuals. All individuals sing the same part of the song. The recordings have variations in the environments they were recorded in, tonal qualities, voices, pitch, and time stretch. We train our models on hums of 40 songs in MIR-QbSH dataset and evaluate search performance on hums of the remaining 8 songs. The training dataset has 1970 hums for training and 676 for validation. The test dataset has 225 hums as a search database and 659 query hums. We evaluate the performance of our models in identifying which song a given query corresponds to via comparison with all sequences in the search database. Each model converts all the audio in our test dataset to sequences of tokens or representations. Each query sequence is compared to all sequences in the search database via Edit Distance (ED) (if tokens) or DTW (if representations). The song id of the most similar sequence in the search database is then selected as query song id. We calculate Mean Reciprocal Ranking (MRR) score with ground-truth song id of the queries for evaluation. The Reciprocal Ranking (RR) score is given as 1/r if the r th most similar sequence in search database has same song id as query. All the audio recordings are converted to Short Term Fourier Transform (STFT) matrices before being passed as inputs to our models. The STFT matrices are computed with 513 frequency bins, a window length of 1024 samples (summarising 128 ms of audio), and hop length of 512 samples.

6.2. SPOKEN TERM DETECTION

Task. Given a test query audio, we are to detect its occurrence in a long utterance. Experiment Details. We use the TIMIT dataset which is composed of 6300 utterances of English speech with time-aligned word transcriptions. We choose 59 most-occurring words with more than 2 characters as keywords and all others as non-keywords. We use utterances of random sentences formed with 6 words sampled from a subset of 25 keywords for training and evaluation on STD experiments for the detection of the remaining 34 keywords. The test dataset is composed of 337 utterances corresponding to the 34 queries and 100 long utterances per query, with half containing a single occurrence of query amongst non-keywords and the other half containing only non-keywords. Given a query and a long utterance, we convert both to sequences of tokens using each audio tokenizer. We perform approximate string matching (Hall & Dowling, 1980) for detection of query in the utterance. The STFT matrix inputs to the models are computed with 185 frequency bins, a window length of 368 samples (summarising 23 ms of audio), and a hop length of 92 samples.

6.3. BASELINES

Triplet. We train encoder f : X → Z to generate L-2 normalized continuous representations for retrieval. Encoder f is trained via optimizing the triplet Loss (Schroff et al., 2015) as done in training an NNFP in Now Playing (Arcas et al., 2017) . Given pair of similar sequences (X , X ′ ), encoder f generates sequences Z and Z ′ . We form a mini-batch of size N of triplets {z, z + , z -} where representation z is sampled from sequence Z , z + and z -are positive and negative samples respectively for z sampled from sequence Z ′ . The loss is defined as, L Triplet = N i=1 max{||z i - z + i || -||z i -z - i || + m, 0} , where m is a margin for similarity. MIPS. We train encoder f : X → Z to generate L-2 normalized continuous representations for retrieval. Encoder f is trained via simulation of MIPS (Mussmann & Ermon, 2016) on mini-batches of representations as proposed by (Chang et al., 2020) . Given pair of similar sequences (X , X ′ ), encoder f generates sequences Z and Z ′ . We form a mini-batch of size N of pairs of {z, z + } where encoder generated representation z is sampled from sequence Z and z + is a positive for z sampled from Z ′ . The loss is defined as, L MIPS = - N i=1 log exp(zi,z + i ) j̸ =i (exp(zi•z + j )+exp(zi•zj )) . wav2vec2. We train our audio tokenizer via wav2vec 2.0 (Baevski et al., 2020) learning framework. Quantizer g in our audio tokenizer is chosen to be a Gumbel Softmax-based Vector Quantizer (See Appendix C for details) as used in (Baevski et al., 2020) but with a single codebook with K members. Given sequence X , encoder f outputs sequence of L-2 normalised representations Z of length T . Quantizer g outputs sequence of discrete representations Q = {q t = g(z t ∈ Z); t = 1, ..., T }. We mask spans of 10 time steps with random starting indices in sequence Z and then pass the new sequence to a transformer network h : Z → O which generates a sequence of contextualized representations O = {o t = h(z t ∈ Z); t = 1, ..., T }. For transformer output o t over masked time step t, we identify the true discrete representation q t from a set D t composed of q t and D distractors which are discrete representations sampled from other time steps. The loss is defined as, L w (o t , D t ) = -log exp(sim(ot,q t )) q∈D t exp(sim(ot, q)) + L d , where sim(a, b) = a T b ||a||||b|| is cosine similarity and L d is codebook diversity loss. wav2vec2P. We train wav2vec2 audio tokenizer with our variation of wav2vec 2.0 (Baevski et al., 2020) learning framework which learns discrete representations from pairs of similar sequences. Given pair (X , X ′ ), encoder f outputs sequences Z of length T and Z ′ of length T ′ respectively. Assuming T ≤ T ′ , we generate sequence Z + of length T whose t time step element z + t is a positive for z t ∈ Z sampled from sequence Z ′ . Gumbel Softmax-based Vector Quantizer g quantizes each representation in sequence Z + to generate sequence Q + . We mask sequence Z and Z + at the same time steps. Transformer h inputs masked sequences and generate sequences O and O + . For masked time step t, we use transformer output o t to identify q + t ∈ Q + from set D + t with distractors sampled from sequence Q + and transformer output o + t to identify q t ∈ Q from set D t with distractors sampled from sequence Q. The loss is defined as, L wP = L w (o t , D + t ) + L w (o + t , D t ). wav2vec2-O. The original wav2vec 2.0 base model with 12 Transformer blocks and 95M parameters as proposed by (Baevski et al., 2020) . It is pre-trained on 960 hours of LibriSpeech data and fine-tuned on TIMIT dataset. It uses K = 32 tokens for tokenization. wav2vec2-Multi. A wav2vec 2.0 large model with 24 Transformer blocks and 317M parameters pre-trained on 53 languages as proposed by (Conneau et al., 2020) . It is fine-tuned on Common Voice to detect all possible phonemes in training languages with K = 392 tokens. Triplet and MIPS use a 2-layer BiLSTM as encoder with 3.6M parameters. We use the LAMB optimizer (You et al., 2020) and a Cosine Annealing Learning Schedule (Loshchilov & Hutter, 2017) with a learning rate restart of 0.0001 to train them. wav2vec2 and wav2vec2P use a 2-layer BiLSTM encoder with 3.6M parameters to generate latent representations and 3 Transformer blocks with 8.5M parameters. Both are trained using the ADAM (Kingma & Ba, 2017) optimizer and a linear learning schedule with a learning rate of 0.001 and 8% of the training steps as warm-up steps. Proposed wav2tok uses only a 2-layer BiLSTM as encoder with 3.6M parameters.

7.1. MUSIC MELODY SEARCH: QUERY BY HUMMING

We present search performances for 3 settings of query namely-Query with no augmentation or Vanilla Query (V), Time Stretched Query (TS), and Pitch Shifted Query (PS). Time stretch and pitch shift are the most common augmentations that may be faced in queries by humming data. No augmentations were applied to audio in search database. Evaluations are performed on sequences corresponding to songs not seen during training. The results present the generalizability of the tokens or representations generated by the models. We set the number of tokens as K = 25 for wav2tok, wav2vec2, and wav2vec2P (See Appendix A.2 for experiments to support our choices). Quality of Tokenization. Table 1 presents the performance of the sequence of tokens T generated by the audio tokenizers on ED-based similarity search. Tokens generated by wav2tok present good generalization capabilities in terms of MRR and outperform all the baselines. It generates time and pitch invariant tokens as we see no drop in performance when either augmentation is applied to query. wav2vec2-O is trained on English speech only. The tokens generated by it do not contain much melodic information but are robust to augmentations. The multilingual training of wav2vec2-Multi infuses both melodic and phonetic information to its 392 tokens, thereby giving good performance. wav2tok outperforms both wav2vec2-O and wav2vec2-Multi given its pairwise training which allows it to infuse more melodic information to the tokens while also being trained on a small amount of unlabelled data. The Gumbel Softmax-based quantizer in wav2vec2 and wav2vec2P isn't ideal for infusing melodic information to tokens but it does infuse phonetic information as will be seen in Section 7.2. We compare the tokens with representations learned by MIPS and Triplet evaluated on DTW-based similarity search. The continuous representations present sub-par generalizations to unseen songs. We compare wav2tok with SOTA melody extraction algorithm proposed in (Salamon & Gómez, 2012) which converts hums to MIDI sequences. wav2tok generates token sequences much smaller than the respective MIDI sequences and outperforms the MIDI tokens in search performance, search time, and robustness. In addition, wav2tok outperforms the algorithm in inference time. We further compare wav2tok with SOTA QbH system proposed in (Mostafa & Fung, 2017) . In our implementation, we map audio to MIDI sequences using the aforementioned SOTA melody extraction algorithm instead of a CNN. Given MIDI sequence 53, 53, 58, 50 with durations 0s, 0.5s, 1s, 2s, a Relative Note sequence is generated as (0, 0), (0, 0.5), (5, 1), (-8, 2) over which DTW is performed for retrieval. wav2tok tokens outperform the SOTA QbH system in both performance and robustness; the performance of the latter drops drastically with time stretch. We present the performances of the uncompressed sequences T and Z and compressed sequence Z generated by the audio tokenizers in Appendix A.1. We observe a drop in performance for all audio tokenizers when we apply sequence compression to sequences T and Z. wav2tok outperforms all the baselines and generates superior-quality of continuous representations and discrete tokens. Search Time. Table 1 presents the search time taken for similarity search over the tokens or representations generated by the models. The search time taken per query is 2 order of magnitude lesser for ED-based Search over compressed sequence of tokens T than standard DTW-based Search over continuous representations Z. The pre-trained models being fine-tuned on transcribed audio give the best tokens in terms of compression and search time. wav2tok gives comparable tokens but outperforms the pre-trained models in inference time. .3) . There is a significant drop in token robustness and performance but the representations suffer a small drop (see Appendix A.3). Hence, although the representation space may be well clustered, wav2tok is able to add more semantics to the tokens as it is being trained with pairs of similar sequences in comparison to wav2tok+NoSim. We train wav2tok with cosine similarity scores instead of a parameterized score (wav2tok+Cos). The drop in performance validates the enhancement brought about by using a parameterized score. We also train wav2tok with L ctc only (wav2tok+CTC). The CTC loss considers all possible paths which compress to the target label sequence. As a result, the learnt tokens aren't much semantic. The use of both losses gives the best tokens. Some Variations. In wav2tok+NewInit, we associate the discrete representations with K centroids in the input space X. Such association does not initialize our tokenizer with optimal centroids which cluster the space Z perfectly. This results in a significant drop in performance and robustness as shown in Table 2 . We train wav2tok on MIR-1K dataset (wav2tok+MIR1K) which is composed of polyphonic music recordings of 1000 distinct songs. The tokens generalize well to monophonic hums in MIR-QbSH dataset giving a comparable performance to MIDI tokens. This validates that wav2tok tokens do learn melodic information and are robust to variations incurred in hums. We further compare wav2tok with log-mel features and token sequences (with no compression) obtained via quantization of log-mel features. wav2tok tokens outperform both.

7.2. SPOKEN TERM DETECTION

Quality of Tokenization. Table 3 presents the quality of tokenization of the query keywords by the models evaluated in the Spoken Term Detection experiments. We present the performances of wav2vec2 , wav2vec2P, wav2vec2-O, wav2vec2-Multi and proposed wav2tok. We conduct search experiments on a test dataset composed of a search database of 337 utterances of the 34 keywords used as queries in the STD experiments and 1289 query utterances. We identify the keyword to which each query corresponds to via comparison to all the 337 utterances in the search database via ED-based similarity score. The word id of the most similar utterance is selected as the word to which the query corresponds to. We set K = 40 equivalent to the number of phonemes in English. wav2tok gives the best performance in terms of MRR scores. It outperforms huge models like wav2vec2-O and wav2vec2-Multi which are fine-tuned for the task of phonetic tokenization of speech audio while using a small number of parameters. wav2vec2 and wav2vec2P also outperform wav2vec2-Multi and wav2vec2-O while using smaller number of parameters. wav2vec2-O and wav2vec2-Multi use a blank token to handle consecutive occurrences of the same tokens and to label background noise. The utterances of each keyword in the test dataset are very small in time duration. This causes wav2vec2-O to confuse word utterances as background noise. It generates a sequence of blank tokens and performs poorly in search. wav2vec2-Multi using a larger number of phonetic tokens does not suffer this issue. wav2tok , wav2vec2, and wav2vec2P have no such blank token. This brings a drop in search performance with sequence compression. We further present the performance of wav2tok trained on a much larger LibriSpeech 100 hours dataset (wav2tok+Libri). It is able to outperform wav2vec2-O and give comparable performance to wav2vec2-Multi. Spoken Term Detection. We convert the query word utterance and the long utterance in to sequences of tokens by all our models and detect the occurrence of the query via approximate string matching. We use fuzzysearch library to perform approximate string matching. It automatically chooses the fastest algorithm for matching. Table 4 presents the performance of wav2vec2 , wav2vec2P , wav2vec2-O, wav2vec2-Multi, and proposed wav2tok in STD. All the models give a comparable performance in terms of F1-score with wav2tok performing slightly better. We also implement the STD system proposed in (Anguera & Ferrarons, 2013) which performs highly competitive STD via subsequence DTW (S-DTW) over gaussian posterior features. In our implementation, we extract the posterior features with SOTA ASR models like wav2vec2-O and wav2vec2-Multi. The results are presented in the DTW column in Table 4 . Note, the results for other models in same column are for STD via S-DTW over representations. We observe STD over tokens to give better F1-score.

8. CONCLUSION AND FUTURE WORK

In this paper, we present an audio sequence tokenizer wav2tok that generates semantically meaningful ordered representations (or tokens) that can be used for efficient retrieval by query sequences. The model learns only from pairs of semantically similar sequences and outperforms state-of-theart approaches for spoken term detection and query by humming. One may apply more efficient search algorithms such as locality-sensitive hashing and longest common subsequence search on the generated tokens to further speed up the search. The proposed framework can also be extended to image and video retrieval, as they also have spatial ordering. We would like to investigate the domain-specific, i.e., linguistic or musicological, aspects of the extracted tokens. For instance, during retrieval, the matching algorithm assumes all the tokens to be equidistant from each other. One may study or use the metric space of these tokens.

9. REPRODUCIBILITY

The codes are available in https://github.com/madhavlab/wav2tok. The experiments are performed using standard datasets. We initialise as follows, 

C GUMBEL SOFTMAX BASED VECTOR QUANTIZER

The Gumbel Softmax based Vector Quantizer (Baevski et al., 2019) quantizes input latent representation z t ∈ R m with C codebooks containing K quantizers e ∈ R K× m C each. For our experiments, we set C = 1 and K ∈ {15, 25, 40}. Given z t , one of the K quantizers from each of the C codebooks are chosen resulting in vectors e 1 , ..., e C . The codebook vectors are then concatenated and linearly transformed from R m to R d to output a discrete representation q t ∈ R d . z t is mapped to l ∈ R C×K logits to give probability scores for the choice of codeword. The probability p c,k of choosing k th quantizer in c th codebook is given as, p c,k = exp (l c,k + n k )/τ K i=1 exp (l c,i + n i )/τ where τ is a non-negative temperature, n = -log(-log(u)) and u are samples from the uniform distribution Unif(0, 1). During forward pass, the codeword is chosen as κ = arg max j p c,j . During backward pass, the loss is calculated over the gumble softmax distribution p. We use the straight-through gradient estimator (Yin et al., 2019) to estimate the gradient. Codebook Diversity Loss L d . This loss promotes equal use of all the entries in each of the C codebooks. Minimization of this loss maximizes the entropy of the averaged softmax distribution p over the K entries for each codebook pc across a batch of utterances. L d = 1 CK C c=1 K k=1 pc,k log pc,k



Figure 1: X ′ is an augmented replica of X . 1a illustrates our model architecture. 1b demonstrates the generation of P required for calculation of L m . 1c demonstrates our likelihood loss calculation.

T (| T ′ |) = l T, T ′ | T ′ | β T (s) = 0, ∀s < | T ′ | (9)and recursively calculate β t (s) as,β t (s) = (β t+1 (s) + β t+1 (s + 1))l t, T ′ s (10)We set β t (s) = 0, ∀s > | T ′ |.

Quality of Tokenization

Ablation Studies and Some Variations Query by humming involves similarity based on melody information, which is carried by the semantic pairing of the audio in training data. We constrain this pairing to include sequences not semantically similar and call this model wav2tok+NoSim. We optimize the contrastive loss L m to train the model. The results are shown in Table 2 (full table in Appendix A

Quality of Tokenization for speech

Spoken Term Detection

A FURTHER STUDIES A.1 SEQUENCE COMPRESSION

We present the quality of sequence of tokens T and sequence of representations Z and their corresponding compressed versions sequences T and Z generated by the audio tokenizers in Table 5 .wav2tok outperformed the baselines and generated the best quality of sequences T , Z, T and Z. Sequence compression brings an order of magnitude drop in search time for all the audio tokenizers with a trade-off in search performance. Compression from T to T increases the robustness of the token sequences generated by wav2tok to various augmentations. wav2vec2P learnt better tokens and representations than wav2vec2 because of it's pairwise training on similar audio. The effect of varying the size of alphabet A is shown in Table 6 . We train wav2vec2, wav2vec2P, and proposed wav2tok with alphabets of size K ∈ {15, 25, 40}. Out of the three settings for K, K = 25 gives the best performance for all models. wav2tok gives best performance for all settings of K.

A.3 ABLATION STUDIES AND SOME VARIATIONS

We present the full version of Table 2 in table 7 . Note wav2tok+NoSim repsentations are well clustered. wav2tok+Trans representations are also comparable with wav2tok but the tokens are of lesser quality. This is due to model overfitting.

A.4 QUALITY OF REPRESENTATIONS

We present the performance of the continuous representations generated by wav2tok and the baselines in Table 8 . wav2tok generates the best representations for music outperforming representations generated by the large wav2vec 2.0 models. wav2tok trained on MIR1K generates representations outperforming domain-specific QbH baselines. Note, wav2vec2-O outperforms wav2vec2-Multi as the hums in the dataset were all in english. wav2vec2-O is pre-trained and fine-tuned on English speech only while wav2vec2-Multi is pre-trained multilingually. Hence , wav2vec2-O gave better results. We train wav2tok on 100-hours subset of LibriSpeech (Panayotov et al., 2015) dataset. We evaluate the quality of tokenization of word utterances done by wav2tok on TIMIT (Garofolo et al., 1993) dataset. We use a 2-layer BiLSTM network with 3.6 million parameters as encoder network which takes MFCC feature sequences as input. We perform tokenization with K = 40 tokens.wav2tok outperforms wav2vec2-O by a large margin and gives comparable performance to wav2vec2-Multi in terms of MRR score. wav2tok uses a minute number of parameters in comparison to 95 million parameters in wav2vec2-O and 317 million parameters in wav2vec2-Multi. Note, wav2vec2-O and wav2vec2-Multi were pre-trained on large amount of unlabelled speech data and (Panayotov et al., 2015) dataset generalised well to TIMIT (Garofolo et al., 1993) . where π corresponds to all T -length paths over tokens such that C(π) = T ′ . Here, C is a compressor which compresses π a T -length sequence of tokens via de-duplication.We initialise as follows,and recursively calculate α t (s) as, α t (s) = (α t-1 (s) + α t-1 (s -1))l t, Ts ′We set α t (s) = 0, ∀s < 1.The backward variable is defined as,

