SEMANTIC HASHING WITH LOCALITY SENSITIVE EM-BEDDINGS

Abstract

Semantic hashing methods have been explored for learning transformations into binary vector spaces. These learned binary representations may then be used in hashing based retrieval methods, typically by retrieving all neighboring elements in the Hamming ball with radius 1 or 2. Prior studies focus on tasks with a few dozen to a few hundred semantic categories at most, and it is not currently well known how these methods scale to domains with richer semantic structure. In this study, we focus on learning embeddings for the use in exact hashing retrieval, where Approximate Nearest Neighbor search comprises of a simple table lookup. We propose similarity learning methods in which the optimized base similarity is the angular similarity (the probability of collision under SimHash.) We demonstrate the benefits of these embeddings on a variety of domains, including a coocurrence modelling task on a large scale text corpus; the rich structure of which cannot be handled by a few hundred semantic groups.

1. INTRODUCTION

One of most challenging aspects in many Information Retrieval (IR) systems is the discovery and identification of the nearest neighbors of a query element in an vector space. This is typically solved using Approximate Nearest Neighbors (ANN) methods as exact solutions typically do not scale well with the dimension of the vector space. ANN methods typically fall into one of three categories: space partitioning trees, such as the kd-tree (Bentley (1975) ; Friedman et al. (1977) ; Arya et al. (1998) ), neighborhood graph search (Chen et al. (2018) ; Iwasaki & Miyazaki (2018) ) or Locality Sensitive Hashing (LSH) methods (Charikar (2002) ; Gionis et al. (1999) ; Lv et al. (2007) ). Despite their theoretical, intuitive, and computational appeal, LSH methods are not as prevalent in modern IR systems as are space-partitioning trees or neighborhood graph methods (Bernhardsson (2013) ; Chen et al. (2018) ; Johnson et al. (2017) ; Iwasaki & Miyazaki (2018) ). Empirical studies demonstrate that LSH techniques frequently do not attain the same level of quality as spacepartitioning trees (Muja & Lowe (2009) ). Nonetheless, space-partitioning and neighborhood graph search methods are expensive, both in data structure construction and in query time, and remain a bottleneck in many modern IR pipelines. As many modern retrieval tasks revolve around solving ANN for vector representations learned from raw, structured data, one might attempt to learn representations which are more suited towards efficient retrieval. Metric learning methods (Xing et al. (2003) ; Weinberger et al. (2006) ; Chechik et al. (2010) ; Hoffer & Ailon (2015) ; Kulis et al. (2013) ) have been proposed for learning linear and non-linear transformations of given representations for improved clustering and retrieval quality. A class of related methods, semantic hashing or hash learning methods (Salakhutdinov & Hinton (2009) ), have also been explored for learning transformations into binary vector spaces. These learned binary representations may then be used in hashing based retrieval methods, typically by retrieving all neighboring elements in the Hamming ball with radius 1 or 2. Exact hashing retrieval algorithms, that is, Hamming ball "search" with radius 0, have a particular computational appeal in that search data structures are not needed nor is enumeration of all codes within a Hamming ball. In addition, binary representations that are suitable for exact hashing retrieval can also be used to identify groups of related items that can be interpreted as clusters in the traditional sense. As the number of clusters discovered by the algorithm isn't explicitly controlled (only bounded by 2 d ,) algorithms generating binary embeddings suitable for exact hashing retrieval can be viewed as nonparametric clustering methods. To this end, we propose a method for learning continuous representations in which the optimized similarity is the angular similarity. The angular similarity corresponds to the collision probability of SimHash, a hyperplane based LSH function (Charikar (2002) ). Angular distance gives a sharp topology on the embedding space which encourages similar objects have nearly identical embeddings suitable for exact hashing retrieval. Related work on similarity learning, LSH, and hash learning can be found in Section 2. The proposed models are found in Section 3. The experimental results, and other technical details, can be found in Sections 4. Finally, we conclude in Section 5.

2. PRELIMINARIES

2.1 SIMILARITY MODELLING Similarity learning methods are a class of techniques for learning a similarity function between objects. One successful approach for similarity learning are "twin network" or "two tower architecture" models, in which two neural network architectures are joined to produce a similarity prediction (Bromley et al. (1994) ; Chopra et al. (2005) ; Huang et al. (2013) ). The weights of these networks may be shared or not, depending on whether the two input domains are equivalent or not. Let i ∈ U and j ∈ V be the identities of two objects, where U and V are the two domains across which a similarity function is to be learned. Let φ u (i) and φ v (j) be the input representations for the objects (these functions φ may be identity functions if the input domains are discrete.) These representations are then transformed through parameterized vector-valued functions f u (•|θ u ) and f v (•|θ v ), whose output are typically the learned representations u i = f u (φ u (i)|θ u ) and v j = f v (φ v (j)|θ v ). A loss is then defined using pairwise labels y ij and an interaction function s(u i , v j ) which denotes the similarity or relevancy of the pair. Taking f u to be a mapping for each index i to an independent parameter vector u i (similarly for f v and v i ), and taking s(u i , v j ) = u T i v j with an appropriate loss results in a variety of matrix factorization approaches (Koren et al. (2009) ; Lee & Seung (2001) ; Mnih & Salakhutdinov (2008); Blei et al. (2003) ; Rendle et al. (2012) ; Pennington et al. (2014) ). Taking f u to be a neural network mapping a context φ u (i) to a representation u i allows for similarity models that readily make use of complex contextual information. Common choices for the similarity function include transformations of Euclidean distance (Chopra et al. (2005) ), and cosine similarity: (Huang et al. (2013) ). In addition, the loss can be defined for pairs (Chopra et al. (2005) ), triplets (one positive pair, one negative pair) (Rendle et al. (2012) ; Chechik et al. (2010) ), or on larger sets (Huang et al. (2013) ). s(u i , v j ) = u T i vj ||ui||||vj ||

2.2. LOCALITY SENSITIVE HASHING AND ANGULAR SIMILARITY

A Locality Sensitive Hash (LSH) family F is a distribution of hashes h on a collection of objects Q such that for q i , q j ∈ Q, (Indyk & Motwani (1998) ; Gionis et al. (1999) ; Charikar ( 2002) ) P r[h(q i ) = h(q j )] = s(q i , q j ) (1) for some similarity function s on the objects. SimHash (Charikar (2002) ) is a LSH technique developed for document deduplication but may be used in other contexts. For a vector representations q ∈ R d , SimHash draws a random matrix Z ∈ R d×M with standard Normal entries. The hash h(q i ) ∈ {0, 1} M is then constructed as h(q i ) m = 1[q T i Z :m > 0]. Intuitively, SimHash draws random hyperplanes intersecting the origin to separate points. A useful property of this hash function, as stated in Charikar (2002) , is that where the above probability is measured with respect to Z. ψ(q i , q j ), the collision probability for two vectors, is also known as the angular similarity, and ξ = 1 -ψ is the angular distance, which is a proper metric (unlike the cosine distance 1 - ψ(q i , q j ) := P r[h(q i ) m = h(q j ) m ] = 1 - 1 π cos -1 q T i q j ||q i ||||q j || , q T i qj ||qi||||qj || ). As the columns of Z are independent, the collision probability for a K bit hash is ψ K .

2.3. LEARNING TO HASH

A related approach to similarity learning is hash learning methods, introduced in Salakhutdinov & Hinton (2009) . These methods train binary embeddings directly and then use hash collisions or Hamming Ball search to retrieve approximate nearest neighbors. Binary representations lead to some technical challenges; Salakhutdinov & Hinton (2009) uses contrastive divergence for training, whereas Hubara et al. (2016) implement binary threshold activation functions with stochastic neurons. Another approach (and the one followed in this work) is to avoid explicit binary representations in training and to introduce an quantization loss to penalize embeddings that are not close to binary, and to subsequently threshold these near-binary embeddings to binary ones. This type of quantization loss is distinct from those used in vector quantization methods (Ahalt et al. (1990) ; Kohonen (1990); Sato & Yamada (1996) ) in which the data representations are fixed and the codes are learned; here the codes are fixed and the representations are learned.  (u i |θ) = d log cosh (|u id | -1) ≈ |u i | -1 1 . (3) Other quantization losses based on distances to binary codes have been used in Li et al. (2016) ; Liu et al. (2016) . Cao et al. (2017) utilizes a quantization loss whose strength increases over time. Finally, Deep Cauchy Hashing (DCH) (Cao et al. (2018) ) has shown improvements by utilizing a heavy-tailed similarity function with a similarly inspired quantization loss.

3. LOCALITY SENSITIVE EMBEDDINGS

Many similarity learning methods utilize dot products or cosine similarity to relate the embeddings of a pair to each other. For example GloVe (Pennington et al. (2014) ) minimizes the weighted error between the dot product of the embeddings and a log-coocurrence matrix, and the DSSM model (Huang et al. (2013) ) utilizes cosine similarity as the "crossing" layer between the two halves of a twin network. In general, embeddings trained in this way are not suitable for SimHash retrieval, as can be seen in Figure 1 . If models are trained so as to minimize the error of a prediction made by cosine similarity, extremely low tolerances are required in order to achieve embeddings with significant collision probability. Similar observations on the misspecifiation of cosine distance for Semantic Hashing were made in Cao et al. (2018) . In this section, we define models in which collision probabilities of learned representations are directly optimized.

3.1. LOSS DEFINITION

In the following, we define a population loss through a data distribution D of relevant and irrelevant pairs. Each sample from D is a tuple (y, i, j) ∈ {0, 1} × U × V, where U and V are the sets across which a similarity is to be learned -for example, "users" and "items" in a recommender system. y is the relevancy of the pair (i, j). The population losses we consider are expectations over D of a per-tuple loss l with regularization terms r per item: L(θ) = E y,i,j∼D l(y, i, j|θ) + λr(i|θ) + λr(j|θ). In practice, we minimize the empirical loss L constructed from a finite sample from D, and we use r(i|θ) = b(u i |θ) defined in equation 3. θ represents all parameters of the model, including any learned representations for the elements of the sets U and V. An embedding u i for element i may either be a vector of free parameters, as would be in a fixed vocabulary embedding model, or may be the output of a model on a raw input: u i = f u (φ u (i)) , as would be in a twin network model. In addition, each half of the pair (u i , v j ) may represent a different input space, as in the DSSM model.

3.2. BINARY CROSS ENTROPY LOSS

We may simply model the relevancy y ij for the pair (u i , v j ) with a binary cross entropy loss: l(y ij , i, j|θ) = -y ij log (p(y ij |i, j, θ)) -β(1 -y ij ) log (1 -p(y ij |i, j, θ)) . ( ) where p is the learned estimate for E[y ij |i, j, θ], and β is a scalar hyperparameter for tuning the relative importance of positive and negative samples. One standard choice for p in representation learning is to take p(y ij |i, j, θ) = s σ (u i , v j ) := σ α u T i v j ||u i ||||v j || , ( ) where σ is the logistic function and α is a scalar. As the logistic function saturates quickly, the embeddings u i and v j do not need to be extremely close (when y ij is positive) in order to achieve low error. Thus, to encourage representations that are amenable to hashing retrieval, we might consider other transformations of the embeddings that do not saturate so quickly. For example, one may take a polynomial transformation of cosine similarity: p(y ij |i, j, θ) = s c (u i , v j ) K := 1 2 K 1 + u T i v j ||u i ||||v j || K , or a polynomial transformation of the angular similarity: p(y ij |i, j, θ) = ψ(u i , v j ) K = 1 - 1 π cos -1 u T i v j ||u i ||||v j || K . ( ) The p(y ij |i, j, θ) = ψ(u i , v j ) K choice has a natural interpretation of using the SimHash collision probability under a K bit hash as the estimation function. Intuitively, we are training representations whose collision probability distribution under SimHash has minimum cross entropy with the pairwise label distribution y. Embeddings trained with equation 8 are termed Locality Sensitive Embeddings (LSE) and are the proposed method of this paper. Deterministic thresholding is still used to derive binary embeddings from dense versions. DCH Cao et al. (2018) introduced the following similarity measure for defining the loss:  p(y ij |i, j, θ) = s h (u i , v j ) := γ γ + d 2 1 - u T i vj ||ui||||vj || . (9)

3.3. TOPOLOGICAL ANALYSIS

The SimHash method and the angular similarity can be used for studying the topologies induced by the different similarity measures in the previous section. Theorem 1. Let B(q i ) = N (δ, q i , 1 -s) denote a ball around q i with radius δ under the 1 -s distance. For an arbitrary point q j ∈ B(q i ), we can consider the probability q j and q i will collide under SimHash -denote this with P s (δ). Then, 1. (LSE) P ψ (δ) ≥ 1 -δ 2. (COS) P sc (δ) ≥ 1 -2 √ δ π -O(δ 3 2 ) 3. (DCH) P s h (δ) ≥ 1 - 4γδ π 2 d(1-δ) 1 2 -O(δ 3 2 ) Proof in Appendix. These bounds are tight as we know the asymptotic error for each (for LSE there is no error term). Theorem 1 reveals that, for a similarity model trained to tolerance level δ for positive pairs, under a SimHash algorithm LSE would have linear scaling of a single bit collision probability, while COS and DCH would have sublinear scaling. Note that for the logistic based similarity, P sσ (δ) is only well defined for α > | log(δ) -log(1 -δ)| (otherwise 1 -s σ cannot be below δ.) Any analysis here requires choosing a rate for α. There is also a relationship between Angular and Hamming distances. Angular distance can be viewed as a dimension scaled version of Hamming distance applied to randomly rotated inputs. Lemma 1. Let b(q i ) be the vector indicating signs of q i , that is b(q i ) m := 1[q im > 0]. Denote the Hamming distance of the sign vectors as ρ H (q i , q j ) := ||b(q i ) -b(q j )|| 1 which defines a semimetric on R d . Take R as a uniformly random orthogonal matrix. Then 1 -ψ(q i , q j ) = 1 d E R [ρ H (Rq i , Rq j )]. Proof in Appendix. Lemma 1 demonstrates that angular distance may be viewed as an expectation of a dimension-scaled Hamming distance, where the expectation is taken with respect to the choice of basis. In other words, minimizing angular distance is equivalent to minimizing the Hamming semimetric ρ H averaged over all possible bases.

4. EXPERIMENTS

In this section, we compare representations trained using equation 8 (LSE), using equation 7 (COS), using DHN's logistic based similarity, and using DCH. is roughly block diagonal, with additional edges appearing from "nearby" factions at a lower rate. The task is to hash the individuals from each faction together, while separating them from all other factions. In order to enable higher precision of the resulting models, a set of hard negatives is generated by taking pairs (i, j) such that (D T D) ij > 0 and D ij = 0. Easy negatives are also generated by taking pairs at random and assuming a cooccurrence of 0, as is common in Noise Contrastive Estimation (Gutmann & Hyvärinen (2010) ) inspired methods. Each individual is given a free parameter u i which is the output of a embedding layer followed by tanh activation. This output is dropped-out (shared randomization for each half of the training pair) before the cosine similarity computation. Batches are constructed from 1024 positive pairs, and 3072 negative pairs (1024 of which may either be hard or easy negatives, determined as a hyperparameter.) We trained all models with 32 dimensional representations, for 50 epochs, where 1 epoch is the number of batches required to iterate through all positive pairs. We explore 4 hyperparameters, K, β, λ, and the use of hard negative samples. During evaluation, we retrieve all individuals which have the same binarized embedding as the query individual. We measure precision, recall and F1-score with the data generating factions as the target. 10 trials are repeated for each hyperparameter setting, and the mean over trials is reported. Figure 2 shows a detailed view of the hyperparameter tuning, and Table 1 shows the chosen hyperparameters for each model when ranking by F1-score. DCH and LSE are competitive, however the LSE model is able to achieve surprisingly accurate recovery of the data generating structure with an F1 of nearly 0.99. 2 . We show results for binary embedding size of 64 bits, 48 bits, 32 bits, and 16 bits. We do not alter any other setting and use K = 1, λ = 0.1 for all models. In pilot experiments we did not find any significant improvement for higher K for any of the methods. All baseline papers use mean average precision for the evaluation (Cao et al. (2018) ; Zhu et al. ( 2016)) which is the evaluation method we adopt for this experiment. The LSE model is comparable to the baseline methods. Cifar10 results are statistically significant with the p-value of 7.5 × 10 -4 , NUS Wide with the p-value of 2.81 × 10 -146 and Imagenet with the p-value of 7.99 × 10 -4 according to widely used Iman Daveport test (Garcia & Herrera, 2008) . We want to point out that all the experimental results (originally in the respective baseline papers and in this paper) on DCH and DHN have used pretrained AlexNet. The usage of pretrained network and well separated categories reduces the need for a model with a strong inductive bias like LSE. Nonetheless LSE remains a good choice for these type of tasks as well.

4.3. OSCAR COOCCURRENCE MODEL

Finally, we compare all methods on a cooccurrence matrix generated from the OSCAR English dataset (Ortiz Suarez et al. (2019) ; Ortiz Suarez et al. ( 2020)). We take the deduplicated version of the corpus (1.2TB compressed) and generate an initial symmetric cooccurrence matrix by counting word pairs with a window of size 10, inversely weighting the counts by the distance of the words within the sentence, as in Pennington et al. (2014) . This initial matrix is then filtered to remove extremely common terms, extremely rare terms. Additional filtering based on row and column normalized cooccurrences is used to retain pairs that are atypical compared to the marginal frequencies of the two terms. For each row, the top 100 pairs ranked by the original cooccurrence are kept, and the resulting binary matrix is symmetrized. The resulting cooccurrence matrix D has 660K unique Popular models for text data frequently allow for each word's representation to have multiple contexts, as in topic modelling (Hofmann (1999) ; Blei et al. ( 2003)) or multi-sense embedding models (Nguyen et al. ( 2017)). To incorporate multiple context representations into semantic hashing methods, we represent each word i with L embeddings u il (these are free parameters per word and no subword information is used.) The maximum among all pairwise cosine similarities is then taken as the base similarity function: s(u i , u j ) = max l,m u T il u jm ||u il ||||u jm || . ( ) This base similarity is then used in place of cosine similarity in defining the loss for all modelsfoot_1 . At retrieval time, a query word is mapped to its L hashes, corresponding to L "clusters." The union of the L clusters is the retrieved set for the query -no search is performed and the only data structures used are hash tables. As each word is associated with L hashes, this model may be understood as a "word2hashes" method. The base architecture used for all models is a 32 dimension tanh activated embedding layer followed by dropout (with shared randomization across all 2L embeddings,) with L = 3. Following the dropout layer is equation 11. Batches are constructed from 8192 positives, 8192 hard negatives, and 16384 easy negatives. Models are trained for 20 epochs through the positive set using 2 GPUs. We utilize a semantic quality measure based on Wu-Palmer similarity (WP) (Wu & Palmer (1994) ) on WordNet (WN) (Miller (1995) ; Fellbaum et al. (1998) ). We take all nouns, verbs and adjectives from the WordNet corpus and remove all words with no hypernyms (these are typically isolated nodes in the WordNet graph for which WP values are not available.) The intersection with the 660K vocabulary leaves 46K words, which we index based on the semantic hashing models. For each query word w and its retrieved set V (w), the average WP similarity is computed across all pairs w, v with v ∈ V (w). Self-pairs are removed, and empty V (w) are given 0 values. This WP measure is bounded between 0 and 1, with higher values indicating more semantically meaningful clusters. Figure 3 shows WP of the models on a 1K word query set (taken randomly from the 46K WN vocabulary) which we use as a tuning data set. We also report F1 score on a 1K query set sampled from the 660K vocabulary to evaluate how well each model reconstructs the training data. All models used the same tuning grid, except COS for which K = 16 was added, as the initial sweep showed potentially large improvement by expanding the grid for the COS model. Table 3 shows the final model comparison. The hyperparameters with highest WP per model are taken and models for each are trained with 100 epochs. We compare the scores of these models on a non-tuning set of 1K queries from WN for WP, and 1K queries from the full vocab for Precision, Recall, and F1. In addition, we evaluate a HitRatio (HR) score on the heldout 4600 pairs, where all colliding words for a query are retrieved, and if the target word appears in the top n items ranked by cosine similarity (of the dense embeddings,) the query achieves a HR of 1. This is the only measure to use the dense embeddings. We also report the number of non-singleton clusters. As can be seen, LSE outperforms the baselines on training, test, and semantic quality measures.foot_3  We display some example queries and their retrieved hash siblings from the 100 epoch LSE model in Table 7 . T-SNE Maaten & Hinton (2008) plots of the dense embeddings on WordNet vocabulary are shown in Figures 5, 6, and 7 . Within the discrete hash based clusters used in retrieval, there is still additional structure in the dense embeddings that may be leveraged. This can be seen from T-SNE plots (Figures 8 and 9 ) of the dense embeddings that collide together in the "Generic Food" cluster seen in Table 7 . 

A APPENDIX

A.1 EFFECT OF HARD NEGATIVES AND ABLATION STUDY In this section we study the impact of hard negatives. Recall from the main text that we define hard negatives from a dataset D by taking pairs where (D T D) ij > 0 and D ij = 0. See Figure 4 for a diagram showing these hard negative pairs on the SBM synthetic data. We performed an experiment on the OSCAR dataset in which a modified loss function is used where hard negatives samples are weighted (that is, easy negatives are always given weight of 1.) These tuned models are compared to the tuned models of the main text in Table 5 . This tuning gives a modest improvement to COS and DCH models in terms of WP (and a degradation for LSE,) while keeping F1 much the same. However, DCH is not able to obtain the same F1 measure as LSE (in either tuning.) In addition, the LSE model from the original tuning outperforms the other models in WP+F1. Also note that the DHN model improves substantially on the removal of Hard Negatives, however it still remains the worst performing algorithm of the four. We also use this experiment to perform an ablation study, see and WP when increasing the negative weight. All methods have comparable performance when the quantization loss is removed. And finally, DCH performs the best when K = 1.

A.2 PROOFS

Theorem 1. Let B(q i ) = N (δ, q i , 1 -s) denote a ball around q i with radius δ under the 1 -s distance. For an arbitrary point q j ∈ B(q i ), we can consider the probability q j and q i will collide under SimHash -denote this with P s (δ). Then, 1. (LSE) P ψ (δ) ≥ 1 -δ 2. (COS) P sc (δ) ≥ 1 -2 √ δ π -O(δ 3 2 ) 3. (DCH) P s h (δ) ≥ 1 - 4γδ π 2 d(1-δ) 1 2 -O(δ 3 2 ) Proof. (1): 1 -ψ(q i , q j ) ≤ δ by definition, so P ψ (δ) ≥ 1 -δ. For each of the following, we will use the expansion via the Frobenius method: cos -1 (1 -δ) = √ 2δ + O(δ 2 ). (2): 1 -s c (q i , q j ) ≤ δ. Substituting q T i qj ||qi||||qj || = cos(π(1 -ψ(q i , q j ))) in s c and rearranging gives (note cos -1 is monotonically decreasing on (0, 1)) 1 -s c (q i , q j ) = 1 2 1 - q T i q j ||q i ||||q j || ≤ δ cos(π(1 -ψ(q i , q j )) ≥ 1 -2δ 1 -ψ(q i , q j ) ≤ 1 π cos -1 (1 -2δ) = 2 √ δ π + O(δ 2 ). (3): 1 -s h (q i , q j ) ≤ δ. Proceeding as above 1 -s h (q i , q j ) = 1 - γ γ + d 2 1 - q T i qj ||qi||||qj || ≤ δ cos(π(1 -ψ(q i , q j )) ≥ 1 - 2γδ d(1 -δ) 1 -ψ(q i , q j ) ≤ 1 π cos -1 1 - 2γδ d(1 -δ) = 4γδ π 2 d(1 -δ) 1 2 + O(δ 3 2 ). Note that for the logistic based similarity, P sσ (δ) is only well defined for α > | log(δ) -log(1 -δ)| (otherwise 1 -s σ cannot be below δ.) Any analysis here requires choosing a rate for α. Lemma 1. Let b(q i ) be the vector indicating signs of q i , that is b(q i ) m := 1[q im > 0]. Denote the Hamming distance of the sign vectors as ρ H (q i , q j ) := ||b(q i ) -b(q j )|| 1 which defines a semimetric on R d . Take R as a uniformly random orthogonal matrix. Then 1 -ψ(q i , q j ) = 1 d E R [ρ H (Rq i , Rq j )]. Proof. Consider a modified SimHash algorithm where z is taken uniformly at random from the standard basis vectors {e m } m∈[1,...d] and let h(q i ) := 1[q T i z > 0] denote this hash. The collision probability is simply the chance that two given embeddings share the same sign for a randomly chosen dimension, so P r[ h(q i ) = h(q j )] = E m∼(1,...,d) [1[q im = q jm ]] = 1 - ρ H (q i , q j ) d . Let z 0 ∈ R d be a vector with independent standard Normal entries, and take z = z0 ||z0|| , which is equal in distribution to the uniform distribution on the unit sphere. Thus Rz d = z. The original SimHash function h(q i ) = 1[q T i z 0 > 0] = 1[q T i z > 0] , and so h(q i ) d = h(Rq i ), and thus ψ(q i , q j ) = P r[h(q i ) = h(q j )] = E R 1 - ρ H (Rq i , Rq j ) d . A.3 EVALUATION METRICS • Precision -Retrieve all items with the same hash, and compute precision • Recall -Retrieve all items with the same hash, and compute recall • F1 -Compute F1-measure using the above Precision and Recall • WP -Wu-Palmer similarity measure for evaluation of semantic quality of hash groups. Wu-Palmer similarity (Wu & Palmer (1994) ) is computed on WordNet (WN) (Miller (1995) ; Fellbaum et al. (1998) ). We take all nouns, verbs and adjectives from the WordNet corpus and remove all words with no hypernyms (these are typically isolated nodes in the WordNet graph for which WP values are not available.) The intersection with the 660K OSCAR vocabulary leaves 46K words, which we index based on the semantic hashing models. For each query word w and its retrieved set V (w), the average WP similarity is computed across all pairs w, v with v ∈ V (w). Self-pairs are removed, and empty V (w) are given 0 values. This WP measure is bounded between 0 and 1, with higher values indicating more semantically meaningful clusters. • HR@n -HitRatio score on the heldout (test) 4600 pairs, where all colliding words for a query are retrieved, and if the target word appears in the top n items ranked by cosine similarity (of the dense embeddings,) the query achieves a HR of 1. This is the only measure in the OSCAR experiment to use the dense embeddings. • Mean Average Precision (MAP) -Consider rank position of each relevant retrieved item (in top R) based on Hamming distance -K 1 , K 2 , . . . , K R . First calculate precision @ k by setting a rank threshold k ≤ R and then ratio of relevant in top k divided by k (ignoring the ranked lower than k). Next step is to calculate average of precision at 0 < r ≤ R. Finally mean average precision is calculated by taking mean of average precision over all queries. • Recall@2 -Retrieve everything within hamming distance 2 and calculate recall A.4 QUALITATIVE PLOTS AND TABLES Table 7 shows example hash factors retrieved in the OSCAR model for a set of queries. These are constructed from exact hash collisions only -no search or ANN is performed in either the binary representation space or the dense embeddings space. Figures 5, 6 and 7 show TSNE plots for the dense embeddings on the WordNet set. Figures 8 and 9 show TSNE plots for the dense embeddings associated with the single hash that resembles a "Generic Food" cluster. These figures demonstrate there is significant structure remaining in the dense version of the embeddings that is semantically meaningful.

Query

Cluster 



https://github.com/swuxyj/DeepHash-pytorch As DHN uses the unnormalized inner product, for which the max operation has undesirable properties, we modify the DHN implementation to use 5s(ui, uj) as the input to the logistic function. CONCLUSIONWe extend semantic hashing methods to problems with substantial label noise and to the exact hashing retrieval case via the introduction of Locality Sensitive Embeddings, which leverage angular similarity as the main component of an output prediction. The learned representations show superior performance in the exact hashing retrieval setting. We applied LSE to a multiple-context representation learning model to a cooccurrence matrix generated from the OSCAR English corpus, producing a "word2hashes" model which is novel to the best of the authors' knowledge. DHN performs poorly here as it is unable to proper handle the inclusion of hard negative samples. See A.1 for an ablation study and an discussion of the impact of hard negatives.



Figure 1: (left) Response of 8-bit SimHash collision probability vs. cosine distance. The plot indicates that vectors that are extremely close in cosine distance may not collide under SimHash. For example, a cosine distance of 0.001 corresponds to a collision probability of only 0.9. (right) Response of various similarities vs cosine distance. Angular distance (1-bit Simhash collision probability) induces the sharpest topology. The DCH distance uses γ = 5 and d = 32.

The quantization loss introduced in Deep Hashing Networks (DHN) Zhu et al. (2016) is of the form b

Figure 2: Results of the SBM experiment hyperparameter search, sliced by different variables. Each set of bars represents the scores of the highest F1 model with the listed constraint.

Figure 3: Results of the OSCAR experiment hyperparameter search, sliced by different variables. Each set of bars represents the scores of the highest combined WP+F1 measures (where WP is Wu-Palmer similarity -described in main text) with the listed hyperparameter value fixed. F1 score are computed on a query set randomly sampled from the 660K vocabulary, and is computed by performing retrieval on the full 660K vocabulary. Wu-Palmer similarity is computed on the 46K WordNet evaluation vocabulary, with a query set sampled from the WN vocabulary. (bottom right)T-SNE plot of the dense embeddings assigned to a single hash cluster ("Generic Food"), revealing additional structure

Figure 4: (left) Subset of SBM dataset, indicating near block-diagonal structure. (right) Hard negatives sampled for SBM dataset

Results of Tuned Models on SBM Experiment Model K β Hard Negatives?

Mean Average Precision on Image Models

Tuned Parameters and Performance of 100-Epoch Models on OSCAR dataset

Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969-978, 2009. Atsushi Sato and Keiji Yamada. Generalized learning vector quantization. In Advances in neural information processing systems, pp. 423-429, 1996. Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems, pp. 1473-1480, 2006.

DHN is most competitive when hard negatives are removed, achieving the highest WP (but still lowest F1.) It is also this case where we see LSE achieve a Recall of 0.54 -this metric is essentially traded-off for Precision

Example Retrieved Hash Clusters

