SEMANTIC HASHING WITH LOCALITY SENSITIVE EM-BEDDINGS

Abstract

Semantic hashing methods have been explored for learning transformations into binary vector spaces. These learned binary representations may then be used in hashing based retrieval methods, typically by retrieving all neighboring elements in the Hamming ball with radius 1 or 2. Prior studies focus on tasks with a few dozen to a few hundred semantic categories at most, and it is not currently well known how these methods scale to domains with richer semantic structure. In this study, we focus on learning embeddings for the use in exact hashing retrieval, where Approximate Nearest Neighbor search comprises of a simple table lookup. We propose similarity learning methods in which the optimized base similarity is the angular similarity (the probability of collision under SimHash.) We demonstrate the benefits of these embeddings on a variety of domains, including a coocurrence modelling task on a large scale text corpus; the rich structure of which cannot be handled by a few hundred semantic groups.

1. INTRODUCTION

One of most challenging aspects in many Information Retrieval (IR) systems is the discovery and identification of the nearest neighbors of a query element in an vector space. This is typically solved using Approximate Nearest Neighbors (ANN) methods as exact solutions typically do not scale well with the dimension of the vector space. ANN methods typically fall into one of three categories: space partitioning trees, such as the kd-tree (Bentley (1975) ; Friedman et al. (1977) ; Arya et al. Nonetheless, space-partitioning and neighborhood graph search methods are expensive, both in data structure construction and in query time, and remain a bottleneck in many modern IR pipelines. As many modern retrieval tasks revolve around solving ANN for vector representations learned from raw, structured data, one might attempt to learn representations which are more suited towards efficient retrieval. Metric learning methods (Xing et al. (2003); Weinberger et al. (2006); Chechik et al. (2010); Hoffer & Ailon (2015) ; Kulis et al. ( 2013)) have been proposed for learning linear and non-linear transformations of given representations for improved clustering and retrieval quality. A class of related methods, semantic hashing or hash learning methods (Salakhutdinov & Hinton (2009) ), have also been explored for learning transformations into binary vector spaces. These learned binary representations may then be used in hashing based retrieval methods, typically by retrieving all neighboring elements in the Hamming ball with radius 1 or 2. Exact hashing retrieval algorithms, that is, Hamming ball "search" with radius 0, have a particular computational appeal in that search data structures are not needed nor is enumeration of all codes within a Hamming ball. In addition, binary representations that are suitable for exact hashing retrieval can also be used to identify groups of related items that can be interpreted as clusters in the traditional sense. As the number of clusters discovered by the algorithm isn't explicitly controlled (only bounded by 2 d ,) algorithms generating binary embeddings suitable for exact hashing retrieval can be viewed as nonparametric clustering methods. To this end, we propose a method for learning continuous representations in which the optimized similarity is the angular similarity. The angular similarity corresponds to the collision probability of SimHash, a hyperplane based LSH function (Charikar (2002) ). Angular distance gives a sharp topology on the embedding space which encourages similar objects have nearly identical embeddings suitable for exact hashing retrieval. Related work on similarity learning, LSH, and hash learning can be found in Section 2. The proposed models are found in Section 3. The experimental results, and other technical details, can be found in Sections 4. Finally, we conclude in Section 5.

2. PRELIMINARIES

2.1 SIMILARITY MODELLING Similarity learning methods are a class of techniques for learning a similarity function between objects. One successful approach for similarity learning are "twin network" or "two tower architecture" models, in which two neural network architectures are joined to produce a similarity prediction (Bromley et al. (1994); Chopra et al. (2005) ; Huang et al. ( 2013)). The weights of these networks may be shared or not, depending on whether the two input domains are equivalent or not. Let i ∈ U and j ∈ V be the identities of two objects, where U and V are the two domains across which a similarity function is to be learned. Let φ u (i) and φ v (j) be the input representations for the objects (these functions φ may be identity functions if the input domains are discrete.) These representations are then transformed through parameterized vector-valued functions f u (•|θ u ) and f v (•|θ v ), whose output are typically the learned representations u i = f u (φ u (i)|θ u ) and v j = f v (φ v (j)|θ v ). A loss is then defined using pairwise labels y ij and an interaction function s(u i , v j ) which denotes the similarity or relevancy of the pair. Taking f u to be a mapping for each index i to an independent parameter vector u i (similarly for f v and v i ), and taking s(u i , v j ) = u T i v j with an appropriate loss results in a variety of matrix factorization approaches (Koren et al. ( 2009 Taking f u to be a neural network mapping a context φ u (i) to a representation u i allows for similarity models that readily make use of complex contextual information. Common choices for the similarity function include transformations of Euclidean distance (Chopra et al. (2005) ), and cosine similarity:  s(u i , v j ) =

2.2. LOCALITY SENSITIVE HASHING AND ANGULAR SIMILARITY

A Locality Sensitive Hash (LSH) family F is a distribution of hashes h on a collection of objects Q such that for q i , q j ∈ Q, (Indyk & Motwani (1998); Gionis et al. (1999) ; Charikar ( 2002 )) P r[h(q i ) = h(q j )] = s(q i , q j ) for some similarity function s on the objects. SimHash (Charikar ( 2002)) is a LSH technique developed for document deduplication but may be used in other contexts. For a vector representations q ∈ R d , SimHash draws a random matrix Z ∈ R d×M with standard Normal entries. The hash h(q i ) ∈ {0, 1} M is then constructed as h(q i ) m = 1[q T i Z :m > 0]. Intuitively, SimHash draws random hyperplanes intersecting the origin to separate points. A useful property of this hash function, as stated in Charikar (2002) , is that ψ(q i , q j ) := P r[h(q i ) m = h(q j ) m ] = 1 -1 π cos -1 q T i q j ||q i ||||q j || ,



(1998)), neighborhood graph search (Chen et al. (2018); Iwasaki & Miyazaki (2018)) or Locality Sensitive Hashing (LSH) methods (Charikar (2002); Gionis et al. (1999); Lv et al. (2007)). Despite their theoretical, intuitive, and computational appeal, LSH methods are not as prevalent in modern IR systems as are space-partitioning trees or neighborhood graph methods (Bernhardsson (2013); Chen et al. (2018); Johnson et al. (2017); Iwasaki & Miyazaki (2018)). Empirical studies demonstrate that LSH techniques frequently do not attain the same level of quality as spacepartitioning trees (Muja & Lowe (2009)).

); Lee & Seung (2001); Mnih & Salakhutdinov (2008); Blei et al. (2003); Rendle et al. (2012); Pennington et al. (2014)).

|| (Huang et al. (2013)). In addition, the loss can be defined for pairs (Chopra et al. (2005)), triplets (one positive pair, one negative pair) (Rendle et al. (2012); Chechik et al. (2010)), or on larger sets (Huang et al. (2013)).

