Draft of an article to be published in IEEE Signal Processing Magazine, Lecture Notes, March 2008.

1

Locality-Sensitive Hashing: Finding a Needle in a Haystack Malcolm Slaney (Yahoo! Research) and Michael Casey (Goldsmith College, University of London)

Fifth Draft! Do not distribute I. S COPE One of the more surprising changes in computing during the last few years is the wealth of data that is now available at our fingertips. We can easily carry in our pockets thousands of songs, hundreds of thousands of images, and hundreds of hours of video. But even with the rapid growth of computer performance, we don’t have the processing power to search this amount of data by brute force. This note describes a technique known as locality-sensitive hashing (LSH) that allows one to quickly find similar entries in large databases of text, images or music. This approach belongs to a novel and interesting class of algorithms that are known as randomized algorithms. A randomized algorithm does not guarantee an exact answer, but instead provides a high probability guarantee that it will return the correct answer or one close to it. By investing additional computational effort for more sampling, the probability can be pushed as high as desired. II. R ELEVANCE There are many problems that involve finding similar items. These problems are often solved by finding the nearest neighbor to an object in some metric space. This is an easy problem to state, but when the database is large and the objects are complicated the processing time grows linearly with the number of items and the complexity of the object. LSH is most valuable when searching for near matches (as opposed to exact matches) of high-dimensional items in very large databases. In these searches it can drastically reduce the computational time, at the cost of a small probability of failing to find the absolute closest match. III. P REREQUISITES This tutorial note is based on simple geometric reasoning. Some knowledge of probabilities, and a comfort with the mathematics of high-dimensional space is useful. IV. P ROBLEM S TATEMENT Given a query point, we wish to find the points in a large database that are closest to the query. We wish to guarantee with high probability, 1 − δ, that we return the nearest neighbor for any query point.

Conceptually, this problem is easily solved by iterating through each point in the database and calculating the

distance to the query object. But our database may contain billions of objects—each object described by a vector

September 21, 2007

DRAFT

2

that contains hundreds of dimensions. Therefore, we wish to find a solution that does not depend on a linear search of the database. A. Trees In a one-dimensional world, it is easy to search for values by building a tree of objects. Then, given a query we start at the top node, ask if our query object is to the left or to the right of the current node, and then recursively descend the tree. If the tree is properly constructed, this solves the query problem in O(log N ) time, where N is the number of objects. In a one-dimensional world, this is a binary search; the k-d tree algorithm is a multi-dimensional version of this idea [?]. But multi-dimensional algorithms, such as k-d trees, break down when the dimensionality of the search space is greater than a few dimensions—we end up testing nearly all the nodes in the data set and the computational complexity grows to O(N ). B. Hashes Many search problems are solved using conventional computer hashing algorithms. A hash table is a data structure that allows one to quickly map between a symbol (i.e. a string) and a value. This is done by calculating an arbitrary, pseudo-random function of the symbol that maps the symbol into an integer that indexes a table. Thus a symbol with dozens of characters, and perhaps hundreds of bits of data, is mapped to a relatively small index into the table. A collision occurs when two points hash to the same value and there are special provisions to allow more than one symbol per hash value. But a well-designed hash table allows a symbol lookup in O(1) time with O(N ) memory, where N is the number of entries in the table. A hash table returns an exact match. A well-designed hash function separates two symbols that are close together into different buckets. This makes a hash table a good means of finding exact matches, but not for finding approximate matches. By contrast, a locality-sensitive hash is an efficient means of finding near matches. V. S OLUTION LSH is based on a simple idea. Consider the most general view: If two points are close together, then after a “projection” operation these two points will remain close together. Figure ?? illustrates the basic idea. Two points that are close together on the sphere are also close together when the sphere is imaged (or projected) onto the two-dimensional page. This is true no matter how we rotate the sphere. Two other points on the sphere that are far apart will, for some orientations, be close together on the page, but it is more likely that the points will remain far apart. We will describe two different kinds of projection operators, but thinking about rendering a multi-dimensional sphere onto a two-dimensional page is a good metaphor. Given a random “projection” operation, we note which points are close to our query. A projection maps a data point from a high-dimensional space to a low-dimensional subspace. We create projections from a number of different directions and keep track of the nearby points. We keep a list of these found points and note the points that appear close to each other in more than one projection. Part of the art of solving this problem is defining a

September 21, 2007

DRAFT

3

Fig. 1.

Splat. Two examples showing projections of two close (circles) and two distant points (squares) onto the printed page.

notion of “nearby” (a threshold) so that we keep track of a manageable number of points. Commonly, the projection operation projects the points onto a line, so the similarity test is a simple comparison. There are two common ways to do these projections. The original formulation for LSH assumed all points are described by a large number of binary features [?]. Projections are formed by selecting a subset of the dimensions. A better formulation for signal processing applications computes a dot-product with Gaussian vectors, creating arbitrary projections [?]. Calculating the distance between two objects, x and y, in a binary feature space is easy with the Hamming distance D=

! i

xi "= yi

(1)

where xi is the value of the i’th feature for object x and "= is the exclusive-or operator. We implement a locality-

sensitive hash by performing the Hamming calculation over subsets of the dimensions. On average, points that are close together because they share many of the same features will remain close together in a random subspace. Binary features are not a limitation for signal processing because integers can be represented with a unary code. Thus in an N -bit unary code, bit i for 0 ≤ i < N represents the feature that the value of the number is less than

i. In an implementation of LSH, this bit vector does not need to be calculated. Instead, we expand the calculation in Equation ?? to include the numerical comparison for bit i. A. Random Projections

− The key to locality-sensitive hashes of the query point → v from a real-valued high-dimensional space is the dot product − − − h(→ v)=→ v ·→ x

(2)

− where the elements of the vector → x are chosen at random from a Gaussian distribution, for example N (0, 1). This scalar projection is then quantized into a set of hash bins, with the intention that nearby items in the original space September 21, 2007

DRAFT

4

will fall into the same bin. The full hash function is given by # "→ − − x ·→ v +b − hx,b (→ v)= w

(3)

where %·& is the floor operation and 0 ≤ b < w is a uniformly distributed random variable that makes the quantization error easier to analyze, with no loss in performance.

In order for the projection operator to “converge,” it must project nearby points to positions that are close together. Thus, for any points p and q in Rd , we want a high probability, P1 , that two close points fall into the same bucket: PH [h(p) = h(q)] ≥ P1 for ||p − q|| ≤ R1

(4)

and we want a low probability, P2 < P1 , that two points that are far apart, R2 > R1 , fall into the same bucket PH [h(p) = h(q)] ≤ P2 for ||p − q|| ≥ cR1 = R2 .

(5)

Because of the linearity of the dot product, the difference between two image points ||h(p) − h(q)||2 has a

magnitude whose distribution is proportional to ||p − q||2 . By this argument we see that P1 > P2 . We further

magnify the difference between P1 and P2 , thus increasing the performance of each projection, by performing k

dot products in parallel. This increases the ratio of the probability of nearby points over not so close points: $ %k P1 P1 > . (6) P2 P2 − The k independent hashes are a projection we call gj . The projection gj transforms the query point → v into k real numbers. For efficient comparisons, we put all points (the query points and all the points in the database) into buckets, quantizing the hash values in Equation ??, with the hope that similar points will fall in the same bucket. The width of the buckets, w, determines how many points collide. A small value for w means there is a bigger table and fewer nearest neighbor points to check; a large value means we have to sort through many points to find the true nearest neighbors. Within each set of k dot products that form a projection, we achieve success if the query and the nearest neighbor (

are in the same bin in all k dot products. Hence, P1 k) falls as we include more dot products. However, when we repeat this L times, only some of the projections will fail to find the nearest neighbor. This gives us additional error tolerance. Thus, we form L of these projections to get the desired level of probability. By increasing L we can find the nearest neighbor with arbitrarily high probability. We will discuss how these parameters are chosen in Section ??. B. Implementation At this point, we have mapped a data point into a hash bucket described by k integer indices. This k-dimensional space is sparse, but we can use conventional hashes, with no loss of performance, to efficiently find the right bucket. For illustration, we describe the approach used by E 2 LSH [?], but more sophisticated approaches based on reusing hashes [?], or different spatial tilings [?], are also possible. Even after the projection operator, one still must find nearest neighbors along a line. A na¨ıve algorithm could easily take O(logN) operations, but we reduce this to O(1) operations using a pair of conventional hash tables. September 21, 2007

DRAFT

5

One projection of the LSH algorithm using dot products is shown in Figure ??. This processing puts one point through k dot products and then stores a fingerprint (T2 ) that describes the k-dimensional result into a hash bucket computed by hash T1 . Collisions are potential nearest neighbors. This single projection allows us to find nearby points in a small fraction of the time it takes to look at all the points in the database.

Fig. 2.

A block diagram of one LSH projection.

We first use a conventional hash to map the k-dimensional projection output into a single linear index. This is implemented by computing T1 =

& ! i

Hij ki

'

mod P1

(7)

for an arbitrary prime number (and hash-table size) P1 . With a well-designed hash, the hash table is efficient, but we still have a chance that two points will collide under T1 . Thus, we need to chain the entries in a bucket and verify that we have found the right entry by comparing the queries k-dimensional projection value to the database values. This calculation grows as k gets larger, in addition to taking more space. Instead, we see a fingerprint, similar to T1 for the projection vector. The fingerprint is calculated with ' & ! Hij ki mod P2 . T2 =

(8)

i

Now when checking to see if the query projection is in the T1 hash bucket, we just compare the fingerprints. Even

with a 16-bit fingerprint as determined by P2 , the chance of a mistake is very small. C. s-Stable distributions As described above, any projection operation can be used to reduce the data to a lower-dimensional space. Forming a dot product with a vector based on a family of random variables from a s-stable distribution simplifies the analysis of the performance of LSH. A weighted sum of random variables from a s-stable distribution has a probability distribution that is similar to the original distribution. More formally, a distribution D is s-stable if for any independent, identically distributed (iid) random variables X1 , ..., Xn distributed according to D and any ( real numbers v1 , ..., vn the random variable i vi Xi has a probability distribution that is the same as the random

variable

! ( |vi |s )(1/s) X

(9)

i

where X is drawn from D. For s = 2, or the L2 norm, a Gaussian probability distribution is s-stable.

September 21, 2007

DRAFT

6

Using an s-stable distribution for our projections allows us to analytically describe the performance of LSH. We start by calculating the probability that two points, p and q, separated by distance u = ||p − q||2 , collide and fall

into the same hash bucket. The projections of the two close points will always be close, but because of quantization they might fall on opposite sides of the barrier and thus land in different buckets. The probability that these two points hash to the same value is given by p(u) = P ra,b [ha,b (p) = ha,b (q)] =

)

0

w

1 fs u

% $ %$ t t 1− dt u w

(10)

where fs is the probability density function (pdf) of the hash H as given by Equation ?? and the 1 − t/2 term

represents the probability that the two points fall in the same bin of width w. For any given bucket width, w, this probability falls as the distance u grows. This probability can be used to calculate the probabilities in Equations ?? and ?? for an L2 space with R1 equal to the bin width w [?] P2 = 1 − 2F (−w/c) − √

2 2 2 (1 − e−(w /2c ) ). 2πw/c

(11)

Here F () is the cumulative pdf of a Guassian random variable. P1 is found by setting c = 1. (P1 )k is the probability that a single point falls within the same bucket as the query, so the probability that all L projections fail to produce a collision between the query and the true nearest neighbor is equal to (1 − P1k )L .

If we want the probability that our algorithm fails to find the true nearest neighbor is to be no more than δ, the L must be at least

for a fixed value of k to be determined.

*

log δ L= log(1 − P1k )

+

(12)

The amount of time needed to find a nearest neighbor is the time needed to calculate the hash functions, plus the time needed to search the buckets for collisions. Because there are kL projections of n-dimensional vectors, the first time, Tg , is O(nkL) where n is the dimensionality of the search space. The second time, Tc increases linearly based on the expected number of collisions for each projection Tc = O(dLNc ) where d is the average number of points in each bucket. The expected number of collisions for a single projection is Nc =

!

q ! ∈D

pk (||q − q # ||)

(13)

where p() from Equation ?? gives the probability that each point contributes to a collision and D represents all the points in the database. It is easy to see that Tg increases as a function of k, while Tc decreases since pk < p for p < 1 and k > 1. The E 2 LSH algorithm finds the best value for k by experimentally evaluating the cost of the calculation for samples in the given data set. E 2 LSH scales the data so the w is always equal to 4. In the cover-song experiments described elsewhere [?] E 2 LSH used between 7 and 14 dot products per hash (k) and more than 150 projections (L).

September 21, 2007

DRAFT

7

VI. A PPLICATIONS The LSH algorithm, and related randomized algorithms, make it possible to quickly find nearest neighbors in very large databases. Conventional hashes work well for finding exact matches, but do not help us find neighbors. Instead, a hashing algorithm, as we have described, needs to take into account the locality of the points so that nearby points remain nearby. These locality-sensitive hashes have been applied in a number of domains, as we will now use as illustrations for this idea. A. Web A randomized algorithm was first applied to finding duplicate pages on the web. The web is full of duplicate pages, partly because content is duplicated across sites, and partly because there is more than one URL that points to the same file on a disk. Yet search engines donâ€™t want to return 10 copies of the same page. A solution is based on shingles, and each shingle represents a portion of a web page and is computed by forming a histogram of the words found within that portion of the page. We can test to see if a portion of the page is duplicated elsewhere on the web by looking for other shingles with the same histogram. Given that there are billions of pages on the web, and any portion of any page might be a duplicate, there are a large number of shingles to test. Broder [?] solved this problem by considering random selections, analogous to LSH, to test the similarity of pages. If the shingles of the new page match shingles from the database, then it is likely the new pages bear a strong resemblance to an existing page. The nearest-neighbor solution is important because web pages are surrounded by navigational and other information that changes from site to site. An approximate solution to this problem is fine, especially when balanced with the computational savings of a solution like LSH. B. Image retrieval A second application of LSH is for object recognition [?]. We compute a detailed metric for many different orientations and configurations of an object we want to recognize. Then, given a new image we simply check our database to see if a pre-computed objectâ€™s metrics are close to our query. This database can contain millions of these poses. Using LSH allows us to quickly check to see if the query object is known. A similar idea was applied to genomic data [?]. C. Music retrieval We use conventional hashes to find exact musical matches. Fingerprints are representations of an audio signal that are robust to common types of abuse that are performed to audio before it reaches our ears [?]. This can be done, for example by noting the peaks in the spectrum because they are robust to noise and encoding their position in time and space. One then just has to query the database for the same fingerprint. With such robust features, one can use conventional hashing, especially when looking at many samples over time, because one only needs to find one exact match to reduce the search space. To find similar songs, as might happen when a song is remixed for a new audience, or more dramatically when a different artist performs the same song, we can not use a fingerprint. Instead, we can use several seconds of the September 21, 2007

DRAFT

8

song, a snippet, as a shingle. To determine if two songs are similar, we need to query the database and see if a large enough number of the query shingles are close to one song in the database [?]. Closeness depends on the feature vector, but long shingles provide specificity, and make LSH more important. This similarity measure is important in the Internet era because we can eliminate duplicates to improve search results, and to link recommendation data between similar songs. LSH is important for this nearest-neighbor check because of the size of the database. VII. C ONCLUSIONS - W HAT WE HAVE LEARNED In this note, we have described the theory and implementation of a randomized algorithm known as localitysensitive hashing (LSH). Unlike conventional computer hashes that are designed to return exact matches in O(1) time, an LSH algorithm uses dot products with random vectors to quickly find nearest neighbors. LSH provides a probabilistic guarantee that it will return the correct answer. In systems that have other sources of error, perhaps due to mislabeled data or the difficulties of pattern recognition, one can reduce the error due to LSH below other sources of error, and gain significant improvement in computational effort. These randomized algorithms are important in today’s world of Internet-sized databases. VIII. ACKNOWLEDGMENTS We appreciate thoughtful comments we have received from Alex Jaffe, Sara Anderson, and several reviewers. R EFERENCES [1] Alexandr Andoni and Piotr Indyk. E 2 LSH 0.1 User Manual. http://web.mit.edu/andoni/www/LSH, June 21, 2005 [2] A. Andoni, M. Datar, N. Immorlica, V. Mirrokni. Locality-sensitive hashing using stable distributions. In Nearest Neighbor Methods in Learning and Vision: Theory and Practice, T. Darrell, P. Indyk, G. Shakhnarovich (eds.), MIT Press, 2006. [3] Alexandr Andoni and Piotr Indyk. Near-Optimal Hashing Algorithms for Near Neighbor Problem in High Dimensions. In Proceedings of the Symposium on Foundations of Computer Science (FOCS’06), 2006. [4] J.L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18:509–517, 1975. [5] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In Proc. of WWW, pages 1157–1166, Santa Clara, CA, 1997. [6] Jeremy Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17: 419-428. [7] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of algorithms for audio fingerprinting. In International Workshop on Multimedia Signal Processing, December 2002. [8] Michael Casey and Malcolm Slaney. Fast Recognition of Remixed Music Audio. In Proceedings ICASSP 2007, Volume 4, IV-1425– IV-1428 , 2007. [9] Aristides Gionis, Piotr Indyk and Rajeev Motwani. Similarity Search in High Dimensions via Hashing. in The VLDB Journal, 518–529, 1999. [10] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality, in STOC, 1998. [11] Gregory Shakhnarovich, Paul Viola, Trevor Darrell. Fast Pose Estimation with Parameter-Sensitive Hashing. In Nearest Neighbor Methods in Learning and Vision: Theory and Practice, T. Darrell, P. Indyk, G. Shakhnarovich (eds.), MIT Press, 2006.

IX. AUTHORS Malcolm Slaney (Senior Member, IEEE) received his PhD at Purdue University for his work on diffraction tomography. Since the start of his career he has been a researcher at Bell Labs, Schlumberger Palo Alto Research, Apple’s Advanced Technology Lab, Interval Research, IBM Almaden Research Center, and most recently at Yahoo! September 21, 2007

DRAFT

9

Research. Since 1990 he has organized the Stanford CCRMA Hearing Seminar, where he now holds the title (Consulting) Professor. He is a coauthor (with A. C. Kak) of the book Principles of Computerized Tomographic Imaging, which has been republished as a Classics in Applied Mathematics by SIAM Press. He is a coeditor of the book Computational Models of Hearing. Malcolm once wondered what computer science theory researchers did of any practical importance. He is pleasantly surprised by the applicability of LSH to signal processing problems, and this note is partial penance.

Michael Casey (Member, IEEE) received his PhD from the MIT Media Labâ€™s Machine Listening Group in 1998 for research in structured audio analysis and synthesis. He was a Research Scientist at Mitsubishi Electric Research Laboratories (MERL) and Professor of Computer Science at Goldsmiths College, University of London, where he still holds the title of Visiting Research Professor, prior to taking up his current post as Professor of Music at Dartmouth College, Hanover, NH, USA. Michael was a co-editor for the Audio section of the MPEG-7 International Standard for Multimedia Content Description and principle investigator for several large grants in the UK in the field of music information retrieval which is the main subject of his current research.

September 21, 2007

DRAFT