Finding documents that are almost exact duplicates is difficult. Exact hashing algorithms do not work and pairwise comparisons do not scale. Locality Sensitive Hashing (LSH) is a scalable method for detecting near duplicate content that allows computation to be exchanged for accuracy. I will present the theoretical side of LSH and an open source Python implementation of the technique.
Finding exact duplicates is trivially easy (if (md5(doc1) == md5(doc2))
), but finding near duplicates is much harder. Many document collections, from tweets to news text to forum discussions, contain documents that are almost exactly the same, with a few sentences that differ. A hash function such as md5
will not find these near duplicates as they are not the same on the byte level, but rather on the semantic level. Finding these semantic duplicates can be done, for instance, by using n-gram shingles and computing the Jaccard similarity of the shingle sets. This is computationally very expensive as the similarity between all pairs of documents needs to be computed.
Locality senstive hashing (LSH) relies on two methods, a hash fingerprint of each document and a locality sensitive hash that is applied to the fingerprint. If the fingerprint is generated using minhash
the probability of a hash collision is equal to the Jaccard distance of the documents. In this talk I will present the idea behind locality sensitive hashing as well as an MIT licensed Python implementation of the method.
The library is available at https://github.com/mattilyra/lsh