Blog

Clustering news stories at scale

Our news aggregator consumes thousands of news stories a day. To cluster similar news stories at scale, we use data mining algorithms such as Locality-sensitive Hashing (LSH) with MinHash. This technique dramatically improves the scalability of the bag-of-words model.

In this blog post, we'll look at:

  1. The news story clustering challenge
  2. Hashing
  3. Minwise Hashing
  4. Jaccard Similarity Coefficient
  5. Locality-sensitive Hashing (LSH) with MinHash
  6. Execution time of LSH with MinHash
Subscribe to Blog