Clustering news stories at scale

Our news aggregator consumes thousands of news stories a day. To cluster similar news stories at scale, we use data mining algorithms such as Locality-sensitive Hashing (LSH) with MinHash. This technique dramatically improves the scalability of the bag-of-words model.

In this blog post, we'll look at:

  1. The news story clustering challenge
  2. Hashing
  3. Minwise Hashing
  4. Jaccard Similarity Coefficient
  5. Locality-sensitive Hashing (LSH) with MinHash
  6. Execution time of LSH with MinHash

Under the hood of Newshound's news aggregation platform

In the coming weeks, we will be talking about how our news aggregator works under the hood. Please comment on the articles below to help us tackle our technical challenges better.

#1: Gathering and ranking news stories
#2: Clustering news stories at scale
#3: Discovering the biggest newsmakers of the day
#4: Searching news archives
#5: Analyzing news sentiment for fun and profit

#1: Gathering and ranking news stories

A news aggregator collects multiple news stories from multiple publishers which begs the question: how do we surface the most important stories of the day? We use some aspects of Natural Language Processing like the bag-of-words model and approximate string matching algorithms to come up with the answer.
↣ Click here to read the detailed blog post.

Launching a web-based news aggregator


We are two friends and we are happy and excited to launch Newshound, a web-based news aggregator. This is something that we have been working on for the past one year. We have released this publicly today in the hope that others would also find it useful. Ultimately, we hope that you make it a part of your daily lives, just like it is for us.

Main features

Newshound gathers news from multiple organizations that publish stories of national interest representing a broad spectrum of political views. An algorithm clusters related news stories and categorizes them into broad sections like politics, business, technology, sports, etc. Op-eds are in a section of their own, separate from straight news reporting. A separate algorithm identifies the top newsmakers of the day.

Roughly once an hour, the top headlines of the moment are published on the website The page loads quickly in the browser; it is about 1 MB total in size. It consists of a few hundred KB of pure HTML and CSS with no JS. The bulk of the page size consists of thumbnails that accompany the news stories.

Everyone gets to see the same news stories. All links go directly to the news publisher's website with no AMP and no redirection involved. There is no algorithm to track what you read and show you more of the same, so there is no filter bubble. The website does not use cookies. There are no ads. AWStats, an open source software for web analytics, is installed on the server side to collect anonymized, aggregated data about website visits. This data allows us to estimate usage and provision our web servers to handle the expected load.

Subscribe to Blog