Clustering news stories at scale

Our news aggregator consumes thousands of news stories a day. To cluster similar news stories at scale, we use data mining algorithms such as Locality-sensitive Hashing (LSH) with MinHash. This technique dramatically improves the scalability of the bag-of-words model.

In this blog post, we'll look at:

  1. The news story clustering challenge
  2. Hashing
  3. Minwise Hashing
  4. Jaccard Similarity Coefficient
  5. Locality-sensitive Hashing (LSH) with MinHash
  6. Execution time of LSH with MinHash

The news story clustering challenge

In a day, Newshound pulls in thousands of news stories from multiple news sources. Stories that share similar words signal that they probably cover the same news event and so belong to the same cluster. The challenge lies in identifying these clusters as rapidly as the news stories come in, to stay up-to-date with breaking news.

Newshound deploys several algorithms from the field of data mining to meet this challenge. In essence, these algorithms first represent news stories as numbers and then match these numbers to determine how similar the stories are. We break down these algorithms one by one in a simple, non-mathematical way.

Using Natural Language Processing, each news story is first stripped it down to its essential words after removing common words like 'a' and 'the'. (These common words are also called stopwords.) Thus each news story is represented as a bag-of-words.

For example,

{
  "publisher": "NPR",
  "headline": "Astronomers Strike Gravitational Gold In Colliding Neutron Stars",
  "url": "https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars",
  "bagOfWords": [ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" ]
}

Hashing

Next, Newshound uses hashing to represent a news story's bag-of-words as a set of numbers.

Mathematically speaking, hashing is a technique where data of arbitrary size is mapped to a value of fixed size using a hash function. In the context of a news aggregator, the data is news stories whose headlines and summary can consist of an arbitrary number of words. Hashing is used to convert these words into numbers. These numbers are called hash values since they derive from a hash function.

A universal hash function has the form:

  h(x) = (a*x + b) % c
   where 'x' is the input value,
         h(x) is the hash value,
         'a' and 'b' are random numbers between 0 and the maximum hash value, and
         'c' is the prime number just greater than the maximum hash value.

If we want the hash value to be an unsigned 32-bit integer, the input value would also be an unsigned 32-bit integer. The maximum hash value would be the highest value of an unsigned 32-bit integer, which is:

 2^32 - 1 = 4,294,967,295

The prime number 'c' just greater than this maximum hash value is 4,294,967,311. (It's easy to just look it up on a list of known prime numbers.)

If we pick any two random numbers between 0 and the maximum hash value of 4,294,967,295, for example, 'a' = 1,319,152,729 and 'b' = 9,549,623,503, our hash function then looks like this:

  h(x) = (a*x + b) % c
  h(x) = (1,319,152,729*x + 9,549,623,503) % 4,294,967,311

When a news story is represented as a bag-of-words, each word in the bag needs to be converted into an unsigned 32-bit integer for it to serve as input to the hash function. The usual way to do is to use a CRC-32 algorithm.

  h('ascii string') => crc-32('ascii-string')

For example, the CRC-32 hash for the ASCII string 'astronomers' is 2,746,117,606.

h('astronomers') =  2,746,117,606

Applying the CRC-32 algorithm to the entire bag-of-words, the news story now is:

{
  "publisher": "NPR",
  "headline": "Astronomers Strike Gravitational Gold In Colliding Neutron Stars",
  "url": "https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars",
  "bagOfWords": [ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" ],
  "bagOfWordsCRC32": [ 2746117606, 231155648, 921500173, 1202265425, 4140090749, 3567888722, 18726956 ]
}

Plugging the CRC-32 hash value for 'astronomers' into our universal hash function, we get:

  h(x) = (a*x + b) % c
  h(x) = (a*crc-32(x) + b) % c
  h('astronomers') = (1,319,152,729*2,746,117,606 + 9,549,623,503) % 4,294,967,311
                   = (3,622,548,534,109,846,528 + 9,549,623,503) % 4,294,967,311
                   = 3,622,548,543,659,469,824 % 4,294,967,311
                   = 609,632,658

Thus we have used a universal hash function to transform the string 'astronomers' to the unsigned 32-bit integer 609632658.

'astronomers' -> 609632658

Minwise Hashing

MinHash is a technique for quickly estimating how similar two sets are. In the context of a news aggregator, this technique is applied in large-scale clustering of news stories by estimating the similarity between their bag-of-words.

News stories differ in the size of their individual bag-of-words; some stories have just 7 words (like the one above) while others can be double that. When we need to compare as many as 10,000 news stories, we need to make sure their representation has the same size. For example, if all news stories could be represented as a set of 120 numbers each, they would be easier to compare. One way to do this, i.e. to represent a large, diverse-sized set as a smaller, fixed-sized set, is to use Minwise Hashing.

Using Minwise Hashing, we convert each news story's bag-of-words into a set of numbers, let us say 120 for the sake of the current discussion. This set of numbers is called the signature of the news story and its size (120) plays a major role further ahead when we look for similarities among the stories.

To create the story signature, we pre-select 120 universal hash functions of the type h(x) = (a*x + b) % c.

Each hash function is run on each word in the bag-of-words of a news story. Of all the hash values generated, the lowest is recorded (i.e., the minimum hash value or MinHash).

For example, if the hash function:

  h(x) = (1,319,152,729*x + 9,549,623,503) % 4,294,967,311

is run on each of the words:

[ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" ]

we would get the hash values:

[ 609632658, 1801401218, 1489107419, 3700651032, 2748198042, 312278623, 934128849 ]

of which the lowest value (the minimum hash value or MinHash) is 934,128,849.

If we run another 119 hash functions on this bag-of-words, we will get another 119 minimum hash values (or MinHashes). This total set of 120 minimum hash values (or MinHashes) constitutes the news story's numerical signature.

For example,

{
  "publisher": "NPR",
  "headline": "Astronomers Strike Gravitational Gold In Colliding Neutron Stars",
  "url": "https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars",
  "bagOfWords": [ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" ],
  "bagOfWordsCRC32": [ 2746117606, 231155648, 921500173, 1202265425, 4140090749, 3567888722, 18726956 ],
  "minHash": [ 934128849, 1327289309, 708373158, 483657827 ... total 120 MinHashes ]
}

If we use the same 120 hash functions on all the thousands of news stories streaming into the aggregator, we will have 120-element numerical signatures for each news story.

Jaccard Similarity Coefficient

If we compare the 120-element numerical signatures of any 2 stories and find that they have a few elements in common, we can say that they are somewhat similar.

We can quantify the similarity by calculating their Jaccard similarity coefficient. It is mathematically defined as the ratio of the number of elements in common and the number of elements of their union.

This value is 0 when the two signatures have no elements in common, 1 when they are identical, and strictly between 0 and 1 when they have a few elements in common.

Two stories are more similar (i.e. their signatures have relatively more elements in common) when their Jaccard similarity coefficient is closer to 1.

For example, consider the following 2 stories that show similarity in their bag-of-words model:

{
 {
  "publisher": "NPR",
  "headline": "Astronomers Strike Gravitational Gold In Colliding Neutron Stars",
  "url": "https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars",
  "bagOfWords": [ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" ],
  "bagOfWordsCRC32": [ 2746117606, 231155648, 921500173, 1202265425, 4140090749, 3567888722, 18726956 ],
  "minHash": [ 934128849, 1327289309, 708373158, 483657827 ... total 120 MinHashes ]
 },
 {
  "publisher": "The Guardian US",
  "headline": "New frontier for science as astronomers witness neutron stars colliding",
  "url": "https://www.theguardian.com/science/2017/oct/16/astronomers-witness-neutron-stars-collide-global-rapid-response-event-ligo",
  "bagOfWords": [ "new", "frontier", "science", "astronomers", "witness", "neutron", "stars", "colliding" ],
  "bagOfWordsCRC32": [ 1810056261, 3526069090, 1729573288, 2746117606, 3488721422, 3567888722, 18726956, 4140090749 ],
  "minHash": [ 834178841, 2357289300, 608278152, 317659826 ... total 120 MinHashes ]
 }
}

Calculating their Jaccard similarity coefficient from their MinHashes, we get a value of 0.84.

Now consider another pair of stories that do not show similarity in their bag-of-words model:

{
 {
  "publisher": "NPR",
  "headline": "Astronomers Strike Gravitational Gold In Colliding Neutron Stars",
  "url": "https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars",
  "bagOfWords": [ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" ],
  "bagOfWordsCRC32": [ 2746117606, 231155648, 921500173, 1202265425, 4140090749, 3567888722, 18726956 ],
  "minHash": [ 934128849, 1327289309, 708373158, 483657827 ... total 120 MinHashes ]
 },
 {
  "publisher": "Reuters",
  "headline": "Iraq says captures positions south of Kirkuk including airbase",
  "url": "https://www.reuters.com/article/mideast-crisis-iraq-kurds-kirkuk/iraq-says-captures-positions-south-of-kirkuk-including-airbase-idINKBN1CL0PA",
  "bagOfWords": [ "iraq", "says", "captures", "positions", "south", "kirkuk", "including", "airbase" ],
  "bagOfWordsCRC32": [ 1811661031, 640324416, 3420792666, 1091925364, 1246704307, 4229448983, 4256453743 ],
  "minHash": [ 634126840, 1127239327, 409371136, 285652883 ... total 120 MinHashes ]
 }
}

Calculating their Jaccard similarity coefficient from their MinHashes, we get a value of 0.114.

Thus we can see that similar stories have a Jaccard similarity coefficient closer to 1 whereas dissimilar stories have a value closer to 0.

Comparing N stories two at a time by computing their Jaccard similarity coefficient require N*(N-1)/2 comparisons. For large values of N, such as 10,000, this approaches O(N^2). This technique does not scale linearly for large and ever-increasing values of N. We use Locality-sensitive Hashing (LSH) with MinHash instead.

Locality-sensitive Hashing (LSH) with MinHash

Locality-sensitive hashing is a technique that maps large sets of objects down to smaller hash values in such a way that, when two objects have a small distance from each other, their hash values are likely to be the same. In the context of a news aggregator, this implies that similar news stories will have the same hash values.

Using Minwise Hashing, we generate a 120-element signature for each news story. If we have 10,000 news stories, we have 10,000 signatures to compare, each consisting of 120 elements.

We line up all the 10,000 signatures as columns of a table. There would be 120 rows, each table cell representing one element of a signature. We group the rows into bands such that each band contains several rows.

For example, we could divide the 120-element signatures into 40 bands of 3 rows each.

120 (story signature size) = 40 (number of bands) * 3 (number of rows in a band)

Now, we match all 10,000 stories across each of the 40 bands by hashing each band. Stories that are similar will have the same hash value within a band. Thus if we maintain a hash map of stories -> band hash values, we will be able to count how many band hash values are identical, and thus, which stories are similar.

To hash each band, we can once again use the universal hash function that has the form:

  h(x) = (a*x + b) % c
   where 'x' is the input value,
         h(x) is the hash value,
         'a' and 'b' are pre-selected random numbers between 0 and the maximum hash value, and
         'c' is the prime number just greater than the maximum hash value.

The same hash function is used for hashing each of the 40 bands of the 10,000 stories. The hash value of each band of a story is the XOR of the hash values of each row within that band for that story.

Stories that have multiple matching hash values across several bands are more likely to belong to a cluster.

Here are some likely scenarios:

  • 2 stories have all 40 band hash values match. The stories are likely to be a duplicate of each other.
  • 6 stories have 4 bands out of 40 that have the same hash value. They are likely to be similar and belong to the same cluster.
  • 2 stories have just 1 band out of 40 match hash values. They may not be similar enough to cluster together.

Through trial and error, we can set a threshold for the number of matches; matches over this threshold signify certainty that the stories are similar while matches under the threshold probably need another test to be applied. For example, we can set the threshold of number of matches to be 3 and above. Then we can say that if N stories have at least 3 bands out of 40 match they are similar enough to cluster together.

Another way to conduct trial and error is to increase or decrease the signature size. For example, we can increase the signature size to 240 MinHashes to be divided into 60 bands of 4 rows each, or 80 bands of 3 rows each.

In general, the greater the number of bands, the more chance of the stories having matching hash values across the bands, thus increasing the false positives (stories that are not actually similar). Conversely, the fewer the number of bands, the lower the chance of stories having matching hash values across the bands, thus increasing the false negatives (missing out on matching stories).

After trial and error, once we settle on a signature size and its corresponding number of bands and rows to ensure that we are not missing out on matching stores (false negatives), we can eliminate false positives by applying a quality check like a Jaccard similarity coefficient. For example, through trial and error, we can set the Jaccard similarity coefficient to be above a certain threshold for a good match, say 0.778.

For example, consider two breaking news stories reported on June 14, 2017:

{
 {
  "publisher": "The Mercury News",
  "headline": "Several reportedly injured in shooting at UPS facility in San Francisco",
  "url": "https://www.mercurynews.com/2017/06/14/several-reportedly-injured-in-shooting-at-ups-facility-in-san-francisco/",
  "bagOfWords": [ "several", "reportedly", "injured", "shooting", "ups", "facility", "san", "francisco" ],
  "bagOfWordsCRC32": [ 220746452, 1063466911, 1006567869, 2526503741, 1263020761, 274306226, 2132416930, 1709995174 ],
  "minHash": [ 620743458, 93462713, 806967162, 1596533840 ... total 120 MinHashes ]
 },
 {
  "publisher": "Washington Post",
  "headline": "Multiple people injured after shooting in Alexandria",
  "url": "https://www.washingtonpost.com/local/public-safety/multiple-people-injured-after-shooting-in-alexandria/2017/06/14/0289c768-50f6-11e7-be25-3a519335381c_story.html",
  "bagOfWords": [ "multiple", "people", "injured", "shooting", "alexandria" ],
  "bagOfWordsCRC32": [ 95911172, 672557606, 2728657021, 2526503741, 2825194882 ],
  "minHash": [ 152913074, 826556324, 738957122, 72043055 ... total 120 MinHashes ]
 }
}

The 2 news stories cover 2 different events and so, empirically speaking, they should not be clustered together.

To kick off the algorithmic clustering process, we use MinHash to generate a 120-element numerical signature for each story. Then we run the locality-sensitive hashing algorithm to divide the signatures into 40 bands of 3 rows each to find out how many band hash values match.

Let us suppose that the 2 stories have 3 band hash values out of 40 match (which meets the threshold of at least 3 band hash value matches). The diagram below illustrates this.

MinHash with LSH

Let us suppose that the 2 stories have 3 band hash values out of 40 match (which meets the threshold of at least 3 band hash value matches). Then we calculate their Jaccard similarity coefficient as a quality check. Let us suppose that the Jaccard similarity coefficient is 0.629 (which is less than the threshold of 0.778). At this point we can say that the 2 stories are definitely not similar enough to belong to the same cluster, algorithmically speaking.

(This is a non-mathematical explanation of LSH with MinHash with examples. A full mathematical treatment is available in the book Mining of Massive Datasets (1st edition, 2010) by Rajaraman and Ullman, Chapter 3: Finding Similar Items)

Execution time of LSH with MinHash

Newshound consumes thousands of news stories in a day. Generating a k-element numerical signature for each news story using MinHash takes O(k) time. Since k is constant once its value has been established through trial and error, O(k) is also constant.

Running the locality-sensitive hashing (LSH) algorithm over N stories takes O(N) time, which implies a linear increase in time complexity with increasing number of stories.

Thus LSH with MinHash takes O(k*N) time which scales sub-linearly with large values of N since N is much larger than k (N >> k) in most circumstances.

For example, if we take k to be 120, the time it takes to generate a k-element numerical signature for each news story, O(k) is around 1 millisecond.

If we take N to be 10,000, the time it takes to run the LSH algorithm is of the order of 3 milliseconds.

Thus the total time taken is:
(10,000 news stories * 1 millisecond per numerical signature) + (3 milliseconds for LSH)
= (10 seconds for signature generation) + (3 milliseconds for LSH)
≈ 10 seconds

Thus if Newshound consumes a thousand news stories and then spends just 1 second clustering them, it becomes possible to keep up with breaking news.


This blog post is part of a series Under the hood of Newshound's news aggregation platform.

Subscribe to our RSS feed to keep up with the latest from Newshound Engineering.