#1: Gathering and ranking news stories

Newshound deploys multiple algorithms to surface the most important news stories of the day.

In this blog post, we'll look at:

  1. Identifying news sources
  2. Gathering news stories
  3. Clustering news stories that cover the same event
  4. Ranking news stories
  5. Displaying news stories
  6. Staying current with breaking news

Identifying news sources

Newshound gets its news stories from multiple sources: news agencies like Associated Press and Reuters; national newspapers like The New York Times and The Wall Street Journal; mass media like ABC, CBS, NBC, Fox and NPR; magazines like The Economist and Rolling Stone; and digital publishers like Guardian US and Gizmodo.

All of Newshound's news sources have been carefully selected on the basis that they have their own newsrooms which publish original reporting. In aggregate, their output represents a broad spectrum of political views.

Gathering news stories

Many news sources make their stories available online via RSS feeds (see The New York Times) while some have APIs that require sign-up (see Associated Press). Most news sources classify their news stories into categories like politics, business, technology, sports, etc. (see Reuters) while others have all of theirs come out of a single fire hose (see BBC US edition).

Generally speaking, each news story has a headline, a snippet containing more details about the story, its date and time of publication, a thumbnail of the main photograph associated with the story, and a URL linking to the full story on the publisher's website.

Clustering news stories that cover the same event

When a newsworthy event occurs, several publications cover it. Clustering those news stories together allows readers to examine media coverage of the event from several different perspectives and come to a balanced understanding.

Newshound deploys an algorithmic solution to cluster similar news articles from among the thousands that it consumes on a daily basis. These algorithms come from the field of Natural Language Processing. They involve examining the text of each article, stripping it down to its essential words, and looking for other articles that have similar words. The closer the similarity, the more likely that the articles belong together in a cluster.

Consider some of the news-worthy events that unfolded on October 16, 2017:

NPR
Astronomers Strike Gravitational Gold In Colliding Neutron Stars
https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars

Reuters
Iraq says captures positions south of Kirkuk including airbase
https://www.reuters.com/article/mideast-crisis-iraq-kurds-kirkuk/iraq-says-captures-positions-south-of-kirkuk-including-airbase-idINKBN1CL0PA

BBC News
Hurricane Ophelia: Three killed as storm lashes Ireland
https://www.bbc.com/news/uk-northern-ireland-41632835

The Guardian US
Malta car bomb kills Panama Papers journalist
https://www.theguardian.com/world/2017/oct/16/malta-car-bomb-kills-panama-papers-journalist

The New York Times
Iraqi Forces Sweep Into Kirkuk, Checking Kurdish Independence Drive
https://www.nytimes.com/2017/10/16/world/middleeast/kirkuk-iraq-kurds.html

The Guardian US
New frontier for science as astronomers witness neutron stars colliding
https://www.theguardian.com/science/2017/oct/16/astronomers-witness-neutron-stars-collide-global-rapid-response-event-ligo

The field of Natural Language Processing offers a model called Bag-of-Words that considers each news story to be a set of its words after removing common words like 'a' and 'the'. (These common words are also called stopwords.)

Representing each news story as a bag-of-words, we get:

NPR
Astronomers Strike Gravitational Gold In Colliding Neutron Stars
https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars
{ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" }

Reuters
Iraq says captures positions south of Kirkuk including airbase
https://www.reuters.com/article/mideast-crisis-iraq-kurds-kirkuk/iraq-says-captures-positions-south-of-kirkuk-including-airbase-idINKBN1CL0PA
{ "iraq", "says", "captures", "positions", "south", "Kirkuk", "including", "airbase" }

BBC News
Hurricane Ophelia: Three killed as storm lashes Ireland
https://www.bbc.com/news/uk-northern-ireland-41632835
{ "hurricane", "ophelia", "three", "killed", "storm", "lashes", "ireland" }

The Guardian US
Malta car bomb kills Panama Papers journalist
https://www.theguardian.com/world/2017/oct/16/malta-car-bomb-kills-panama-papers-journalist
{ "malta", "car", "bomb", "kills", "panama", "papers", "journalist" }

The New York Times
Iraqi Forces Sweep Into Kirkuk, Checking Kurdish Independence Drive
https://www.nytimes.com/2017/10/16/world/middleeast/kirkuk-iraq-kurds.html
{ "iraqi" ,"forces", "sweep", "kirkuk", "checking", "kurdish", "independence", "drive" }

The Guardian US
New frontier for science as astronomers witness neutron stars colliding
https://www.theguardian.com/science/2017/oct/16/astronomers-witness-neutron-stars-collide-global-rapid-response-event-ligo
{ "new", "frontier", "science", "astronomers", "witness", "neutron", "stars", "colliding" }

If each story is compared to every other story by doing a simple string match on their bag-of-words, we see that we get the following two clusters of stories:

Cluster #1:

NPR
Astronomers Strike Gravitational Gold In Colliding Neutron Stars
https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars
{ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" }

The Guardian US
New frontier for science as astronomers witness neutron stars colliding
https://www.theguardian.com/science/2017/oct/16/astronomers-witness-neutron-stars-collide-global-rapid-response-event-ligo
{ "new", "frontier", "science", "astronomers", "witness", "neutron", "stars", "colliding" }

Cluster #2:

Reuters
Iraq says captures positions south of Kirkuk including airbase
https://www.reuters.com/article/mideast-crisis-iraq-kurds-kirkuk/iraq-says-captures-positions-south-of-kirkuk-including-airbase-idINKBN1CL0PA
{ "iraq", "says", "captures", "positions", "south", "Kirkuk", "including", "airbase" }

The New York Times
Iraqi Forces Sweep Into Kirkuk, Checking Kurdish Independence Drive
https://www.nytimes.com/2017/10/16/world/middleeast/kirkuk-iraq-kurds.html
{ "iraqi" ,"forces", "sweep", "kirkuk", "checking", "kurdish", "independence", "drive" }

Now if Newshound consumes another news story shortly afterwards, it knows to add it to an existing cluster with which it shares a similar bag-of-words. For example:

New story:

CNN
First-seen neutron star collision creates light, gravitational waves and gold
https://www.cnn.com/2017/10/16/world/neutron-star-collision-gravitational-waves-light/index.html
{ "first", "seen", "neutron", "star", "collision", "creates", "light", "gravitational", "waves", "gold" }

This story has a bag-of-words that is similar to those stories in Cluster #1. After adding it to the cluster, the new cluster is now:

Cluster #1:

NPR
Astronomers Strike Gravitational Gold In Colliding Neutron Stars
https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars
{ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" }

The Guardian US
New frontier for science as astronomers witness neutron stars colliding
https://www.theguardian.com/science/2017/oct/16/astronomers-witness-neutron-stars-collide-global-rapid-response-event-ligo
{ "new", "frontier", "science", "astronomers", "witness", "neutron", "stars", "colliding" }

CNN
First-seen neutron star collision creates light, gravitational waves and gold
https://www.cnn.com/2017/10/16/world/neutron-star-collision-gravitational-waves-light/index.html
{ "first", "seen", "neutron", "star", "collision", "creates", "light", "gravitational", "waves", "gold" }

Ranking news stories

The more important a news story, the more it is covered by multiple news outlets. Thus, the larger a cluster of stories, the higher it should rank. This heuristic provides a practical approach to surfacing the important stories of the day.

It is important to keep in mind that some news outlets syndicate articles from wire agencies when they are not in a position to provide original coverage on their own. Syndication may happen when local papers license national or international stories from news wires, or when several newspapers pool resources to generate shared content.

Here is an example of syndication: a news story originally published by the Associated Press is syndicated by the Tampa Bay Times and the Chicago Tribune.

Nobel Peace Prize awarded to anti-nuclear campaign group
By Jamey Keaten and Mark Lewis, Associated Press | October 6, 2017
Associated Press: https://apnews.com/26f35c8abeea49ce931e82f0e29b7d5b/Nobel-Peace-Prize-awarded-to-anti-nuclear-campaign-group
Tampa Bay Times: https://www.tampabay.com/news/politics/group-opposing-nuclear-weapons-wins-nobel-peace-prize/2340223/
Chicago Tribune: https://www.chicagotribune.com/nation-world/ct-nobel-peace-prize-ican-20171006-story.html

Such syndicated stories should typically not count towards cluster size as much as original reporting from Pulitzer Prize-winning publications that have their own newsrooms.

Another important factor in ranking clusters is the recency of their news stories. Given two clusters of the same size, the one containing the more recent news stories ranks higher.

Displaying news stories

Newshound displays each news story cluster in the category in which it belongs, listing the higher-ranked clusters at the top and the lower-ranked at the bottom.

Each news cluster displays the 3 latest stories with their headlines, snippets, timestamp of publication and name of the news source. More stories in the cluster are collapsed at the bottom of the cluster, linked to their news source. If any story in the cluster has a thumbnail of a photo associated with it, that thumbnail is displayed as part of the cluster.

Here is an example of a cluster of the stories:

First-seen neutron star collision creates light, gravitational waves and gold
CNN · Oct 16, 2017 · For the first time, two neutron stars in a nearby galaxy have been observed engaging in a spiral death dance around one another until they collided.
Astronomers Strike Gravitational Gold In Colliding Neutron Stars · NPR · Oct 16, 2017 · The collision of two neutron stars, seen in an artist's rendering, created both gravitational waves and gamma rays. Researchers used those signals to locate the event with optical telescopes.
New frontier for science as astronomers witness neutron stars colliding · The Guardian · Oct 16, 2017 · Extraordinary event has been ‘seen’ for the first time, in both gravitational waves and light – ending decades-old debate about where gold comes from.

Staying current with breaking news

Newshound consumes thousands of new stories in a day. Comparing each story's bag-of-words to every other story's bag-of-words would require N*(N-1)/2 comparisons. For large values of N, this approaches O(N^2).

For example, if we take N to be 6, the number of comparisons is 6 * 5 / 2 = 15.
For N = 1000, the number of comparisons is 1000 * 999 / 2 = 499,500.

If each comparison takes 1 millisecond, the total time taken is:
N = 6, comparisons = 15, time taken = 15 * 1 ms = 15 ms.
N = 1000, comparisons = 499500, time taken = 499,500 * 1 ms = 499.5 minutes = 8.325 hours.

If Newshound consumes a thousand news stories and then spends 8 hours clustering them, it would become impossible to keep up with breaking news. In effect, this approach of comparing every news story with every other news story does not scale. A different approach is required and that is precisely what we'll discuss in our next blog post.

Donate to Newshound

Help us keep the lights on and the servers running.