Gathering and ranking news stories
Newshound deploys multiple algorithms to surface the most important news stories of the day.
In this blog post, we'll look at:
- Identifying news sources
- Gathering news stories
- Clustering news stories that cover the same event
- Ranking news stories
- Displaying news stories
- Staying current with breaking news
Identifying news sources
Newshound gets its news stories from multiple sources: news agencies like Associated Press and Reuters; national newspapers like The New York Times and The Wall Street Journal; mass media like ABC, CBS, NBC, Fox and NPR; magazines like The Economist and Rolling Stone; and digital publishers like Guardian US and Gizmodo.
All of Newshound's news sources have been carefully selected on the basis that they have their own newsrooms which publish original reporting. In aggregate, their output represents a broad spectrum of political views.
Gathering news stories
Many news sources make their stories available online via RSS feeds (see The New York Times) while some have APIs that require sign-up (see Associated Press). Most news sources classify their news stories into categories like politics, business, technology, sports, etc. (see Reuters) while others have all of theirs come out of a single fire hose (see BBC US edition).
Generally speaking, each news story has a headline, a snippet containing more details about the story, its date and time of publication, a thumbnail of the main photograph associated with the story, and a URL linking to the full story on the publisher's website.
Clustering news stories that cover the same event
When a newsworthy event occurs, several publications cover it. Clustering those news stories together allows readers to examine media coverage of the event from several different perspectives and come to a balanced understanding.
Newshound deploys an algorithmic solution to cluster similar news articles from among the thousands that it consumes on a daily basis. These algorithms come from the field of Natural Language Processing. They involve examining the text of each article, stripping it down to its essential words, and looking for other articles that have similar words. The closer the similarity, the more likely that the articles belong together in a cluster.
Consider some of the news-worthy events that unfolded on October 16, 2017:
{ { "publisher": "NPR", "headline": "Astronomers Strike Gravitational Gold In Colliding Neutron Stars", "url": "https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars" }, { "publisher": "Reuters", "headline": "Iraq says captures positions south of Kirkuk including airbase", "url": "https://www.reuters.com/article/mideast-crisis-iraq-kurds-kirkuk/iraq-says-captures-positions-south-of-kirkuk-including-airbase-idINKBN1CL0PA" }, { "publisher": "BBC News", "headline": "Hurricane Ophelia: Three killed as storm lashes Ireland", "url": "https://www.bbc.com/news/uk-northern-ireland-41632835" }, { "publisher": "The Guardian US", "headline": "Malta car bomb kills Panama Papers journalist", "url": "https://www.theguardian.com/world/2017/oct/16/malta-car-bomb-kills-panama-papers-journalist" }, { "publisher": "The New York Times", "headline": "Iraqi Forces Sweep Into Kirkuk, Checking Kurdish Independence Drive", "url": "https://www.nytimes.com/2017/10/16/world/middleeast/kirkuk-iraq-kurds.html" }, { "publisher": "The Guardian US", "headline": "New frontier for science as astronomers witness neutron stars colliding", "url": "https://www.theguardian.com/science/2017/oct/16/astronomers-witness-neutron-stars-collide-global-rapid-response-event-ligo" } }
The field of Natural Language Processing offers a model called bag-of-words that considers each news story to be a set of its words after removing common words like 'a' and 'the'. (These common words are also called stopwords.)
Representing each news story as a bag-of-words, we get:
{ { "publisher": "NPR", "headline": "Astronomers Strike Gravitational Gold In Colliding Neutron Stars", "url": "https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars", "bagOfWords": [ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" ] }, { "publisher": "Reuters", "headline": "Iraq says captures positions south of Kirkuk including airbase", "url": "https://www.reuters.com/article/mideast-crisis-iraq-kurds-kirkuk/iraq-says-captures-positions-south-of-kirkuk-including-airbase-idINKBN1CL0PA", "bagOfWords": [ "iraq", "says", "captures", "positions", "south", "kirkuk", "including", "airbase" ] }, { "publisher": "BBC News", "headline": "Hurricane Ophelia: Three killed as storm lashes Ireland", "url": "https://www.bbc.com/news/uk-northern-ireland-41632835", "bagOfWords": [ "hurricane", "ophelia", "three", "killed", "storm", "lashes", "ireland" ] }, { "publisher": "The Guardian US", "headline": "Malta car bomb kills Panama Papers journalist", "url": "https://www.theguardian.com/world/2017/oct/16/malta-car-bomb-kills-panama-papers-journalist", "bagOfWords": [ "malta", "car", "bomb", "kills", "panama", "papers", "journalist" ] }, { "publisher": "The New York Times", "headline": "Iraqi Forces Sweep Into Kirkuk, Checking Kurdish Independence Drive", "url": "https://www.nytimes.com/2017/10/16/world/middleeast/kirkuk-iraq-kurds.html", "bagOfWords": [ "iraqi" ,"forces", "sweep", "kirkuk", "checking", "kurdish", "independence", "drive" ] }, { "publisher": "The Guardian US", "headline": "New frontier for science as astronomers witness neutron stars colliding", "url": "https://www.theguardian.com/science/2017/oct/16/astronomers-witness-neutron-stars-collide-global-rapid-response-event-ligo", "bagOfWords": [ "new", "frontier", "science", "astronomers", "witness", "neutron", "stars", "colliding" ] } }
If each story is compared to every other story by doing a simple string match on their bag-of-words, we see that we get the following two clusters of stories:
Cluster #1:
{ { "publisher": "NPR", "headline": "Astronomers Strike Gravitational Gold In Colliding Neutron Stars", "url": "https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars", "bagOfWords": [ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" ] }, { "publisher": "The Guardian US", "headline": "New frontier for science as astronomers witness neutron stars colliding", "url": "https://www.theguardian.com/science/2017/oct/16/astronomers-witness-neutron-stars-collide-global-rapid-response-event-ligo", "bagOfWords": [ "new", "frontier", "science", "astronomers", "witness", "neutron", "stars", "colliding" ] } }
Cluster #2:
{ { "publisher": "Reuters", "headline": "Iraq says captures positions south of Kirkuk including airbase", "url": "https://www.reuters.com/article/mideast-crisis-iraq-kurds-kirkuk/iraq-says-captures-positions-south-of-kirkuk-including-airbase-idINKBN1CL0PA", "bagOfWords": [ "iraq", "says", "captures", "positions", "south", "kirkuk", "including", "airbase" ] }, { "publisher": "The New York Times", "headline": "Iraqi Forces Sweep Into Kirkuk, Checking Kurdish Independence Drive", "url": "https://www.nytimes.com/2017/10/16/world/middleeast/kirkuk-iraq-kurds.html", "bagOfWords": [ "iraqi" ,"forces", "sweep", "kirkuk", "checking", "kurdish", "independence", "drive" ] } }
Now if Newshound consumes another news story shortly afterwards, it knows to add it to an existing cluster with which it shares a similar bag-of-words. For example:
New story:
{ "publisher": "CNN", "headline": "First-seen neutron star collision creates light, gravitational waves and gold", "url": "https://www.cnn.com/2017/10/16/world/neutron-star-collision-gravitational-waves-light/index.html", "bagOfWords": [ "first", "seen", "neutron", "star", "collision", "creates", "light", "gravitational", "waves", "gold" ] }
This story has a bag-of-words that is similar to those stories in Cluster #1. After adding it to the cluster, the new cluster is now:
Cluster #1:
{ { "publisher": "NPR", "headline": "Astronomers Strike Gravitational Gold In Colliding Neutron Stars", "url": "https://www.npr.org/sections/thetwo-way/2017/10/16/557557544/astronomers-strike-gravitational-gold-in-colliding-neutron-stars", "bagOfWords": [ "astronomers", "strike", "gravitational", "gold", "colliding", "neutron", "stars" ] }, { "publisher": "The Guardian US", "headline": "New frontier for science as astronomers witness neutron stars colliding", "url": "https://www.theguardian.com/science/2017/oct/16/astronomers-witness-neutron-stars-collide-global-rapid-response-event-ligo", "bagOfWords": [ "new", "frontier", "science", "astronomers", "witness", "neutron", "stars", "colliding" ] }, { "publisher": "CNN", "headline": "First-seen neutron star collision creates light, gravitational waves and gold", "url": "https://www.cnn.com/2017/10/16/world/neutron-star-collision-gravitational-waves-light/index.html", "bagOfWords": [ "first", "seen", "neutron", "star", "collision", "creates", "light", "gravitational", "waves", "gold" ] } }
Ranking news stories
The more important a news story, the more it is covered by multiple news outlets. Thus, the larger a cluster of stories, the higher it should rank. This heuristic provides a practical approach to surfacing the important stories of the day.
It is important to keep in mind that some news outlets reprint articles from a press syndicate when they are not in a position to provide original coverage on their own. Syndication may happen when local papers license national or international stories from news wires, or when several newspapers pool resources to generate shared content.
Here is an example of syndication: a news story originally published by the Associated Press is syndicated by the Tampa Bay Times and the Chicago Tribune.
{ "publisher": "Associated Press", "headline": "Nobel Peace Prize awarded to anti-nuclear campaign group", "url": "https://apnews.com/26f35c8abeea49ce931e82f0e29b7d5b/Nobel-Peace-Prize-awarded-to-anti-nuclear-campaign-group", "date": "2017-10-06", "syndication": [ { "publisher": "Tampa Bay Times", "url": "https://www.tampabay.com/news/politics/group-opposing-nuclear-weapons-wins-nobel-peace-prize/2340223/" }, { "publisher": "Chicago Tribune", "url": "https://www.chicagotribune.com/nation-world/ct-nobel-peace-prize-ican-20171006-story.html" } ] }
Such syndicated stories should typically not count towards cluster size as much as original reporting from Pulitzer Prize-winning publications that have their own newsrooms.
Another important factor in ranking clusters is the recency of their news stories. Given two clusters of the same size, the one containing the more recent news stories ranks higher.
Displaying news stories
Newshound displays each news story cluster in the category in which it belongs, listing the higher-ranked clusters at the top and the lower-ranked at the bottom.
Each news cluster may display the top latest stories with their headlines, snippets, timestamp of publication and name of the news source. More stories in the cluster may be collapsed at the bottom of the cluster, linked to their news source. If any story in the cluster has a thumbnail of a photo associated with it, that thumbnail may be displayed as part of the cluster.
Here is a simple example of a cluster of the stories:
First-seen neutron star collision creates light, gravitational waves and gold
CNN · Oct 16, 2017 · For the first time, two neutron stars in a nearby galaxy have been observed engaging in a spiral death dance around one another until they collided.
Astronomers Strike Gravitational Gold In Colliding Neutron Stars · NPR · Oct 16, 2017 · The collision of two neutron stars, seen in an artist's rendering, created both gravitational waves and gamma rays. Researchers used those signals to locate the event with optical telescopes.
New frontier for science as astronomers witness neutron stars colliding · The Guardian · Oct 16, 2017 · Extraordinary event has been ‘seen’ for the first time, in both gravitational waves and light – ending decades-old debate about where gold comes from.
Staying current with breaking news
Newshound consumes thousands of news stories in a day. Comparing each story's bag-of-words to every other story's bag-of-words would require N*(N-1)/2 comparisons. For large values of N, this approaches O(N^2).
For example, if we take N to be 6, the number of comparisons is 6 * 5 / 2 = 15.
For N = 1000, the number of comparisons is 1000 * 999 / 2 = 499,500.
If each comparison takes 1 millisecond, the total time taken is:
N = 6, comparisons = 15, time taken = 15 * 1 ms = 15 ms.
N = 1000, comparisons = 499500, time taken = 499,500 * 1 ms = 499.5 minutes = 8.325 hours.
If Newshound consumes a thousand news stories and then spends 8 hours clustering them, it would become impossible to keep up with breaking news. In effect, this approach of comparing every news story with every other news story does not scale. A different approach is required and that is precisely what we'll discuss in our next blog post.
This blog post is part of a series Under the hood of Newshound's news aggregation platform.
Subscribe to our RSS feed to keep up with the latest from Newshound Engineering.