Can Data Science Deliver a Fake News Detector?
Regardless of your political opinion, fake news has dominated the conversation since the 2016 US presidential election. The crux of the problem is that the very definition of what qualifies as fake news is in dispute. Still, most of us would like to know if the news story we’re reading reflects actual events – or if it’s intentionally misrepresenting those events to mislead the public.
The Fake News Detector Test
Although imperfect and therefore inadmissible in court – a lie detector test is generally accepted as a good indicator of truthfulness. Mental disorders aside, we all have a good idea of what the truth is – and when we’re lying about it. If you think you’re lying, your physiological response will show as much.
Let’s assume purveyors of fake news stories are very much aware they are not engaging in ethical journalism (to put it mildly). So, couldn’t data scientists and engineers come up with a “Fake News Detector” based on nothing but the web data trail news articles generate? Most believe it’s possible. As of this writing, however, nobody’s pulled it off just yet. We’re hoping that is about to change, thanks to several teams working with data pulled from the webhose.io web data feed API.
The Training Dataset Problem
Any data scientist will tell you that creating a machine model for fake news requires at least two distinct datasets; One for news articles you know are fake, and another for confirmed legitimate news stories. Acquiring a high quality dataset of actual news articles takes some work but is a straightforward task. For better or worse, fake news stories are far less abundant. You’d need to feed thousands of fake news items into a machine to create a data-driven model what constitutes a fake news story.
Approach 1: Identify Political Bias
Confronted with the training challenge, one team of computer science students at the University of Pennsylvania decided to pivot. For their graduation project initially designed as a fake news detector, Nova Fallen, Sebi Lozano, Scott Freeman, and Sacha Best changed their project goal from fake news detection to indication of political bias. As a familiar example, New York Times readers would likely disagree with fans of FOX news about definitions of fake news stories. However, most would agree that a given news story is influenced by and caters to a particular political viewpoint. It gets easier when you review news articles published around the time of the 2016 US presidential election. An article favoring Donald Trump (presenting the candidate in a positive context) is a good indication of a conservative bias. An article supporting Hillary Clinton signals a more liberal slant.
In this case, the dataset is significantly larger than confirmed misleading or bogus news stories. Rather than a fake news detector, the U Penn team is working on a bias indicator.
Approach 2: Identify Known Sources of Fake News
Several teams, including the crowdsourced Kaggle community, are working on fake news detection based on known flagged sources of unverified news content.
Just out of stealth mode, the CrossCheck team has even suggested adding humor and entertainment sites to a source list. CrossCheck CFO Jay Khurana explained the team’s approach:
“We know that websites devoted to satire, spoof, and hoax content are publishers of known false content, it’s worth adding them to a source list, especially for algorithm training purposes.”
Assuming they’re maintained, these sources could contribute large datasets you can use to develop sophisticated machine learning models.
Obstacles to Overcome: Mixed Sources
While the above are interesting approaches to the problem, they still aren’t addressing the real issue troubling consumers. Is the news story I’m reading now attempting to deceive me deliberately? Most people reading satire news sites are unlikely to accept the content as hard news news. What happens when the sources are mixed. How can we spot fake news when it somehow appears on an established media publication?
The only way to scientifically confirm the accuracy of any fake news indicator is to show a confirmed positive. Such a tool would need to provide an example of a news article identified as fake before it is publicly exposed.
Top Liked News Stories Published on April 1st 2017
A well known example of a mixed source would be any news publication participating in the traditional April Fool’s mischief. Assuming you don’t want to wait until the hoax is paraded in public the next day – how can we know for sure?
It gets even trickier when you consider the modern news game rewards sensational headlines and stories that are at the edge of conventional wisdom. Consider the top 3 news stories published on April 1st 2017 and sorted in descending order of Facebook Likes:
Mike Will Made-It Says He Has More Music on Kendrick Lamar’s Album
Published 4/1/2017 By Omar Burgess @ complex.com
Stolen Rockwell painting returned after 41 years
Published 4/1/2017 By Evan Simko-Bednarski, CNN
Bob Dylan receives Nobel Prize in literature in Sweden
Published 4/1/2017 By Alex Stambaugh and Deanna Hackney, CNN
The first and fourth items fit into the youth culture theme consistent with the complex.com publication. However, a stolen Rockwell painting retrieved 41 years later and published on April 1st could go either way. You have to read the entire article and watch the video to discover that it is indeed an amazing story 41 years in the making. It just happened to be published on a date that could raise a red flag in the news bubble reality of 2017. Did Bob Dylan really receive the Nobel Prize in literature?
News publishers are aware of this emerging digital trend. As web data-driven news publisher Vocative reported last week, several Scandinavian outlets announced they would not participate in the long time tradition of April Fool’s items.
Well, in 2017 we can safely say that since the news source is Scandinavian, it’s arguably more reliable considering the April 1st timestamp!