Avoid Biased Data Analysis with Clean and Structured Data

Posted on March 10, 2019 by Shai Schwartz

read the article

I want to share with you an unfortunate truth: All data is biased.

Here at Webhose, we’ve written about this at length in our posts that explained how surveys are biased and the danger of fake reviews.  News headlines throughout 2018 were full of examples of disinformation, fake news and the questioning of its impact on political elections and the public opinion of events. And we all know that even a seemingly mainstream article from a respected publication is inherently biased because both the publication and the author often have their own political and personal agenda. That’s life – there’s not a lot you can do about it as a reader except to learn critical thinking skills and carefully read and evaluate the information in front of you.

It’s important to realize, however, that data analysis is also biased. And unlike much of the data that harmlessly floats around the web, data analysis applications that are biased can cause harm to a great number of people.

So what happens when the data it is based on is biased, inaccurate or fake?

Webhose has made it our mission to provide you with clean, structured and organized data ready-made for further processing so that your data analysis is as fair and accurate as possible. At the same time, however, we wanted to educate you as to a few of the types of problematic data analyses out there, so your organization can make sure to avoid biasing your data analysis as much as possible.

1) Biased or Inaccurate Information Can Alter the Perception of Reality

The proliferation of biased and inaccurate data and its effects on various industries was one of the most talked-about news items in 2018. From the rise of fake news sites for pure monetary gain to talk of more pernicious goals of Russian disinformation, it’s clear that the data we consume can have a powerful effect on how we perceive and react to world events.

A 2017 study from Yale researchers even demonstrated that the more people hear false information, the more likely they are to believe it. (By the way, the success of political disinformation is based on this principle).

More recently, inaccurate data was one of several factors responsible for the December 2018 fall in the stock market. According to Marko Kolanovic, the top quantitative analyst at JP Morgan Chase, inaccurate negative news can seriously affect the markets. “We trace the disconnect between negative sentiment and macroeconomic reality to the reinforcing feedback loop of real and fake negative news…. If we add to this an increased number of algorithms that trade based on posts and headlines, the impact on price action and investor psychology can be significant,” Kolanovic said. Traders made decisions based on an inaccurate perception of reality causing them to lose money, contributing to a decrease in trust in the stock market.

2) Algorithms Can Be Developed to Be Inherently Biased

While inaccurate or biased research can bias your algorithms, you should be aware that organizations can also develop algorithms that are themselves biased. As Harvard PhD mathematics graduate and former quantitative analyst on Wall Street O’Neil explains in her book Weapons of Math Destruction, many people perceive algorithms to be objective tools, a black box of sorts. But all too often, these models evolve into opinions imbedded in mathematics, resulting in pervasive unfairness and discrimination in all sectors of society. The results of these algorithms can have life-altering consequences, deciding which future home owners are granted a mortgage, which teachers to hire and fire and which accused criminals to put in jail and for how long.

3) Datasets May Not Accurately Represent the Average Population

When data is gathered for analysis, there’s a lot that can unfortunately go wrong. For one, data analysts need to make sure that the data sample they gather accurately represents the population as a whole. For instance, a majority of the research from the field of psychology for the last 50 years has been performed on American college students, which should take into question the accuracy of most of this research. College students should actually be considered outliers in a dataset, as they are Westernized, educated people from developed and prosperous democracies. More than half of these students participating in the studies are psychology majors, making them even more of an outlier than even a regular college student! These are the types of samples that don’t accurately represent the average population and can lead to biased and inaccurate results in your data analysis.   

4) Datasets May Not Be Big Enough

Another technical glitch in gathering data can simply be sample size. Generally speaking, the more data an organization is able to gather, the higher the chance of accuracy of the analysis  from that dataset. Models built on the generation population, for example, may be less biased as there is more data that exists for them. These models cannot necessarily be extended to minorities, and models built for minorities are often more biased and inaccurate due to smaller sample sizes.  

Organizations that aim to strive to provide accurate analysis need access to bigger data sets. That’s why an advanced web data provider that gathers datasets from whatever sources you choose – either mainstream or  more independent sources — can help your organization’s data analysis to be as accurate and unbiased as possible. Researchers and leading news organizations have used Webhose’s vast sources of structured datasets to combat fake news, and alert the public to the biased predictions of the national election polls.

How Webhose’s Structured Data Comes to the Rescue

As our world continues to do our best to solve increasingly-complex problems with data, we will need to rely more and more on diverse, comprehensive data sets. For this, organizations need a way to clean, structure and unify data – long before it reaches their data analysts.  As the saying goes: Your analysis is only as good as your data. Our goal from the beginning has been to provide any organization and developer access to crawled web data by leveling the playing field with the big players like Google and Microsoft. That way they can concentrate on what they do best, while at the same time ensure that their data analysis is as fair and unbiased as possible.