Today we’re very excited to announce the latest milestone in our journey to make structured web data easily accessible to every organization, developer and researcher: the Online News Archive has now been officially launched!
TL;DR version: it’s a massive database of online news articles in structured format collected from thousands of sources in over 115 languages, as collected by the Webhose.io crawlers since December 2014. The data is available in JSON or XML through a simple and affordable on-demand consumption model. Read on for more details, or create a free account to see it in action (you can also use your Webhose login).
What is this all about?
Webhose.io constantly crawls a very wide range of online news websites to power our News API and firehose services. We crawl thousands of sources, with new ones constantly added by our bot’s automated content discovery, as well as specific requests from our clients – which include some the world’s leading web monitoring, AI and financial analysis companies.
While the API is meant for live consumption and so is limited to the past 30 days, all the data is stored on our servers, and over time this has grown into a fairly huge and comprehensive database of the world’s online news – including items that have been removed or are difficult to reach via other means. The Online News Archive is a simple and straightforward way to access this database and retrieve data related to any topic that can be translated into a boolean query.
Why a separate service?
This was a dilemma we struggled with, but after much internal debate we decided to launch the Online News Archive as a separate – albeit closely related – website and service, rather than keep it under the Webhose.io rooftop. While both are essentially means to access structured web data at scale, the News Archive is meant for different use cases and audiences, which is why we eventually opted to separate.
For the same reasons, and after looking at the types of clients that are currently using our historical data, we tried to make the website and user experience simpler and more accessible. Our idea is to cater to a wider range of users compared to the Webhose API, which is primarily intended for developers.
Is this replacing the Webhose.io archive?
Nope! The Webhose.io archive is still active, and there is no intention to retire it. It’s a larger database that you can use for other types of historical data: blogs, online discussions and comment threads. If you’re already a Webhose.io user, you don’t need to create a new account to continue using the existing archive.
What is the Online News Archive good for?
Anything you can think of! From our experience, our users are way more creative than us when it comes to coming up with use cases. However, here’s some of the more common ways we’ve been seeing our historical data used so far:
- Financial analysis and modeling – looking at past news stories can allow you to test financial models meant to predict changes in stock prices in the future; e.g., if you have a hypothesis that for correlation between trends in public sentiment versus market behavior, you can see how that model would have worked if applied to past data. Essentially, you’re traveling back in time to predict the future.
- Data science, AI and machine learning – one of the reasons data has been called “the oil of the 21st century” is due to the essential nature of training data in developing AI and machine learning systems. The News Archive provides terabytes of structured textual data which is perfect if you’re looking to give your ML or NLP algorithms something substantial to chew on.
- Market research – looking to launch a new product or gain some insight into your competitors? Use the Archive to extract relevant headlines and stories from the past 3.5 years, then apply your own know-how and tools to mine the data for insights you can’t find anywhere else.
How do I get started?
Simply go to www.onlinenewsarchive.com, create a free account and you’re good to go. Or, if you already have a Webhose.io account, you can skip this stage and use your Webhose credentials to log in.
How awesome is the Online News Archive on a scale of 1-10?
You’ll need a scale that goes up to 11.