What Type of Datasets Do You Offer and What is the Price Range?

Webhose offers a range of free high-quality datasets across multiple vertical content domains, including online reviews,  news, blog sources and online discussions. Our datasets are divided into different subjects according to language, category and organization. In addition, they are highly granular and easy to consume.  We index millions of web pages daily, structuring the data into extracted, inferred and enriched fields that is ready for AI or machine-learning. Extracted fields include fields common to a particular source types (e.g. URL, title, body text and other elements such as comments). Inferred fields include language and associated information such as country, author name and date. Finally, Webhose is able to extract data such as ranking and scoring data from posts that provide a measurement for traffic levels, social distribution volume and relevance.

All datasets are delivered as machine-readable data in JSON format with the same structure.

Free Datasets Publicly Accessible for Data Analysis

Some well-known organizations such as Microsoft and Google also crawl the web at a massive scale and have high-quality search capabilities. But many do not allow access of the raw datasets to other organizations, so the data isn’t available for analysis at the next level. What if you’re a student or researcher and need the datasets to gain data insights for your PhD or paper? Do you have to invest your own resources into gathering datasets yourself? Where do you start?

Webhose’s vision is to deliver the same massive quantity of filtered data the above brand names offer to everyone. Our free datasets are available for students and researchers as well as commercial enterprises so they don’t have to waste precious brainpower, time and resources on crawling, extracting and structuring web datasets themselves. The free plan includes 10,000 requests per month with a limit of 100,000 posts a month. High-growth organizations that need over 1000 requests per month will need to upgrade to either the enterprise firehose or archive plan.  

Finally, students and researchers can now access the same datasets used by industry leaders like Salesforce, Meltwater and Kantar Media.

Unified Data is the First Step Before AI or Machine Learning

Datasets are the first step for developing a machine learning or AI model.

For example, say you want to develop a machine learning model that can forecast virality before posting new content. First you’ll have to find a set of blog posts in a unified, machine-readable format that have been widely shared. Then you’d need to feed the model posts that haven’t been shared for 30 days. Only after that can you test the model’s performance on a new dataset to check accuracy of performance. At this point you’re ready to run the model on the latest content. To develop a forecasting model, you would repeat these steps continuously until the model is forecasting effectively.

But without that first step of access to datasets, many researchers, students and even enterprises would be stuck.

Want to Learn More?

Learn more about Webhose’s free datasets that include data from millions of blog posts, online message boards and news articles from a range of categories and languages.


Idan Hagai
Idan Hagai is the Head of Development at Webhose.io, a leading web data provider used by hundreds of data analytics, cybersecurity and web monitoring companies worldwide. He brings over 8 years of experience as a full stack developer to Webhose.io from his position at Buzilla Ltd, a web monitoring and analytics company that helps brands track, monitor, analyze and extract insight from online content.
See Webhose in Action
Create your own account and access data feeds from news, blogs, discussions and online reviews