Common Crawl vs. Webhose

Posted on April 13, 2020 by Ran Geva

read the article

Web archives are an important resource for both academic and commercial research. Getting access to historical web data is crucial for political events analysis, fake news detection, financial trends correlation and training machine learning models, among other things. 

If you would like to conduct large-scale data mining research and explore questions about the linking structure of the web or analyze the textual content of pages, you will need access to a web archive. 

In this post we will compare between two leading archive solutions: Webhose and Common Crawl. But before we dive into the detailed comparison, a brief overview of both Common Crawl and Webhose.

Common Crawl crawls the web and freely provides its archives and datasets to the public. The Common Crawl corpus contains petabytes of data collected since 2011. It also contains raw web page data, extracted metadata and plain text extractions. Amazon Web Services began hosting Common Crawl’s archive through its Public Datasets Program in 2012. 

Webhose offers an easy and cost effective way to access segmented and structured web data. Webhose provides access to pre-defined data verticals such as news, blogs, forums, reviews and dark nets. This includes access to both a live data stream and an archive going back to 2008.

Common Crawl vs. Webhose

  Common Crawl Webhose
Archive Time Frame 2011 – Present 2008 – Present
Site Types Unclassified HTML pages from all around the web regardless of site type. Data from pre-defined verticals: news, blogs, forums and review sites.
Data Structure URL, raw HTML, HTML & server metadata, extracted plaintext No HTML, rather clean structured data extracted out of the HTML: title, publication date, post text, comments, author, language, post URL, Section URL & title, country, entities, external links, # likes/shares
Data Format WARC file format and also contains metadata (WAT) and text data (WET) extracts NDJSON
Method to data access Bulk download of all the data per crawled month Filtered by Boolean keywords in the title/text or by any extracted metadata such as language, country, site type, date etc.
Support for present live data No support for live data. Data is available at the end of the crawled month. Live access to crawled data going back 30 days
API Access RESTful API
Support for AJAX based sites No Yes

Here’s the bottom line: Common Crawl has a huge archive available for free for anyone to download. The downside is that the data isn’t structured, cannot be filtered, and is only available in bulk. In comparison, Webhose provides an affordable commercial solution for clean and structured data spanning over 10 years. Unlike Common Crawl data, which isn’t limited to certain types of sites, Webhose’s crawled data is available in pre-defined verticals (news, blogs, forums and reviews). 

If you require access to free historical web data in bulk, Common Crawl is most likely your best solution. If you need filtered granular structured data, then Webhose is probably a better tool for the job.