Ever imagined how "Big Data" looks like?

Posted on October 13, 2015 by Ran Geva

We have created a fun little experiment, letting you navigate in a 3D universe of real data from the open web. The data is made out of important news and blog titles, their meta-data like dates, comment count, domains and more. It’s called INSIDE BIG Data – https://webhose.io/demo/big-data

Continue reading

Posted in Big Data

30-Days of Historical Data Access for Webhose.io Now Available

Posted on September 10, 2015 by Ran Geva

I’m very happy to let you know about the launch of our extended access to 30-days of historical data from Webhose.io, which is available to our paying customers immediately. No waiting list. With the 30 days data access, Webhose.io customers don’t have to worry about missing posts in the realtime stream since they can now...

Continue reading

Posted in News

To crawl or not to crawl, that is the question

Posted on August 24, 2015 by Ran Geva

In order to write an efficient crawler, you must be smart about the content you download. When your crawler downloads an HTML page it uses bandwidth, memory and CPU, not only its own, but also of the server the resource resides on. Knowing when not to download a resource is more important than downloading one,...

Continue reading

Posted in Technology

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

Posted on August 16, 2015 by Ran Geva

On my previous post I wrote about a very basic web crawler I wrote, that can randomly scour the web and mirror/download websites. Today I want to share with you a very simple script that can extract structured data from any <almost> website. Use the following script to extract specific information from any website (i.e prices, ids, titles,...

Continue reading

Posted in API

Tiny basic multi-threaded web crawler in Python

Posted on August 12, 2015 by Ran Geva

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you. Usage: $ python tinyDirtyIffyGoodEnoughWebCrawler.py https://cnn.com Where https://cnn.com is your seed site. It could be any site that contains content and links to other sites. My colleagues described this piece of code I wrote...

Continue reading

Posted in API

How we quadrupled the performance of Elasticsearch

Posted on July 19, 2015 by Ran Geva

Well, that’s a misleading title. We actually quadrupled the performance of our brand monitoring alert system that uses Elasticsearch’s Percolator, but that would have been a much longer title. Some background Buzzilla has two main products. The first is Webhose.io which provides businesses worldwide access to structured data from the open web, and the second...

Continue reading

Posted in Technology