Webhose.io Tip: Search for top performing (viral) posts

Posted on April 30, 2015 by Ran Geva

Here at Webhose, our crawlers download millions of posts a day from millions of sources. When searching for web data among these many sources, you may want to limit your results to news or blog posts that had some kind of social impact. To provide you with this capability, we are introducing a new score...

Continue reading

Posted in API

Building a Better Search Query

Posted on December 10, 2014 by Ran Geva

Many factors can affect streaming data relevancy. When the data you consume isn’t ordered by relevancy, rather by the time it was crawled, getting the relevant posts is essential. I would like to share with you a few tips you can use to highly increase the relevancy of the data you consume via Webhose.io API...

Continue reading

Posted in API

Webhose.io Tips & Tricks: Search for Reviews

Posted on December 10, 2014 by Ran Geva

Are you looking to focus your data search specifically on consumer generated reviews? Here are a couple of simple Webhose.io tricks that might help: Limit your query to specific sites You can limit your search to specific “review sites” like amazon.com, bestbuy.com, newegg.com, cnet.com, engadget.com, pcmag.com etc.. Here is an example for how you should...

Continue reading

Posted in API

Vertical aggregation & Pattern matching crawlers

Posted on November 27, 2014 by Ran Geva

After bashing various crawling techniques, I would like to describe the technique we use here, at webhose.io, a technology that was developed over the past 8 years. Our crawlers were developed with the following demands in mind: Efficient on server resources, i.e CPU & bandwidth Fast in fetching and extracting content Easily add new sites...

Continue reading

Posted in API

Crawling Horrors – Browser Scraping

Posted on November 25, 2014 by Ran Geva

In my previous blog post, I wrote about RSS crawlers, and why they don’t really work. In this post I want to discuss the technique of using a headless browser to parse a website and extract its content. A headless browser is a web browser without a graphical user interface. The logic behind using a...

Continue reading

Posted in Technology