Author Archives: Ran Geva

4 Reasons to Stick to a Free Plan Model

Posted on May 3, 2018 by Ran Geva

I’ve always been a proponent of keeping as much of your software free for as long as possible. However, in recent years, as Webhose.io has experienced rapid growth, and since a large part of our revenue comes from enterprise sales, I’m increasingly finding myself on the defensive. I can summarize the objections I hear from

Continue reading

Posted in Technology | Leave a comment

Financial success using AI and Time Travel

Posted on January 18, 2018 by Ran Geva

Wait let me explain. I can explain every part of this click-bait title, it will make sense I promise. So, A great philosopher named Homer Simpsons once said: "Trying is the first step towards failure" And I agree, however Failure is the first step towards success. Learning from past mistakes is a crucial step to

Continue reading

Posted in Machine Learning | Comments Off on Financial success using AI and Time Travel

Calling all (almost) Kimono Labs developers to migrate to Webhose.io

Posted on February 16, 2016 by Ran Geva

Kimono Labs made an announcement today that it has been acquired by Palantir. Unfortunately Kimono Labs users will only have two weeks to migrate to a different service because the team will shut down the Kimono service on February 29, 2016. The good news is that if you are a Kimono Labs user that used

Continue reading

Posted in Technology | Comments Off on Calling all (almost) Kimono Labs developers to migrate to Webhose.io

Article’s publication date extractor – an overview

Posted on December 13, 2015 by Ran Geva

A few days ago I’ve released an open source Python module that provides you with a simple way to extract and normalize the publication date of any online blog or news post. There are some commercial solutions out there, but why not just use this module for free?   The logic behind the code Here

Continue reading

Posted in API | Comments Off on Article’s publication date extractor – an overview

Ever imagined how “Big Data” looks like?

Posted on October 13, 2015 by Ran Geva

We have created a fun little experiment, letting you navigate in a 3D universe of real data from the open web. The data is made out of important news and blog titles, their meta-data like dates, comment count, domains and more. It’s called INSIDE BIG Data – https://webhose.io/demo/big-data

Continue reading

Posted in Big Data | Comments Off on Ever imagined how “Big Data” looks like?

30-Days of Historical Data Access for Webhose.io Now Available

Posted on September 10, 2015 by Ran Geva

I’m very happy to let you know about the launch of our extended access to 30-days of historical data from Webhose.io, which is available to our paying customers immediately. No waiting list. With the 30 days data access, Webhose.io customers don’t have to worry about missing posts in the realtime stream since they can now

Continue reading

Posted in News | Comments Off on 30-Days of Historical Data Access for Webhose.io Now Available

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

Posted on August 16, 2015 by Ran Geva

On my previous post I wrote about a very basic web crawler I wrote, that can randomly scour the web and mirror/download websites. Today I want to share with you a very simple script that can extract structured data from any <almost> website. Use the following script to extract specific information from any website (i.e prices, ids, titles,

Continue reading

Posted in API | Comments Off on Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

Tiny basic multi-threaded web crawler in Python

Posted on August 12, 2015 by Ran Geva

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you. Usage: $ python tinyDirtyIffyGoodEnoughWebCrawler.py https://cnn.com Where https://cnn.com is your seed site. It could be any site that contains content and links to other sites. My colleagues described this piece of code I wrote

Continue reading

Posted in API | Leave a comment

How we quadrupled the performance of Elasticsearch

Posted on July 19, 2015 by Ran Geva

Well, that’s a misleading title. We actually quadrupled the performance of our brand monitoring alert system that uses Elasticsearch’s Percolator, but that would have been a much longer title. Some background Buzzilla has two main products. The first is Webhose.io which provides businesses worldwide access to structured data from the open web, and the second

Continue reading

Posted in Technology | Leave a comment

Webhose.io Tip: Search for top performing (viral) posts

Posted on April 30, 2015 by Ran Geva

Our crawlers download millions of posts a day from millions of sources. Sometimes you may want to only sift through news or blog posts that had some kind of social impact. To provide you with this capability, we are introducing a new score we call the “Performance Score”.  

Continue reading

Posted in API | Comments Off on Webhose.io Tip: Search for top performing (viral) posts

Building a Better Search Query

Posted on December 10, 2014 by Ran Geva

Many factors can affect streaming data relevancy. When the data you consume isn’t ordered by relevancy, rather by the time it was crawled, getting the relevant posts is essential. I would like to share with you a few tips you can use to highly increase the relevancy of the data you consume via Webhose.io API

Continue reading

Posted in API | Leave a comment

Webhose.io Tips & Tricks: Search for Reviews

Posted on December 10, 2014 by Ran Geva

Are you looking to focus your data search specifically on consumer generated reviews? Here are a couple of simple Webhose.io tricks that might help: Limit your query to specific sites You can limit your search to specific “review sites” like amazon.com, bestbuy.com, newegg.com, cnet.com, engadget.com, pcmag.com etc.. Here is an example for how you should

Continue reading

Posted in API | Comments Off on Webhose.io Tips & Tricks: Search for Reviews

Vertical aggregation & Pattern matching crawlers

Posted on November 27, 2014 by Ran Geva

After bashing various crawling techniques, I would like to describe the technique we use here, at webhose.io, a technology that was developed over the past 8 years. Our crawlers were developed with the following demands in mind: Efficient on server resources, i.e CPU & bandwidth Fast in fetching and extracting content Easily add new sites

Continue reading

Posted in API | Leave a comment

Crawling Horrors – Computer Vision Crawlers

Posted on November 26, 2014 by Ran Geva

So if RSS Crawlers are bad, Browser Scraping isn’t efficient, what about computer vision web-page analyzers? This technology uses machine learning and computer vision to extract information from web pages by interpreting pages visually as a human being might.  

Continue reading

Posted in Technology | Comments Off on Crawling Horrors – Computer Vision Crawlers

Crawling Horrors – Browser Scraping

Posted on November 25, 2014 by Ran Geva

In my previous blog post, I wrote about RSS crawlers, and why they don’t really work. In this post I want to discuss the technique of using a headless browser to parse a website and extract its content. A headless browser is a web browser without a graphical user interface. The logic behind using a

Continue reading

Posted in Technology | Leave a comment

Crawling Horrors – RSS Crawlers

Posted on November 24, 2014 by Ran Geva

One of the fastest, simplest and unfortunately wrong ways of extracting content out of a website, is by reading its RSS feeds. I will show you how its done and why it’s useless. Each RSS feed already contains the data, structured and ready for harvesting, so content extraction is indeed simple and fast. Let’s take

Continue reading

Posted in Technology | Leave a comment