Survey Results: What Matters to Web Data Collection Buyers

Posted on June 28, 2018 by webhose

While structured web data presents exciting possibilities in many fields of endeavor – including finance, cyber-security, artificial intelligence and more – the market for data extraction platforms is still fairly young. Only a handful of companies are providing online data at scale, and unlike other technologies which are covered extensively by analysts and professional publications,...

Continue reading

Posted in Technology

Quick Guide to News APIs

Posted on October 10, 2017 by eranl

Monitoring mass media has come a long way since the days of the press-cutting agency. The bulk of today’s news is published online, while modern technology lets us store, index and query massive amounts of textual data in milliseconds. Digitization presents clear advantages for consumers, who can now read or watch the news from the...

Continue reading

Posted in API

Article’s publication date extractor – an overview

Posted on December 13, 2015 by Ran Geva

A few days ago I’ve released an open source Python module that provides you with a simple way to extract and normalize the publication date of any online blog or news post. There are some commercial solutions out there, but why not just use this module for free? The logic behind the code Here at...

Continue reading

Posted in API

Dead simple {for devs} python crawler (script) for extracting structured data from any website into CSV

Posted on August 16, 2015 by Ran Geva

On my previous post I wrote about a very basic web crawler I wrote, that can randomly scour the web and mirror/download websites. Today I want to share with you a very simple script that can extract structured data from any <almost> website. Use the following script to extract specific information from any website (i.e prices, ids, titles,...

Continue reading

Posted in API

Crawling Horrors – Browser Scraping

Posted on November 25, 2014 by Ran Geva

In my previous blog post, I wrote about RSS crawlers, and why they don’t really work. In this post I want to discuss the technique of using a headless browser to parse a website and extract its content. A headless browser is a web browser without a graphical user interface. The logic behind using a...

Continue reading

Posted in Technology