Crawling Horrors – Browser Scraping

Posted on November 25, 2014 by Ran Geva

In my previous blog post, I wrote about RSS crawlers, and why they don’t really work. In this post I want to discuss the technique of using a headless browser to parse a website and extract its content. A headless browser is a web browser without a graphical user interface. The logic behind using a...

Continue reading

Posted in Technology

Crawling Horrors – RSS Crawlers

Posted on November 24, 2014 by Ran Geva

One of the fastest, simplest and unfortunately wrong ways of extracting content out of a website, is by reading its RSS feeds. I will show you how its done and why it’s useless. Each RSS feed already contains the data, structured and ready for harvesting, so content extraction is indeed simple and fast. Let’s take...

Continue reading

Posted in Technology