To crawl or not to crawl, that is the question

Posted on August 24, 2015 by Ran Geva

In order to write an efficient crawler, you must be smart about the content you download. When your crawler downloads an HTML page it uses bandwidth, memory and CPU, not only its own, but also of the server the resource resides on. Knowing when not to download a resource is more important than downloading one,...

Continue reading

Posted in Technology

Tiny basic multi-threaded web crawler in Python

Posted on August 12, 2015 by Ran Geva

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you. Usage: $ python tinyDirtyIffyGoodEnoughWebCrawler.py https://cnn.com Where https://cnn.com is your seed site. It could be any site that contains content and links to other sites. My colleagues described this piece of code I wrote...

Continue reading

Posted in API