To crawl or not to crawl, that is the question
In order to write an efficient crawler, you must be smart about the content you download. When your crawler downloads an HTML page it uses bandwidth, memory and CPU, not only its own, but also of the server the resource resides on. Knowing when not to download a resource is more important than downloading one,...