Tiny basic multi-threaded web crawler in Python

Posted on August 12, 2015 by

If you need a simple web crawler that will scour the web for a while to download random site’s content – this code is for you.

Usage:

$ python tinyDirtyIffyGoodEnoughWebCrawler.py https://cnn.com

Where https://cnn.com is your seed site. It could be any site that contains content and links to other sites.

My colleagues described this piece of code I wrote as “Dirty”, “Iffy”, “Bad”, “Not very good”. I say, it gets the job done and downloads thousands of pages from multiple pages in a matter of hours. No setup is required, no external imports, just run the following python code with a seed site and sit back (or go do something else because it could take a few hours, or days depending on how much data you need).

tinyDirtyIffyGoodEnoughWebCrawler.py

import sys, thread, Queue, re, urllib, urlparse, time, os, sys
dupcheck = set()  
q = Queue.Queue(100) 
q.put(sys.argv[1]) 
def queueURLs(html, origLink): 
    for url in re.findall('''<a[^>]+href=["'](.[^"']+)["']''', html, re.I): 
        link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] 
        if link in dupcheck:
            continue
        dupcheck.add(link)
        if len(dupcheck) > 99999: 
            dupcheck.clear()
        q.put(link) 
def getHTML(link): 
    try:
        html = urllib.urlopen(link).read() 
        open(str(time.time()) + ".html", "w").write("" % link  + "n" + html) 
        queueURLs(html, link) 
    except (KeyboardInterrupt, SystemExit): 
        raise
    except Exception:
        pass
while True:
    thread.start_new_thread( getHTML, (q.get(),)) 
    time.sleep(0.5)

Features:

  • Multi-threaded – for fastness
  • Duplication elimination (kinda) – for link uniqueness
  • Saves both source and its link – for the purpose it was built
  • FREE

Enjoy,

Ran

This entry was posted in API. Bookmark the permalink.

Comments are closed.