In a technologically driven environment, the temptation to develop a proprietary web crawling solution is virtually irresistible. Our latest report examines the true cost of computing and software development resources required to deliver a data crawling and structuring solution at scale:
Development & Maintenance
Development could mean coding a proprietary solution from scratch, or modifying an existing crawling software development environment (SDE) to meet your specific needs. Either way, delivering a stable solution takes months. At least one full time employee is usually assigned to own the development and maintenance of such a crawling infrastructure. The goal is to deliver comprehensive coverage, low latency, and high granularity. The cost of hiring a developer does vary across geographies and industries, but represents a significant expense. Assuming you simply assign the initiative to your existing development team, it still means diverting resources away from other initiatives.
Your multi-threaded crawler software will run 24/7 on a dedicated machine. Depending on your efficiency requirements, the cost is at least $100 per month to download thousands of pages a day from multiple sites.
While developing a rudimentary crawling solution is not rocket science, the approach is rarely scalable (assuming you’re not in the web crawling business). When it’s time to grow, your web data requirements grow with you. Pretty soon, you need to add more crawlers, filter the content you crawl, maintain hundreds or thousands of new sources per day, and the list goes on. That database you’ve putting off installing has to go live ASAP so you can filter out duplicates and schedule crawling jobs across multiple crawlers. Efficient data filtering also requires an indexing solution you can query consistently. Over time, you’ll need to hire even more developers to develop a more robust solution that can keep up with the dynamic nature of the web. That’s when the CFO starts asking why you’re spending so much on an initiative that is not aligned with your core competence.