While structured web data presents exciting possibilities in many fields of endeavor – including finance, cyber-security, artificial intelligence and more – the market for data extraction platforms is still fairly young. Only a handful of companies are providing online data at scale, and unlike other technologies which are covered extensively by analysts and professional publications,
I’ve always been a proponent of keeping as much of your software free for as long as possible. However, in recent years, as Webhose.io has experienced rapid growth, and since a large part of our revenue comes from enterprise sales, I’m increasingly finding myself on the defensive. I can summarize the objections I hear from
The vast majority of us use the web every single day – for news, shopping, socializing and really any type of activity you can imagine. But when it comes to acquiring data from the web for analytical or research purposes, you need to start looking at web content in a more technical way – breaking
When is it okay to grab data from someone else’s website, without their explicit permission? A new ruling by a federal judge in California might have dramatic implications on this question, and on the open nature of the web in general. As reported in several outlets (including The Verge, Engadget, and The Register), the ruling
In a technologically driven environment, the temptation to develop a proprietary web crawling solution is virtually irresistible. Our latest report examines the true cost of computing and software development resources required to deliver a data crawling and structuring solution at scale: Development & Maintenance Development could mean coding a proprietary solution from scratch, or modifying an existing crawling
Posted in Technology | Comments Off on Should you buy crawled web data or build your own solution?
The analysis you provide is only as good as the raw data you start with. Although data from the open web is often perceived as a commodity, not all crawled data is created equal. Whether you’re relying on a proprietary crawling technology, tapping into a vendor’s firehose, or implementing a combination of both strategies –
Posted in Technology | Comments Off on 5 Ways to Measure the Impact of Crawled Web Data on Your Business
Kimono Labs made an announcement today that it has been acquired by Palantir. Unfortunately Kimono Labs users will only have two weeks to migrate to a different service because the team will shut down the Kimono service on February 29, 2016. The good news is that if you are a Kimono Labs user that used
Posted in Technology | Comments Off on Calling all (almost) Kimono Labs developers to migrate to Webhose.io
In order to write an efficient crawler, you must be smart about the content you download. When your crawler downloads an HTML page it uses bandwidth, memory and CPU, not only its own, but also of the server the resource resides on. Knowing when not to download a resource is more important than downloading one,
Well, that’s a misleading title. We actually quadrupled the performance of our brand monitoring alert system that uses Elasticsearch’s Percolator, but that would have been a much longer title. Some background Buzzilla has two main products. The first is Webhose.io which provides businesses worldwide access to structured data from the open web, and the second
So if RSS Crawlers are bad, Browser Scraping isn’t efficient, what about computer vision web-page analyzers? This technology uses machine learning and computer vision to extract information from web pages by interpreting pages visually as a human being might.
Posted in Technology | Comments Off on Crawling Horrors – Computer Vision Crawlers
In my previous blog post, I wrote about RSS crawlers, and why they don’t really work. In this post I want to discuss the technique of using a headless browser to parse a website and extract its content. A headless browser is a web browser without a graphical user interface. The logic behind using a
One of the fastest, simplest and unfortunately wrong ways of extracting content out of a website, is by reading its RSS feeds. I will show you how its done and why it’s useless. Each RSS feed already contains the data, structured and ready for harvesting, so content extraction is indeed simple and fast. Let’s take