Back to Blog

Crawling Horrors – RSS Crawlers

November 24, 2014

One of the fastest, simplest and unfortunately wrong ways of extracting content out of a website, is by reading its RSS feeds. I will show you how its done and why it’s useless.

Each RSS feed already contains the data, structured and ready for harvesting, so content extraction is indeed simple and fast. Let’s take for example the RSS feed from TechCrunch (Many times you can find the RSS feed URL by reading the <link rel=”alternate” type=”application/rss+xml”…> tag from the main html page. In TechCrunch’s case, it’s https://techcrunch.com/feed/). The output is an XML that includes an <item> element within you can find the author name, the post date, images and even part of the content.

So why is this wrong you ask? Because getting only part of the content, misses the purpose of a good crawler. Getting 2-3 lines out of the complete article is useless, not to mention that you don’t get the comments for the article (some sites provides a comments feed, but again it contains a fraction of the comment content)

True, it’s fast, simple, very low on bandwidth, and you get structured data, but you don’t get the complete data, and in my book it disqualifies this method as a valid crawling option. You can use an RSS crawler as a starting point to discover article URLs, but not as a content extractor.

Spread the News

Subscribe to our newsletter for more news and updates!

By submitting you agree to Webz.io's Privacy Policy and further marketing communications.

Crawling Horrors – RSS Crawlers

Subscribe to our newsletter for more news and updates!

Feed Your Machines the Data They Need

Feed Your Machines the Data They Need

Crawling Horrors – RSS Crawlers

Subscribe to our newsletter for more news and updates!

Read More

AI Takeover? 4 Big Web Data Predictions for 2024

Structured or Unstructured Data? The Big Web Data Question for Businesses

Common Crawl vs. Webz.io Data: Which One Works Best for Large Language Models?

Feed Your Machines the Data They Need

Feed Your Machines the Data They Need