2017 was a turbulent year: With Donald Trump shaking up the American political system, cryptocurrencies causing riptides throughout financial markets, and advancements in artificial intelligence sparking both anticipation and anxiety in the scientific world, the passing year seems to have been dominated by a sense of uncertainty and a sea change waiting to happen at any moment.
All of these phenomena deserve deep and thoughtful consideration as we race towards the end of the current decade, and you can probably find that elsewhere. As for us, in 2018 we’ll keep focusing on the thing we do best, which is turning websites into data. This is one area where we can offer some of our own predictions and trends for 2018. Here are the three big ones, which we believe will shape the coming year in the realm of web data.
1. AI and Cyber Security Become Dominant Consumers
As one of a handful of leaders in the web data provisioning space, we have noticed a shift in the profiles of users who were signing up for Webhose.io accounts (both free and paid).
Traditionally, our platform catered mostly to media monitoring and reputation management services – organizations that rely on the ability to monitor the world wide web for obvious reasons, as it is the very core of their business. However, in recent years, and especially 2017, we have been witnessing the emergence of new types of data consumers, whose interests lie in artificial intelligence development as well as cybersecurity threat intelligence.
While both of these domains are growing at breakneck speed, each of them has a different need for web data: AI developers often see the web as a massive repository of natural language content, which their machine learning algorithms will happily ingest and become more robust; whereas cybersecurity companies (or teams) want a way to scan both the open and dark web in order to identify suspicious behavior and find the proverbial needle in the haystack that could indicate a data breach, or the sale of illicit items such as stolen credit cards.
We expect these two types of players to become even more dominant in this space in 2018, as the cybersecurity and AI industries themselves are on a clear growth trajectory. In turn, this will affect the way providers such as ourselves collect, structure and commercialize extracted web data.
2. Maturity and Growing Legitimacy
A judge’s ruling in a court case involving the social network Linkedin and a data mining startup could prove to be a watershed moment when it comes to web data extraction. In the ruling, the court sided with the principle that data which is publicly available can legitimately be collected and analyzed by third parties – even without the permission of the site owners.
As we mentioned in our previous article, this debate isn’t directly relevant to Webhose.io as we refrain from crawling any site that has indicated we are not welcome by putting its content behind a login form, or via the robots.txt file. However, it does represent a trend that we hope and believe will continue in 2018: the increased acceptance of web crawling as a legitimate business practice, and as a necessity for further research and innovation.
Doubtless, there are still “bad actors” in this space – providers that scrape data which is clearly not meant to be publicly available for semi-legal or plain illegal purposes. However, the delineation between blackhat and whitehat web crawling services is becoming clearer, with the latter being accepted as a necessary tool in the pursuit of a deeper understanding of the web.
3. Data Structure and Segmentation Become Crucial
From a technical perspective, the ability to accommodate different data structures for different types of data is becoming crucial. Today’s web is far more complex than it used to be a decade ago, as more and more parts of our lives move online. What’s more – the level of analysis that organizations wish to perform is often much deeper and more intricate (which also relates to the above mentioned shift from brand monitoring to artificial intelligence).
Hence, bulk scraping web content is becoming less useful. We predict that in 2018, organizations that monitor and analyze the web will look for structured data that is easily machine-readable, and which they can slice and dice according to predefined dimensions. These dimensions will have to vary according to the type of content being analyzed – e.g., an e-commerce website differs wildly from an online news outlet, and so would typically need to be approached differently from an analytical perspective.
Do you have any predictions for web data in 2018? Tell us in the comments! Want to learn more about how web data can help your business?Join us for a weekly live demo.