Web Data Collection FAQ

Experts chime in on questions about data feeds for the open and dark web and provide insight on how to best utilize the Webhose API and Firehose service for data extraction.

What type of data are you able to extract?

Webhose extracts data from news articles, blogs, online discussions/forums and reviews and is able to filter data from search results on the open web according to keyword, publication date, person, location or organization, language, country and more. Granular filtering capabilities also include delivering data according to social signals, performance score (grading posts by how viral they are), and sentiment analysis.

What if a source I’m looking for on the open web is not covered?

Sources can quickly be ingested as needed, free of hassle simply by request from our customer service. New sources are constantly added for our customers.

Can I Use Webose to Capture Web Data?

Yes. Webhose crawls the open (or “clear”) web and allows users to retrieve pre-filtered data in machine-readable format. It indexes millions of posts from news sites, online discussions, blogs and reviews. Users can query the API by keywords, author, sentiment, location, and organization. The internet is a resource that is constantly shifting in both size and content and no one tool, including major search engines, can ever index it completely. For organizations at any stage of their growth, however, Webhose’s open web API can serve as their primary means for ingesting data from the open internet or serve as a supplement to existing sources.

Can I Use the News API for Financial Analysis?

Those trading the financial markets, particularly high-frequency traders, require access to real-time financial market data to stay on top of market and geopolitical trends. Webhose’s News API can allow traders to gain a broader perspective over the financial market by adding global news, forums, reviews, and blogs to the repository of informational sources that they make trading decisions based upon, resulting in them taking smarter, more profitable, decisions. AI-powered decision-making is currently responsible for more than two-thirds of global financial transactions. A rapid supply of data is required to feed the continuously improving machine learning models that power these technologies. Additionally, by correlating historic market data with subsequent market movements, financial institutions can deduce trends which can inform future trading decisions and develop predictive analytics systems to determine the future trajectory of financial instruments and markets.

Can I Use Webhose to Build my own Datasets for Machine Learning?

Yes. Building artificial intelligence (AI) models that rely on machine learning requires supplying datasets. Webhose delivers both historical and real-time data feeds at scale that can power use-cases such as predictive analytics engines, natural language processing (NLP) tools, and financial analysis programs. Webhose users can leverage over 25TB of historical data through using the open and historical archived web data. In addition, those considering using the service can evaluate by using free datasets including information retrieved from blog posts, online messaging boards, and news articles. Webhose has been successfully used to power models which identify fake news.

Do You Provide Data from Social Media Accounts?

Webhose can provide data from several social networking sites, including 4chan.org, Reddit, and vkontakte, a popular Russian social media service. User engagement information extracted from these platforms, such as likes, comments, and shares, is often enough to build machine-learning models that can parse and return sentiment-based queries, an invaluable asset for brands concerned with understanding their perception on social media as it evolves in real-time. Additionally, Webhose can also assess the virality of social media content by assigning a performance score from one to ten based on the number of times it was shared on social networks. Webhose only crawls websites that have given their permission to be indexed and honors all nofollow requests.

Do You Provide Data from the Google Search API?

The Google Search API was deprecated in April 2017 and was replaced by the Custom Research API which returns results from user-built Google custom search engines. These custom search engines are commonly embedded in websites and blogs and poll a list of URLs defined by the user. While Google is inarguably the best-known search engine in the world, it no longer provides its search results in machine-readable format. By comparison, the Webhose API scans and extracts data from hundreds of thousands of global data sources, providing organizations with access to a wide range of data from blogs, news sources, and other content sources. Only Webhose’s APIs are suitable for providing enterprise-class applications with the volume of data they require from open web search results.

How Can I Get Access to Data Feeds in Near Real-Time?

Webhose’s data feeds offer accurate, up-to-date information from relevant online, app reviews or rated discussion sources about your brand that can be vital for keeping tabs on the dynamic nature of product sentiment and the voice of the customer. Our web crawlers are scheduled to collect data from major websites, several times a day and deliver it to you in a structured, machine-readable JSON format – ready for analysis.

How Does Webhose Monitor Online Forums?

Message boards, forums, and online review sites are home to millions of exchanges every day — and brands that want to develop a comprehensive picture of public sentiment about them can analyze brand mentions on these platforms to extract actionable insights about their business.
Using the Webhose API to supplement brand mention monitoring from social media and news websites is a great advantage for organizations that need 360 degree visibility into brand perception on the open web. One Webhose customer, for example, was able to use the the Webhose forum monitoring API to help their customers understand sentiment based on reviews about their restaurants, focusing the queries on natural language parameters such as “ambiance”, “service”, and “food” to determine which aspects of the diner experience best resonated with customers.

Can I Use Webhose to Capture Blog Texts?

Yes. On-demand access to machine-readable blog data feeds is an essential tool for media monitoring companies, major brands that need to keep abreast of competitor product developments, and many other entities.

The Webhose Blog Search API includes advanced filters that allow for comprehensive coverage of blog postings, allowing users to build queries with the API based on keywords, language, country, detected sentiment, publication date, author and more. Data can be retrieved in JSON formats. Research and financial institutions can also tap into Webhose’s Archived Web Data (billed separately) to help build and refine predictive analytics algorithms.

How Can I Get Access to Comprehensive Coverage of the Web?

The web is constantly expanding and capturing as many sources as possible is a constant challenge for those responsible for providing internal data analytics teams with enough data points to develop accurate models to both understand current conditions and advance predictions for the future. Webhose can form a pivotal part of that picture by providing reliable, near real-time coverage of a wide variety of data sources, including news, blog posts, forum discussions, and the dark web. Customer implementations range from using it as the sole API source for grabbing information from the internet, using a Webhose API as a backup to a primary data source, or using Webhose data alongside that of other providers.

What Are Alternatives to News Scraping Tools?

News scraping usually refers to an automated process which copies and extracts data from the web, whether from a central local database or spreadsheet. It mimics the human version of browsing the web for data and copying and pasting it into a file to save on your computer. Those that want to develop their own self-hosted scraping tools need to build and maintain a list of websites for the scraper to visit, develop a parsing engine to extract syntactic and semantic components from natural language, and ensure that they have the online infrastructure in place to host their tools.

By contrast, using a structured web data API such as Webhose, those that need to supply data to internal analysis tools merely need to sign up to a service and purchase enough API calls to meet their expected needs, letting the API do the heavy lifting instead of their development staff. Webhose also offers advanced filters that extract sentiment and determine social signals. This frees up internal resources to focus on extracting insights and value from the data — rather than building the infrastructure to take it in.

What Tools Can I Use to Acquire Data from the Web?

Scraping tools are able to automatically capture and export information from the internet and can also detect and output both syntactic and semantic components, such as phone numbers, email addresses, and other contact details. Many tools can be used to scrape data from the internet: dedicated web scrapers, RSS feeds, and various web APIs are among the most popular tools.

Webhose provides an API that facilitates scraping the web — at scale. Its spiders index, capture, and analyze millions of posts a day, including content from the dark and deep webs which is notoriously difficult to capture for structured analysis.

What Tools Can I Use to Monitor the Dark Web?

Scanners, crawlers, and scraping tools can all be used to extract data from the dark web, but due to the ephemeral nature of much of the content uploaded by criminals to these websites, low latency extraction tools are a preferable methodology for capturing information for analysis.

Webhose’s Cyber API scans and extracts data from millions of dark web (.onion) sites, files, marketplaces, messaging apps, and forums and can serve the data extracted in both structured and unstructured formats.

Webhose’s technology also understands the meaning of abbreviations commonly used by criminals operating on these networks, such as “DUMP” (full credit card information) “fullz” (full package of an individual’s information), and “fishscale” (high quality drugs). It also retrieves information from password-protected deepweb and communities, indexes gated content, and can automatically solve complex triple CAPTCHA puzzles.

Due to the sensitive nature of information often posted on the dark web, unlike Webhose’s open web APIs, prospective users of the Cyber API must pass through a short approval process. National security and law enforcement agencies are among those whose dark web interception and analysis efforts are powered in part by Webhose’s service.

What Type of Sentiment Analysis Can I Measure?

Sentiment analysis uses natural language processing (NLP) and text analysis to assess and label the connotation of text. Unlike competitor tools, however, Webhose’s sophisticated sentiment analysis algorithm can process the data it parses with a high level of granularity. One Webhose customer, for example, was able to use the open web API to receive detailed insights about the average customer experience at their business simply by receiving the output of determined sentiment in online reviews from the Webhose API.

Additionally, data obtained from Webhose can be used to train and develop internal sentiment analysis engines. Training sentiment analysis AI tools requires providing them with two datasets split in an 80% to 20% ratio with a separate test dataset. Reliable customer sentiment data allows organizations to improve marketing campaigns, train their salespeople to better understand their target market, and has even been used by Microsoft Research Labs to flag Twitter users at risk of developing postnatal depression.

Continuous access to sentiment monitoring is useful for both brands that need continuous updates as to how public perception is shaping around internal or industry developments as well as financial users who can use sentiment data to develop predictive analytics models for how markets’ human participants will react to key variables.

How Does Webhose Extract Data from the Dark Web?

Gathering data from the dark web is difficult. Unlike the ope web, there is no straightforward means of indexing the network — and criminals tend to migrate data between its websites, networks, and secret forums in order to keep law enforcement and national security agencies at bay.

To provide a means for their customers to search through this information with their own analysis tools, Webhose’s team of cyber analysts constantly monitor the dark web to develop and maintain a proprietary list index of websites to crawl. This continuously updated index includes millions of active properties, many of which facilitate illicit activities.

The API, which is interacted with by making a simple RESTful API call, can output both structured and unstructured data in machine readable formats and be polled for entities such as email addresses, organizations, locations, and cryptocurrency wallet IDs. In particular, Webhose’s crawling bots are focused on capturing non-public information (NPI), personally identifiable information (PII), and information which may have implications for national security.

What are the Different Services Available for Aggregating Business Reviews?

Staying current with online reviews about your brand is an essential best practice for any marketing team to adopt. Reviews can come from many online services, including Google My Business, Yelp, Amazon, and Booking.com, to name but a few.

Customers expect their public feedback to be addressed quickly and taken seriously, especially when it is not complementary to the brand. Even positive feedback can provide useful datapoints about the customer experience and help product and customer service teams to improve and refine their business offer.

While small teams may be able to review and action such feedback on a manual basis, for large businesses and those with multinational operations attempting to do so quickly becomes an exercise in futility. Brand monitoring tools exist which leverage machine learning to optimize this process, but without a reliable supply of real-time data they cannot be effective.

Webhose’s dedicated Reviews API utilizes Natural Language Processing (NLP) to index review websites across the internet and can detect sentiment and mentions of discrete entities based upon keywords ‑ such as specific branches or product mentions. Users can also query the API based on what category of website the review site falls into and receive the domain rank of the website that matched the search keyword — allowing teams to prioritize responding to reviews from high-traffic websites and forge better relationships with potential brand influencers.

Subscribing to the Webhose Reviews API can give your marketing and brand monitoring team the data it needs to effectively map user sentiment at scale and ensure that your customers are not raising their online voices in vain.

Does Webhose Offer Prepared Datasets?

Webhose offers a range of high-quality, free datasets spanning multiple content domains, including online reviews, news, blogs, and discussions. Additionally, millions of open and dark web content is indexed every day and its data is structured for delivery by API to clients. Both API results and the static datasets available for free download include extracted elements (entities common to a particular source type), inferred elements, such as language and author name, as well as enriched data such as web ranking and social distribution volume. Students, researchers, and commercial enterprises that want straightforward access to pre-structured web data in a unified format — the first stage in building viable AI and machine learning models — can all leverage Webhose’s datasets to further their objectives — without having to waste time developing their own data collection and structuring systems.