Web Data Collection Article

Experts chime in on questions about data feeds for the open and dark web and provide insight on how to best utilize the Webhose API and Firehose service for data extraction.

Do you have any filter that will return data only from the top sites you crawl?

Yes. There are actually multiple ways to get better quality posts either from popular websites, or even popular posts. The first way would be to use the domain_rank filter. The domain rank filter specifies how popular a domain is (by monthly traffic), so if you want to search for posts from the top 1,000 sites, use the following:

domain_rank:<1000

The second option would be to either use the performance_score, or social signals to filter for posts that were either viral, or got shared/liked a lot on social networks.

Can you share the list of sites you are crawling?

Webhose.io doesn’t rely on a white-list to crawl the web, our crawlers find new sites and new content dynamically, so sending a list would be misrepresentative. If you want to know if we crawl a source or not, you can either use the “site:” filter, or email support@webhose.io with the list of sites you want to check.

Can I disable stemming when searching for an exact term?

Yes. Just append the dollar sign ($) to the end of the keyword. For example, searching for the keyword “simplivity” will also return hits for the word “simple” since we index the stemmed version of the word, but if you want to find documents that contain “simplivity” and nothing else, search for “simplivity$“.
Stemmed searches are currently supported for English, Spanish, Arabic and Russian.

Do you rate limit API calls?

Rate limiting of the API is considered on a per access token basis. You can make one request per second. Exceeding the API rate limit will result in a 429 HTTP error.

Does the API support wildcard expressions as the query?

The query syntax is based on Elasticsearch query string syntax, which means you can use wildcards.

Do you limit the length of the query, or the maximum number of Boolean clauses I can use?

The maximum length of a query is 4,000 characters.

Does the API support nested boolean expressions as well?

Boolean expressions can be nested in as many levels as you want.

For example: (exp1 AND exp2 AND exp3) OR (exp4 AND (exp5 OR exp6)) -(exp7 AND (exp8 OR exp9))

Can I get the highlighted fragments that matched my query?

Yes. Just add highlight=true as a parameter to your call.

How can I get all the posts of a thread?

To extract an entire thread, use the “thread.url” filter. This will return all the posts belonging to the thread URL provided. Example:

thread.url:http\:\/\/domain.com\/param=val

(note that you must escape the http:// part of the URL like so: http\:\/\/).

Pricing section says ‘100 results per request’. Does that mean we get only 100 results?

No. If your query produced more than 100 results, you can call the URL appearing in the “next” key in the results set to receive the next page presenting the next set of 100 posts.

Does your search support entity extraction (like people, companies, locations)?

Yes. You can search by person, location or organization on news or blog posts in English. For example, organization:apple will return news or blog posts mentioning Apple the company and not the fruit.

How many sources do you crawl? / Can you share your complete list of sources on your crawling cycle?

Webhose.io does not share this information. We could never provide a comprehensive list that is up to date as it is by nature an ever evolving and continuously updating dataset that aggregates a vast volume of sources.
What we can tell you however is that is in the millions with over 10MM posts indexed daily. We pride ourselves in our ability to quickly add sources that we don’t yet have covered within a few hours.
Moreover, you can quickly use the API query builder domain field to confirm coverage for a particular source. Customers send us source requests (often including a long list of sources), and we can report back to you regarding our coverage in a day or two.

How many keywords can we track per month?

You can enter any Boolean query with no set limit to the number of tracked keywords. The plan limit refers to the number of monthly requests, which you can upgrade at any time.

My result set shows the same article link multiple times – don’t you filter out duplicates?

We do filter out duplicates. You may get the same article link multiple times, if your query matches multiple comments for the same article. Webhose.io searches at the post level, so results include each post that matched your query. Each post also contains information about its containing thread, one of the properties of the thread, is the article link. That’s the reason you might see the same link multiple times. If you want to search only for the first post (i.e only the article and no comments) add is_first:true to your query. For example:

opera is_first:true

Will return only articles (i.e no comments) containing the word “opera”.

Do you filter out spam?

Each thread is given a spam score, ranging between 0 to 1, indicating how spammy the text is. For example, you can filter out threads with spam score higher than 0.5, by adding term “spam_score:<=0.5” to the search query.

Why do the thread and post URLs go through Omgili.com?

On the free plan, URLs for post and threads redirect through Omgili.com with a 5 second redirect lag. This way we show site owners webhose.io is a significant traffic referral source.

How are results sorted?

By default (when the sort parameter isn’t specified) the results are sorted by the recommended order of crawl date. You can however change the sort order by using the following values:

  • relevancy
  • social.facebook.likes
  • social.facebook.shares
  • social.facebook.comments
  • social.gplus.shares
  • social.pinterest.shares
  • social.linkedin.shares
  • social.stumbledupon.shares
  • social.vk.shares
  • replies_count
  • participants_count
  • spam_score
  • performance_score
  • domain_rank
  • ord_in_thread
  • rating

For example, the following call, will return posts ordered by the number of likes:

https://webhose.io/filterWebContent?token=XXXX-XXXX-XXXX&format=json&q=*&sort=social.facebook.likes

Is it possible to query for posts in multiple languages?

Yes. Use a simple OR Boolean query. For example:

(language:german OR language:chinese)

Will search for posts in both German & Chinese.

What is your language and geographic coverage?

Webhose.io supports 80 languages across every geographic territory with online access.

Do you provide historical data?

Yes, you can access the archive to get access to data older than 30 days.

What type of data are you able to extract?

Webhose extracts data from news articles, blogs, online discussions/forums and reviews and is able to filter data from search results on the open web according to keyword, publication date, person, location or organization, language, country and more. Granular filtering capabilities also include delivering data according to social signals, performance score (grading posts by how viral they are), and sentiment analysis.

What if a source I’m looking for on the open web is not covered?

Sources can quickly be ingested as needed, free of hassle simply by request from our customer service. New sources are constantly added for our customers.

Can I Use Webose to Capture Web Data?

Yes. Webhose crawls the open (or “clear”) web and allows users to retrieve pre-filtered data in machine-readable format. It indexes millions of posts from news sites, online discussions, blogs and reviews. Users can query the API by keywords, author, sentiment, location, and organization. The internet is a resource that is constantly shifting in both size and content and no one tool, including major search engines, can ever index it completely. For organizations at any stage of their growth, however, Webhose’s open web API can serve as their primary means for ingesting data from the open internet or serve as a supplement to existing sources.

Can I Use the News API for Financial Analysis?

Those trading the financial markets, particularly high-frequency traders, require access to real-time financial market data to stay on top of market and geopolitical trends. Webhose’s News API can allow traders to gain a broader perspective over the financial market by adding global news, forums, reviews, and blogs to the repository of informational sources that they make trading decisions based upon, resulting in them taking smarter, more profitable, decisions. AI-powered decision-making is currently responsible for more than two-thirds of global financial transactions. A rapid supply of data is required to feed the continuously improving machine learning models that power these technologies. Additionally, by correlating historic market data with subsequent market movements, financial institutions can deduce trends which can inform future trading decisions and develop predictive analytics systems to determine the future trajectory of financial instruments and markets.

Can I Use Webhose to Build my own Datasets for Machine Learning?

Yes. Building artificial intelligence (AI) models that rely on machine learning requires supplying datasets. Webhose delivers both historical and real-time data feeds at scale that can power use-cases such as predictive analytics engines, natural language processing (NLP) tools, and financial analysis programs. Webhose users can leverage over 25TB of historical data through using the open and historical archived web data. In addition, those considering using the service can evaluate by using free datasets including information retrieved from blog posts, online messaging boards, and news articles. Webhose has been successfully used to power models which identify fake news.

Do You Provide Data from Social Media Accounts?

Webhose can provide data from several social networking sites, including 4chan.org, Reddit, and vkontakte, a popular Russian social media service. User engagement information extracted from these platforms, such as likes, comments, and shares, is often enough to build machine-learning models that can parse and return sentiment-based queries, an invaluable asset for brands concerned with understanding their perception on social media as it evolves in real-time. Additionally, Webhose can also assess the virality of social media content by assigning a performance score from one to ten based on the number of times it was shared on social networks. Webhose only crawls websites that have given their permission to be indexed and honors all nofollow requests.

Do You Provide Data from the Google Search API?

The Google Search API was deprecated in April 2017 and was replaced by the Custom Research API which returns results from user-built Google custom search engines. These custom search engines are commonly embedded in websites and blogs and poll a list of URLs defined by the user. While Google is inarguably the best-known search engine in the world, it no longer provides its search results in machine-readable format. By comparison, the Webhose API scans and extracts data from hundreds of thousands of global data sources, providing organizations with access to a wide range of data from blogs, news sources, and other content sources. Only Webhose’s APIs are suitable for providing enterprise-class applications with the volume of data they require from open web search results.

How Can I Get Access to Data Feeds in Near Real-Time?

Webhose’s data feeds offer accurate, up-to-date information from relevant online, app reviews or rated discussion sources about your brand that can be vital for keeping tabs on the dynamic nature of product sentiment and the voice of the customer. Our web crawlers are scheduled to collect data from major websites, several times a day and deliver it to you in a structured, machine-readable JSON format – ready for analysis.

How Does Webhose Monitor Online Forums?

Message boards, forums, and online review sites are home to millions of exchanges every day — and brands that want to develop a comprehensive picture of public sentiment about them can analyze brand mentions on these platforms to extract actionable insights about their business.
Using the Webhose API to supplement brand mention monitoring from social media and news websites is a great advantage for organizations that need 360 degree visibility into brand perception on the open web. One Webhose customer, for example, was able to use the the Webhose forum monitoring API to help their customers understand sentiment based on reviews about their restaurants, focusing the queries on natural language parameters such as “ambiance”, “service”, and “food” to determine which aspects of the diner experience best resonated with customers.

Can I Use Webhose to Capture Blog Texts?

Yes. On-demand access to machine-readable blog data feeds is an essential tool for media monitoring companies, major brands that need to keep abreast of competitor product developments, and many other entities.

The Webhose Blog Search API includes advanced filters that allow for comprehensive coverage of blog postings, allowing users to build queries with the API based on keywords, language, country, detected sentiment, publication date, author and more. Data can be retrieved in JSON formats. Research and financial institutions can also tap into Webhose’s Archived Web Data (billed separately) to help build and refine predictive analytics algorithms.

How Can I Get Access to Comprehensive Coverage of the Web?

The web is constantly expanding and capturing as many sources as possible is a constant challenge for those responsible for providing internal data analytics teams with enough data points to develop accurate models to both understand current conditions and advance predictions for the future. Webhose can form a pivotal part of that picture by providing reliable, near real-time coverage of a wide variety of data sources, including news, blog posts, forum discussions, and the dark web. Customer implementations range from using it as the sole API source for grabbing information from the internet, using a Webhose API as a backup to a primary data source, or using Webhose data alongside that of other providers.

What Are Alternatives to News Scraping Tools?

News scraping usually refers to an automated process which copies and extracts data from the web, whether from a central local database or spreadsheet. It mimics the human version of browsing the web for data and copying and pasting it into a file to save on your computer. Those that want to develop their own self-hosted scraping tools need to build and maintain a list of websites for the scraper to visit, develop a parsing engine to extract syntactic and semantic components from natural language, and ensure that they have the online infrastructure in place to host their tools.

By contrast, using a structured web data API such as Webhose, those that need to supply data to internal analysis tools merely need to sign up to a service and purchase enough API calls to meet their expected needs, letting the API do the heavy lifting instead of their development staff. Webhose also offers advanced filters that extract sentiment and determine social signals. This frees up internal resources to focus on extracting insights and value from the data — rather than building the infrastructure to take it in.

What Tools Can I Use to Acquire Data from the Web?

Scraping tools are able to automatically capture and export information from the internet and can also detect and output both syntactic and semantic components, such as phone numbers, email addresses, and other contact details. Many tools can be used to scrape data from the internet: dedicated web scrapers, RSS feeds, and various web APIs are among the most popular tools.

Webhose provides an API that facilitates scraping the web — at scale. Its spiders index, capture, and analyze millions of posts a day, including content from the dark and deep webs which is notoriously difficult to capture for structured analysis.

What Tools Can I Use to Monitor the Dark Web?

Scanners, crawlers, and scraping tools can all be used to extract data from the dark web, but due to the ephemeral nature of much of the content uploaded by criminals to these websites, low latency extraction tools are a preferable methodology for capturing information for analysis.

Webhose’s Cyber API scans and extracts data from millions of dark web (.onion) sites, files, marketplaces, messaging apps, and forums and can serve the data extracted in both structured and unstructured formats.

Webhose’s technology also understands the meaning of abbreviations commonly used by criminals operating on these networks, such as “DUMP” (full credit card information) “fullz” (full package of an individual’s information), and “fishscale” (high quality drugs). It also retrieves information from password-protected deepweb and communities, indexes gated content, and can automatically solve complex triple CAPTCHA puzzles.

Due to the sensitive nature of information often posted on the dark web, unlike Webhose’s open web APIs, prospective users of the Cyber API must pass through a short approval process. National security and law enforcement agencies are among those whose dark web interception and analysis efforts are powered in part by Webhose’s service.

What Type of Sentiment Analysis Can I Measure?

Sentiment analysis uses natural language processing (NLP) and text analysis to assess and label the connotation of text. Unlike competitor tools, however, Webhose’s sophisticated sentiment analysis algorithm can process the data it parses with a high level of granularity. One Webhose customer, for example, was able to use the open web API to receive detailed insights about the average customer experience at their business simply by receiving the output of determined sentiment in online reviews from the Webhose API.

Additionally, data obtained from Webhose can be used to train and develop internal sentiment analysis engines. Training sentiment analysis AI tools requires providing them with two datasets split in an 80% to 20% ratio with a separate test dataset. Reliable customer sentiment data allows organizations to improve marketing campaigns, train their salespeople to better understand their target market, and has even been used by Microsoft Research Labs to flag Twitter users at risk of developing postnatal depression.

Continuous access to sentiment monitoring is useful for both brands that need continuous updates as to how public perception is shaping around internal or industry developments as well as financial users who can use sentiment data to develop predictive analytics models for how markets’ human participants will react to key variables.

How Does Webhose Extract Data from the Dark Web?

Gathering data from the dark web is difficult. Unlike the ope web, there is no straightforward means of indexing the network — and criminals tend to migrate data between its websites, networks, and secret forums in order to keep law enforcement and national security agencies at bay.

To provide a means for their customers to search through this information with their own analysis tools, Webhose’s team of cyber analysts constantly monitor the dark web to develop and maintain a proprietary list index of websites to crawl. This continuously updated index includes millions of active properties, many of which facilitate illicit activities.

The API, which is interacted with by making a simple RESTful API call, can output both structured and unstructured data in machine readable formats and be polled for entities such as email addresses, organizations, locations, and cryptocurrency wallet IDs. In particular, Webhose’s crawling bots are focused on capturing non-public information (NPI), personally identifiable information (PII), and information which may have implications for national security.

What are the Different Services Available for Aggregating Business Reviews?

Staying current with online reviews about your brand is an essential best practice for any marketing team to adopt. Reviews can come from many online services, including Google My Business, Yelp, Amazon, and Booking.com, to name but a few.

Customers expect their public feedback to be addressed quickly and taken seriously, especially when it is not complementary to the brand. Even positive feedback can provide useful datapoints about the customer experience and help product and customer service teams to improve and refine their business offer.

While small teams may be able to review and action such feedback on a manual basis, for large businesses and those with multinational operations attempting to do so quickly becomes an exercise in futility. Brand monitoring tools exist which leverage machine learning to optimize this process, but without a reliable supply of real-time data they cannot be effective.

Webhose’s dedicated Reviews API utilizes Natural Language Processing (NLP) to index review websites across the internet and can detect sentiment and mentions of discrete entities based upon keywords ‑ such as specific branches or product mentions. Users can also query the API based on what category of website the review site falls into and receive the domain rank of the website that matched the search keyword — allowing teams to prioritize responding to reviews from high-traffic websites and forge better relationships with potential brand influencers.

Subscribing to the Webhose Reviews API can give your marketing and brand monitoring team the data it needs to effectively map user sentiment at scale and ensure that your customers are not raising their online voices in vain.

Does Webhose Offer Prepared Datasets?

Webhose offers a range of high-quality, free datasets spanning multiple content domains, including online reviews, news, blogs, and discussions. Additionally, millions of open and dark web content is indexed every day and its data is structured for delivery by API to clients. Both API results and the static datasets available for free download include extracted elements (entities common to a particular source type), inferred elements, such as language and author name, as well as enriched data such as web ranking and social distribution volume. Students, researchers, and commercial enterprises that want straightforward access to pre-structured web data in a unified format — the first stage in building viable AI and machine learning models — can all leverage Webhose’s datasets to further their objectives — without having to waste time developing their own data collection and structuring systems.