for their machine learning and AI models.
Learn more about these projects and their real-world applications.
In April 2018 it became impossible to extract content from Facebook without violating their Terms of Service. As a result, any software or services that based data on Facebook were rendered useless. This article expands on the dangers of relying on such APIs exclusively and explores other solutions for gathering data from the web.
Many different techniques can be utilized to extract data from the web. This paper highlights the combination of web scraping techniques integrated into a browser as an extension. It also discusses the methods used for this data extraction and how users can be notified of it in the best way possible. This new technique could possibly encourage more collaboration, annotation and sharing among users.
Biometric technology, used in immigration and refugee management, has components of human identification and identify management which are now a top target for cybercriminals. A team from the computer science department from the University of North Carolina at Charlotte developed a methodology to identify biometric technology vulnerabilities in addition to the different limitations of identity management. They did this by publicly monitoring different sources for threats. Open web data for this project was taken from the Google Search API and dark web data was taken from Webhose.
This study’s goal is to find an approach to extract any web data in a structured manner, regardless of the different sources. The idea is that this approach would be used as a foundation for developing a web scraping application for tracking information.
Many people receive both news from both fake and real news sources without having the ability to differentiate between the two. This team built a model that identifies fake news through reverse plagiarism and natural language processing (NLP). The model has room for improvement, especially when compared to models based on logistic regression and multinomial naïve Bayes. Datasets for this project were used from the Webhose archive.
This paper compares topics of public opinion based on news and blogs on the web in both Italy and Germany.
Online data is ever-expanding, and storing it in all its different formats, particularly unstructured data, is becoming increasingly challenging. This paper presents different approaches to storing, retrieving, and analyzing the unstructured data.
Developing a Global Indicator for Aichi Target 1 by Merging Online Data Sources to Measure Biodiversity Awareness and Engagement
Traditionally, public support for biodiversity has been measured through public-opinion surveys that are costly, geographically-restricted, and time consuming. Seeking a reliable alternative to these surveys, a team from the University of Maryland tracked biodiversity-related keywords in 31 different languages in data across the web in real-time, globally, at a much lower cost. Researchers used datasets from Webhose’s new repository.
FactExtract: Automatic Collection and Aggregation of Articles and Journalistic Factual Claims from Online Newspaper
Manual fact-checking of the web is now impossible with the exponential increase of data in real-time. Web scraping is one solution to this manual and time-consuming method for the extraction of specific and highly structured data. This team presented a method for the automatic extraction of articles that was tested on 15 Senegalese news websites.
A need exists for technology that can classify and filter images into different subjects and categories. Although this is not a new idea, the idea of scaling this with datasets based on deep learning is new in addition to the training models used to classify these images. Data used to evaluate the image-recognition models comes from Webhose’s datasets, which includes 7 different categories: entertainment, finance, politics, sports, technology, travel and world news.
Many content delivery engines are known to deliver polarized results due to the personalization and filter bubble of content online. A team of researchers developed an alternative: A prototype for a news search engine with balanced viewpoints according to flexible user-defined constraints. The goal is to demonstrate that balanced content delivery systems are a real possibility and that the ability to control bias of results can be given to the user as well. This project utilized the Webhose News Search API to rank articles according to popularity.
Social engineering focuses on attacking humans rather than the technology of information systems. This research offers a threat assessment model to determine which social media users are vulnerable as well as detect malicious credit card resellers.
The team created a model that could predict if news articles were real or fake. Webhose’s API allowed the team to query its database with site URLS to retrieve articles and download articles from the most reliable news sources available.
Since manual text classification can be very time consuming and costly, several supervised approaches have been introduced, but they require lots of labeled training data. Semi-supervised approaches, on the other hand, don’t require as much labeling. A new approach of dataless text classification entails less labeled data as well and instead the identification of semantic similarity between documents and predefined categories. But this approach does not allow researchers to take advantage of large-scale knowledge bases. Another approach, the TECNE approach, does not require any training data and is the focus of this project. Webhose’s news article datasets were used for this research.
What is an institution’s weakest link? The humans operating it. This paper proposes a model for measuring social engineering vulnerability. The model can detect vulnerabilities in an automated social engineering attack as well as vulnerabilities in different social networking sites, blog and forums as well as an automated reverse social engineering attack.
The ability to detect novelty, or originality of a document has many applications, including tracking news events, predicting the impact of research, and other NLP applications. This team of researchers has not found any way of measuring novelty that currently exists. This paper presents a benchmark for measuring the level of novelty in documents.
This paper demonstrates the value of web mining and data mining approaches to gain information regarding the labor market. It applies both to determine clusters for careers, showing how these results can be then used to identify decision trees for modeling career paths.
Social media is becoming more and more influential in both a positive and negative way on society. For example, governments can use it for effective disaster management or as a way of spreading disinformation. As a result, the ability to detect fake news in social media can be advantageous in order to prevent it from spreading. This paper introduces a method for identifying tweets with fake news content. The research team used large miscellaneous event datasets for this project.
Ecommerce sites can provide a wealth of data to customers about their product, but often this data can become overwhelming and time-consuming for the customer. This paper examines an alternative by using web scrapers to gather that information and present it to customers in a cross-comparison display.
Data on the internet is expanding at an ever-increasing rate. Raw data must be analyzed quickly, like sentiment analysis. This paper introduces an approach that extracts the hottest topics and news headlines, offers an efficient sentiment analysis and a model to quickly understand the relationship between words in the article. This new model could be useful for society analysts, sociologists and politics when analyzing news articles. The team collected news from six different countries using Webhose’s API.
Until now, big data used to combat terrorism has focused on only one type of social media at a time. This paper introduces a way of harnessing data from multiple social media sources to detect terrorist activities, called the Social Media Analysis for Combating Terorrism (SMACT) model. Webhose was one of the sources used for collecting data from social media.
What is the public’s opinion of the Supplemental Nutrition Assistance Program (SNAP), once known as the Food Stamp Program. This project seeks to find an answer to that question through natural learning processing (NLP) tools, machine learning, text mining and sentiment analysis. Webhose was able to provide access to news articles with rich datasets for this project.
The purpose of this project was to first build a thematic database about news of the Zika virus from online sources such as news, blogs, and online discussions. After that, the project was aimed at being able to perform queries and find connections in the datasets with the goal of better understanding the impacts of this disease in social media. Data for news articles for this project was collected from Webhose’s repository.
REALM: A Computational Framework for Investigating Research Impacts Using Alternative Metrics (Portuguese)
During emergencies such as the Zika virus in Brazil, researchers and physicians rely on social media to exchange information in a faster and more efficient manner. When this method is used during emergencies, however, citizens need a way to verify the validity of the information and the reputation of the researcher. This paper offers a way to identify the reputation of these researchers based on a method called altmetrics. Data was taken from Webhose’s API, whose advanced filters enable extraction of only specific publications related to the topic, such as the Zika virus.
Word embedding takes words or phrases from a language and maps them to vectors or real numbers. This paper examines the quality of these word embeddings when used in word analogy tasks (e.g Paris is to France as Lisbon is to Portugal) and when used in sentiment analysis. The research found word analogy tasks had the highest performance when models of text included data that was large, rich in vocabulary, and multi-thematic. Sentiment analysis tasks also needed datasets that were large and rich in vocabulary for best results. For this research, sentiment analysis was conducted on tweets, song lyrics, movie reviews and phone reviews. Research was conducted with news articles from Webhose’s datasets.
This paper describes the system developed to deliver insight to Eastman Chemical Company about the current chemical landscape. The team at Virginia Tech did this by creating a search interface that enabled them to drill down to a particular time period and look at what people were writing about with certain keywords.
This project introduces the development of a data warehouse (DW) of athletic results as well as a way to integrate that data with both the geographic location and the atmospheric conditions of the competitions. First, data must be parsed by tokenizing the text and extracting data and defining the different data hierarchy. Next, data must be scaped for the results and converted into PDF and plain text to store in the DW.
Semantic interpretations of text can be achieved through cognitive modeling, though these interpretations are not usually subjective. This paper takes into account this subjective factor of text in cognitive modeling through automated synthesis with an in-depth cognitive interpretation of model components.
New Technologies, Continuing Ideologies: Online Reader Comments as a Support for Media Perspectives of Minority Religions
Nationalist interests have strengthened in Europe in the past few years, leading to increased numbers of discussions in right-leaning news websites. This paper examines these discussions for stereotypical representations of Islam and Catholicism in Daily Mail and Telegraph websites.
This paper introduces a design for a system that collects data about items sold on the Dark Web. The system lets users search the data and also notify them of any changes in the data that occur in the markets being searched. A prototype of this system was used for the Cyber Crime Unit of the Police of the Czech Republic.
This paper gives a comprehensive analysis of style transfer, an area of Natural Language Processing (NLP) with an explanation of the challenges of granularity, transferability and distinguishability inherent in sentiment transfer. Other solutions are discussed, including news outlet style transfer and non-parallel error correction.
The use of cryptocurrency has become very popular in the past few years. The anonymity of the currency, however, enables it to be used for criminal purposes, such as the purchase of weapons, forged documents and drugs. However, if the same bitcoin address is used for multiple purchases or if the user publishes it to receive payments in forums, for example, it is possible to establish the user’s identity. This paper presents a tool for the recovery of bitcoin addresses and information related to them.
The chapter of this book discusses several larger examples of web scrapers.
Although personalized news and shopping has grown in popularity, the public has become more aware of its inherent bias and the polarization of opinions. This paper presents an alternative balanced news feed via which users can see beforehand the political leaning of their news consumption as well as set their polarization constraints.
Data from social media is becoming increasingly valuable to social scientists even though the solutions for crawling and gathering this data don’t allow for individual crawling scenarios. This paper addresses this challenge with an approach based on a developed domain specific language (DSL) and architecture of distributed crawling system, which requires the user to define the description of needed data.
This paper presents a web crawler that would collect information from the websites of Slovenian companies and their profiles on social networks and apply a model for evaluating the quality of information. The conclusion of the project was that we can find and explain the variance in information quality among the websites of Slovenian companies based on their size and business type.
Online personalization has delivered many positive results to users, including better product recommendations and more relevant news and content. But it also can create biases which influence users through a filter bubble which they do not have much control over. This paper presents a proven scalable algorithm that allows us to avoid polarization yet still optimize individual utility.
This paper demonstrates the ability to conduct cognitive modeling while taking into account semantic interpretations. This is done through a variety of methods, including convergent decision making, heuristic algorithms, or a combination of an automated synthesis of cognitive models with an in-depth cognitive interpretation of model components.
The collection of relevant data from the web can be done with the help of web monitoring tools. This paper presents a web monitoring tool for a company in the automotive industry.
A Hybrid Approach for Alarm Verification Using Stream Processing, Machine Learning and Text Analytics
“False alarms triggered by security sensors can be expensive for all parties involved. This paper presents a scalable alarm verification system for 30K alarms a second with up to 90% accuracy through a combination of machine learning, stream and batch processing technologies. The team used Webhose to collect descriptions of fire and intrusion incidents from different online resources, such as Twitter, RSS feeds, or web pages to calculate an a-priori risk factor for intrusion and fire alarms.”
Machine learning is a valid alternative to the manual classification of classifying large quantities of text documents. This paper demonstrates how automatic classification of texts work and gives a quick overview of the most common algorithms used for this purpose. The paper continues this work by beginning to extrapolate the application to handle official Swedish documents. Webhose’s free datasets were used for this project.
This paper’s goal is to build a model that predicts cryptocurrency trends based on data from the Poloniex online market. The forecast for the model was based on news obtained through the Reddit website. The conclusion was that the model would fail to yield profitable returns.