Academic Research

Researchers from leading global institutions rely on Webhose’s free datasets
for their machine learning and AI models.
Learn more about these projects and their real-world applications.

Computational Research in the Post-API Age

Authors: Deen Freelon
  |  
Oct 2018
  |  
Political Communication

In April 2018 it became impossible to extract content from Facebook without violating their Terms of Service. As a result, any software or services that based data on Facebook were rendered useless. This article expands on the dangers of relying on such APIs exclusively and explores other solutions for gathering data from the web.


Change Detection and Notification Method of the Rich Internet Application Content

Authors: Emil Gatial, Zoltán Balogh , Ladislav Hluchý
  |  
June 2018
  |  
2018 IEEE 22nd International Conference on Intelligence Engineering Systems (INES)

Many different techniques can be utilized to extract data from the web. This paper highlights the combination of web scraping techniques integrated into a browser as an extension. It also discusses the methods used for this data extraction and how users can be notified of it in the best way possible. This new technique could possibly encourage more collaboration, annotation and sharing among users.


Technical Report: An Inventory of Open and Dark Web Marketplace for Identity Misrepresentation

Authors: Arunkumar Bagavathi, Sai Eshwar Prasad Muppalla, Siddharth Krishnan and Bojan Cukic
  |  
2018
  |  
Team research paper

Biometric technology, used in immigration and refugee management, has components of human identification and identify management which are now a top target for cybercriminals. A team from the computer science department from the University of North Carolina at Charlotte developed a methodology to identify biometric technology vulnerabilities in addition to the different limitations of identity management. They did this by publicly monitoring different sources for threats. Open web data for this project was taken from the Google Search API and dark web data was taken from Webhose.


Conceptual Approach for Development of Web Scraping Application for Tracking Information

Authors: Plamen Milev
  |  
2017
  |  
Economic Alternatives

This study’s goal is to find an approach to extract any web data in a structured manner, regardless of the different sources. The idea is that this approach would be used as a foundation for developing a web scraping application for tracking information.


Finding Truth in Fake News: Reverse Plagiarism and other Models of Classification

Authors: Matthew Przybyla, David Tran, Amber Whelpley, Daniel W. Engels
  |  
2018
  |  
SMU Data Science Review

Many people receive both news from both fake and real news sources without having the ability to differentiate between the two. This team built a model that identifies fake news through reverse plagiarism and natural language processing (NLP). The model has room for improvement, especially when compared to models based on logistic regression and multinomial naïve Bayes. Datasets for this project were used from the Webhose archive.


The Web News Coverage of Industry 4.0 in Italy and Germany (Italian)

Authors: Achille Pierre Paliotta
  |  
2018
  |  
INAPP Public Policy Innovation

This paper compares topics of public opinion based on news and blogs on the web in both Italy and Germany.


Unstructured Data: Various approaches for Storage, Extraction and Analysis

Authors: Dr. Babu Reddy
  |  
2017
  |  
Journal of Computer Science

Online data is ever-expanding, and storing it in all its different formats, particularly unstructured data, is becoming increasingly challenging. This paper presents different approaches to storing, retrieving, and analyzing the unstructured data.


Developing a Global Indicator for Aichi Target 1 by Merging Online Data Sources to Measure Biodiversity Awareness and Engagement

Authors: Matthew W. cooper, Enrico Di Minin, anna Hausmann, Siyu Qin, Aaron Schwartz, Ricardo Aleixo Correia
  |  
February 2019
  |  
Biological Conservation

Traditionally, public support for biodiversity has been measured through public-opinion surveys that are costly, geographically-restricted, and time consuming. Seeking a reliable alternative to these surveys, a team from the University of Maryland tracked biodiversity-related keywords in 31 different languages in data across the web in real-time, globally, at a much lower cost. Researchers used datasets from Webhose’s new repository.


FactExtract: Automatic Collection and Aggregation of Articles and Journalistic Factual Claims from Online Newspaper

Authors: Edouard Ngor SARR, Ousmane SALL, Aminata DIALLO
  |  
February 2019
  |  
2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS)

Manual fact-checking of the web is now impossible with the exponential increase of data in real-time. Web scraping is one solution to this manual and time-consuming method for the extraction of specific and highly structured data. This team presented a method for the automatic extraction of articles that was tested on 15 Senegalese news websites.


Text-Enriched Representations for News Image Classification

Authors: Elias Moons, Tinne Tuytelaars, Marie-Francine Moens
  |  
April 2018
  |  
International World Wide Web Conferences Steering Committee

A need exists for technology that can classify and filter images into different subjects and categories. Although this is not a new idea, the idea of scaling this with datasets based on deep learning is new in addition to the training models used to classify these images. Data used to evaluate the image-recognition models comes from Webhose’s datasets, which includes 7 different categories: entertainment, finance, politics, sports, technology, travel and world news.


Balanced News Using Constrained Bandit-based Personalization

Authors: Sayash Kapoor, Vijay Keswani, Nisheeth K. Vishnoi, L. Elisa Celis
  |  
2018
  |  
arXiv:1806.09202v1

Many content delivery engines are known to deliver polarized results due to the personalization and filter bubble of content online. A team of researchers developed an alternative: A prototype for a news search engine with balanced viewpoints according to flexible user-defined constraints. The goal is to demonstrate that balanced content delivery systems are a real possibility and that the ability to control bias of results can be given to the user as well. This project utilized the Webhose News Search API to rank articles according to popularity.


Social Engineering Threat Assessment Using a Multi-Layered Graph-Based Model

Authors: Omar Jaafor, Babiga Birregah
  |  
April 2017
  |  
Trends in Social Network Analysis

Social engineering focuses on attacking humans rather than the technology of information systems. This research offers a threat assessment model to determine which social media users are vulnerable as well as detect malicious credit card resellers.


Development of Classification Models for Fake News Detection

Authors: Javier Pascual Mesa
  |  
N/A
  |  
Individual Research Project Report

The team created a model that could predict if news articles were real or fake. Webhose’s API allowed the team to query its database with site URLS to retrieve articles and download articles from the most reliable news sources available.


TECNE: Knowledge Based Text Classification Using Network Embeddings

Authors: Rima Turker, Maria Koutraki, Lei Zhang, Harald Sack
  |  
N/A
  |  
N/A

Since manual text classification can be very time consuming and costly, several supervised approaches have been introduced, but they require lots of labeled training data. Semi-supervised approaches, on the other hand, don’t require as much labeling. A new approach of dataless text classification entails less labeled data as well and instead the identification of semantic similarity between documents and predefined categories. But this approach does not allow researchers to take advantage of large-scale knowledge bases. Another approach, the TECNE approach, does not require any training data and is the focus of this project. Webhose’s news article datasets were used for this research.


Multi-layered Graph-based Model for Social Engineering Vulnerability Assessment

Authors: Omar Jaafor, Babiga Birregah
  |  
2015
  |  
2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)

What is an institution’s weakest link? The humans operating it. This paper proposes a model for measuring social engineering vulnerability. The model can detect vulnerabilities in an automated social engineering attack as well as vulnerabilities in different social networking sites, blog and forums as well as an automated reverse social engineering attack.


TAP-DLND 1.0: A Corpus for Document Level Novelty Detection

Authors: Tirthankar Ghosal, Amitra Salam, Swati Tiwari, Asif Ekbal, Pushpak Bhattacharyya
  |  
Februrary 2018
  |  
Language Resources and Evaluation Conference (LREC) 2018

The ability to detect novelty, or originality of a document has many applications, including tracking news events, predicting the impact of research, and other NLP applications. This team of researchers has not found any way of measuring novelty that currently exists. This paper presents a benchmark for measuring the level of novelty in documents.


Analysing of the Labour Demand and Supply Using Web Mining and Data Mining

Authors: Ciprian Panzaru and Claudiu Brandas
  |  
November 2015
  |  
Big Data and the Compliexity of Labour Market Policies: New Approaches in Regional and Local Labour Market Monitoring for Reducing Skills Mismatches

This paper demonstrates the value of web mining and data mining approaches to gain information regarding the labor market. It applies both to determine clusters for careers, showing how these results can be then used to identify decision trees for modeling career paths.


Identifying Tweets with Fake News

Authors: Saranya Krishnan; Min Chen
  |  
2018
  |  
2018 IEEE International Conference on Information Reuse and Integration (IRI)

Social media is becoming more and more influential in both a positive and negative way on society. For example, governments can use it for effective disaster management or as a way of spreading disinformation. As a result, the ability to detect fake news in social media can be advantageous in order to prevent it from spreading. This paper introduces a method for identifying tweets with fake news content. The research team used large miscellaneous event datasets for this project.


Multi Vendor Weighted Product Recommendation Systems

Authors: Vijaya Lakshmi Illuri, Harshita Cheemakurthi, Harika Redlapalli, T.Subha Mastan Rao, G.Rama Krishna,
  |  
2017
  |  
International Journal of Pure and Applied Mathematics

Ecommerce sites can provide a wealth of data to customers about their product, but often this data can become overwhelming and time-consuming for the customer. This paper examines an alternative by using web scrapers to gather that information and present it to customers in a cross-comparison display.


Sentimental Content Analysis and Knowledge Extraction from News Articles

Authors: Mohammad Kamel, Neda Keyvani, Hadi Sadoghi Yazdi
  |  
August 2018
  |  
arXiv:1808.03027

Data on the internet is expanding at an ever-increasing rate. Raw data must be analyzed quickly, like sentiment analysis. This paper introduces an approach that extracts the hottest topics and news headlines, offers an efficient sentiment analysis and a model to quickly understand the relationship between words in the article. This new model could be useful for society analysts, sociologists and politics when analyzing news articles. The team collected news from six different countries using Webhose’s API.


Leveraging Big Data to Combat Terrorism in Developing Countries

Authors: Leveraging big data to combat terrorism in developing countries
  |  
March 2017
  |  
2017 Conference on Information Communication Technology and Society (ICTAS)

Until now, big data used to combat terrorism has focused on only one type of social media at a time. This paper introduces a way of harnessing data from multiple social media sources to detect terrorist activities, called the Social Media Analysis for Combating Terorrism (SMACT) model. Webhose was one of the sources used for collecting data from social media.


Food for Thought: Analyzing Public Opinion on the Supplemental Nutrition Assistance Program

Authors: "Miriam Chappelka, Jihwan Oh , Dorris Scott, Mizzani Walker-Holmes"
  |  
2017
  |  
2017 Bloomberg Data for Good Exchange Conference

What is the public’s opinion of the Supplemental Nutrition Assistance Program (SNAP), once known as the Food Stamp Program. This project seeks to find an answer to that question through natural learning processing (NLP) tools, machine learning, text mining and sentiment analysis. Webhose was able to provide access to news articles with rich datasets for this project.


Data Triplification of Zika News (Portuguese)

Authors: Luís Fernando Monsores Passos Maia, Marcela Mayumi Mauricio Yagui
  |  
2017
  |  
XIII Brazilian Symposium on Information Systems, Lavras, Minas Gerais

The purpose of this project was to first build a thematic database about news of the Zika virus from online sources such as news, blogs, and online discussions. After that, the project was aimed at being able to perform queries and find connections in the datasets with the goal of better understanding the impacts of this disease in social media. Data for news articles for this project was collected from Webhose’s repository.


REALM: A Computational Framework for Investigating Research Impacts Using Alternative Metrics (Portuguese)

Authors: "Luís Fernando Monsores, Passos Maia, Jonice Oliveira"
  |  
August 2018
  |  
Brazilian Symposium on Databases (SBBD)

During emergencies such as the Zika virus in Brazil, researchers and physicians rely on social media to exchange information in a faster and more efficient manner. When this method is used during emergencies, however, citizens need a way to verify the validity of the information and the reputation of the researcher. This paper offers a way to identify the reputation of these researchers based on a method called altmetrics. Data was taken from Webhose’s API, whose advanced filters enable extraction of only specific publications related to the topic, such as the Zika virus.


Word Embeddings for Sentiment Analysis: A Comprehensive Empirical Survey

Authors: Erion Cano, Meurizio Morisio
  |  
February 2019
  |  
project of Academic Computing in the Department of Control and Computer Engineering, Politecnico di Torino

Word embedding takes words or phrases from a language and maps them to vectors or real numbers. This paper examines the quality of these word embeddings when used in word analogy tasks (e.g Paris is to France as Lisbon is to Portugal) and when used in sentiment analysis. The research found word analogy tasks had the highest performance when models of text included data that was large, rich in vocabulary, and multi-thematic. Sentiment analysis tasks also needed datasets that were large and rich in vocabulary for best results. For this research, sentiment analysis was conducted on tweets, song lyrics, movie reviews and phone reviews. Research was conducted with news articles from Webhose’s datasets.


Common Crawl Mining

Authors: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff
  |  
2017
  |  
Multimedia, Hypertext and Information Access

This paper describes the system developed to deliver insight to Eastman Chemical Company about the current chemical landscape. The team at Virginia Tech did this by creating a search interface that enabled them to drill down to a particular time period and look at what people were writing about with certain keywords.


Extraction and Multidimensional Analysis of Athletics from Non-Structured Data (Portuguese)

Authors: Rui José da Rocha Lima
  |  
February 2018
  |  
N/A

This project introduces the development of a data warehouse (DW) of athletic results as well as a way to integrate that data with both the geographic location and the atmospheric conditions of the competitions. First, data must be parsed by tokenizing the text and extracting data and defining the different data hierarchy. Next, data must be scaped for the results and converted into PDF and plain text to store in the DW.


Convergent Synthesis of Cognitive Model Based on Deep Learning and Quantum Semantics (Russian)

Authors: A. N. Raikov
  |  
2018
  |  
International Journal of Open Information Technologies, 2018

Semantic interpretations of text can be achieved through cognitive modeling, though these interpretations are not usually subjective. This paper takes into account this subjective factor of text in cognitive modeling through automated synthesis with an in-depth cognitive interpretation of model components.


New Technologies, Continuing Ideologies: Online Reader Comments as a Support for Media Perspectives of Minority Religions

Authors: Tayyiba Vruce
  |  
August 2018
  |  
Discourse, Context and Media

Nationalist interests have strengthened in Europe in the past few years, leading to increased numbers of discussions in right-leaning news websites. This paper examines these discussions for stereotypical representations of Islam and Catholicism in Daily Mail and Telegraph websites.


Design of Darknet Market Monitoring System (Czech)

Authors: Smrž Josef
  |  
2018
  |  
Individual research project

This paper introduces a design for a system that collects data about items sold on the Dark Web. The system lets users search the data and also notify them of any changes in the data that occur in the markets being searched. A prototype of this system was used for the Cyber Crime Unit of the Police of the Czech Republic.


Evaluating Style Transfer in Natural Language

Authors: Nicholas Matthews
  |  
2017
  |  
N/A

This paper gives a comprehensive analysis of style transfer, an area of Natural Language Processing (NLP) with an explanation of the challenges of granularity, transferability and distinguishability inherent in sentiment transfer. Other solutions are discussed, including news outlet style transfer and non-parallel error correction.


Recovery of Bitcoin Addresses from the Web (Italian)

Authors:  Alessio Santoru
  |  
2016
  |  
individual thesis project

The use of cryptocurrency has become very popular in the past few years. The anonymity of the currency, however, enables it to be used for criminal purposes, such as the purchase of weapons, forged documents and drugs. However, if the same bitcoin address is used for multiple purchases or if the user publishes it to receive payments in forums, for example, it is possible to establish the user’s identity. This paper presents a tool for the recovery of bitcoin addresses and information related to them.


Examples

Authors: Seppe vanden Broucke and Bart Baesens
  |  
April 2018
  |  
Chapter from the book Practical Web Scraping for Data Science

The chapter of this book discusses several larger examples of web scrapers.


A Dashboard for Controlling Polarization in Personalization

Authors: Celis, L. Elisaa; Kapoor, Sayashb; Salehi, Farnoodc; Keswani, Vijayd; Vishnoi, Nisheeth K.e
  |  
2019
  |  
AI Communications

Although personalized news and shopping has grown in popularity, the public has become more aware of its inherent bias and the polarization of opinions. This paper presents an alternative balanced news feed via which users can see beforehand the political leaning of their news consumption as well as set their polarization constraints.


Unified Domain-specific Language for Collecting and Processing Data of Social Media

Authors: Nikolav Butakov, Maxim Petrov, Ksenia Mukhina, Denis Nasonov, Segey Kovalchuk
  |  
October 2018
  |  
Journal of Intelligent Information Systems

Data from social media is becoming increasingly valuable to social scientists even though the solutions for crawling and gathering this data don’t allow for individual crawling scenarios. This paper addresses this challenge with an approach based on a developed domain specific language (DSL) and architecture of distributed crawling system, which requires the user to define the description of needed data.


Information Quality Dimensions Analysis of Slovenian Companies' Websites (Slovenian)

Authors: Matic Jazbec
  |  
2016
  |  
individual thesis project

This paper presents a web crawler that would collect information from the websites of Slovenian companies and their profiles on social networks and apply a model for evaluating the quality of information. The conclusion of the project was that we can find and explain the variance in information quality among the websites of Slovenian companies based on their size and business type.


Controlling Polarization in Personalization: An Algorithmic Framework

Authors: L. Elisa Celis; Sayash Kapoor; Farnood Salehi; Nisheeth Vishnoi
  |  
January 2019
  |  
Proceedings of the Conference on Fairness, Accountability, and Transparency

Online personalization has delivered many positive results to users, including better product recommendations and more relevant news and content. But it also can create biases which influence users through a filter bubble which they do not have much control over. This paper presents a proven scalable algorithm that allows us to avoid polarization yet still optimize individual utility.


Convergent Synthesis of a Cognitive Model Based on Deep Learning and Quantum Semantics (in Russian)

Authors: Raykov Alexander Nikolaevich
  |  
2018
  |  
International Journal of Open Information Technologies

This paper demonstrates the ability to conduct cognitive modeling while taking into account semantic interpretations. This is done through a variety of methods, including convergent decision making, heuristic algorithms, or a combination of an automated synthesis of cognitive models with an in-depth cognitive interpretation of model components.


User Centered Development of Prototypical Web-Monitoring-Tools (in German)

Authors: Ulrike Exner
  |  
April 2017
  |  
Individual master thesis

The collection of relevant data from the web can be done with the help of web monitoring tools. This paper presents a web monitoring tool for a company in the automotive industry.


A Hybrid Approach for Alarm Verification Using Stream Processing, Machine Learning and Text Analytics

Authors: Ana Sima; Kurt Stockinger; Katrin Affolter; Martin Braschler; Peter Monte; Lukas Kaiser
  |  
2018
  |  
Industrial and Applications paper

“False alarms triggered by security sensors can be expensive for all parties involved. This paper presents a scalable alarm verification system for 30K alarms a second with up to 90% accuracy through a combination of machine learning, stream and batch processing technologies. The team used Webhose to collect descriptions of fire and intrusion incidents from different online resources, such as Twitter, RSS feeds, or web pages to calculate an a-priori risk factor for intrusion and fire alarms.”


Automatic Document Classification with Machine Learning Help (in Swedish)

Authors: Johan Dufberg
  |  
2018
  |  
Individual thesis paper

Machine learning is a valid alternative to the manual classification of classifying large quantities of text documents. This paper demonstrates how automatic classification of texts work and gives a quick overview of the most common algorithms used for this purpose. The paper continues this work by beginning to extrapolate the application to handle official Swedish documents. Webhose’s free datasets were used for this project. 


Modeling Cryptocurrency Market Trends Using Textual Data (in Slovenian)

Authors: Benjamin Fele
  |  
September 2017
  |  
Thesis paper

This paper’s goal is to build a model that predicts cryptocurrency trends based on data from the Poloniex online market. The forecast for the model was based on news obtained through the Reddit website. The conclusion was that the model would fail to yield profitable returns.


Use Webhose Data Feeds For Your Academic Research