A Big Data Framework for Unstructured Text Processing With Applications Towards Political Science and Healthcare
Abstract
Abstract
Machine learning and deep neural networks have soared in popularity in recent years, allowing us to enhance many aspects of everyday life. While these methods are intuitive, they are
very reliant on the dataset being used to build the model. A high-quality dataset boosts the
model’s accuracy and validates the model’s output in the context of a real-world scenario.
Furthermore, continuous improvement on the dataset contributes in the tuning of the model
in a time-consistent way and the mitigation of temporal inconsistencies. However, preparing
datasets, particularly for text domains, is difficult due to the inherent unstructured nature
of the data and the use of multiple languages. Furthermore, the amount of text produced in
the form of news articles or social media posts is massive, necessitating large-scale processing. The velocity at which new texts are produced demands an elastic and scalable system
that can accommodate any surge of inputs while remaining resource efficient while not in
use. Texts are created in a variety of ways and must be preprocessed and analyzed in order
to provide well-structured, consistent data. This can be accomplished through the use of a
well-defined domain-specific ontology (rule-based approach) or machine learning approaches.
While rule-based systems can provide information that are more precise and are preferred
in a variety of circumstances, they lack flexibility as the ontologies are often fixed and does
not respond well with the continuous changes in respective domains. We propose associated solutions to the challenges described above in this dissertation. First, we go over a scalable
architecture for collecting news stories from around the world and utilizing a rule-based approach with the Conflict and Mediation Event Observation(CAMEO) ontology to generate
political events. We present a summary of the generated dataset, as well as some basic
analysis, to demonstrate how it relates to the real-world scenario. We present techniques
to dynamically adding information to the ontology using a mining approach for discovering
new political actors that works as a recommender system and retrieves more than 80% of
the missing information including political figures and their roles. We discuss an extended
data processing system for processing articles published in several languages, with a focus
on translation methodologies and tools developed. In comparison to the English language,
we demonstrate the efficacy of the coder in Spanish. When compared to equivalent events
in English articles, the revised event coder with translated knowledge-base was able to recognize 83% of information in Spanish.
For healthcare, we propose an alternative strategy in which we use several machine learning
algorithms and social media, such as tweets, to extract the location and severity of Road
Traffic Incidents (RTI). We highlight a pipeline that goes from collecting tweets to summarizing related tweets for an RTI. We also demonstrate how semi-automatic ontology learning
can be useful in determining severity and offer a simplified example in which 100% of the
target rules were identified using an iterative technique.