A Big Data Framework for Unstructured Text Processing With Applications Towards Political Science and Healthcare
MetadataShow full item record
Machine learning and deep neural networks have soared in popularity in recent years, allowing us to enhance many aspects of everyday life. While these methods are intuitive, they are very reliant on the dataset being used to build the model. A high-quality dataset boosts the model’s accuracy and validates the model’s output in the context of a real-world scenario. Furthermore, continuous improvement on the dataset contributes in the tuning of the model in a time-consistent way and the mitigation of temporal inconsistencies. However, preparing datasets, particularly for text domains, is difficult due to the inherent unstructured nature of the data and the use of multiple languages. Furthermore, the amount of text produced in the form of news articles or social media posts is massive, necessitating large-scale processing. The velocity at which new texts are produced demands an elastic and scalable system that can accommodate any surge of inputs while remaining resource efficient while not in use. Texts are created in a variety of ways and must be preprocessed and analyzed in order to provide well-structured, consistent data. This can be accomplished through the use of a well-defined domain-specific ontology (rule-based approach) or machine learning approaches. While rule-based systems can provide information that are more precise and are preferred in a variety of circumstances, they lack flexibility as the ontologies are often fixed and does not respond well with the continuous changes in respective domains. We propose associated solutions to the challenges described above in this dissertation. First, we go over a scalable architecture for collecting news stories from around the world and utilizing a rule-based approach with the Conflict and Mediation Event Observation(CAMEO) ontology to generate political events. We present a summary of the generated dataset, as well as some basic analysis, to demonstrate how it relates to the real-world scenario. We present techniques to dynamically adding information to the ontology using a mining approach for discovering new political actors that works as a recommender system and retrieves more than 80% of the missing information including political figures and their roles. We discuss an extended data processing system for processing articles published in several languages, with a focus on translation methodologies and tools developed. In comparison to the English language, we demonstrate the efficacy of the coder in Spanish. When compared to equivalent events in English articles, the revised event coder with translated knowledge-base was able to recognize 83% of information in Spanish. For healthcare, we propose an alternative strategy in which we use several machine learning algorithms and social media, such as tweets, to extract the location and severity of Road Traffic Incidents (RTI). We highlight a pipeline that goes from collecting tweets to summarizing related tweets for an RTI. We also demonstrate how semi-automatic ontology learning can be useful in determining severity and offer a simplified example in which 100% of the target rules were identified using an iterative technique.