Design and Development of Real-Time Big Data Analytics Frameworks

Date

2017-12

Journal Title

Journal ISSN

Volume Title

Publisher

item.page.doi

Abstract

Today most sophisticated technologies such as Internet of Things (IoT), autonomous driving, Cloud, data center consolidation, etc., demand smarter IT infrastructure and real-time operations. They continuously generate lots of data called “Big Data” to report their operational activities. In response to this, we need advanced analytics frameworks to capture, filter, and analyze data and make quick decisions in real-time. The high volumes, velocities, and varieties of data make it an impossible (overwhelming) task for humans in real-time.

Current state-of-the-arts like advanced analytics, Machine learning (ML), Natural Language Processing (NLP) can be utilized to handle heterogeneous Big Data. However, most of these algorithms suffer scalability issues and cannot manage real-time constraints. In this dissertation, we have focused on two areas: anomaly detection on structured VMware performance data (e.g., CPU/Memory usage metric, etc.) and text mining for politics in unstructured text data. We have developed real-time distributed frameworks with ML and NLP techniques. With regard to anomaly detection, we have implemented an adaptive clustering technique to identify individual anomalies and a Chi-square-based statistical technique to detect group anomalies in real-time. With regards to text mining, we have developed a real-time framework SPEC to capture online news articles of different languages from the web and annotated them using CoreNLP, PETRARCH, and CAMEO dictionary to generate structured political events like ‘who-did-what-to-whom’ format. Later, we extend this framework to code atrocity events – a machine coded structured data containing perpetrators, action, victims, etc. Finally, we have developed a novel, distributed, window-based political actor recommendation framework to discover and recommend new political actors with their possible roles. We have implemented scalable distributed streaming frameworks with a message broker – Kafka, unsupervised and supervised machine learning techniques and Spark.

Description

Keywords

Big data, SPARK (Computer program language), Anomaly detection (Computer security), Machine learning, Text processing (Computer science), Data mining

item.page.sponsorship

Rights

Copyright ©2017 is held by the author. Digital access to this material is made possible by the Eugene McDermott Library. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Citation