Novel Class Detection and Cross-Lingual Duplicate Detection over Online Data Stream
Mustafa, Ahmad M.
MetadataShow full item record
Data streams are continuous flows of data points. They are very common now-a-days in several domains such as e-commerce, education, health, security, and social networks. Their sheer volume and throughput speed pose a great challenge for the data mining community to extract useful knowledge from such streams. Data stream classification refers to the task of predicting class labels of data instances using classification models trained with the past labeled data. Data stream classification has been a major research thrust for the past several years because of increasing demand in many business and security applications. Data streams induce several unique properties when compared with traditional datasets, such as infinite length, concept-drift, and concept-evolution. Concept-drift occurs in data streams when the underlying concept of data changes over time, making previously trained models obsolete. Concept-evolution refers to the emergence of a new or novel class. Existing classification techniques that address the concept-evolution and concept-drift requires labeled data to detect stream changes. Labeled data is scarce and expensive. This dissertation addresses the aforementioned challenges in a number of ways. We address infinite length, concept-drift, and concept-evolution by proposing an ensemble-based classification model that exploits unsupervised deep embeddings and non-parametric change point detection. We show empirically on both synthetic and several benchmark data streams that the proposed techniques outperform state-of-the-art techniques. The detection of duplicate news reports - those which cover the same event - plays an important role in condensing information about an event into an easily digestible format. Several methods of duplicate report detection have been developed. However, these existing methods either result in poor accuracy or are not effective when reports are in different languages. We propose a novel method to measure the similarity between news reports written in different languages in an unsupervised learning approach. Experimental results on publicly available datasets of multi-lingual reports show that our approach efficiently and significantly reduce duplicate detection errors compared to state-of-the-art techniques.