Semi-supervised Adaptive Classification over Data Streams

Haque, Ahsanul

Semi-supervised Adaptive Classification over Data Streams

dc.contributor.advisor	Khan, Latifur
dc.creator	Haque, Ahsanul
dc.date.accessioned	2018-03-22T20:49:32Z
dc.date.available	2018-03-22T20:49:32Z
dc.date.created	2017-12
dc.date.issued	2017-12
dc.date.submitted	December 2017
dc.date.updated	2018-03-22T20:49:33Z
dc.description.abstract	Data streams are ubiquitous in today’s digital world. Efficient extraction of knowledge from these streams may help in making important decisions in (near) real-time, and unveiling hidden opportunities. Traditional data mining techniques are inadequate on the streaming data due to its inherent properties, such as inﬁnite length, concept drift, concept evolution, limited labeled data, and covariate shift between labeled training data and unlabeled test data. In this dissertation, we study the challenges posed in data stream classiﬁcation due to these properties, and propose solutions to the challenges. A data stream is essentially an inﬁnite ﬂow of data. The classiﬁer in a streaming scenario needs to be updated regularly as the underlying class boundary may change and totally new classes may emerge in the stream over time, known as the concept drift and the concept evolution problems respectively. As labeled data instances are scarce in the real-world data streams, the classiﬁer must be trained and updated under a semi-supervised setting in order to capitalize the large portion of data that are unlabeled. Due to the semi-supervised setting, covariate shift such as sampling bias may be introduced between the training and the test distribution. An efficient data stream classiﬁcation approach would consolidate for this difference in distributions. In addition to addressing these challenges, a classiﬁer in the streaming context must be scalable for addressing the additional challenges posed by any Big Data or Internet of Things (IoT) stream. In this dissertation, we propose four paradigms for addressing challenges in classifying data streams, namely ECHO, FUSION, SDKMM, and CASTLE. First, we propose ECHO, which is a semi-supervised approach for addressing inﬁnite length, concept drift and concept evolution using a limited amount of labeled data. Next, we study the consequences of covariate shift in data stream classiﬁcation. A covariate shift between the training and test distribution may occur due to difficulty in collecting labeled data instance, often resulting in a sampling bias. In a streaming scenario, we consider two separate streams, where one of the streams provides only labeled training data, and the other stream provides unlabeled test data. These streams of data may have covariate shift and asynchronous concept drifts among them. The second approach proposed in this dissertation, referred to as FUSION, addresses challenges in the above scenario, also known as the Multistream classiﬁcation problem. In the third and fourth approaches proposed in this dissertation, we propose two scalable paradigms for data stream classiﬁcation. The third approach, SDKMM, is a sampling-based distributed approach for addressing covariate shift between the training and the test distribution. The last approach presented in this dissertation, referred to as CASTLE, is a hierarchical ensemble classiﬁcation model for data streams, where individual classiﬁers in the hierarchy are trained in parallel and in a distributed fashion. We theoretically analyze various properties of the proposed approaches. Moreover, we evaluate each of the above approaches using benchmark datasets, and compare them with a number of baseline approaches. Empirical results indicate the effectiveness of the proposed approaches.
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/10735.1/5664
dc.language.iso	en
dc.rights	Copyright ©2017 is held by the author. Digital access to this material is made possible by the Eugene McDermott Library. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
dc.subject	Streaming technology (Telecommunications)
dc.subject	Data mining
dc.subject	Supervised learning (Machine learning)
dc.subject	Analysis of covariance
dc.subject	Big data
dc.subject	Internet of things
dc.title	Semi-supervised Adaptive Classification over Data Streams
dc.type	Dissertation
dc.type.material	text
thesis.degree.department	Computer Science
thesis.degree.grantor	The University of Texas at Dallas
thesis.degree.level	Doctoral
thesis.degree.name	PHD

Files

Original bundle

Now showing 1 - 1 of 1

Name:: ETD-5608-7468.48.pdf
Size:: 2.84 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 2 of 2

Name:: LICENSE.txt
Size:: 1.84 KB
Format:: Plain Text
Description:

Download

Name:: PROQUEST_LICENSE.txt
Size:: 5.84 KB
Format:: Plain Text
Description:

Download

Collections

UTD Theses and Dissertations