Semi-supervised Adaptive Classification over Data Streams
MetadataShow full item record
Data streams are ubiquitous in today’s digital world. Efficient extraction of knowledge from these streams may help in making important decisions in (near) real-time, and unveiling hidden opportunities. Traditional data mining techniques are inadequate on the streaming data due to its inherent properties, such as inﬁnite length, concept drift, concept evolution, limited labeled data, and covariate shift between labeled training data and unlabeled test data. In this dissertation, we study the challenges posed in data stream classiﬁcation due to these properties, and propose solutions to the challenges. A data stream is essentially an inﬁnite ﬂow of data. The classiﬁer in a streaming scenario needs to be updated regularly as the underlying class boundary may change and totally new classes may emerge in the stream over time, known as the concept drift and the concept evolution problems respectively. As labeled data instances are scarce in the real-world data streams, the classiﬁer must be trained and updated under a semi-supervised setting in order to capitalize the large portion of data that are unlabeled. Due to the semi-supervised setting, covariate shift such as sampling bias may be introduced between the training and the test distribution. An efficient data stream classiﬁcation approach would consolidate for this difference in distributions. In addition to addressing these challenges, a classiﬁer in the streaming context must be scalable for addressing the additional challenges posed by any Big Data or Internet of Things (IoT) stream. In this dissertation, we propose four paradigms for addressing challenges in classifying data streams, namely ECHO, FUSION, SDKMM, and CASTLE. First, we propose ECHO, which is a semi-supervised approach for addressing inﬁnite length, concept drift and concept evolution using a limited amount of labeled data. Next, we study the consequences of covariate shift in data stream classiﬁcation. A covariate shift between the training and test distribution may occur due to difficulty in collecting labeled data instance, often resulting in a sampling bias. In a streaming scenario, we consider two separate streams, where one of the streams provides only labeled training data, and the other stream provides unlabeled test data. These streams of data may have covariate shift and asynchronous concept drifts among them. The second approach proposed in this dissertation, referred to as FUSION, addresses challenges in the above scenario, also known as the Multistream classiﬁcation problem. In the third and fourth approaches proposed in this dissertation, we propose two scalable paradigms for data stream classiﬁcation. The third approach, SDKMM, is a sampling-based distributed approach for addressing covariate shift between the training and the test distribution. The last approach presented in this dissertation, referred to as CASTLE, is a hierarchical ensemble classiﬁcation model for data streams, where individual classiﬁers in the hierarchy are trained in parallel and in a distributed fashion. We theoretically analyze various properties of the proposed approaches. Moreover, we evaluate each of the above approaches using benchmark datasets, and compare them with a number of baseline approaches. Empirical results indicate the effectiveness of the proposed approaches.