Multistream Mining and Its Applications
Date
Authors
ORCID
Journal Title
Journal ISSN
Volume Title
Publisher
item.page.doi
Abstract
Development of Internet technology has created a lot of stream data in our daily life. Streams of data may come from a variety of sources, such as online shopping, social media, smart sensors, transportation, and etc. Data streams has some unique properties, such as fast changing, time stamped, numerous, and potentially infinite. Hence, traditional machine learning approaches have limited ability to handle the whole data stream. Since data stream usually has tremendous volume. When we consider a supervised learning scenario, the availability of sufficient labeled data is the most important thing. However, collecting ground truth data is time consuming and very expensive in many real world applications. Particularly when performing prediction over a non-stationary data stream, limited labeled data affects the classifier’s long-term performance by limiting its adaptability to changes in the data distribution along with time. In this dissertation, the approaches we propose to solve this problem can be divided into two directions: Multistream Classification and Multistream Regression. For the first direction, we propose a transfer learning framework which address the covariate shift and concept drift challenges over a data stream setting. We consider two independent non-stationary streams. One stream contains labeled data, called source stream. The other stream contains unlabeled data, called target stream. Data instances in source stream have a biased distribution compared to data instances in target stream. Label prediction task under above scenario is called Multistream Classification. In this task, data instances in source stream and target stream occur independently. While previous studies have addressed various challenges in this multistream setting, it still suffers from large computational overhead mainly due to frequently employed bias correction and drift adaptation methods. In this dissertation, we propose a multistream framework called MSCRDR. In MSCRDR, we focus on utilizing an alternative bias correction technique, called relative density-ratio estimation, which is known to be computationally faster. Importantly, we propose a novel mechanism to automatically learn an appropriate mixture of relative density that adapts to changes in the multistream setting along with time. In addition, we theoretically study its properties and empirically illustrate its superior performance. We extend our research in Multistream Regression task. In this dissertation, we propose a multistream regression framework which unify concept drift detection and covariate shift adaptation. We use multiple real world datasets and synthetic datasets to evaluate the performence of our proposed framework. We use multiple state-of-the-art approached in our experiments. Empirical evaluation results indicates the effectiveness of our proposed framework. We demonstrate our approaches by several applications: game cheating application, flight delay project and Covid-19 spread analysis.