Scalable and Secure Learning with Limited Supervision over Data Streams




Journal Title

Journal ISSN

Volume Title



Applications that employ machine learning over a stream of data provide the knowledge necessary for its users to make informed decisions at the right time. With the advantages of cloud computing infrastructure, these applications can potentially reach a myriad of users. However, challenges arising from evolving statistical properties of data occurring continuously over time, and concerns about data security has severely limited its adoption in the real world. This dissertation contributes new results to address critical challenges in both these complementary research areas, particularly when deployed over a third-party resource. The first part of this dissertation introduces a novel framework for data classification over a non-stationary data stream where the goal is to learn from limited labeled data over time. Here, a scenario in which multiple data generating processes, that continuously generate data, is considered, with a constraint of labeled data being generated only by a small set of processes whose data distribution is biased compared to the population. The effect of learning with such sampling bias in a concept-drifting data stream is explored. Changes in data distribution over time with biased labeled data degrades classifier performance. By representing instances along the stream as two independent streams, one containing labeled instances (called the source stream) and the other containing unlabeled instances (called the target stream), methodologies which uniquely combine transfer learning mechanisms with drift detection are presented. While the above framework may adapt existing batch-wise bias correction techniques, these are computationally expensive and are not scalable over a data stream. The next part of this dissertation explores sampling and ensemble techniques to address this challenge. The theoretical and empirical results show large improvements in computational time while maintaining similar performance compared to the baseline methods. The final part of this dissertation considers security concerns when deploying applications that use machine learning systems on an untrusted third-party resource. Here, the focus is on protecting the learning system against insider threats. A strong adversary can evade security and privacy of an application aiming to protect its code and data. Using the recent commercially available off-the-shelf (COTS) hardware-based cryptographic platform, called Intel SGX, a black-box system can be achieved to protect against such direct attacks. Unfortunately, side-channels from the platform that leak information during computation exists. A novel defense strategy that leverages the trade-off between computational efficiency and privacy to address this challenge is presented, with results demonstrating a large gain in computational time compared to other competing strategies.



Supervised learning (Machine learning), Computer multitasking, Streaming technology (Telecommunications), Data mining, Data encryption (Computer science)


©2018 The Author. Digital access to this material is made possible by the Eugene McDermott Library. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.