Guided Subset Selection Based Unified Active Learning Framework: Formulations, Algorithms and Applications


May 2023

Journal Title

Journal ISSN

Volume Title



Deep learning has been successful in a wide variety of domains, ranging from face recognition to self-driving cars. The success of deep learning algorithms is due to large amounts of labeled data, which is easily available in today’s digital world. However, training deep models using large datasets comes with high compute costs and labeling costs. Moreover, large datasets are often naturally plagued with data imperfections like class imbalance, OOD, and redundancy. Training machine learning models using such datasets leads to biased models obtained even after incurring expensive costs. Moreover, these models underperform on rare yet critical scenarios, which can be catastrophic. As one can imagine, one or more of these problems can occur at any point during the development of machine learning models. Hence, there exists a need for a wholistic framework that can serve as a one-stop solution for data-efficiency, model-efficiency and reducing data imperfections. In this dissertation, we focus on designing a Guided Active Learning framework that can serve this purpose. In particular, we propose four phases of the Guided Active Learning framework: 1) Seed Set Selection, 2) Discovery, 3) Targeting and, 4) Filtering. This framework is designed to be modular, since different teams can find themselves in requiring to optimize for different phases in this framework. The Seed Set Selection phase focuses on finding an initial set that represents the larger dataset, such that the majority of the information and semantics of the dataset are covered. The Discovery phase focuses on finding unknown instances that do not exist in the current labeled dataset or were potentially missed during the exploration phase. The Targeting phase aims to mitigate data imbalance by finding data points that are semantically similar to rare classes or slices. The Filtering phase focuses on avoiding out-of-distribution and redundant data points from being selected. We provide algorithms and mathematical formulations for executing these phases in several applications. Particularly, we demonstrate the effectiveness of the Guided Active Learning framework on a wide range of real-world domains including autonomous driving, medical imaging and automated speech recognition. We hope that the Guided Active Learning framework will help practitioners navigate data better and improve the performance of their downstream machine learning models.



Computer Science