Data Subset Selection for Compute Efficient Deep Learning


August 2023

Journal Title

Journal ISSN

Volume Title



Deep learning has achieved tremendous success in a variety of machine learning applications, such as computer vision, natural language processing, and speech recognition. However, the pursuit of better performance has led to larger models and training datasets, increasing training times, energy usage, and carbon emissions. Real-world applications frequently involve hyperparameter tuning, which necessitates multiple training cycles and further exacerbates computational expenses. Additionally, real-world datasets often exhibit challenges like class imbalance, out-of-distribution data (OOD), and label noise, emphasizing the need for a scalable and comprehensive framework for efficient and robust model training. Having a framework for efficient model training allows for the democratization of these models to those with limited resources and reduces carbon emissions, allowing us to take a step closer to greener AI. This dissertation addresses these challenges by developing theoretically sound and scalable data subset selection techniques that can identify small, informative data subsets for training machine learning models with minimal performance loss. We provide theoretical evidence that our data subset selection approaches are effective by deriving the rate at which the model converges to its optimal value when trained on the selected subsets using gradient descent approaches. Additionally, we demonstrate the versatility of our data subset selection approaches in supervised learning, semi-supervised learning, and hyperparameter tuning across diverse real-world domains such as natural language processing, image classification, and tabular data classification. We also empirically demonstrate that some of the proposed subset selection formulations effectively handle data imperfections like class imbalance, OOD, and label noise, providing a comprehensive solution for efficient deep learning. To help practitioners train their models more quickly and robustly, we release a modular, user-friendly open-source repository called “CORDS,” which includes several data subset selection implementations for efficient model training and tuning. We hope that our work will make it easier for researchers and practitioners to develop and deploy deep learning models efficiently.



Computer Science