Data Subset Selection for Compute Efficient Deep Learning

dc.contributor.advisorIyer, Rishabh
dc.contributor.advisorObaid, Girgis
dc.contributor.committeeMemberRamakrishnan, Ganesh
dc.contributor.committeeMemberNatarajan, Sriraam
dc.contributor.committeeMemberGogate, Vibhav
dc.creatorKillamsetty, Krishnateja 1994-
dc.creator.orcid0000-0001-5565-9126
dc.date.accessioned2023-10-24T21:16:34Z
dc.date.available2023-10-24T21:16:34Z
dc.date.created2023-08
dc.date.issuedAugust 2023
dc.date.submittedAugust 2023
dc.date.updated2023-10-24T21:16:35Z
dc.description.abstractDeep learning has achieved tremendous success in a variety of machine learning applications, such as computer vision, natural language processing, and speech recognition. However, the pursuit of better performance has led to larger models and training datasets, increasing training times, energy usage, and carbon emissions. Real-world applications frequently involve hyperparameter tuning, which necessitates multiple training cycles and further exacerbates computational expenses. Additionally, real-world datasets often exhibit challenges like class imbalance, out-of-distribution data (OOD), and label noise, emphasizing the need for a scalable and comprehensive framework for efficient and robust model training. Having a framework for efficient model training allows for the democratization of these models to those with limited resources and reduces carbon emissions, allowing us to take a step closer to greener AI. This dissertation addresses these challenges by developing theoretically sound and scalable data subset selection techniques that can identify small, informative data subsets for training machine learning models with minimal performance loss. We provide theoretical evidence that our data subset selection approaches are effective by deriving the rate at which the model converges to its optimal value when trained on the selected subsets using gradient descent approaches. Additionally, we demonstrate the versatility of our data subset selection approaches in supervised learning, semi-supervised learning, and hyperparameter tuning across diverse real-world domains such as natural language processing, image classification, and tabular data classification. We also empirically demonstrate that some of the proposed subset selection formulations effectively handle data imperfections like class imbalance, OOD, and label noise, providing a comprehensive solution for efficient deep learning. To help practitioners train their models more quickly and robustly, we release a modular, user-friendly open-source repository called “CORDS,” which includes several data subset selection implementations for efficient model training and tuning. We hope that our work will make it easier for researchers and practitioners to develop and deploy deep learning models efficiently.
dc.format.mimetypeapplication/pdf
dc.identifier.uri
dc.identifier.urihttps://hdl.handle.net/10735.1/9958
dc.language.isoEnglish
dc.subjectComputer Science
dc.titleData Subset Selection for Compute Efficient Deep Learning
dc.typeThesis
dc.type.materialtext
thesis.degree.collegeCollege of Engineering
thesis.degree.departmentComputer Science
thesis.degree.grantorThe University of Texas at Dallas
thesis.degree.namePHD

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
KILLAMSETTY-PRIMARY-2023.pdf
Size:
16.25 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
proquest_license.txt
Size:
6.39 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
license.txt
Size:
2.01 KB
Format:
Plain Text
Description: