Cold Start Active Learning With Submodular Mutual Information for Imbalanced Text Classification
Date
Authors
ORCID
Journal Title
Journal ISSN
Volume Title
Publisher
item.page.doi
Abstract
This study tackles the cold start problem in active learning for imbalanced binary text classification. Focusing on three datasets (YouTube spam, SMS spam, tweet sentiment) with class imbalances in their training data, we investigate the efficacy of Submodular Mutual Information (SMI) methods in the initial active learning stage. These methods aim to balance class representation using a query set around one percent of the unlabeled data size. We compare four SMI approaches (two facility location variants, log determinant, graph cut) with a custom regular expression matching baseline and five established baseline sampling strategies (Random, BADGE, Entropy, Least confidence, and Margin Sampling) across all datasets. Our experiments, conducted ten times per dataset reveal that SMI methods, on average, especially log determinant, outperform both regex matching and traditional baselines. Further analysis has also been done on the effect of the number of query samples used on performance. The work highlights the potential of SMI in efficiently addressing the cold start challenge in imbalanced text classification contexts.