Cold Start Active Learning With Submodular Mutual Information for Imbalanced Text Classification

Date

December 2023

Journal Title

Journal ISSN

Volume Title

Publisher

item.page.doi

Abstract

This study tackles the cold start problem in active learning for imbalanced binary text classification. Focusing on three datasets (YouTube spam, SMS spam, tweet sentiment) with class imbalances in their training data, we investigate the efficacy of Submodular Mutual Information (SMI) methods in the initial active learning stage. These methods aim to balance class representation using a query set around one percent of the unlabeled data size. We compare four SMI approaches (two facility location variants, log determinant, graph cut) with a custom regular expression matching baseline and five established baseline sampling strategies (Random, BADGE, Entropy, Least confidence, and Margin Sampling) across all datasets. Our experiments, conducted ten times per dataset reveal that SMI methods, on average, especially log determinant, outperform both regex matching and traditional baselines. Further analysis has also been done on the effect of the number of query samples used on performance. The work highlights the potential of SMI in efficiently addressing the cold start challenge in imbalanced text classification contexts.

Description

Keywords

Cold start, Active learning, Computer science, Rare class detection, Submodular mutual information, Imbalanced text classification

item.page.sponsorship

Rights

Citation