Semi-Supervised Learning in Drug Classification and Medical Image Analysis




Journal Title

Journal ISSN

Volume Title



Machine Learning classification tasks are divided into two broad categories: Supervised Learning and Unsupervised Learning. Supervised Learning requires a large amount of labeled data. But in reality, it is very difficult to get lot of labeled data, as data annotation requires domain experts and data is costly to get. Unsupervised Learning requires no labeled data. It allows the model to detect previously unseen patterns and information. Semi-supervised Learning (SSL) works in the midway between supervised and unsupervised learning. SSL has been provably effective in leveraging unannotated instances to mitigate the reliance on large amounts of labeled data along with increasing the model’s performance. The focus of this dissertation is on recently proposed unsupervised contrastive learning algorithms in the unlabeled data learning, which aims to achieve the positive concentrated and negative separated properties in the unsupervised embedding space. In this dissertation, we introduce MultiCon, a multi-contrastive learning paradigm to learn data augmentation invariant, and instance spread out feature embedding. Specifically, we add the multi-contrastive learning approach to maximize the similarity between differently augmented views of the same example, and push the embedding of different instances away in the latent space. We have tested MultiCon on a number of application domains. Such as: traditional image classification datasets (CIFAR-10, CIFAR-100. SVHN etc.), on therapeutic classification of drugs from their chemical structures, detection of COVID-19 from CT scans and chest X-rays and screening of breast cancer. Despite its tremendous potential, semi-supervised learning has yet to be implemented in the field of drug discovery. Empirical testing of drugs and their classification is costly and time-consuming. In contrast, predicting therapeutic applications of drugs from their structural formulas using semi-supervised learning would reduce costs and time significantly. We employ MultiCon for classifying drugs into 12 categories, according to therapeutic applications, on the basis of image analyses of their structural formulas. By rational use of data balancing, online augmentations of the drug image data during training, and the combined use of multicontrastive loss with consistency regularization, MultiCon achieves better class prediction accuracies when compared with the state-of-the-art machine learning methods across a variety of existing semi-supervised learning benchmarks. We have also employed SSL approaches to detect COVID-19 and breast cancer cases accurately by analyzing digital chest X-rays, CT scans and mammograms. We observed impressive results of MultiCon in all domains irrespective of settings and depicted it through various chapters of this dissertation.



Machine learning, Supervised learning (Machine learning), Diagnostic imaging