Feature Selection and Extraction - Algorithms and Applications
Shah, Swair Rajesh
MetadataShow full item record
Feature selection is a very important process in statistics and machine learning. It removes redundant and irrelevant features and selects the most useful set of features from a given dataset. This tends to improve generalization of machine learning algorithms and reduces training time. Feature selection is used to make the models more interpretable. Recently it has been also used to reduce bias of such models and ensure fairness of the outcome. Feature extraction is another dimensionality reduction process which finds a small set of features to approximate a given dataset. Unlike feature selection in extraction the resulting features can be arbitrary functions of the features in the original dataset. There are fast algorithms to compute feature extraction but it doesn’t provide the interpretability aspect of feature selection and it tends to be less effective than feature selection in making models generalize better. One of the problems addressed in this dissertation is a hybrid problem which combines feature selection and extraction. This hybrid problem is at least as hard as feature selection which is known to be NP-hard. We show how simplistic sequential application of optimal selection and extraction does not provide an optimal solution for this problem. We develop an algorithm to solve the hybrid problem optimally using methods inspired by the classic A* search algorithm. One of the most widely used feature extraction methods is the Principal Component Analysis (PCA). It is known to be very sensitive to the outliers in the data. There have been various attempts in the literature to address this issue none promising an optimal solution to the problem. We model this problem as a graph search problem and again apply our heuristic search framework to design an algorithm which solves this problem optimally. We show that we compare favorably to the state-of-the-art convex relaxation approach. PCA is closely tied to a very popular linear algebra problem called the eigenvalue problem. The third part of the dissertation uses the eigenvalue problem and a variant of it known as the generalized eigenvalue problem to achieve the privacy of the user data. Today there are many companies which provide predictive models as services. In order to use these services one needs to send one’s data to such a service for prediction or inference. It is possible that this data can be used to infer some confidential information about the data sender. We design algorithms to apply transformations to this data so that the inference of the confidential information is prevented while the data can still be used to infer the desired information.