Feature Engineering for Data Analytics
Data plays a fundamental role in modern science, engineering, and business applications. We investigate two important problems in data analytics. The first is feature selection, where we consider both the unsupervised and the supervised case. The second is data privacy, where we propose a new model and describe algorithms that improve data privacy in that model. Feature selection is the process of removing redundant and irrelevant features from the data. There are two general classes of feature selection: the unsupervised case and the supervised case. In the unsupervised case features are selected to approximate the entire data matrix, while in the supervised case features are selected to predict a set of labels from the data matrix. We describe several new algorithms for the unsupervised case that are closely related to the A∗ heuristic search algorithm. These new algorithms can effectively select features from large datasets and are shown experimentally to be more accurate than the current state of the art. The evaluation criterion for feature selection is typically an error measured in the Frobenius norm. We generalize the criteria to a large family of unitarily invariant norms. These include, among others, the Spectral norm, the Nuclear norm, and Schatten p-norms. We proposed several algorithms for supervised feature selection that improve the running time and the accuracy of the current state of the art. A common approach for reducing the running time is to perform the selection in two stages. In the first stage a fast filter is applied to select good candidates. The number of candidates is further reduced in the second stage by an accurate algorithm that may run significantly slower. We describe a general framework that can use an arbitrary off-the-shelf unsupervised algorithm for the second stage. The algorithm is applied to the selection obtained in the first stage weighted appropriately. Another common approach for accelerating the running time of feature selection is to use a greedy technique called “forward selection”. We show how to use this technique to address the multi-label classification problem. Experimental results on real-world data demonstrate the effectiveness of the proposed approach. We generalize the error criteria of forward selection to unitarily invariant functions. In particular we show how to minimize Schatten p-norms to solve the outlier-robust PCA problem. The algorithm is very efficient and experimental results show that it outperforms a well-known outlier-robust PCA method that uses convex optimization. In standard machine learning and regression, feature values are used to predict some desired information from the data. One privacy concern is that the feature values may also expose information that one wishes to keep confidential. We propose a model that formulates this concern and show that such privacy can be achieved with almost no effect on the quality of predicting desired information. We describe two algorithms for the case in which the prediction model starts with a linear operator. The desired effect can be achieved by zeroing out feature components in the approximate null space of the linear operator.