Algorithms for Robust Data Analysis
Abstract
Abstract
Data analysis plays an important role in making decisions, making predictions, and helping
business operate. Unfortunately, in many situations the data is not reliable and robust
analysis is needed to obtain stable results. This may be challenging when the data is high dimensional. Transforming high-dimensional data into low-dimensional data is an important
prior step in applications such as managing the data, performing efficient learning, retrieving
information, and other data analytics tasks.
Feature selection and feature extraction are two classical techniques to achieve dimensionality
reduction. Feature selection removes irrelevant and redundant features and keeps only the
most important ones. Unlike feature selection, feature extraction generates the features
as arbitrary functions of the data. They are typically more accurate than those obtained
by feature selection. But the extracted features are harder to interpret than the selected
features. Another disadvantage is that feature extraction is less robust than feature selection.
In this dissertation we describe algorithms and methods to improve the robustness of feature
selection and feature extraction.
We propose computing a hybrid low rank representation (HLR) of selected features and
extracted features. The robustness of the HLR model comes from the selected features, and
its accuracy comes from the extracted features. We develop an algorithm to solve this hybrid
problem optimally by combinatorial search. We propose optimal, sub-optimal, and greedy
variants of the algorithm to solve this hybrid problem. The sub-optimal and the greedy
variants come with the exact bounds on the representation accuracy.
Principal Component Analysis (PCA) is a widely used feature extraction algorithm. It is
known to be sensitive to outliers that reduce its robustness. Robust Principal Component
Analysis (RPCA), also known as Robust Subspace Recovery (RSR), is a classical approach
to improve the robustness of the PCA by identifying and removing outliers. We develop a
novel RPCA algorithm, which converts this problem into graph search. We show how to
solve the graph search problem optimally by applying heuristic search techniques from AI.
The results obtained by our algorithm are optimal in terms of accuracy. We also describe a
sub-optimal variant that runs much faster than the optimal variant and produces a solution
that is guaranteed to be near the optimal.
Outlier based RPCA removes outliers from the data and computes the principal components
of the remaining data. The centered variant requires the center of non-outliers, which is
unknown until after the outliers are determined. Not using an accurate center may lead
to the detection of wrong outliers. We propose a method that can be used to improve the
robustness of many currently known RPCA algorithms. Our method implicitly centers the
non-outliers; it is implemented by appending a bias value to each data element. It can be
used with “black box” RPCA algorithms since only their input needs to be augmented.