## AI Inspired Algorithms for Several Combinatorial Optimization Problems in Data Science

##### Abstract

##### Abstract

Combinatorial optimization is a class of problems that consists of finding an optimal solution from a finite set of feasible solutions. Many important problems in Data Science can
be viewed as combinatorial optimization problems, typically described in terms of selecting
a small number of items from a much larger set. We describe Artificial Intelligence (AI)
inspired combinatorial optimization algorithms to three selection problems that have important practical applications. The first is the “unsupervised column subset selection problem”,
which has important applications to dimensionality reduction. The second is the “supervised column subset selection problem”, which can be viewed as a direct generalization of
the unsupervised case. The third is the “outlier detection for Principal Component Analysis
(PCA)”. We use ideas from AI to derive new algorithms for these classical problems that are
known to be NP-hard. Our algorithms compare favorably with the current state of the art
and come with guarantees on the quality of the solutions.
In the unsupervised column subset selection problem, one attempts to represent an entire
matrix as a linear combination of a small fraction of its columns. We study a generalization
that approximates the matrix with both selected and extracted features. We show that
an optimal solution to this hybrid problem involves a combinatorial search, and cannot
be trivially obtained even if one can optimally solve the separate problems of selection and
extraction. Our approach that gives optimal and approximate solutions uses a combinatorial
search in a setting similar to the weighted A* algorithm.
In the supervised column subset selection problem, we study the approximation of a “target”
matrix in terms of several selected columns of another matrix, sometimes called a “dictionary” matrix. We propose the first nontrivial optimal algorithm for this problem, using a
combinatorial search setting similar to the classical A∗ algorithm. We also propose practical
sub-optimal algorithms in a setting similar to the classical weighted A∗ algorithm. Experimental results show that our sub-optimal algorithms compare favorably with the current
state of the art. Previously proposed fastest nontrivial algorithms have a running time
proportional to the product of the number of columns of the two matrices. We describe
a significantly faster algorithm with complexity proportional to the sum of the number of
columns of the two matrices.
Outliers negatively affect the accuracy of data analysis. Algorithms that attempt to detect
outliers and remove them from the data prior to applying PCA are sometimes called “Robust
PCA” algorithms. We propose a new algorithm to detect outliers for PCA that combines two
ideas. The first is “chunk recursive elimination” that was used effectively to accelerate feature
selection, and the second is combinatorial search, in a setting similar to the weighted A*
algorithm. Our main result is showing how to combine these two ideas to balance speed and
accuracy. The resulting algorithm is called Chunk-A∗
, with variants that compute optimal
and sub-optimal solutions. We also propose a fast algorithm to address this problem. The
main idea is to rank each data point by looking ahead and evaluating the change in the
global PCA error when an inlier is converted into an outlier. We show that this lookahead
procedure can be implemented efficiently, and it is much more accurate than the current
state-of-the-art algorithms.