AI Inspired Algorithms for Several Combinatorial Optimization Problems in Data Science
Date
Authors
ORCID
Journal Title
Journal ISSN
Volume Title
Publisher
item.page.doi
Abstract
Combinatorial optimization is a class of problems that consists of finding an optimal solution from a finite set of feasible solutions. Many important problems in Data Science can be viewed as combinatorial optimization problems, typically described in terms of selecting a small number of items from a much larger set. We describe Artificial Intelligence (AI) inspired combinatorial optimization algorithms to three selection problems that have important practical applications. The first is the “unsupervised column subset selection problem”, which has important applications to dimensionality reduction. The second is the “supervised column subset selection problem”, which can be viewed as a direct generalization of the unsupervised case. The third is the “outlier detection for Principal Component Analysis (PCA)”. We use ideas from AI to derive new algorithms for these classical problems that are known to be NP-hard. Our algorithms compare favorably with the current state of the art and come with guarantees on the quality of the solutions. In the unsupervised column subset selection problem, one attempts to represent an entire matrix as a linear combination of a small fraction of its columns. We study a generalization that approximates the matrix with both selected and extracted features. We show that an optimal solution to this hybrid problem involves a combinatorial search, and cannot be trivially obtained even if one can optimally solve the separate problems of selection and extraction. Our approach that gives optimal and approximate solutions uses a combinatorial search in a setting similar to the weighted A* algorithm. In the supervised column subset selection problem, we study the approximation of a “target” matrix in terms of several selected columns of another matrix, sometimes called a “dictionary” matrix. We propose the first nontrivial optimal algorithm for this problem, using a combinatorial search setting similar to the classical A∗ algorithm. We also propose practical sub-optimal algorithms in a setting similar to the classical weighted A∗ algorithm. Experimental results show that our sub-optimal algorithms compare favorably with the current state of the art. Previously proposed fastest nontrivial algorithms have a running time proportional to the product of the number of columns of the two matrices. We describe a significantly faster algorithm with complexity proportional to the sum of the number of columns of the two matrices. Outliers negatively affect the accuracy of data analysis. Algorithms that attempt to detect outliers and remove them from the data prior to applying PCA are sometimes called “Robust PCA” algorithms. We propose a new algorithm to detect outliers for PCA that combines two ideas. The first is “chunk recursive elimination” that was used effectively to accelerate feature selection, and the second is combinatorial search, in a setting similar to the weighted A* algorithm. Our main result is showing how to combine these two ideas to balance speed and accuracy. The resulting algorithm is called Chunk-A∗ , with variants that compute optimal and sub-optimal solutions. We also propose a fast algorithm to address this problem. The main idea is to rank each data point by looking ahead and evaluating the change in the global PCA error when an inlier is converted into an outlier. We show that this lookahead procedure can be implemented efficiently, and it is much more accurate than the current state-of-the-art algorithms.