Sparse Partial Least Square Correspondence Analysis (SPLA-CA): Applications to Genetics and Behavioral Studies
Abstract
Abstract
Most research questions in modern cognitive neuroscience relate two sets of variables collected from the same observations (e.g., participants). A popular multivariate method for
such research questions is the partial least square correlation (PLS-C) method. PLS-C is a
component-based method that generates, from a two-table data set, components (also called
dimensions) that best explain the association between the two sets of variables. Recently,
Beaton et al. (2016) combined PLS-C with multiple correspondence analysis (MCA)—a
component-based method for categorical data—and introduced partial least square correspondence analysis (PLS-CA), which is tailored to analyze two-table data sets that includes
categorical data. For all these component-based methods, the variables are weighted by
non-zero values—called loadings—to build a component, and the extracted components are
interpreted based on the contributing variables that have high loadings. However, when the
data set has a large number of variables, the extracted components are often difficult to
interpret because of the sheer number of variables. To alleviate this problem, modern statisticians incorporated sparsification into the singular value decomposition (SVD)—the core
method of these component-based methods—to keep only variables that contribute the most
to each component. Among different solutions, Guillemot et al. (2019) recently developed
the constrained singular value decomposition (CSVD), a method that obtains loadings that
are not only sparse but also orthogonal across different components. This orthogonality is
important because it facilitates the interpretation of the components by ensuring that they
do not share confounding information and can, therefore, be interpreted independently. In
this dissertation, I extend the CSVD to develop new methods that extract orthogonal loadings and components: sparse PLS-C (sPLS-C), sparse MCA (sMCA), and their combination,
sparse PLS-CA (sPLS-CA). I showed with simulation and empirical data that sparsification
facilitates the interpretation of components for PLS-C, MCA, and PLS-CA. For example, I
analyzed the genetics data obtained from the Alzheimer’s Disease Neuroimaging Initiative
(ADNI) database with (s)MCA and (s)PLS-CA. Because the genetics information was represented by single nucleotide polymorphisms (SNPs) with the genotypes, the genetics data were
analyzed as categorical variables by these methods. These results showed that with sparse
and orthogonal components, the results from sMCA and sPLS-CA were easier to interpret.
In addition, because (s)MCA and (s)PLS-CA analyzed SNPs as categorical data (i.e., genotypes), these methods were useful to explore—without a-priori information—heterozygote
risks or advantages. In the future, these methods can be further extended to develop sparse
methods of various component-based methods to improve the analyses of large data sets.