Sparse Partial Least Square Correspondence Analysis (SPLA-CA): Applications to Genetics and Behavioral Studies



Journal Title

Journal ISSN

Volume Title



Most research questions in modern cognitive neuroscience relate two sets of variables collected from the same observations (e.g., participants). A popular multivariate method for such research questions is the partial least square correlation (PLS-C) method. PLS-C is a component-based method that generates, from a two-table data set, components (also called dimensions) that best explain the association between the two sets of variables. Recently, Beaton et al. (2016) combined PLS-C with multiple correspondence analysis (MCA)—a component-based method for categorical data—and introduced partial least square correspondence analysis (PLS-CA), which is tailored to analyze two-table data sets that includes categorical data. For all these component-based methods, the variables are weighted by non-zero values—called loadings—to build a component, and the extracted components are interpreted based on the contributing variables that have high loadings. However, when the data set has a large number of variables, the extracted components are often difficult to interpret because of the sheer number of variables. To alleviate this problem, modern statisticians incorporated sparsification into the singular value decomposition (SVD)—the core method of these component-based methods—to keep only variables that contribute the most to each component. Among different solutions, Guillemot et al. (2019) recently developed the constrained singular value decomposition (CSVD), a method that obtains loadings that are not only sparse but also orthogonal across different components. This orthogonality is important because it facilitates the interpretation of the components by ensuring that they do not share confounding information and can, therefore, be interpreted independently. In this dissertation, I extend the CSVD to develop new methods that extract orthogonal loadings and components: sparse PLS-C (sPLS-C), sparse MCA (sMCA), and their combination, sparse PLS-CA (sPLS-CA). I showed with simulation and empirical data that sparsification facilitates the interpretation of components for PLS-C, MCA, and PLS-CA. For example, I analyzed the genetics data obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database with (s)MCA and (s)PLS-CA. Because the genetics information was represented by single nucleotide polymorphisms (SNPs) with the genotypes, the genetics data were analyzed as categorical variables by these methods. These results showed that with sparse and orthogonal components, the results from sMCA and sPLS-CA were easier to interpret. In addition, because (s)MCA and (s)PLS-CA analyzed SNPs as categorical data (i.e., genotypes), these methods were useful to explore—without a-priori information—heterozygote risks or advantages. In the future, these methods can be further extended to develop sparse methods of various component-based methods to improve the analyses of large data sets.



Correspondence analysis (Statistics), Singular value decomposition, Multivariate analysis, Least squares