Implementing Appropriate Multivariate Methods for Higher Quality Results from Genetic Association Studies in Substance Abuse Populations


For nearly a century, detecting the genetic contributions to cognitive and behavioral phenomena has been a core interest for psychological research. Recently, this interest has been reinvigorated across many related domains including and especially psychiatric research. Furthermore, genotyping technologies (e.g., microarrays) that provide genetic data, such as single nucleotide polymorphisms (SNPs), are routinely available and easily accessible to almost any researcher. These SNPs—which represent pairs of nucleotide letters (e.g., AA, AG, or GG) found at specific positions on human chromosomes—are best considered as categorical variables. However, a categorical coding scheme can make difficult the analysis of their relationships with behavioral, diagnostic, or clinical measurements because most multivariate techniques developed for the analysis between sets of variables are designed for quantitative variables. Furthermore, there are many—not just one or a few—genetic contributions to complex behaviors and disorders such as substance abuse, thus requiring multivariate techniques to fully understand the many genetic contributions. To palliate this problem, I present a generalization of partial least squares (PLS)—a technique used to extract the information common to two different data tables measured on the same observations—called partial least squares correspondence analysis (PLS-CA)—that is specifically tailored for the analysis of categorical and mixed (“heterogeneous”) data types. I further extend PLS-CA with a ridge-like regularization called Smoothed PLS-CA (SmooPLS-CA). SmooPLS-CA adjusts for overfitting and noise that can lead to the interpretation of spurious effects in high dimensional-low sample size data such as genetics and genomics. PLS-CA and SmooPLS-CA were both applied to two genetic data sets within substance use disorders (SUDs) that focused on a large number of genes: an archived set (“discovery”) and an external set (“validation”). The goal of the two data sets were to discover markers of SUDs in one set, and then validate those markers in an independent and completed sequestered set. SmooPLS-CA showed no advantage over standard PLS-CA: bootstrap resampling techniques provided robust results regardless of regularization. Finally, multiple genes were identified as contributors to a broad case-control (i.e., SUDs vs. control group) effect. Some of the identified genes play key roles in the glutamatergic (e.g., GRIN2B) and dopaminergic systems (e.g., CCKBR), where other genes play complex or even undefined roles (e.g., PRKCE). In sum there are many robust, albeiet small, genetic effects as opposed to only a few large effects that contribute to SUDs.



Least squares, Genetics, Substance abuse, Correspondence analysis (Statistics)


Copyright ©2017 is held by the author. Digital access to this material is made possible by the Eugene McDermott Library. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.