A Bayesian Modeling for Paired Data in Genome-Wide Association Studies with Application to Breast Cancer
Abstract
Abstract
Many complex human diseases are associated with genetic factors. Identifying genetic markers is the key step to account for disease heritability, and develop disease diagnosis, risk
prediction, prevention and therapeutic strategies. Genome-wide association study (GWAS)
has emerged as a powerful tool to identify genetic variants that are associated with various
cancers. The common statistical methodologies in GWAS focus on case-control data where
cases and controls are sampled independently from the populations. Despite the success of
GWAS in finding a number of genetic variants that are associated with cancers, the power
of conventional GWAS is limited. Extensive research has shown that many tumors develop
as a consequence of the progressive accumulation of somatic mutations over time. We focus
on GWAS data from tumor and paired normal tissues to unravel the genetic association
of somatic mutations. To address the limitation that conventional GWAS methods are not
applicable to matched-paired data, we propose in this dissertation a framework that incorporates allelic relative risk, frequency and mutation rate to accommodate the structure of
paired data.
We first apply the penalized maximum likelihood estimation (MLE) to perform single marker
analysis based on the framework. Simulation studies are carried out to assess the performance of penalized MLE. To further improve the estimation accuracy and power of single
marker analysis, we develop a Bayesian hierarchical model that takes advantage of applying
Bayesian shrinkage and making inferences based on the posterior distributions. The hierarchical Bayesian model has the flexibility to take into account the prior knowledge and
extend to multiple marker analysis. We find that the single-marker Bayesian model has
improved the estimation and power performance in most simulation scenarios. To identify
DNA segments and SNP sets, rather than single genetic variants that are associated with
the disease, we develop a multiple-SNP Bayesian model which considers SNP sets that are
grouped together in a biologically meaningful way, such as genes or pathways. The multiple SNP analysis considers the joint effects of the SNP set, which improves the power to identify
SNPs that have moderate marginal effects by themselves. Simulation studies show that the
multi-marker Bayesian model has higher power to identify associated SNPs and lower type
I error rates. Next, we apply the proposed methods to a breast cancer data set from The
Cancer Genome Atlas (TCGA).We compare the most significant genes identified by single
marker analysis and multiple marker analysis to external resources on somatic mutations of
breast cancer. We find that both methods identify genes associated with breast cancer, and
multiple marker analysis provides more consistent results with external resources.