A Bayesian Modeling for Paired Data in Genome-Wide Association Studies with Application to Breast Cancer
MetadataShow full item record
Many complex human diseases are associated with genetic factors. Identifying genetic markers is the key step to account for disease heritability, and develop disease diagnosis, risk prediction, prevention and therapeutic strategies. Genome-wide association study (GWAS) has emerged as a powerful tool to identify genetic variants that are associated with various cancers. The common statistical methodologies in GWAS focus on case-control data where cases and controls are sampled independently from the populations. Despite the success of GWAS in finding a number of genetic variants that are associated with cancers, the power of conventional GWAS is limited. Extensive research has shown that many tumors develop as a consequence of the progressive accumulation of somatic mutations over time. We focus on GWAS data from tumor and paired normal tissues to unravel the genetic association of somatic mutations. To address the limitation that conventional GWAS methods are not applicable to matched-paired data, we propose in this dissertation a framework that incorporates allelic relative risk, frequency and mutation rate to accommodate the structure of paired data. We first apply the penalized maximum likelihood estimation (MLE) to perform single marker analysis based on the framework. Simulation studies are carried out to assess the performance of penalized MLE. To further improve the estimation accuracy and power of single marker analysis, we develop a Bayesian hierarchical model that takes advantage of applying Bayesian shrinkage and making inferences based on the posterior distributions. The hierarchical Bayesian model has the flexibility to take into account the prior knowledge and extend to multiple marker analysis. We find that the single-marker Bayesian model has improved the estimation and power performance in most simulation scenarios. To identify DNA segments and SNP sets, rather than single genetic variants that are associated with the disease, we develop a multiple-SNP Bayesian model which considers SNP sets that are grouped together in a biologically meaningful way, such as genes or pathways. The multiple SNP analysis considers the joint effects of the SNP set, which improves the power to identify SNPs that have moderate marginal effects by themselves. Simulation studies show that the multi-marker Bayesian model has higher power to identify associated SNPs and lower type I error rates. Next, we apply the proposed methods to a breast cancer data set from The Cancer Genome Atlas (TCGA).We compare the most significant genes identified by single marker analysis and multiple marker analysis to external resources on somatic mutations of breast cancer. We find that both methods identify genes associated with breast cancer, and multiple marker analysis provides more consistent results with external resources.