Some Methods and Applications of Large-scale Genomic Data Analysis




Journal Title

Journal ISSN

Volume Title



Modern genomic and epigenomic studies have found numerous genomic regions that interact with biochemical factors and regulate gene activities. A lot of studies focus on transcriptional regulation of disease related genes that such genomic regions mediate. Identification of driver elements associated with protein-noncoding regions for a disease is a challenging and unsolved problem. Chapter 2 develops a novel statistical test based on single sequence modeling, named DNAprotein binding changer test. The test predicts insertions and deletions of bases in the genome (InDels) that change protein binding to DNA. It is the first computational and statistical approach that directly evaluates InDel influence on protein binding. The binding changer test statistic we propose is based on binding p-values to the reference and mutation sequences. We employ importance sampling algorithm such that the binding changer pvalue is computed with sequence pairs generated from an importance distribution. We derive the importance distribution along with the optimal tilting parameter that determines the importance distribution to maximize the algorithm efficiency. The binding changer test is a general approach for any protein-binding motifs and InDel mutations found in any disease types. The simulation studies demonstrate that the test is very successful in Type I error control, statistical power increase, and binding changer InDel prediction. From the application to leukemia data, we obtain potential InDels responsible for leukemia through creating or eliminating transcription factor MYC binding. Chapter 3 introduces an integrative analysis to improve the prediction of cancer driver SNVs (Single Nucleotide Variants) that change transcriptional regulation and influence cancer genes by leveraging cancer-specific data collected from experiments. It utilizes an existing noncoding mutation scoring scheme which enables to retain SNVs with high priority. Highly expressed and non-housekeeping TF (transcription factor) genes are selected with mRNA expression data. The SNVs that may cause the TF binding change to the DNA sequence are further predicted with atSNP binding change detection methods. Its application to leukemia SNV data finds potential leukemia associated genes along with driver SNVs and TFs in the cis-regulatory structure. Further, we confirm that the integrative approach improves the power of detecting regulatory mutations.