Machine Learning Approaches to Unravelling the Role of Mammalian DNA Methylation in Gene Regulation

Journal Title
Journal ISSN
Volume Title

Transcriptional regulation is a highly complicated and dynamic process established by regulatory pathways involving cascades, feedbacks and other sophisticated control mechanisms. Epigenetic mechanisms are key regulatory processes involving heritable modifications to the genome that do not require the substitution of constituent nucleotides in the DNA sequence but which may be suitably reprogrammed in germ cells. 5-Methylcytosine and 5-Hydroxymethylcytosine in DNA are major epigenetic modifications known to be implicated in mammalian gene regulation. The literature suggests that DNA methylation in a promoter or enhancer region causes transcription repression, while hydroxymethylation abundance in enhancers coincides with elevated expression of proximal genes. Accordingly, obtaining, analyzing, and interpreting Next Generation Sequencing methylation data could give us a deeper insight into the trancriptome, as well as modes of epigenetic gene regulation. However, performing whole-genome methylation assays is expensive and unfeasible to conduct for every physiological or perturbation condition, and often generates incomplete genome-wide methylation profile. For that purpose we created a novel, supervised, ensemble-learning classification framework to perform whole-genome methylation and hydroxymethylation status predictions in CpG dinucleotides. Additionally, we developed a platform to perform in silico, high-throughput hypotheses testing based on such predictions. For the purpose of performing de novo methylome reconstruction, we adopted the concept of invariant methylation across mammalian reference methylomes, and incorporated it into our framework by creating the consensus reference methylome. Our toolkit performs fast and accurate prediction and imputation on large amounts (~Terabytes) of data in existing sequencing datasets. Since we do not use cell type specific features such as Transcription Factor Binding Sites, models trained on one cell type can be used to predict the epigenetic profile of a related cell type, thereby showing great promise for transfer learning scenarios. We test our approach on H1 human embryonic stem cells and H1-derived neural progenitor cells. Our predictive model is comparable in accuracy to other state-of-the-art DNA methylation prediction algorithms, and is the first in silico predictor of hydroxymethylation achieving high whole-genome accuracy, paving the way for large-scale reconstruction of hydroxymethylation maps in mammalian model systems. We designed a novel, beam-search driven feature selection algorithm to identify the most discriminative predictor variables, and developed a platform for performing integrative analysis and reconstruction of the epigenome. Our toolkit DIRECTION provides predictions at single nucleotide resolution and identifies relevant features based on resource availability. This offers enhanced biological interpretability of results potentially leading to a better understanding of epigenetic gene regulation. Our tool is publicly available and can be downloaded from:

Machine learning, Methylation, Genetic transcription—Regulation, Epigenetics
©2017 Milos Pavlovic. All Rights Reserved.