Computational Methods for Analyzing and Visualizing NGS Data

Date

2019-12

ORCID

Journal Title

Journal ISSN

Volume Title

Publisher

item.page.doi

Abstract

Advancements in next-generation sequencing (NGS) technology have enabled the rapid growth and availability of large quantities of DNA and RNA sequences. These sequences from both model and non-model organisms can now be acquired at a low cost. The sequencing of large amounts of genomic and proteomic data empowers scientific achievements, many of which were thought to be impossible, and novel biological applications have been developed to study their genetic contribution to human diseases and evolution. This is especially true for uncovering new insights from comparative genomics to the evolution of the disease. For example, NGS allows researchers to identify all changes between sequences in the sample set, which could be used in a clinical setting for things like early cancer detection. This dissertation describes a set of computational bioinformatic approaches that bridge the gap between the large-scale, high-throughput sequencing data that is available, and the lack of computational tools to make predictions for and assist in evolutionary studies. Specifically, I have focused on developing computational methods that enable analysis and visualization for three distinct research tasks. These tasks focus on NGS data and will range in scope from processed genomic data to raw sequencing data, to viral proteomic data. The first task focused on the visualization of two genomes and the changes required to transform from one sequence into the other, which mimics the evolutionary process that has occurred on these organisms. My contribution to this task is DCJVis. DCJVis is a visualization tool based on a linear-time algorithm that computes the distance between two genomes and visualizes the number and type of genomic operations necessary to transform one genome set into another. The second task focused on developing a software application and efficient algorithmic workflow for analyzing and comparing raw sequence reads of two samples without the need of a reference genome. Most sequence analysis pipelines start with aligning to a known reference. However, this is not an ideal approach as reference genomes are not available for all organisms and alignment inaccuracies can lead to biased results. I developed a reference-free sequence analysis computational tool, NoRef, using k-length substring (k-mer) analysis. I also proposed an efficient k-mer sorting algorithm that decreases execution time by 3-folds compared to traditional sorting methods. Finally, the NoRef workflow outputs the results in the raw sequence read format based on user-selected filters, that can be directly used for downstream analysis. The third task is focused on viral proteomic data analysis and answers the following questions:

  1. How many viral genes originate as "stolen host" (human) genes?
  2. What viruses most often steal genes from a host (human) and are specific to certain locations within the host?
  3. Can we understand the function of the host (human) gene through a viral perspective? To address these questions, I took a computational approach starting with string sequence comparisons and localization prediction using machine learning models to create a comprehensive community data resource that will enable researchers to gain insights into viruses that affect human immunity and diseases.

Description

Keywords

Bioinformatics, DNA--Analysis, Application software, Viral proteins

item.page.sponsorship

Rights

©2019 Sruthi Chappidi. All Rights Reserved.

Citation