Disease Gene Identification by Random Walk on Multigraphs Merging Heterogeneous Genomic and Phenotype Data
BACKGROUND: High throughput experiments resulted in many genomic datasets and hundreds of candidate disease genes. To discover the real disease genes from a set of candidate genes, computational methods have been proposed and worked on various types of genomic data sources. As a single source of genomic data is prone of bias, incompleteness and noise, integration of different genomic data sources is highly demanded to accomplish reliable disease gene identification. RESULTS: In contrast to the commonly adapted data integration approach which integrates separate lists of candidate genes derived from the each single data sources, we merge various genomic networks into a multigraph which is capable of connecting multiple edges between a pair of nodes. This novel approach provides a data platform with strong noise tolerance to prioritize the disease genes. A new idea of random walk is then developed to work on multigraphs using a modified step to calculate the transition matrix. Our method is further enhanced to deal with heterogeneous data types by allowing cross-walk between phenotype and gene networks. Compared on benchmark datasets, our method is shown to be more accurate than the state-of-the-art methods in disease gene identification. We also conducted a case study to identify disease genes for Insulin-Dependent Diabetes Mellitus. Some of the newly identified disease genes are supported by recently published literature. CONCLUSIONS: The proposed RWRM (Random Walk with Restart on Multigraphs) model and CHN (Complex Heterogeneous Network) model are effective in data integration for candidate gene prioritization.