Reconstructing Disease Phenome-genome Association by Bi-Random Walk

MaoQiang Xie1, TaeHyun Hwang2 and Rui Kuang3*

1. College of Software, Nankai University, Tianjin, 300071, China
2. Masonic Cancer Center, University of Minnesota Twin Cities, Minneapolis, USA
3. Department of Computer Science and Engineering, University of Minnesota Twin Cities, Minneapolis, USA
*Correspondence: kuang@cs.umn.edu

ABSTRACT

Motivation: Promising results were recently reported in utilizing network information in phenotype-similarity network and gene-interaction network with graph-based learning to derive new disease phenotype-gene associations. However, a more fundamental understanding of how the network information is relevant to disease phenotype-gene associations is lacking. In this paper, we analyze the circular bigraphs (CBGs) in OMIM phenotype-gene association networks, and introduce a bi-random walk (BiRW) algorithm to capture the CBG patterns in the networks for unveiling the associations between the complete collection of disease phenotypes (phenome) and genes. BiRW performs separate random walk simultaneously on gene interaction network and phenotype similarity network to explore gene paths and phenotype paths in CBGs of different sizes.

Results:In the analysis of OMIM associations, we discovered that on average 81% of the associations are covered by CBG patterns of path-length up to 3 with variability by 21 disease classes, and there is a clear correlation between the CBG coverage and the prediction performance of the phenotype-gene associations. Some prominent examples are cancers, nutritional diseases, dermatological diseases, bone diseases, cardiovascular diseases and respiratory diseases. Experiments on recovering known associations in cross-validation and predicting new associations in a holdout set validated that BiRW effectively improved prediction performance over existing methods by ranking more known associations in the top 100 out of more than 12,000 candidate genes. The investigation of the global disease phenome-genome association map also revealed interesting new predictions and phenotype-gene modules by disease classes.

SUPPLEMENTARY DATA

Supplementary figures (19 disease modules. Nutritional, and Ear,Nose,Throat diseases are not included because of the amounts of these two disease classes are too small.)[zip],

GO biological processes [zip].

DATASET

1. Disease Phenotype Similarity Network: [Matlab data]

Each node indicates disease phenotype, each edge is weighted by disease similarity obtained by text mining. There are 5,080 disease phenotypes in the phenotype network. The first column indicates OMIM number of disease phenotype. Original data is from the following reference: [Marc A. van Driel, Jorn Bruggeman, Gert Vriend, Han G. Brunner, and Jack A.M. Leunissen. "A text-mining analysis of the human phenome." (2006), European Journal of Human Genetics,14, 535-542. PMID: 16493445]

The phenotype similarity network used in experiments was transformed by a logistic function (Vanunu et al., 2010).[phenotype_logistic.mat]

2. Protein-Protein Interaction: [Matlab data]

Each node indicates protein, and each edge indicates protein interactions. There are 8,919 proteins. Note that we remove self-interactions (i.e. protein A interacts protein A itself) in our experiments. Original data is from the following reference: [Wu X, Jiang R, Zhang MQ, Li S (2008) Network-based global inference of human disease genes. Molecular Systems Biology, 4:189]

3. Disease Phenotype-Gene Network (in May 2007): [Matlab data]

Each row represents disease phenotype, and column represents genes. Binary values in data indicate disease-gene associations. (i.e. if gene1 is causative gene for disease phenotype 1, matrix(1,1) indicates 1) Original data is from the following reference: [Wu X, Jiang R, Zhang MQ, Li S (2008) Network-based global inference of human disease genes. Molecular Systems Biology, 4:189]

4. Disease Phenotype-Gene Network (in May 2010): [Matlab data]

The format of this phenotype-gene network is the same as the version of May 2007, and it is extracted from OMIM database downloaded in May 2010. New emerging assoications in May 2010 can be downloaded here, which can be used to evaluate the prediction of BiRW algorithm.

SOURCE CODE

Source code of BiRW algorithm can be downloaded here.

Source code of 100-fold cross-validation can be downloaded here.(new)

 

(Last update May. 17 ,2013)