Motivation: Promising results were
recently reported in utilizing network information in phenotype-similarity
network and gene-interaction network with graph-based learning to derive new
disease phenotype-gene associations. However, a more fundamental understanding
of how the network information is relevant to disease phenotype-gene
associations is lacking. In this paper, we analyze the circular bigraphs (CBGs) in OMIM phenotype-gene association
networks, and introduce a bi-random walk (BiRW)
algorithm to capture the CBG patterns in the networks for unveiling the
associations between the complete collection of disease phenotypes (phenome) and genes. BiRW performs
separate random walk simultaneously on gene interaction network and phenotype
similarity network to explore gene paths and phenotype paths in CBGs of
different sizes.
Results:In the analysis of OMIM
associations, we discovered that on average 81% of the associations are covered
by CBG patterns of path-length up to 3 with variability by 21 disease classes,
and there is a clear correlation between the CBG coverage and the prediction
performance of the phenotype-gene associations. Some prominent examples are
cancers, nutritional diseases, dermatological diseases, bone diseases,
cardiovascular diseases and respiratory diseases. Experiments on recovering
known associations in cross-validation and predicting new associations in a
holdout set validated that BiRW effectively improved
prediction performance over existing methods by ranking more known associations
in the top 100 out of more than 12,000 candidate genes. The investigation of
the global disease phenome-genome association map
also revealed interesting new predictions and phenotype-gene modules by disease
classes.
Supplementary
figures (19 disease modules. Nutritional, and Ear,Nose,Throat diseases are not included because of the
amounts of these two disease classes are too small.)[zip],
GO biological
processes [zip].
Each node indicates disease phenotype, each edge is weighted by disease similarity
obtained by text mining. There are 5,080 disease phenotypes in the phenotype
network. The first column indicates OMIM number of disease phenotype. Original
data is from the following reference: [Marc A. van Driel,
Jorn Bruggeman, Gert Vriend, Han G. Brunner, and
Jack A.M. Leunissen. "A text-mining analysis of
the human phenome." (2006), European Journal of
Human Genetics,14, 535-542. PMID:
16493445]
The phenotype similarity network used in experiments was transformed by a logistic function (Vanunu et al., 2010).[phenotype_logistic.mat]
Each node indicates protein, and
each edge indicates protein interactions. There are 8,919 proteins. Note that
we remove self-interactions (i.e. protein A interacts protein A itself) in our
experiments. Original data is from the following reference: [Wu X, Jiang R,
Zhang MQ, Li S (2008) Network-based global inference of human disease genes.
Molecular Systems Biology, 4:189]
Each row represents disease
phenotype, and column represents genes. Binary values in data indicate
disease-gene associations. (i.e. if gene1 is causative
gene for disease phenotype 1, matrix(1,1) indicates 1) Original data is from the
following reference: [Wu X, Jiang R, Zhang MQ, Li S (2008) Network-based global
inference of human disease genes. Molecular Systems Biology, 4:189]
The format of this phenotype-gene network is the same as the version of May 2007, and it is extracted from OMIM database downloaded in May 2010. New emerging assoications in May 2010 can be downloaded here, which can be used to evaluate the prediction of BiRW algorithm.
Source code of BiRW algorithm can be downloaded here.
Source code of 100-fold cross-validation can be downloaded here.(new)
(Last
update May. 17 ,2013)