Co-clustering Phenome-genome for Phenotype Classification and Disease Gene Discovery
TaeHyun Hwang1, Gowtham Atluri,2 MaoQiang Xie3, Sanjoy
Dey2, Changjin Hong4, Vipin Kumar2, and Rui Kuang2
1Biostatistics
and Bioinformatics, Masonic Cancer Center, University of Minnesota Twin Cities,
USA
2Department of
Computer Science and Engineering, University of Minnesota Twin Cities, USA
3College of
Software, Nankai University, China
4Computational
Biomedicine, Boston University, USA
Abstract
Categorization of human diseases is critical for reliably identifying
disease causal genes. Recently, genome-wide studies of abnormal chromosomal
locations related to diseases have mapped more than 2000 phenotype-gene
relations, which provide valuable information for classifying diseases and
identifying candidate genes as drug targets. In this paper, several regularized
non-negative matrix tri-factorization (R-NMTF) algorithms are introduced to
co-cluster phenotypes and genes, and simultaneously detect associations between
the detected phenotype clusters and gene clusters. The R-NMTF algorithms
factorize the phenotype-gene association matrix under the prior knowledge from
phenotype similarity network and protein-protein interaction network,
supervised by the label information from known disease classes and biological
pathways. In the experiments on disease phenotype-gene associations in OMIM,
R-NMTF significantly improved the classification of disease phenotypes compared
with SVMs in cross-validation on the annotated phenotypes. It was further
validated that the newly predicted phenotypes in each disease class are highly
consistent with the Human Phenotype Ontology (HPO) annotations. Extensive
literature review also confirmed many new members of the disease classes and
pathways as well as the predicted associations between disease phenotype
clusters (classes) and gene clusters (pathways).
Full
Paper [PDF]
Supplementary
Information and Source Code