Co-clustering Phenome-genome for Phenotype Classification and Disease Gene Discovery 

TaeHyun Hwang1, Gowtham Atluri,2 MaoQiang Xie3, Sanjoy Dey2, Changjin Hong4, Vipin Kumar2, and Rui Kuang2

 

1Biostatistics and Bioinformatics, Masonic Cancer Center, University of Minnesota Twin Cities, USA

2Department of Computer Science and Engineering, University of Minnesota Twin Cities, USA

3College of Software, Nankai University, China

4Computational Biomedicine, Boston University, USA

 

Abstract

Categorization of human diseases is critical for reliably identifying disease causal genes. Recently, genome-wide studies of abnormal chromosomal locations related to diseases have mapped more than 2000 phenotype-gene relations, which provide valuable information for classifying diseases and identifying candidate genes as drug targets. In this paper, several regularized non-negative matrix tri-factorization (R-NMTF) algorithms are introduced to co-cluster phenotypes and genes, and simultaneously detect associations between the detected phenotype clusters and gene clusters. The R-NMTF algorithms factorize the phenotype-gene association matrix under the prior knowledge from phenotype similarity network and protein-protein interaction network, supervised by the label information from known disease classes and biological pathways. In the experiments on disease phenotype-gene associations in OMIM, R-NMTF significantly improved the classification of disease phenotypes compared with SVMs in cross-validation on the annotated phenotypes. It was further validated that the newly predicted phenotypes in each disease class are highly consistent with the Human Phenotype Ontology (HPO) annotations. Extensive literature review also confirmed many new members of the disease classes and pathways as well as the predicted associations between disease phenotype clusters (classes) and gene clusters (pathways).


Full Paper [PDF]

Supplementary Information and Source Code

Compbio Home