Our lab is particularly interested in large-scale genomic and biomedical data analysis with machine learning and network-based methods for research problems in health-related and biological science. The two broad areas for my research are 1) phenome-genome association analysis and 2) cancer outcome prediction and biomarker identification. In the first area, we performed large-scale association analysis between all genes and the complete collection of phenotypes (phenome) by network-based machine learning methods. In the second area, we developed graph-based learning models and kernel methods to capture the structures in single-cell RNA sequencing data, high-dimensional gene (isoform) expressions and DNA copy number variations for improved cancer outcome prediction and robust biomarker identification. In addition, we also developed kernel methods for protein classification. Our current projects center around the following topics,
- Spatial and single-cell transcriptomics: Spatial transcriptomics technologies have enabled spatially-resolved RNA profiling of single cells with cell identities and localizations for understanding cells’ organizations and functions. Our group develops new machine learning methods for mining RNA profiles collected from single cells and their spatial locations.
Zhang, Huanan; Lee, Catherine A. A.; Li, Zhuliu; Garbe, John R.; Eide, Cindy R.; Petegrosso, Raphael; Kuang, Rui; Tolar, Jakub
A Multitask Clustering Approach for Single-cell RNA-Seq Analysis in Recessive Dystrophic Epidermolysis Bullosa Journal Article
In: PLOS Computational Biology, vol. 14, no. 4, 2018.
@article{multitask_zhang,
title = {A Multitask Clustering Approach for Single-cell RNA-Seq Analysis in Recessive Dystrophic Epidermolysis Bullosa},
author = {Huanan Zhang and Catherine A. A. Lee and Zhuliu Li and John R. Garbe and Cindy R. Eide and Raphael Petegrosso and Rui Kuang and Jakub Tolar},
url = {http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006053},
doi = {https://doi.org/10.1371/journal.pcbi.1006053},
year = {2018},
date = {2018-04-05},
journal = {PLOS Computational Biology},
volume = {14},
number = {4},
abstract = {Single-cell RNA sequencing (scRNA-seq) has been widely applied to discover new cell types by detecting sub-populations in a heterogeneous group of cells.
Since scRNA-seq experiments have lower read coverage/tag counts and introduce more technical biases compared to bulk RNA-seq experiments, the limited number of sampled cells combined with the experimental biases and other dataset specific variations presents a challenge to cross-dataset analysis and discovery of relevant biological variations across multiple cell populations. In this paper, we introduce a method of variance-driven multitask clustering of single-cell RNA-seq data (scVDMC) that utilizes multiple single-cell populations from biological replicates or different samples. scVDMC clusters single cells in multiple scRNA-seq experiments of similar cell types and markers but varying expression patterns such that the scRNA-seq data are better integrated than typical pooled analyses which only increase the sample size. By controlling the variance among the cell clusters within each dataset and across all the datasets, scVDMC detects cell sub-populations in each individual experiment with shared cell-type markers but varying cluster centers among all the experiments. Applied to two real scRNA-seq datasets with several replicates and one large-scale Drop-seq dataset on three patient samples, scVDMC more accurately detected cell populations and known cell markers than pooled clustering and other recently proposed scRNA-seq clustering methods. In the case study applied to in-house Recessive Dystrophic Epidermolysis Bullosa (RDEB) scRNA-seq data, scVDMC revealed several new cell types and unknown markers validated by flow cytometry.
},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Single-cell RNA sequencing (scRNA-seq) has been widely applied to discover new cell types by detecting sub-populations in a heterogeneous group of cells.
Since scRNA-seq experiments have lower read coverage/tag counts and introduce more technical biases compared to bulk RNA-seq experiments, the limited number of sampled cells combined with the experimental biases and other dataset specific variations presents a challenge to cross-dataset analysis and discovery of relevant biological variations across multiple cell populations. In this paper, we introduce a method of variance-driven multitask clustering of single-cell RNA-seq data (scVDMC) that utilizes multiple single-cell populations from biological replicates or different samples. scVDMC clusters single cells in multiple scRNA-seq experiments of similar cell types and markers but varying expression patterns such that the scRNA-seq data are better integrated than typical pooled analyses which only increase the sample size. By controlling the variance among the cell clusters within each dataset and across all the datasets, scVDMC detects cell sub-populations in each individual experiment with shared cell-type markers but varying cluster centers among all the experiments. Applied to two real scRNA-seq datasets with several replicates and one large-scale Drop-seq dataset on three patient samples, scVDMC more accurately detected cell populations and known cell markers than pooled clustering and other recently proposed scRNA-seq clustering methods. In the case study applied to in-house Recessive Dystrophic Epidermolysis Bullosa (RDEB) scRNA-seq data, scVDMC revealed several new cell types and unknown markers validated by flow cytometry.
- Cancer genomics: Development of graph-based learning algorithms, sequence alignment algorithms and association rule-mining algorithms for building predictive models and mining biomarkers of cancer phenotypes from microarray or sequencing transcriptome data, DNA copy number variations, SNPs and protein-protein interactions.
Chien, Jeremy; Kuang, Rui; Landen, Charles; Shridhar, Viji
Platinum-sensitive recurrence in ovarian cancer: the role of tumor microenvironment Journal Article
In: Frontiers in oncology, vol. 3, pp. 251, 2013.
@article{chien2013platinumb,
title = {Platinum-sensitive recurrence in ovarian cancer: the role of tumor microenvironment},
author = {Jeremy Chien and Rui Kuang and Charles Landen and Viji Shridhar},
url = {http://journal.frontiersin.org/article/10.3389/fonc.2013.00251/full},
doi = {10.3389/fonc.2013.00251},
year = {2013},
date = {2013-09-23},
journal = {Frontiers in oncology},
volume = {3},
pages = {251},
publisher = {Frontiers},
abstract = {Despite several advances in the understanding of ovarian cancer pathobiology, in terms of driver genetic alterations in high-grade serous cancer, histologic heterogeneity of epithelial ovarian cancer, cell-of-origin for ovarian cancer, the survival rate from ovarian cancer is disappointingly low when compared to that of breast or prostate cancer. One of the factors contributing to the poor survival rate from ovarian cancer is the development of chemotherapy resistance following several rounds of chemotherapy. Although unicellular drug resistance mechanisms contribute to chemotherapy resistance, tumor microenvironment and the extracellular matrix (ECM), in particular, is emerging as a significant determinant of a tumor’s response to chemotherapy. In this review, we discuss the potential role of the tumor microenvironment in ovarian cancer recurrence and resistance to chemotherapy. Finally, we propose an alternative view of platinum-sensitive recurrence to describe a potential role of the ECM in the process.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Despite several advances in the understanding of ovarian cancer pathobiology, in terms of driver genetic alterations in high-grade serous cancer, histologic heterogeneity of epithelial ovarian cancer, cell-of-origin for ovarian cancer, the survival rate from ovarian cancer is disappointingly low when compared to that of breast or prostate cancer. One of the factors contributing to the poor survival rate from ovarian cancer is the development of chemotherapy resistance following several rounds of chemotherapy. Although unicellular drug resistance mechanisms contribute to chemotherapy resistance, tumor microenvironment and the extracellular matrix (ECM), in particular, is emerging as a significant determinant of a tumor’s response to chemotherapy. In this review, we discuss the potential role of the tumor microenvironment in ovarian cancer recurrence and resistance to chemotherapy. Finally, we propose an alternative view of platinum-sensitive recurrence to describe a potential role of the ECM in the process.Hwang, TaeHyun; Atluri, Gowtham; Kuang, Rui; Kumar, Vipin; Starr, Timothy; Silverstein, Kevin AT; Haverty, Peter M; Zhang, Zemin; Liu, Jinfeng
Large-scale integrative network-based analysis identifies common pathways disrupted by copy number alterations across cancers Journal Article
In: BMC genomics, vol. 14, no. 1, pp. 440, 2013.
@article{hwang2013large,
title = {Large-scale integrative network-based analysis identifies common pathways disrupted by copy number alterations across cancers},
author = {TaeHyun Hwang and Gowtham Atluri and Rui Kuang and Vipin Kumar and Timothy Starr and Kevin AT Silverstein and Peter M Haverty and Zemin Zhang and Jinfeng Liu},
url = {http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-440},
doi = {10.1186/1471-2164-14-440},
year = {2013},
date = {2013-07-03},
journal = {BMC genomics},
volume = {14},
number = {1},
pages = {440},
publisher = {BioMed Central Ltd},
abstract = {Many large-scale studies analyzed high-throughput genomic data to identify altered pathways essential to the development and progression of specific types of cancer. However, no previous study has been extended to provide a comprehensive analysis of pathways disrupted by copy number alterations across different human cancers. Towards this goal, we propose a network-based method to integrate copy number alteration data with human protein-protein interaction networks and pathway databases to identify pathways that are commonly disrupted in many different types of cancer.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Many large-scale studies analyzed high-throughput genomic data to identify altered pathways essential to the development and progression of specific types of cancer. However, no previous study has been extended to provide a comprehensive analysis of pathways disrupted by copy number alterations across different human cancers. Towards this goal, we propose a network-based method to integrate copy number alteration data with human protein-protein interaction networks and pathway databases to identify pathways that are commonly disrupted in many different types of cancer.Zhang, Wei; Ota, Takayo; Shridhar, Viji; Chien, Jeremy; Wu, Baolin; Kuang, Rui
Network-based survival analysis reveals subnetwork signatures for predicting outcomes of ovarian cancer treatment Journal Article
In: PLoS Comput Biol, vol. 9, no. 3, pp. e1002975, 2013.
@article{zhang2013network,
title = {Network-based survival analysis reveals subnetwork signatures for predicting outcomes of ovarian cancer treatment},
author = {Wei Zhang and Takayo Ota and Viji Shridhar and Jeremy Chien and Baolin Wu and Rui Kuang},
url = {http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002975},
doi = {10.1371/journal.pcbi.1002975},
year = {2013},
date = {2013-03-21},
journal = {PLoS Comput Biol},
volume = {9},
number = {3},
pages = {e1002975},
publisher = {Public Library of Science},
abstract = {Cox regression is commonly used to predict the outcome by the time to an event of interest and in addition, identify relevant features for survival analysis in cancer genomics. Due to the high-dimensionality of high-throughput genomic data, existing Cox models trained on any particular dataset usually generalize poorly to other independent datasets. In this paper, we propose a network-based Cox regression model called Net-Cox and applied Net-Cox for a large-scale survival analysis across multiple ovarian cancer datasets. Net-Cox integrates gene network information into the Cox's proportional hazard model to explore the co-expression or functional relation among high-dimensional gene expression features in the gene network. Net-Cox was applied to analyze three independent gene expression datasets including the TCGA ovarian cancer dataset and two other public ovarian cancer datasets. Net-Cox with the network information from gene co-expression or functional relations identified highly consistent signature genes across the three datasets, and because of the better generalization across the datasets, Net-Cox also consistently improved the accuracy of survival prediction over the Cox models regularized by L1-norm or L2-norm. This study focused on analyzing the death and recurrence outcomes in the treatment of ovarian carcinoma to identify signature genes that can more reliably predict the events. The signature genes comprise dense protein-protein interaction subnetworks, enriched by extracellular matrix receptors and modulators or by nuclear signaling components downstream of extracellular signal-regulated kinases. In the laboratory validation of the signature genes, a tumor array experiment by protein staining on an independent patient cohort from Mayo Clinic showed that the protein expression of the signature gene FBN1 is a biomarker significantly associated with the early recurrence after 12 months of the treatment in the ovarian cancer patients who are initially sensitive to chemotherapy. Net-Cox toolbox is available at http://localhost/~raphaelpetegrosso/wpcb/Net-Cox/.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Cox regression is commonly used to predict the outcome by the time to an event of interest and in addition, identify relevant features for survival analysis in cancer genomics. Due to the high-dimensionality of high-throughput genomic data, existing Cox models trained on any particular dataset usually generalize poorly to other independent datasets. In this paper, we propose a network-based Cox regression model called Net-Cox and applied Net-Cox for a large-scale survival analysis across multiple ovarian cancer datasets. Net-Cox integrates gene network information into the Cox's proportional hazard model to explore the co-expression or functional relation among high-dimensional gene expression features in the gene network. Net-Cox was applied to analyze three independent gene expression datasets including the TCGA ovarian cancer dataset and two other public ovarian cancer datasets. Net-Cox with the network information from gene co-expression or functional relations identified highly consistent signature genes across the three datasets, and because of the better generalization across the datasets, Net-Cox also consistently improved the accuracy of survival prediction over the Cox models regularized by L1-norm or L2-norm. This study focused on analyzing the death and recurrence outcomes in the treatment of ovarian carcinoma to identify signature genes that can more reliably predict the events. The signature genes comprise dense protein-protein interaction subnetworks, enriched by extracellular matrix receptors and modulators or by nuclear signaling components downstream of extracellular signal-regulated kinases. In the laboratory validation of the signature genes, a tumor array experiment by protein staining on an independent patient cohort from Mayo Clinic showed that the protein expression of the signature gene FBN1 is a biomarker significantly associated with the early recurrence after 12 months of the treatment in the ovarian cancer patients who are initially sensitive to chemotherapy. Net-Cox toolbox is available at http://localhost/~raphaelpetegrosso/wpcb/Net-Cox/.Zhang, Wei; Hwang, Baryun; Wu, Baolin; Kuang, Rui
Network propagation models for gene selection Proceedings Article
In: 2010 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS), IEEE, 2010, ISBN: 978-1-61284-791-7.
@inproceedings{zhang2010network,
title = {Network propagation models for gene selection},
author = {Wei Zhang and Baryun Hwang and Baolin Wu and Rui Kuang},
url = {http://compbio.cs.umn.edu/wp-content/uploads/2017/10/NP.pdf},
doi = {10.1109/GENSIPS.2010.5719689},
isbn = {978-1-61284-791-7},
year = {2010},
date = {2010-10-12},
booktitle = {2010 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS)},
publisher = {IEEE},
abstract = {In this paper, we explore several network propagation methods for gene selection from microarray gene expression datasets. The network propagation methods capture gene co-expression and differential expression with unified machine learning frameworks. Large scale experiments on five breast cancer datasets validated that the network propagation methods are capable of selecting genes that are more biologically interpretable and more consistent across multiple datasets, compared with the existing approaches.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper, we explore several network propagation methods for gene selection from microarray gene expression datasets. The network propagation methods capture gene co-expression and differential expression with unified machine learning frameworks. Large scale experiments on five breast cancer datasets validated that the network propagation methods are capable of selecting genes that are more biologically interpretable and more consistent across multiple datasets, compared with the existing approaches.Gupta, Rohit; Agrawal, Smita; Rao, Navneet; Tian, Ze; Kuang, Rui; Kumar, Vipin
Integrative Biomarker Discovery for Breast Cancer Metastasis from Gene Expression and Protein Interaction Data Using Error-tolerant Pattern Mining Proceedings Article
In: Citeseer, 2009.
@inproceedings{mining2009integrative,
title = {Integrative Biomarker Discovery for Breast Cancer Metastasis from Gene Expression and Protein Interaction Data Using Error-tolerant Pattern Mining},
author = {Rohit Gupta and Smita Agrawal and Navneet Rao and Ze Tian and Rui Kuang and Vipin Kumar},
url = {http://compbio.cs.umn.edu/wp-content/uploads/2017/10/BICOB2010.pdf},
year = {2009},
date = {2009-11-29},
publisher = {Citeseer},
abstract = {Biomarker discovery for complex diseases is a challenging
problem. Most of the existing approaches identify
individual genes as disease markers, thereby missing the
interactions among genes. Moreover, often only single biological
data source is used to discover biomarkers. These
factors account for the discovery of inconsistent biomarkers.
In this paper, we propose a novel error-tolerant pattern
mining approach for integrated analysis of gene expression
and protein interaction data. This integrated approach incorporates
constraints from protein interaction network and
efficiently discovers patterns (groups of genes) in a bottomup
fashion from the gene-expression data. We call these
patterns active sub-network biomarkers. To illustrate the
efficacy of our proposed approach, we used four breast cancer
gene expression data sets and a human protein interaction
network and showed that active sub-network biomarkers
are more biologically plausible and genes discovered
are more reproducible across studies. Finally, through pathway
analysis, we also showed a substantial enrichment for
known cancer genes and hence were able to generate relevant
hypotheses for understanding the molecular mechanisms
of breast cancer metastasis.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Biomarker discovery for complex diseases is a challenging
problem. Most of the existing approaches identify
individual genes as disease markers, thereby missing the
interactions among genes. Moreover, often only single biological
data source is used to discover biomarkers. These
factors account for the discovery of inconsistent biomarkers.
In this paper, we propose a novel error-tolerant pattern
mining approach for integrated analysis of gene expression
and protein interaction data. This integrated approach incorporates
constraints from protein interaction network and
efficiently discovers patterns (groups of genes) in a bottomup
fashion from the gene-expression data. We call these
patterns active sub-network biomarkers. To illustrate the
efficacy of our proposed approach, we used four breast cancer
gene expression data sets and a human protein interaction
network and showed that active sub-network biomarkers
are more biologically plausible and genes discovered
are more reproducible across studies. Finally, through pathway
analysis, we also showed a substantial enrichment for
known cancer genes and hence were able to generate relevant
hypotheses for understanding the molecular mechanisms
of breast cancer metastasis. - Phenome-genome association analysis: Development of graph-based learning algorithms for analyzing disease and gene associations in a network context.
Xie, MaoQiang; Xu, YingJie; Zhang, YaoGong; Hwang, TaeHyun; Kuang, Rui
Network-based Phenome-Genome Association Prediction by Bi-Random Walk Journal Article
In: PloS one, vol. 10, no. 5, pp. e0125138, 2015.
@article{xie2015network,
title = {Network-based Phenome-Genome Association Prediction by Bi-Random Walk},
author = {MaoQiang Xie and YingJie Xu and YaoGong Zhang and TaeHyun Hwang and Rui Kuang},
url = {http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0125138},
doi = {10.1371/journal.pone.0125138},
year = {2015},
date = {2015-05-01},
journal = {PloS one},
volume = {10},
number = {5},
pages = {e0125138},
publisher = {Public Library of Science},
abstract = {The availability of ontologies and systematic documentations of phenotypes and their genetic associations has enabled large-scale network-based global analyses of the association between the complete collection of phenotypes (phenome) and genes. To provide a fundamental understanding of how the network information is relevant to phenotype-gene associations, we analyze the circular bigraphs (CBGs) in OMIM human disease phenotype-gene association network and MGI mouse phentoype-gene association network, and introduce a bi-random walk (BiRW) algorithm to capture the CBG patterns in the networks for unveiling human and mouse phenome-genome association. BiRW performs separate random walk simultaneously on gene interaction network and phenotype similarity network to explore gene paths and phenotype paths in CBGs of different sizes to summarize their associations as predictions.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The availability of ontologies and systematic documentations of phenotypes and their genetic associations has enabled large-scale network-based global analyses of the association between the complete collection of phenotypes (phenome) and genes. To provide a fundamental understanding of how the network information is relevant to phenotype-gene associations, we analyze the circular bigraphs (CBGs) in OMIM human disease phenotype-gene association network and MGI mouse phentoype-gene association network, and introduce a bi-random walk (BiRW) algorithm to capture the CBG patterns in the networks for unveiling human and mouse phenome-genome association. BiRW performs separate random walk simultaneously on gene interaction network and phenotype similarity network to explore gene paths and phenotype paths in CBGs of different sizes to summarize their associations as predictions.Hwang, TaeHyun; Atluri, Gowtham; Xie, MaoQiang; Dey, Sanjoy; Hong, Changjin; Kumar, Vipin; Kuang, Rui
Co-clustering phenome--genome for phenotype classification and disease gene discovery Journal Article
In: Nucleic acids research, vol. 40, no. 19, pp. e146–e146, 2012.
@article{hwang2012co,
title = {Co-clustering phenome--genome for phenotype classification and disease gene discovery},
author = {TaeHyun Hwang and Gowtham Atluri and MaoQiang Xie and Sanjoy Dey and Changjin Hong and Vipin Kumar and Rui Kuang},
url = {http://nar.oxfordjournals.org/content/40/19/e146.short},
doi = {10.1093/nar/gks615},
year = {2012},
date = {2012-06-26},
journal = {Nucleic acids research},
volume = {40},
number = {19},
pages = {e146--e146},
publisher = {Oxford Univ Press},
abstract = {Understanding the categorization of human diseases is critical for reliably identifying disease causal genes. Recently, genome-wide studies of abnormal chromosomal locations related to diseases have mapped >2000 phenotype–gene relations, which provide valuable information for classifying diseases and identifying candidate genes as drug targets. In this article, a regularized non-negative matrix tri-factorization (R-NMTF) algorithm is introduced to co-cluster phenotypes and genes, and simultaneously detect associations between the detected phenotype clusters and gene clusters. The R-NMTF algorithm factorizes the phenotype–gene association matrix under the prior knowledge from phenotype similarity network and protein–protein interaction network, supervised by the label information from known disease classes and biological pathways. In the experiments on disease phenotype–gene associations in OMIM and KEGG disease pathways, R-NMTF significantly improved the classification of disease phenotypes and disease pathway genes compared with support vector machines and Label Propagation in cross-validation on the annotated phenotypes and genes. The newly predicted phenotypes in each disease class are highly consistent with human phenotype ontology annotations. The roles of the new member genes in the disease pathways are examined and validated in the protein–protein interaction subnetworks. Extensive literature review also confirmed many new members of the disease classes and pathways as well as the predicted associations between disease phenotype classes and pathways.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Understanding the categorization of human diseases is critical for reliably identifying disease causal genes. Recently, genome-wide studies of abnormal chromosomal locations related to diseases have mapped >2000 phenotype–gene relations, which provide valuable information for classifying diseases and identifying candidate genes as drug targets. In this article, a regularized non-negative matrix tri-factorization (R-NMTF) algorithm is introduced to co-cluster phenotypes and genes, and simultaneously detect associations between the detected phenotype clusters and gene clusters. The R-NMTF algorithm factorizes the phenotype–gene association matrix under the prior knowledge from phenotype similarity network and protein–protein interaction network, supervised by the label information from known disease classes and biological pathways. In the experiments on disease phenotype–gene associations in OMIM and KEGG disease pathways, R-NMTF significantly improved the classification of disease phenotypes and disease pathway genes compared with support vector machines and Label Propagation in cross-validation on the annotated phenotypes and genes. The newly predicted phenotypes in each disease class are highly consistent with human phenotype ontology annotations. The roles of the new member genes in the disease pathways are examined and validated in the protein–protein interaction subnetworks. Extensive literature review also confirmed many new members of the disease classes and pathways as well as the predicted associations between disease phenotype classes and pathways.Xie, Maoqiang; Hwang, Taehyun; Kuang, Rui
Prioritizing disease genes by bi-random walk Proceedings Article
In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 292–303, Springer 2012.
@inproceedings{xie2012prioritizing,
title = {Prioritizing disease genes by bi-random walk},
author = {Maoqiang Xie and Taehyun Hwang and Rui Kuang},
url = {http://compbio.cs.umn.edu/wp-content/uploads/2017/10/PAKDD2012.pdf},
doi = {10.1007/978-3-642-30220-6_25},
year = {2012},
date = {2012-05-29},
booktitle = {Pacific-Asia Conference on Knowledge Discovery and Data Mining},
pages = {292--303},
organization = {Springer},
abstract = {Random walk methods have been successfully applied to prioritizing disease causal genes. In this paper, we propose a bi-random walk algorithm (BiRW) based on a regularization framework for graph matching to globally prioritize disease genes for all phenotypes simultaneously. While previous methods perform random walk either on the protein-protein interaction network or the complete phenome-genome heterogenous network, BiRW performs random walk on the Kronecker product graph between the protein-protein interaction network and the phenotype similarity network. Three variations of BiRW that perform balanced or unbalanced bi-directional random walks are analyzed and compared with other random walk methods. Experiments on analyzing the disease phenotype-gene associations in Online Mendelian Inheritance in Man (OMIM) demonstrate that BiRW effectively improved disease gene prioritization over existing methods by ranking more known associations in the top 100 out of nearly 10,000 candidate genes.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Random walk methods have been successfully applied to prioritizing disease causal genes. In this paper, we propose a bi-random walk algorithm (BiRW) based on a regularization framework for graph matching to globally prioritize disease genes for all phenotypes simultaneously. While previous methods perform random walk either on the protein-protein interaction network or the complete phenome-genome heterogenous network, BiRW performs random walk on the Kronecker product graph between the protein-protein interaction network and the phenotype similarity network. Three variations of BiRW that perform balanced or unbalanced bi-directional random walks are analyzed and compared with other random walk methods. Experiments on analyzing the disease phenotype-gene associations in Online Mendelian Inheritance in Man (OMIM) demonstrate that BiRW effectively improved disease gene prioritization over existing methods by ranking more known associations in the top 100 out of nearly 10,000 candidate genes.Hwang, TaeHyun; Zhang, Wei; Xie, Maoqiang; Liu, Jinfeng; Kuang, Rui
Inferring disease and gene set associations with rank coherence in networks Journal Article
In: Bioinformatics, vol. 27, no. 19, pp. 2692–2699, 2011.
@article{hwang2011inferring,
title = {Inferring disease and gene set associations with rank coherence in networks},
author = {TaeHyun Hwang and Wei Zhang and Maoqiang Xie and Jinfeng Liu and Rui Kuang},
url = {http://bioinformatics.oxfordjournals.org/content/27/19/2692},
doi = {10.1093/bioinformatics/btr463},
year = {2011},
date = {2011-08-02},
journal = {Bioinformatics},
volume = {27},
number = {19},
pages = {2692--2699},
publisher = {Oxford Univ Press},
abstract = {Motivation: To validate the candidate disease genes identified from high-throughput genomic studies, a necessary step is to elucidate the associations between the set of candidate genes and disease phenotypes. The conventional gene set enrichment analysis often fails to reveal associations between disease phenotypes and the gene sets with a short list of poorly annotated genes, because the existing annotations of disease-causative genes are incomplete. This article introduces a network-based computational approach called rcNet to discover the associations between gene sets and disease phenotypes. A learning framework is proposed to maximize the coherence between the predicted phenotype–gene set relations and the known disease phenotype-gene associations. An efficient algorithm coupling ridge regression with label propagation and two variants are designed to find the optimal solution to the objective functions of the learning framework.
Results: We evaluated the rcNet algorithms with leave-one-out cross-validation on Online Mendelian Inheritance in Man (OMIM) data and an independent test set of recently discovered disease–gene associations. In the experiments, the rcNet algorithms achieved best overall rankings compared with the baselines. To further validate the reproducibility of the performance, we applied the algorithms to identify the target diseases of novel candidate disease genes obtained from recent studies of Genome-Wide Association Study (GWAS), DNA copy number variation analysis and gene expression profiling. The algorithms ranked the target disease of the candidate genes at the top of the rank list in many cases across all the three case studies.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Motivation: To validate the candidate disease genes identified from high-throughput genomic studies, a necessary step is to elucidate the associations between the set of candidate genes and disease phenotypes. The conventional gene set enrichment analysis often fails to reveal associations between disease phenotypes and the gene sets with a short list of poorly annotated genes, because the existing annotations of disease-causative genes are incomplete. This article introduces a network-based computational approach called rcNet to discover the associations between gene sets and disease phenotypes. A learning framework is proposed to maximize the coherence between the predicted phenotype–gene set relations and the known disease phenotype-gene associations. An efficient algorithm coupling ridge regression with label propagation and two variants are designed to find the optimal solution to the objective functions of the learning framework.
Results: We evaluated the rcNet algorithms with leave-one-out cross-validation on Online Mendelian Inheritance in Man (OMIM) data and an independent test set of recently discovered disease–gene associations. In the experiments, the rcNet algorithms achieved best overall rankings compared with the baselines. To further validate the reproducibility of the performance, we applied the algorithms to identify the target diseases of novel candidate disease genes obtained from recent studies of Genome-Wide Association Study (GWAS), DNA copy number variation analysis and gene expression profiling. The algorithms ranked the target disease of the candidate genes at the top of the rank list in many cases across all the three case studies.Hwang, TaeHyun; Kuang, Rui
A Heterogeneous Label Propagation Algorithm for Disease Gene Discovery Proceedings Article
In: Society for Industrial and Applied Mathematics. Proceedings of the SIAM International Conference on Data Mining, pp. 583, Society for Industrial and Applied Mathematics 2010.
@inproceedings{hwang2010heterogeneous,
title = {A Heterogeneous Label Propagation Algorithm for Disease Gene Discovery},
author = {TaeHyun Hwang and Rui Kuang},
url = {http://compbio.cs.umn.edu/wp-content/uploads/2017/10/SDM2010.pdf},
doi = {10.1137/1.9781611972801.51},
year = {2010},
date = {2010-04-29},
booktitle = {Society for Industrial and Applied Mathematics. Proceedings of the SIAM International Conference on Data Mining},
pages = {583},
organization = {Society for Industrial and Applied Mathematics},
abstract = {Label propagation is an effective and efficient technique to utilize local and global features in a network for semi-supervised learning. In the literature, one challenge is how to propagate information in heterogeneous networks comprising several subnetworks, each of which has its own cluster structures that need to be explored independently. In this paper, we introduce an intutitive algorithm MINProp (Mutual Interaction-based Network Propagation) and a simple regularization framework for propagating information between subnetworks in a heterogeneous network. MINProp sequentially performs label propagation on each individual subnetwork with the current label information derived from the other subnetworks and repeats this step until convergence to the global optimal solution to the convex objective function of the regularization framework. The independent label propagation on each subnetwork explores the cluster structure in the subnetwork. The label information from the other subnetworks is used to capture mutual interactions (bicluster structures) between the vertices in each pair of the subnetworks. MINProp algorithm is applied to disease gene discovery from a heterogeneus network of disease phenotypes and genes. In the experiments, MINProp significantly output-performed the original label propagation algorithm on a single network and the state-of-the-art methods for discovering disease genes. The results also suggest that MINProp is more effective in utilizing the modular structures in a heterogenous network. Finally, MINProp discovered new disease-gene associations that are only reported recently.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Label propagation is an effective and efficient technique to utilize local and global features in a network for semi-supervised learning. In the literature, one challenge is how to propagate information in heterogeneous networks comprising several subnetworks, each of which has its own cluster structures that need to be explored independently. In this paper, we introduce an intutitive algorithm MINProp (Mutual Interaction-based Network Propagation) and a simple regularization framework for propagating information between subnetworks in a heterogeneous network. MINProp sequentially performs label propagation on each individual subnetwork with the current label information derived from the other subnetworks and repeats this step until convergence to the global optimal solution to the convex objective function of the regularization framework. The independent label propagation on each subnetwork explores the cluster structure in the subnetwork. The label information from the other subnetworks is used to capture mutual interactions (bicluster structures) between the vertices in each pair of the subnetworks. MINProp algorithm is applied to disease gene discovery from a heterogeneus network of disease phenotypes and genes. In the experiments, MINProp significantly output-performed the original label propagation algorithm on a single network and the state-of-the-art methods for discovering disease genes. The results also suggest that MINProp is more effective in utilizing the modular structures in a heterogenous network. Finally, MINProp discovered new disease-gene associations that are only reported recently. - Protein remote homology detection: Development of string kernel algorithms and label propagation algorithms to infer the protein remote homologys and study their protein structures and functions.
Weston, Jason; Kuang, Rui; Leslie, Christina; Noble, William Stafford
Protein ranking by semi-supervised network propagation Journal Article
In: BMC bioinformatics, vol. 7, no. 1, pp. 9, 2006.
@article{weston2006protein,
title = {Protein ranking by semi-supervised network propagation},
author = {Jason Weston and Rui Kuang and Christina Leslie and William Stafford Noble},
url = {http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-S1-S10},
doi = {10.1186/1471-2105-7-S1-S10},
year = {2006},
date = {2006-03-20},
journal = {BMC bioinformatics},
volume = {7},
number = {1},
pages = {9},
publisher = {BioMed Central},
abstract = {Background: Biologists regularly search DNA or protein databases for sequences that share an evolutionary or functional relationship with a given query sequence. Traditional search methods, such as BLAST and PSI-BLAST, focus on detecting statistically significant pairwise sequence alignments and often miss more subtle sequence similarity. Recent work in the machine learning community has shown that exploiting the global structure of the network defined by these pairwise similarities can help detect more remote relationships than a purely local measure.
Methods: We review RankProp, a ranking algorithm that exploits the global network structure of similarity relationships among proteins in a database by performing a diffusion operation on a protein similarity network with weighted edges. The original RankProp algorithm is unsupervised. Here, we describe a semi-supervised version of the algorithm that uses labeled examples. Three possible ways of incorporating label information are considered: (i) as a validation set for model selection, (ii) to learn a new network, by choosing which transfer function to use for a given query, and (iii) to estimate edge weights, which measure the probability of inferring structural similarity.
Results: Benchmarked on a human-curated database of protein structures, the original RankProp algorithm provides significant improvement over local network search algorithms such as PSIBLAST. Furthermore, we show here that labeled data can be used to learn a network without any need for estimating parameters of the transfer function, and that diffusion on this learned network produces better results than the original RankProp algorithm with a fixed network.
Conclusion: In order to gain maximal information from a network, labeled and unlabeled data should be used to extract both local and global structure.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Background: Biologists regularly search DNA or protein databases for sequences that share an evolutionary or functional relationship with a given query sequence. Traditional search methods, such as BLAST and PSI-BLAST, focus on detecting statistically significant pairwise sequence alignments and often miss more subtle sequence similarity. Recent work in the machine learning community has shown that exploiting the global structure of the network defined by these pairwise similarities can help detect more remote relationships than a purely local measure.
Methods: We review RankProp, a ranking algorithm that exploits the global network structure of similarity relationships among proteins in a database by performing a diffusion operation on a protein similarity network with weighted edges. The original RankProp algorithm is unsupervised. Here, we describe a semi-supervised version of the algorithm that uses labeled examples. Three possible ways of incorporating label information are considered: (i) as a validation set for model selection, (ii) to learn a new network, by choosing which transfer function to use for a given query, and (iii) to estimate edge weights, which measure the probability of inferring structural similarity.
Results: Benchmarked on a human-curated database of protein structures, the original RankProp algorithm provides significant improvement over local network search algorithms such as PSIBLAST. Furthermore, we show here that labeled data can be used to learn a network without any need for estimating parameters of the transfer function, and that diffusion on this learned network produces better results than the original RankProp algorithm with a fixed network.
Conclusion: In order to gain maximal information from a network, labeled and unlabeled data should be used to extract both local and global structure.Noble, William Stafford; Kuang, Rui; Leslie, Christina; Weston, Jason
Idetifying remote protein homologs by network propagation Journal Article
In: FEBS J, vol. 272, no. 20, 2005.
@article{noble2005idetifying,
title = {Idetifying remote protein homologs by network propagation},
author = {William Stafford Noble and Rui Kuang and Christina Leslie and Jason Weston},
url = {http://onlinelibrary.wiley.com/doi/10.1111/j.1742-4658.2005.04947.x/abstract},
doi = {10.1111/j.1742-4658.2005.04947.x},
year = {2005},
date = {2005-10-07},
journal = {FEBS J},
volume = {272},
number = {20},
abstract = {Perhaps the most widely used applications of bioinformatics are tools such as psi-blast for searching sequence databases. We describe a recently developed protein database search algorithm called rankprop. rankprop relies upon a precomputed network of pairwise protein similarities. The algorithm performs a diffusion operation from a specified query protein across the protein similarity network. The resulting activation scores, assigned to each database protein, encode information about the global structure of the protein similarity network. This type of algorithm has a rich history in associationist psychology, artificial intelligence and web search. We describe the rankprop algorithm and its relatives, and we provide evidence that the algorithm successfully improves upon the rankings produced by psi-blast.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Perhaps the most widely used applications of bioinformatics are tools such as psi-blast for searching sequence databases. We describe a recently developed protein database search algorithm called rankprop. rankprop relies upon a precomputed network of pairwise protein similarities. The algorithm performs a diffusion operation from a specified query protein across the protein similarity network. The resulting activation scores, assigned to each database protein, encode information about the global structure of the protein similarity network. This type of algorithm has a rich history in associationist psychology, artificial intelligence and web search. We describe the rankprop algorithm and its relatives, and we provide evidence that the algorithm successfully improves upon the rankings produced by psi-blast.Kuang, Rui; Ie, Eugene; Wang, Ke; Wang, Kai; Siddiqi, Mahira; Freund, Yoav; Leslie, Christina
Profile-based string kernels for remote homology detection and motif extraction Journal Article
In: Journal of bioinformatics and computational biology, vol. 3, no. 03, 2005.
@article{kuang2005profile,
title = {Profile-based string kernels for remote homology detection and motif extraction},
author = {Rui Kuang and Eugene Ie and Ke Wang and Kai Wang and Mahira Siddiqi and Yoav Freund and Christina Leslie},
url = {http://compbio.cs.umn.edu/paper/jbcb-profile-kernel.pdf},
doi = {http://dx.doi.org/10.1142/S021972000500120X},
year = {2005},
date = {2005-10-02},
journal = {Journal of bioinformatics and computational biology},
volume = {3},
number = {03},
publisher = {World Scientific},
abstract = {We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences (“k-mers”) in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract “discriminative sequence motifs”—short regions of the original profile that contribute almost all the weight of the SVM classification score—and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented “cluster kernels” give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences (“k-mers”) in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract “discriminative sequence motifs”—short regions of the original profile that contribute almost all the weight of the SVM classification score—and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented “cluster kernels” give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets.Kuang, Rui; Weston, Jason; Noble, William Stafford; Leslie, Christina
Motif-based protein ranking by network propagation Journal Article
In: Bioinformatics, vol. 21, no. 19, 2005.
@article{kuang2005motif,
title = {Motif-based protein ranking by network propagation},
author = {Rui Kuang and Jason Weston and William Stafford Noble and Christina Leslie},
url = {http://bioinformatics.oxfordjournals.org/content/21/19/3711.full},
doi = {10.1093/bioinformatics/bti608},
year = {2005},
date = {2005-08-02},
journal = {Bioinformatics},
volume = {21},
number = {19},
publisher = {Oxford Univ Press},
abstract = {Motivation: Sequence similarity often suggests evolutionary relationships between protein sequences that can be important for inferring similarity of structure or function. The most widely-used pairwise sequence comparison algorithms for homology detection, such as BLAST and PSI-BLAST, often fail to detect less conserved remotely-related targets.
Results: In this paper, we propose a new general graph-based propagation algorithm called MotifProp to detect more subtle similarity relationships than pairwise comparison methods. MotifProp is based on a protein-motif network, in which edges connect proteins and the k-mer based motif features that they contain. We show that our new motif-based propagation algorithm can improve the ranking results over a base algorithm, such as PSI-BLAST, that is used to initialize the ranking. Despite the complex structure of the protein-motif network, MotifProp can be easily interpreted using the top-ranked motifs and motif-rich regions induced by the propagation, both of which are helpful for discovering conserved structural components in remote homologies.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Motivation: Sequence similarity often suggests evolutionary relationships between protein sequences that can be important for inferring similarity of structure or function. The most widely-used pairwise sequence comparison algorithms for homology detection, such as BLAST and PSI-BLAST, often fail to detect less conserved remotely-related targets.
Results: In this paper, we propose a new general graph-based propagation algorithm called MotifProp to detect more subtle similarity relationships than pairwise comparison methods. MotifProp is based on a protein-motif network, in which edges connect proteins and the k-mer based motif features that they contain. We show that our new motif-based propagation algorithm can improve the ranking results over a base algorithm, such as PSI-BLAST, that is used to initialize the ranking. Despite the complex structure of the protein-motif network, MotifProp can be easily interpreted using the top-ranked motifs and motif-rich regions induced by the propagation, both of which are helpful for discovering conserved structural components in remote homologies.Leslie, Christina; Kuang, Rui
Fast string kernels using inexact matching for protein sequences Journal Article
In: Journal of Machine Learning Research, vol. 5, no. Nov, 2004.
@article{leslie2004fast,
title = {Fast string kernels using inexact matching for protein sequences},
author = {Christina Leslie and Rui Kuang},
url = {http://jmlr.csail.mit.edu/papers/volume5/leslie04a/leslie04a.pdf},
year = {2004},
date = {2004-11-01},
journal = {Journal of Machine Learning Research},
volume = {5},
number = {Nov},
abstract = {We describe several families of k-mer based string kernels related to the recently presented mismatch kernel and designed for use with support vector machines (SVMs) for classification of protein sequence data. These new kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by k-length subsequences (“k-mers”) from the string alphabet Σ. However, for all kernels we define here, the kernel value K(x,y) can be computed in O(cK(|x| + |y|)) time, where the constant cK depends on the parameters of the kernel but is independent of the size |Σ| of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameter-dependent constant cK = k m+1 |Σ| m of the (k,m)-mismatch kernel. We compute the kernels efficiently using a trie data structure and relate our new kernels to the recently described transducer formalism. In protein classification experiments on two benchmark SCOP data sets, we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models, and we investigate the dependence of the kernels on parameter choice.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
We describe several families of k-mer based string kernels related to the recently presented mismatch kernel and designed for use with support vector machines (SVMs) for classification of protein sequence data. These new kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by k-length subsequences (“k-mers”) from the string alphabet Σ. However, for all kernels we define here, the kernel value K(x,y) can be computed in O(cK(|x| + |y|)) time, where the constant cK depends on the parameters of the kernel but is independent of the size |Σ| of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameter-dependent constant cK = k m+1 |Σ| m of the (k,m)-mismatch kernel. We compute the kernels efficiently using a trie data structure and relate our new kernels to the recently described transducer formalism. In protein classification experiments on two benchmark SCOP data sets, we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models, and we investigate the dependence of the kernels on parameter choice.