Our lab is particularly interested in large-scale genomic and biomedical data analysis with machine learning and network-based methods for research problems in health-related and biological science. The two broad areas for my research are 1) phenome-genome association analysis and 2) cancer outcome prediction and biomarker identification. In the first area, we performed large-scale association analysis between all genes and the complete collection of phenotypes (phenome) by network-based machine learning methods. In the second area, we developed graph-based learning models and kernel methods to capture the structures in single-cell RNA sequencing data, high-dimensional gene (isoform) expressions and DNA copy number variations for improved cancer outcome prediction and robust biomarker identification. In addition, we also developed kernel methods for protein classification. Our current projects center around the following topics,
- Spatial and single-cell transcriptomics: Spatial transcriptomics technologies have enabled spatially-resolved RNA profiling of single cells with cell identities and localizations for understanding cells’ organizations and functions. Our group develops new machine learning methods for mining RNA profiles collected from single cells and their spatial locations.
Song, Tianci; Broadbent, Charlie; Kuang, Rui
GNTD: Reconstructing Spatial Transcriptomes with Graph-guided Neural Tensor Decomposition Informed by Spatial and Functional Relations Journal Article
In: Nature Communications, vol. 14, no. 8276, 2023.
@article{GNTD,
title = {GNTD: Reconstructing Spatial Transcriptomes with Graph-guided Neural Tensor Decomposition Informed by Spatial and Functional Relations},
author = {Tianci Song and Charlie Broadbent and Rui Kuang},
url = {https://www.nature.com/articles/s41467-023-44017-0},
year = {2023},
date = {2023-04-01},
urldate = {2023-04-01},
journal = {Nature Communications},
volume = {14},
number = {8276},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Atkins, Thomas Karl; Song, Tianci; Kuang, Rui
FIST-nD: A tool for n-dimensional spatial transcriptomics data imputation via graph-regularized tensor completio Technical Report
2022.
@techreport{FIST_nD,
title = {FIST-nD: A tool for n-dimensional spatial transcriptomics data imputation via graph-regularized tensor completio},
author = {Thomas Karl Atkins and Tianci Song and Rui Kuang
},
url = {https://www.biorxiv.org/content/10.1101/2022.10.12.511928v1.article-metrics},
doi = {10.1101/2022.10.12.511928},
year = {2022},
date = {2022-10-16},
urldate = {2022-10-16},
keywords = {},
pubstate = {published},
tppubtype = {techreport}
}
Song, Tianci; Markham, Kathleen K.; Li, Zhuliu; Muller, Kristen E.; Greenham, Kathleen; Kuang, Rui
Detecting Spatially Co-expressed Gene Clusters with Functional Coherence by Graph-regularized Convolutional Neural Network Journal Article
In: Bioinformatics, vol. 38, no. 5, pp. 1344–1352, 2022.
@article{spatialGCNNb,
title = {Detecting Spatially Co-expressed Gene Clusters with Functional Coherence by Graph-regularized Convolutional Neural Network},
author = {Tianci Song and Kathleen K. Markham and Zhuliu Li and Kristen E. Muller and Kathleen Greenham and Rui Kuang},
url = {https://academic.oup.com/bioinformatics/article/38/5/1344/6448221},
year = {2022},
date = {2022-03-01},
urldate = {2021-11-30},
journal = {Bioinformatics},
volume = {38},
number = {5},
pages = {1344–1352},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Li, Zhuliu; Song, Tianci; Yong, Jeongsik; Kuang, Rui
Imputation of spatially-resolved transcriptomes by graph-regularized tensor completion Journal Article
In: PLoS computational biology, vol. 17, no. 4, pp. e1008218, 2021.
@article{li2021imputation,
title = {Imputation of spatially-resolved transcriptomes by graph-regularized tensor completion},
author = {Zhuliu Li and Tianci Song and Jeongsik Yong and Rui Kuang},
url = {https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008218},
year = {2021},
date = {2021-01-01},
journal = {PLoS computational biology},
volume = {17},
number = {4},
pages = {e1008218},
publisher = {Public Library of Science San Francisco, CA USA},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Petegrosso, Raphael; Li, Zhuliu; Kuang, Rui
Machine Learning and Statistical Methods for Clustering Single-cell RNA-sequencing Data Journal Article
In: Briefings in Bioinformatics, 2019.
@article{petegrosso2019scrnaseq,
title = {Machine Learning and Statistical Methods for Clustering Single-cell RNA-sequencing Data},
author = {Raphael Petegrosso and Zhuliu Li and Rui Kuang},
url = {https://doi.org/10.1093/bib/bbz063},
year = {2019},
date = {2019-06-29},
journal = {Briefings in Bioinformatics},
abstract = {Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, $k$-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics.
AVAILABILITY:
All the source code and data are available at https://github.com/kuanglab/single-cell-review},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, $k$-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics.
AVAILABILITY:
All the source code and data are available at https://github.com/kuanglab/single-cell-review - Cancer genomics: Development of graph-based learning algorithms, sequence alignment algorithms and association rule-mining algorithms for building predictive models and mining biomarkers of cancer phenotypes from microarray or sequencing transcriptome data, DNA copy number variations, SNPs and protein-protein interactions.
Zhang, Wei; Chien, Jeremy; Yong, Jeongsik; Kuang, Rui
Network-based Machine Learning and Graph Theory Algorithms for Precision Oncology Journal Article
In: NPJ Precision Oncology, no. 25, 2017.
@article{networkreview2017,
title = {Network-based Machine Learning and Graph Theory Algorithms for Precision Oncology},
author = {Wei Zhang and Jeremy Chien and Jeongsik Yong and Rui Kuang},
url = {https://www.nature.com/articles/s41698-017-0029-7},
doi = {doi:10.1038/s41698-017-0029-7},
year = {2017},
date = {2017-08-08},
journal = {NPJ Precision Oncology},
number = {25},
abstract = {Network-based analytics plays an increasingly important role in precision oncology. Growing evidence in recent studies suggests that cancer can be better understood through mutated or dysregulated pathways or networks rather than individual mutations and that the efficacy of repositioned drugs can be inferred from disease modules in molecular networks. This article reviews network-based machine learning and graph theory algorithms for integrative analysis of personal genomic data and biomedical knowledge bases to identify tumor-specific molecular mechanisms, candidate targets and repositioned drugs for personalized treatment. The review focuses on the algorithmic design and mathematical formulation of these methods to facilitate applications and implementations of network-based analysis in the practice of precision oncology. We review the methods applied in three scenarios to integrate genomic data and network models in different analysis pipelines, and we examine three categories of network-based approaches for repositioning drugs in drug-disease-gene networks. In addition, we perform a comprehensive subnetwork/pathway analysis of mutations in 31 cancer genome projects in the Cancer Genome Atlas (TCGA) and present a detailed case study on ovarian cancer. Finally, we discuss interesting observations, potential pitfalls and future directions in network-based precision oncology.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Network-based analytics plays an increasingly important role in precision oncology. Growing evidence in recent studies suggests that cancer can be better understood through mutated or dysregulated pathways or networks rather than individual mutations and that the efficacy of repositioned drugs can be inferred from disease modules in molecular networks. This article reviews network-based machine learning and graph theory algorithms for integrative analysis of personal genomic data and biomedical knowledge bases to identify tumor-specific molecular mechanisms, candidate targets and repositioned drugs for personalized treatment. The review focuses on the algorithmic design and mathematical formulation of these methods to facilitate applications and implementations of network-based analysis in the practice of precision oncology. We review the methods applied in three scenarios to integrate genomic data and network models in different analysis pipelines, and we examine three categories of network-based approaches for repositioning drugs in drug-disease-gene networks. In addition, we perform a comprehensive subnetwork/pathway analysis of mutations in 31 cancer genome projects in the Cancer Genome Atlas (TCGA) and present a detailed case study on ovarian cancer. Finally, we discuss interesting observations, potential pitfalls and future directions in network-based precision oncology.Zhang, Wei; Chang, Jae-Woong; Lin, Lilong; Minn, Kay; Wu, Baolin; Chien, Jeremy; Yong, Jeongsik; Zheng, Hui; Kuang, Rui
Network-based Isoform Quantification with RNA-Seq Data for Cancer Transcriptome Analysis Journal Article
In: PLoS Computational Biology, vol. e1004465, 2015.
@article{Net-RSTQ,
title = {Network-based Isoform Quantification with RNA-Seq Data for Cancer Transcriptome Analysis},
author = {Wei Zhang and Jae-Woong Chang and Lilong Lin and Kay Minn and Baolin Wu and Jeremy Chien and Jeongsik Yong and Hui Zheng and Rui Kuang},
url = {http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004465},
doi = {http://dx.doi.org/10.1371/journal.pcbi.1004465},
year = {2015},
date = {2015-12-23},
journal = {PLoS Computational Biology},
volume = {e1004465},
abstract = {New sequencing technologies for transcriptome-wide profiling of RNAs have greatly promoted the interest in isoform-based functional characterizations of a cellular system. Elucidation of gene expressions at the isoform resolution could lead to new molecular mechanisms such as gene-regulations and alternative splicings, and potentially better molecular signals for phenotype predictions. However, it could be overly optimistic to derive the proportion of the isoforms of a gene solely based on short read alignments. Inherently, systematical sampling biases from RNA library preparation and ambiguity of read origins in overlapping isoforms pose a problem in reliability. The work in this paper exams the possibility of using protein domain-domain interactions as prior knowledge in isoform transcript quantification. We first made the observation that protein domain-domain interactions positively correlate with isoform co-expressions in TCGA data and then designed a probabilistic EM approach to integrate domain-domain interactions with short read alignments for estimation of isoform proportions. Validated by qRT-PCR experiments on three cell lines, simulations and classifications of TCGA patient samples in several cancer types, Net-RSTQ is proven a useful tool for isoform-based analysis in functional genomes and systems biology.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
New sequencing technologies for transcriptome-wide profiling of RNAs have greatly promoted the interest in isoform-based functional characterizations of a cellular system. Elucidation of gene expressions at the isoform resolution could lead to new molecular mechanisms such as gene-regulations and alternative splicings, and potentially better molecular signals for phenotype predictions. However, it could be overly optimistic to derive the proportion of the isoforms of a gene solely based on short read alignments. Inherently, systematical sampling biases from RNA library preparation and ambiguity of read origins in overlapping isoforms pose a problem in reliability. The work in this paper exams the possibility of using protein domain-domain interactions as prior knowledge in isoform transcript quantification. We first made the observation that protein domain-domain interactions positively correlate with isoform co-expressions in TCGA data and then designed a probabilistic EM approach to integrate domain-domain interactions with short read alignments for estimation of isoform proportions. Validated by qRT-PCR experiments on three cell lines, simulations and classifications of TCGA patient samples in several cancer types, Net-RSTQ is proven a useful tool for isoform-based analysis in functional genomes and systems biology.Chien, Jeremy; Sicotte, Hugues; Fan, Jian-Bing; Humphray, Sean; Cunningham, Julie M; Kalli, Kimberly R; Oberg, Ann L; Hart, Steven N; Li, Ying; Davila, Jaime I; others,
TP53 mutations, tetraploidy and homologous recombination repair defects in early stage high-grade serous ovarian cancer Journal Article
In: Nucleic acids research, pp. gkv111, 2015.
@article{chien2015tp53,
title = {TP53 mutations, tetraploidy and homologous recombination repair defects in early stage high-grade serous ovarian cancer},
author = {Jeremy Chien and Hugues Sicotte and Jian-Bing Fan and Sean Humphray and Julie M Cunningham and Kimberly R Kalli and Ann L Oberg and Steven N Hart and Ying Li and Jaime I Davila and others},
url = {http://nar.oxfordjournals.org/content/43/14/6945},
doi = {10.1093/nar/gkv111},
year = {2015},
date = {2015-02-02},
journal = {Nucleic acids research},
pages = {gkv111},
publisher = {Oxford Univ Press},
abstract = {To determine early somatic changes in high-grade serous ovarian cancer (HGSOC), we performed whole genome sequencing on a rare collection of 16 low stage HGSOCs. The majority showed extensive structural alterations (one had an ultramutated profile), exhibited high levels of p53 immunoreactivity, and harboured TP53 mutation, deletion or inactivation. BRCA1 and BRCA2 mutations were observed in two tumors, with nine showing evidence of a homologous recombination (HR) defect. Combined analysis with The Cancer Genome Atlas indicated that low and late stage HGSOCs have similar mutation and copy number profiles. We also found evidence that deleterious TP53 mutations are the earliest events, followed by deletions or loss of heterozygosity (LOH) of chromosomes carrying TP53, BRCA1 or BRCA2. Inactivation of HR appears to be an early event, as 62.5% of tumours showed a LOH pattern suggestive of HR defects. Three tumours with the highest ploidy had little genome-wide LOH, yet one of these had a homozygous somatic frame-shift BRCA2 mutation, suggesting that some carcinomas begin as tetraploid then descend into diploidy accompanied by genome-wide LOH. Lastly, we found evidence that structural variants (SV) cluster in HGSOC, but are absent in one ultramutated tumor, providing insights into the pathogenesis of low stage HGSOC.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
To determine early somatic changes in high-grade serous ovarian cancer (HGSOC), we performed whole genome sequencing on a rare collection of 16 low stage HGSOCs. The majority showed extensive structural alterations (one had an ultramutated profile), exhibited high levels of p53 immunoreactivity, and harboured TP53 mutation, deletion or inactivation. BRCA1 and BRCA2 mutations were observed in two tumors, with nine showing evidence of a homologous recombination (HR) defect. Combined analysis with The Cancer Genome Atlas indicated that low and late stage HGSOCs have similar mutation and copy number profiles. We also found evidence that deleterious TP53 mutations are the earliest events, followed by deletions or loss of heterozygosity (LOH) of chromosomes carrying TP53, BRCA1 or BRCA2. Inactivation of HR appears to be an early event, as 62.5% of tumours showed a LOH pattern suggestive of HR defects. Three tumours with the highest ploidy had little genome-wide LOH, yet one of these had a homozygous somatic frame-shift BRCA2 mutation, suggesting that some carcinomas begin as tetraploid then descend into diploidy accompanied by genome-wide LOH. Lastly, we found evidence that structural variants (SV) cluster in HGSOC, but are absent in one ultramutated tumor, providing insights into the pathogenesis of low stage HGSOC.Johnson, Nicholas; Zhang, Huanan; Fang, Gang; Kumar, Vipin; Kuang, Rui
SubPatCNV: approximate subspace pattern mining for mapping copy-number variations Journal Article
In: BMC bioinformatics, vol. 16, no. 1, pp. 1, 2015, ISSN: 1471-2105.
@article{johnson2015subpatcnv,
title = {SubPatCNV: approximate subspace pattern mining for mapping copy-number variations},
author = {Nicholas Johnson and Huanan Zhang and Gang Fang and Vipin Kumar and Rui Kuang},
url = {https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-014-0426-7},
doi = {10.1186/s12859-014-0426-7},
issn = {1471-2105},
year = {2015},
date = {2015-01-16},
journal = {BMC bioinformatics},
volume = {16},
number = {1},
pages = {1},
publisher = {BioMed Central},
abstract = {Background
Many DNA copy-number variations (CNVs) are known to lead to phenotypic variations and pathogenesis. While CNVs are often only common in a small number of samples in the studied population or patient cohort, previous work has not focused on customized identification of CNV regions that only exhibit in subsets of samples with advanced data mining techniques to reliably answer questions such as “Which are all the chromosomal fragments showing nearly identical deletions or insertions in more than 30% of the individuals?”.
Results
We introduce a tool for mining CNV subspace patterns, namely SubPatCNV, which is capable of identifying all aberrant CNV regions specific to arbitrary sample subsets larger than a support threshold. By design, SubPatCNV is the implementation of a variation of approximate association pattern mining algorithm under a spatial constraint on the positional CNV probe features. In benchmark test, SubPatCNV was applied to identify population specific germline CNVs from four populations of HapMap samples. In experiments on the TCGA ovarian cancer dataset, SubPatCNV discovered many large aberrant CNV events in patient subgroups, and reported regions enriched with cancer relevant genes. In both HapMap data and TCGA data, it was observed that SubPatCNV employs approximate pattern mining to more effectively identify CNV subspace patterns that are consistent within a subgroup from high-density array data.
Conclusions
SubPatCNV available through http://sourceforge.net/projects/subpatcnv/is a unique scalable open-source software tool that provides the flexibility of identifying CNV regions specific to sample subgroups of different sizes from high-density CNV array data.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Background
Many DNA copy-number variations (CNVs) are known to lead to phenotypic variations and pathogenesis. While CNVs are often only common in a small number of samples in the studied population or patient cohort, previous work has not focused on customized identification of CNV regions that only exhibit in subsets of samples with advanced data mining techniques to reliably answer questions such as “Which are all the chromosomal fragments showing nearly identical deletions or insertions in more than 30% of the individuals?”.
Results
We introduce a tool for mining CNV subspace patterns, namely SubPatCNV, which is capable of identifying all aberrant CNV regions specific to arbitrary sample subsets larger than a support threshold. By design, SubPatCNV is the implementation of a variation of approximate association pattern mining algorithm under a spatial constraint on the positional CNV probe features. In benchmark test, SubPatCNV was applied to identify population specific germline CNVs from four populations of HapMap samples. In experiments on the TCGA ovarian cancer dataset, SubPatCNV discovered many large aberrant CNV events in patient subgroups, and reported regions enriched with cancer relevant genes. In both HapMap data and TCGA data, it was observed that SubPatCNV employs approximate pattern mining to more effectively identify CNV subspace patterns that are consistent within a subgroup from high-density array data.
Conclusions
SubPatCNV available through http://sourceforge.net/projects/subpatcnv/is a unique scalable open-source software tool that provides the flexibility of identifying CNV regions specific to sample subgroups of different sizes from high-density CNV array data.Zhang, Huanan; Tian, Ze; Kuang, Rui
Transfer learning across cancers on DNA copy number variation analysis Proceedings Article
In: 2013 IEEE 13th International Conference on Data Mining, pp. 1283–1288, IEEE IEEE, 2013, ISBN: 978-0-7695-5108-1.
@inproceedings{zhang2013transfer,
title = {Transfer learning across cancers on DNA copy number variation analysis},
author = {Huanan Zhang and Ze Tian and Rui Kuang},
url = {http://compbio.cs.umn.edu/wp-content/uploads/2017/10/TLFL-10Page.pdf},
doi = {10.1109/ICDM.2013.58},
isbn = {978-0-7695-5108-1},
year = {2013},
date = {2013-12-07},
booktitle = {2013 IEEE 13th International Conference on Data Mining},
pages = {1283--1288},
publisher = {IEEE},
organization = {IEEE},
abstract = {Abstract:
DNA copy number variations (CNVs) are prevalent in all types of tumors. It is still a challenge to study how CNVs play a role in driving tumorgenic mechanisms that are either universal or specific in different cancer types. To address the problem, we introduce a transfer learning framework to discover common CNVs shared across different tumor types as well as CNVs specific to each tumor type from genome-wide CNV data measured by array CGH and SNP genotyping array. The proposed model, namely Transfer Learning with Fused LASSO (TLFL), detects latent CNV components from multiple CNV datasets of different tumor types to distinguish the CNVs that are common across the datasets and those that are specific in each dataset. Both the common and type-specific CNVs are detected as latent components in matrix factorization coupled with fused LASSO on adjacent CNV probe features. TLFL considers the common latent components underlying the multiple datasets to transfer knowledge across different tumor types. In simulations and experiments on real cancer CNV datasets, TLFL detected better latent components that can be used as features to improve classification of patient samples in each individual dataset compared with the model without the knowledge transfer. In cross-dataset analysis on bladder cancer and cross-domain analysis on breast cancer and ovarian cancer, TLFL also learned latent CNV components that are both predictive of tumor stages and correlate with known cancer genes.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Abstract:
DNA copy number variations (CNVs) are prevalent in all types of tumors. It is still a challenge to study how CNVs play a role in driving tumorgenic mechanisms that are either universal or specific in different cancer types. To address the problem, we introduce a transfer learning framework to discover common CNVs shared across different tumor types as well as CNVs specific to each tumor type from genome-wide CNV data measured by array CGH and SNP genotyping array. The proposed model, namely Transfer Learning with Fused LASSO (TLFL), detects latent CNV components from multiple CNV datasets of different tumor types to distinguish the CNVs that are common across the datasets and those that are specific in each dataset. Both the common and type-specific CNVs are detected as latent components in matrix factorization coupled with fused LASSO on adjacent CNV probe features. TLFL considers the common latent components underlying the multiple datasets to transfer knowledge across different tumor types. In simulations and experiments on real cancer CNV datasets, TLFL detected better latent components that can be used as features to improve classification of patient samples in each individual dataset compared with the model without the knowledge transfer. In cross-dataset analysis on bladder cancer and cross-domain analysis on breast cancer and ovarian cancer, TLFL also learned latent CNV components that are both predictive of tumor stages and correlate with known cancer genes. - Phenome-genome association analysis: Development of graph-based learning algorithms for analyzing disease and gene associations in a network context.
Li, Zhuliu; Petegrosso, Raphael; Smith, Shaden; Sterling, David; Karypis, George; Kuang, Rui
Scalable Label Propagation for Multi-relational Learning on the Tensor Product of Graphs Journal Article
In: IEEE Transactions on Knowledge and Data Engineering, 2021.
@article{li2021scalable,
title = {Scalable Label Propagation for Multi-relational Learning on the Tensor Product of Graphs},
author = {Zhuliu Li and Raphael Petegrosso and Shaden Smith and David Sterling and George Karypis and Rui Kuang},
url = {https://ieeexplore.ieee.org/document/9369895/},
year = {2021},
date = {2021-01-01},
urldate = {2021-01-01},
journal = {IEEE Transactions on Knowledge and Data Engineering},
publisher = {IEEE},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Petegrosso, Raphael; Song, Tianci; Kuang, Rui
Hierarchical Canonical Correlation Analysis Reveals Phenotype, Genotype, and Geoclimate Associations in Plants Journal Article
In: Plant Phenomics, vol. 2020, no. 1969142, 2020.
@article{Petegrosso2020,
title = {Hierarchical Canonical Correlation Analysis Reveals Phenotype, Genotype, and Geoclimate Associations in Plants},
author = {Raphael Petegrosso and Tianci Song and Rui Kuang},
url = {https://spj.sciencemag.org/plantphenomics/2020/1969142/cta/},
doi = {10.34133/2020/1969142},
year = {2020},
date = {2020-03-31},
journal = {Plant Phenomics},
volume = {2020},
number = {1969142},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Li, Zhuliu; Zhang, Wei; Huang, R Stephanie; Kuang, Rui
Learning a Low-rank Tensor of Pharmacogenomic Multi-relations from Biomedical Networks Proceedings
IEEE International Conference on Data Mining 2019.
@proceedings{GTCORP2019b,
title = {Learning a Low-rank Tensor of Pharmacogenomic Multi-relations from Biomedical Networks},
author = {Zhuliu Li and Wei Zhang and R Stephanie Huang and Rui Kuang},
url = {http://compbio.cs.umn.edu/08970888.pdf},
year = {2019},
date = {2019-08-31},
organization = {IEEE International Conference on Data Mining},
abstract = {Learning pharmacogenomic multi-relations among diseases, genes and chemicals from content-rich biomedical and biological networks can provide important guidance for drug discovery, drug repositioning and disease treatment. Most of the existing methods focus on imputing missing values in the diseasegene, disease-chemical and gene-chemical pairwise relations from the observed relations instead of being designed for learning high-order disease-gene-chemical multi-relations. To achieve the goal, we propose a general tensor-based optimization framework and a scalable Graph-Regularized Tensor Completion from Observed Pairwise Relations (GT-COPR) algorithm to infer the multi-relations among the entities across multiple networks in a low-rank tensor, based on manifold regularization with the graph Laplacian of a Cartesian, tensor or strong product of the networks, and consistencies between the collapsed tensors and the observed bipartite relations. Our theoretical analyses also prove the convergence and efficiency of GT-COPR. In the experiments, the tensor fiber-wise and slice-wise evaluations demonstrate the accuracy of GT-COPR for predicting the diseasegene-chemical associations across the large-scale protein-protein interactions network, chemical structural similarity network and phenotype-based human disease network; and the validation on Genomics of Drug Sensitivity in Cancer cell line dataset shows a potential clinical application of GT-COPR for learning diseasespecific chemical-gene interactions. Statistical enrichment analysis demonstrates that GT-COPR is also capable of producing both topologically and biologically relevant disease, gene and chemical components with high significance.
Source code: https://github.com/kuanglab/GT-COPR},
keywords = {},
pubstate = {published},
tppubtype = {proceedings}
}
Learning pharmacogenomic multi-relations among diseases, genes and chemicals from content-rich biomedical and biological networks can provide important guidance for drug discovery, drug repositioning and disease treatment. Most of the existing methods focus on imputing missing values in the diseasegene, disease-chemical and gene-chemical pairwise relations from the observed relations instead of being designed for learning high-order disease-gene-chemical multi-relations. To achieve the goal, we propose a general tensor-based optimization framework and a scalable Graph-Regularized Tensor Completion from Observed Pairwise Relations (GT-COPR) algorithm to infer the multi-relations among the entities across multiple networks in a low-rank tensor, based on manifold regularization with the graph Laplacian of a Cartesian, tensor or strong product of the networks, and consistencies between the collapsed tensors and the observed bipartite relations. Our theoretical analyses also prove the convergence and efficiency of GT-COPR. In the experiments, the tensor fiber-wise and slice-wise evaluations demonstrate the accuracy of GT-COPR for predicting the diseasegene-chemical associations across the large-scale protein-protein interactions network, chemical structural similarity network and phenotype-based human disease network; and the validation on Genomics of Drug Sensitivity in Cancer cell line dataset shows a potential clinical application of GT-COPR for learning diseasespecific chemical-gene interactions. Statistical enrichment analysis demonstrates that GT-COPR is also capable of producing both topologically and biologically relevant disease, gene and chemical components with high significance.
Source code: https://github.com/kuanglab/GT-COPRZhang, Wei; Chien, Jeremy; Yong, Jeongsik; Kuang, Rui
Network-based Machine Learning and Graph Theory Algorithms for Precision Oncology Journal Article
In: NPJ Precision Oncology, no. 25, 2017.
@article{networkreview2017,
title = {Network-based Machine Learning and Graph Theory Algorithms for Precision Oncology},
author = {Wei Zhang and Jeremy Chien and Jeongsik Yong and Rui Kuang},
url = {https://www.nature.com/articles/s41698-017-0029-7},
doi = {doi:10.1038/s41698-017-0029-7},
year = {2017},
date = {2017-08-08},
journal = {NPJ Precision Oncology},
number = {25},
abstract = {Network-based analytics plays an increasingly important role in precision oncology. Growing evidence in recent studies suggests that cancer can be better understood through mutated or dysregulated pathways or networks rather than individual mutations and that the efficacy of repositioned drugs can be inferred from disease modules in molecular networks. This article reviews network-based machine learning and graph theory algorithms for integrative analysis of personal genomic data and biomedical knowledge bases to identify tumor-specific molecular mechanisms, candidate targets and repositioned drugs for personalized treatment. The review focuses on the algorithmic design and mathematical formulation of these methods to facilitate applications and implementations of network-based analysis in the practice of precision oncology. We review the methods applied in three scenarios to integrate genomic data and network models in different analysis pipelines, and we examine three categories of network-based approaches for repositioning drugs in drug-disease-gene networks. In addition, we perform a comprehensive subnetwork/pathway analysis of mutations in 31 cancer genome projects in the Cancer Genome Atlas (TCGA) and present a detailed case study on ovarian cancer. Finally, we discuss interesting observations, potential pitfalls and future directions in network-based precision oncology.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Network-based analytics plays an increasingly important role in precision oncology. Growing evidence in recent studies suggests that cancer can be better understood through mutated or dysregulated pathways or networks rather than individual mutations and that the efficacy of repositioned drugs can be inferred from disease modules in molecular networks. This article reviews network-based machine learning and graph theory algorithms for integrative analysis of personal genomic data and biomedical knowledge bases to identify tumor-specific molecular mechanisms, candidate targets and repositioned drugs for personalized treatment. The review focuses on the algorithmic design and mathematical formulation of these methods to facilitate applications and implementations of network-based analysis in the practice of precision oncology. We review the methods applied in three scenarios to integrate genomic data and network models in different analysis pipelines, and we examine three categories of network-based approaches for repositioning drugs in drug-disease-gene networks. In addition, we perform a comprehensive subnetwork/pathway analysis of mutations in 31 cancer genome projects in the Cancer Genome Atlas (TCGA) and present a detailed case study on ovarian cancer. Finally, we discuss interesting observations, potential pitfalls and future directions in network-based precision oncology.Petegrosso, Raphael; Park, Sunho; Hwang, Tae Hyun; Kuang, Rui
Transfer Learning across Ontologies for Phenome-Genome Association Prediction Journal Article
In: Bioinformatics, vol. 33, no. 4, pp. 529-536, 2016.
@article{petegrosso2016transfer,
title = {Transfer Learning across Ontologies for Phenome-Genome Association Prediction},
author = {Raphael Petegrosso and Sunho Park and Tae Hyun Hwang and Rui Kuang},
url = {http://bioinformatics.oxfordjournals.org/content/early/2016/10/20/bioinformatics.btw649.abstract},
doi = {10.1093/bioinformatics/btw649},
year = {2016},
date = {2016-11-23},
journal = {Bioinformatics},
volume = {33},
number = {4},
pages = {529-536},
publisher = {Oxford Univ Press},
abstract = {Motivation: To better predict and analyze gene associations with the collection of phenotypes organized in a phenotype ontology, it is crucial to effectively model the hierarchical structure among the phenotypes in the ontology and leverage the sparse known associations with additional training information. In this paper, we first introduce Dual Label Propagation (DLP) to impose consistent associations with the entire phenotype paths in predicting phenotype-gene associations in Human Phenotype Ontology (HPO). DLP is then used as the base model in a transfer learning framework (tlDLP) to incorporate functional annotations in Gene Ontology (GO). By simultaneously reconstructing GO term-gene associations and HPO phenotype-gene associations for all the genes in a protein-protein interaction network, tlDLP benefits from the enriched training associations indirectly through relation with GO terms.
Results: In the experiments to predict the associations between human genes and phenotypes in HPO based on human protein-protein interaction network, both DLP and tlDLP improved the prediction of gene associations with phenotype paths in HPO in cross-validation and the prediction of the most recent associations added after the snapshot of the training data. Moreover, the transfer learning through GO term-gene associations significantly improved association predictions for the phenotypes with no more specific known associations by a large margin. Examples are also shown to demonstrate how phenotype paths in phenotype ontology and transfer learning with gene ontology can improve the predictions.
Availability: Source code is available at http://localhost/~raphaelpetegrosso/wpcb/ontophenome.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Motivation: To better predict and analyze gene associations with the collection of phenotypes organized in a phenotype ontology, it is crucial to effectively model the hierarchical structure among the phenotypes in the ontology and leverage the sparse known associations with additional training information. In this paper, we first introduce Dual Label Propagation (DLP) to impose consistent associations with the entire phenotype paths in predicting phenotype-gene associations in Human Phenotype Ontology (HPO). DLP is then used as the base model in a transfer learning framework (tlDLP) to incorporate functional annotations in Gene Ontology (GO). By simultaneously reconstructing GO term-gene associations and HPO phenotype-gene associations for all the genes in a protein-protein interaction network, tlDLP benefits from the enriched training associations indirectly through relation with GO terms.
Results: In the experiments to predict the associations between human genes and phenotypes in HPO based on human protein-protein interaction network, both DLP and tlDLP improved the prediction of gene associations with phenotype paths in HPO in cross-validation and the prediction of the most recent associations added after the snapshot of the training data. Moreover, the transfer learning through GO term-gene associations significantly improved association predictions for the phenotypes with no more specific known associations by a large margin. Examples are also shown to demonstrate how phenotype paths in phenotype ontology and transfer learning with gene ontology can improve the predictions.
Availability: Source code is available at http://localhost/~raphaelpetegrosso/wpcb/ontophenome. - Protein remote homology detection: Development of string kernel algorithms and label propagation algorithms to infer the protein remote homologys and study their protein structures and functions.
Petegrosso, Raphael; Li, Zhuliu; Srour, Molly A.; Saad, Yousef; Zhang, Wei; Kuang, Rui
Scalable Remote Homology Detection and Fold Recognition in Massive Protein Networks Journal Article
In: PROTEINS: Structure, Function, and Bioinformatics, vol. 87, no. 6, pp. 478-491, 2019.
@article{scalable2019petegrosso,
title = {Scalable Remote Homology Detection and Fold Recognition in Massive Protein Networks},
author = {Raphael Petegrosso and Zhuliu Li and Molly A. Srour and Yousef Saad and Wei Zhang and Rui Kuang},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.25669},
year = {2019},
date = {2019-01-31},
journal = {PROTEINS: Structure, Function, and Bioinformatics},
volume = {87},
number = {6},
pages = {478-491},
abstract = {The global connectivities in very large protein similarity networks contain traces of evolution among the proteins for detecting protein remote evolutionary relations or structural similarities. To investigate how well a protein network captures the evolutionary information, a key limitation is the intensive computation of pairwise sequence similarities needed to construct very large protein networks. In this paper, we introduce Label Propagation on Low-rank Kernel Approximation (LP-LOKA) for searching massively large protein networks. LP-LOKA propagates initial protein similarities in a low-rank graph by Nystrom approximation without computing all pairwise similarities. With scalable parallel implementations based on distributed-memory using message-passing interface and Apache-Hadoop/Spark on cloud, LP-LOKA can search protein networks with one million proteins or more. In the experiments on Swiss-Prot/ADDA/CASP data, LP-LOKA significantly improved protein ranking over the widely used HMM-HMM or profile-sequence alignment methods utilizing large protein networks. It was observed that the larger the protein similarity network, the better the performance, especially on relatively small protein superfamilies and folds. The results suggest that computing massively large protein network is necessary to meet the growing need of annotating proteins from newly sequenced species and LP-LOKA is both scalable and accurate for searching massively large protein networks.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The global connectivities in very large protein similarity networks contain traces of evolution among the proteins for detecting protein remote evolutionary relations or structural similarities. To investigate how well a protein network captures the evolutionary information, a key limitation is the intensive computation of pairwise sequence similarities needed to construct very large protein networks. In this paper, we introduce Label Propagation on Low-rank Kernel Approximation (LP-LOKA) for searching massively large protein networks. LP-LOKA propagates initial protein similarities in a low-rank graph by Nystrom approximation without computing all pairwise similarities. With scalable parallel implementations based on distributed-memory using message-passing interface and Apache-Hadoop/Spark on cloud, LP-LOKA can search protein networks with one million proteins or more. In the experiments on Swiss-Prot/ADDA/CASP data, LP-LOKA significantly improved protein ranking over the widely used HMM-HMM or profile-sequence alignment methods utilizing large protein networks. It was observed that the larger the protein similarity network, the better the performance, especially on relatively small protein superfamilies and folds. The results suggest that computing massively large protein network is necessary to meet the growing need of annotating proteins from newly sequenced species and LP-LOKA is both scalable and accurate for searching massively large protein networks.Min, Martin Renqiang; Kuang, Rui; Bonner, Anthony J; Zhang, Zhaolei
Learning Random-Walk Kernels for Protein Remote Homology Identification and Motif Discovery. Proceedings Article
In: SDM, pp. 133–144, SIAM 2009, ISBN: 978-0-89871-682-5.
@inproceedings{min2009learning,
title = {Learning Random-Walk Kernels for Protein Remote Homology Identification and Motif Discovery.},
author = {Martin Renqiang Min and Rui Kuang and Anthony J Bonner and Zhaolei Zhang},
url = {http://compbio.cs.umn.edu/wp-content/uploads/2017/10/12E97816119727952E12.pdf},
doi = {10.1137/1.9781611972795.12},
isbn = {978-0-89871-682-5},
year = {2009},
date = {2009-04-30},
booktitle = {SDM},
pages = {133--144},
organization = {SIAM},
abstract = {Random-walk based algorithms are good choices for solving many classification problems with limited labeled data and a large amount of unlabeled data. However, it is difficult to choose the optimal number of random steps, and the results are very sensitive to the parameter chosen. In this paper, we will discuss how to better identify protein remote homology than any other algorithm using a learned random-walk kernel based on a positive linear combination of random-walk kernels with different random steps, which leads to a convex combination of kernels. The resulting kernel has much better prediction performance than the state-of-the-art profile kernel for protein remote homology identification. On the SCOP benchmark dataset, the overall mean ROC50 score on 54 protein families we obtained using the new kernel is above 0.90, which has almost perfect prediction performance on most of the 54 families and has significant improvement over the best published result; moreover, our approach based on learned random-walk kernels can effectively identify meaningful protein sequence motifs that are responsible for discriminating the memberships of protein sequences' remote homology in SCOP.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Random-walk based algorithms are good choices for solving many classification problems with limited labeled data and a large amount of unlabeled data. However, it is difficult to choose the optimal number of random steps, and the results are very sensitive to the parameter chosen. In this paper, we will discuss how to better identify protein remote homology than any other algorithm using a learned random-walk kernel based on a positive linear combination of random-walk kernels with different random steps, which leads to a convex combination of kernels. The resulting kernel has much better prediction performance than the state-of-the-art profile kernel for protein remote homology identification. On the SCOP benchmark dataset, the overall mean ROC50 score on 54 protein families we obtained using the new kernel is above 0.90, which has almost perfect prediction performance on most of the 54 families and has significant improvement over the best published result; moreover, our approach based on learned random-walk kernels can effectively identify meaningful protein sequence motifs that are responsible for discriminating the memberships of protein sequences' remote homology in SCOP.Ngo, Thanh; Kuang, Rui
Partial profile alignment kernels for protein classification Proceedings Article
In: 2009 IEEE International Workshop on Genomic Signal Processing and Statistics, pp. 1–4, IEEE 2009, ISBN: 978-1-4244-4761-9.
@inproceedings{ngo2009partial,
title = {Partial profile alignment kernels for protein classification},
author = {Thanh Ngo and Rui Kuang},
url = {http://compbio.cs.umn.edu/wp-content/uploads/2017/10/05174328.pdf},
doi = {10.1109/GENSIPS.2009.5174328},
isbn = {978-1-4244-4761-9},
year = {2009},
date = {2009-01-01},
booktitle = {2009 IEEE International Workshop on Genomic Signal Processing and Statistics},
pages = {1--4},
organization = {IEEE},
abstract = {Remote homology detection and fold recognition are the central problems in protein classification. In real applications, kernel algorithms that are both accurate and efficient are required for classification of large databases. We explore a class of partial profile alignment kernels to be used with support vector machines (SVMs) for remote homology detection and fold recognition. While existing profile-based kernels use the whole profiles to determine the similarity between pairs of proteins, the partial profile alignment kernels are derived from part of the position specific scoring matrices (PSSMs) in the profiles for alignment. Specifically, at each position in the PSSM, only amino acids in the mutation neighborhood of the corresponding amino acid in the original protein sequence are considered for alignment to remove noise and improve computing efficiency. Our experiments on SCOP bench datasets show that the partial profile alignment kernels achieved overall better classification results for both fold recognition and remote homology detection than profile kernels and profile-alignment kernels. In addition, our algorithm using only a fraction of the profiles saves the cost of computing the kernels significantly, compared to the full-profile alignment methods.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Remote homology detection and fold recognition are the central problems in protein classification. In real applications, kernel algorithms that are both accurate and efficient are required for classification of large databases. We explore a class of partial profile alignment kernels to be used with support vector machines (SVMs) for remote homology detection and fold recognition. While existing profile-based kernels use the whole profiles to determine the similarity between pairs of proteins, the partial profile alignment kernels are derived from part of the position specific scoring matrices (PSSMs) in the profiles for alignment. Specifically, at each position in the PSSM, only amino acids in the mutation neighborhood of the corresponding amino acid in the original protein sequence are considered for alignment to remove noise and improve computing efficiency. Our experiments on SCOP bench datasets show that the partial profile alignment kernels achieved overall better classification results for both fold recognition and remote homology detection than profile kernels and profile-alignment kernels. In addition, our algorithm using only a fraction of the profiles saves the cost of computing the kernels significantly, compared to the full-profile alignment methods.Kuang, Rui; Gu, Jianying; Cai, Hong; Wang, Yufeng
Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel Journal Article
In: Genetica, vol. 136, no. 1, pp. 189–209, 2008.
@article{kuang2009improved,
title = {Improved prediction of malaria degradomes by supervised learning with SVM and profile kernel},
author = {Rui Kuang and Jianying Gu and Hong Cai and Yufeng Wang},
url = {http://link.springer.com/article/10.1007/s10709-008-9336-9},
doi = {10.1007/s10709-008-9336-9},
year = {2008},
date = {2008-12-06},
journal = {Genetica},
volume = {136},
number = {1},
pages = {189--209},
publisher = {Springer},
abstract = {The spread of drug resistance through malaria parasite populations calls for the development of new therapeutic strategies. However, the seemingly promising genomics-driven target identification paradigm is hampered by the weak annotation coverage. To identify potentially important yet uncharacterized proteins, we apply support vector machines using profile kernels, a supervised discriminative machine learning technique for remote homology detection, as a complement to the traditional alignment based algorithms. In this study, we focus on the prediction of proteases, which have long been considered attractive drug targets because of their indispensable roles in parasite development and infection. Our analysis demonstrates that an abundant and complex repertoire is conserved in five Plasmodium parasite species. Several putative proteases may be important components in networks that mediate cellular processes, including hemoglobin digestion, invasion, trafficking, cell cycle fate, and signal transduction. This catalog of proteases provides a short list of targets for functional characterization and rational inhibitor design.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The spread of drug resistance through malaria parasite populations calls for the development of new therapeutic strategies. However, the seemingly promising genomics-driven target identification paradigm is hampered by the weak annotation coverage. To identify potentially important yet uncharacterized proteins, we apply support vector machines using profile kernels, a supervised discriminative machine learning technique for remote homology detection, as a complement to the traditional alignment based algorithms. In this study, we focus on the prediction of proteases, which have long been considered attractive drug targets because of their indispensable roles in parasite development and infection. Our analysis demonstrates that an abundant and complex repertoire is conserved in five Plasmodium parasite species. Several putative proteases may be important components in networks that mediate cellular processes, including hemoglobin digestion, invasion, trafficking, cell cycle fate, and signal transduction. This catalog of proteases provides a short list of targets for functional characterization and rational inhibitor design.Melvin, Iain; Ie, Eugene; Kuang, Rui; Weston, Jason; Noble, William Stafford; Leslie, Christina
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition Journal Article
In: BMC bioinformatics, vol. 8, no. 4, 2007.
@article{melvin2007svm,
title = {SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition},
author = {Iain Melvin and Eugene Ie and Rui Kuang and Jason Weston and William Stafford Noble and Christina Leslie},
url = {http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-S4-S2},
doi = {10.1186/1471-2105-8-S4-S2},
year = {2007},
date = {2007-05-22},
journal = {BMC bioinformatics},
volume = {8},
number = {4},
publisher = {BioMed Central},
abstract = {Background
Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community.
Results
We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at http://svm-fold.c2b2.columbia.edu. Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider.
Conclusion
By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Background
Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community.
Results
We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at http://svm-fold.c2b2.columbia.edu. Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider.
Conclusion
By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition.