Publications – Kuang Lab

2007

Melvin, Iain; Ie, Eugene; Kuang, Rui; Weston, Jason; Noble, William Stafford; Leslie, Christina

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition Journal Article

In: BMC bioinformatics, vol. 8, no. 4, 2007.

Abstract | Links | BibTeX | Tags: Protein Remote Homology Detection, String Kernels

@article{melvin2007svm,

title = {SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition},

author = {Iain Melvin and Eugene Ie and Rui Kuang and Jason Weston and William Stafford Noble and Christina Leslie},

url = {http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-S4-S2},

doi = {10.1186/1471-2105-8-S4-S2},

year  = {2007},

date = {2007-05-22},

journal = {BMC bioinformatics},

volume = {8},

number = {4},

publisher = {BioMed Central},

abstract = {Background 

Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community. 

Results 

We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at http://svm-fold.c2b2.columbia.edu. Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider. 

Conclusion 

By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition.},

keywords = {Protein Remote Homology Detection, String Kernels},

pubstate = {published},

tppubtype = {article}

}

Background
Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community.
Results
We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at http://svm-fold.c2b2.columbia.edu. Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider.
Conclusion
By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition.

2005

Kuang, Rui; Ie, Eugene; Wang, Ke; Wang, Kai; Siddiqi, Mahira; Freund, Yoav; Leslie, Christina

Profile-based string kernels for remote homology detection and motif extraction Journal Article

In: Journal of bioinformatics and computational biology, vol. 3, no. 03, 2005.

Abstract | Links | BibTeX | Tags: Protein Remote Homology Detection, String Kernels

@article{kuang2005profile,

title = {Profile-based string kernels for remote homology detection and motif extraction},

author = {Rui Kuang and Eugene Ie and Ke Wang and Kai Wang and Mahira Siddiqi and Yoav Freund and Christina Leslie},

url = {http://compbio.cs.umn.edu/paper/jbcb-profile-kernel.pdf},

doi = {http://dx.doi.org/10.1142/S021972000500120X},

year  = {2005},

date = {2005-10-02},

journal = {Journal of bioinformatics and computational biology},

volume = {3},

number = {03},

publisher = {World Scientific},

abstract = {We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences (“k-mers”) in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract “discriminative sequence motifs”—short regions of the original profile that contribute almost all the weight of the SVM classification score—and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented “cluster kernels” give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets.},

keywords = {Protein Remote Homology Detection, String Kernels},

pubstate = {published},

tppubtype = {article}

}

2004

Leslie, Christina; Kuang, Rui

Fast string kernels using inexact matching for protein sequences Journal Article

In: Journal of Machine Learning Research, vol. 5, no. Nov, 2004.

Abstract | Links | BibTeX | Tags: Protein Remote Homology Detection, String Kernels

Kuang, Rui; Ie, Eugene; Wang, Ke; Wang, Kai; Siddiqi, Mahira; Freund, Yoav; Leslie, Christina

Profile-based string kernels for remote homology detection and motif extraction Proceedings Article

In: CSB 2004, IEEE, 2004, ISBN: 0-7695-2194-0.

Abstract | Links | BibTeX | Tags: Protein Remote Homology Detection, String Kernels

@inproceedings{kuang2005profileb,

title = {Profile-based string kernels for remote homology detection and motif extraction},

author = {Rui Kuang and Eugene Ie and Ke Wang and Kai Wang and Mahira Siddiqi and Yoav Freund and Christina Leslie},

url = {http://compbio.cs.umn.edu/paper/profile-kernel.pdf},

doi = {10.1109/CSB.2004.1332428},

isbn = {0-7695-2194-0},

year  = {2004},

date = {2004-08-19},

booktitle = {CSB 2004},

publisher = {IEEE},

abstract = {We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" - short regions of the original profile that contribute almost all the weight of the SVM classification score - and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels " give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets.},

keywords = {Protein Remote Homology Detection, String Kernels},

pubstate = {published},

tppubtype = {inproceedings}

}

Leslie, Christina; Kuang, Rui; Eskin, Eleazar

Inexact matching string kernels for protein classification Book

MIT Press, Cambridge, MA, 2004, ISBN: 9780262256926.

BibTeX | Tags: Protein Remote Homology Detection, String Kernels

2003

Leslie, Christina; Kuang, Rui

Fast kernels for inexact string matching Proceedings Article

In: 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop (COLT/Kernel), Springer, 2003, ISBN: 978-3-540-45167-9.

Abstract | Links | BibTeX | Tags: Protein Remote Homology Detection, String Kernels