Research Projects

  • Mining clinical and genetic markers of human disease
    Identification of clinical and genetic markers of disease can provide crucial information for both disease treatment and etiology. This complex task involves associating high-dimensional patterns such as large-scale gene expressions and single nucleotide polymorphisms (SNPs) with disease-related phenotypes using very few samples. We introduce a new graph-based semi-supervised feature classification algorithm to identify discriminative patterns by learning on bipartite graphs built from clinical variables, gene expressions and SNPs. Instead of performing feature selection or data-mining association analysis, our algorithm directly classifies the feature nodes in a bipartite graph as positive, negative or neutral with network propagation, which captures the interactions between both samples and features (clinical and genetic variables) by exploring the global structure of the graph. Although optimized for classifying the features, our algorithm can also simultaneously classify the test samples for disease prognosis/diagnosis. We apply our algorithm to studying the Rosetta breast cancer dataset and chronic fatigue syndrome on a CAMDA contest dataset. Our algorithm identifies interesting clinical and genetic markers that are consistent with previous studies in the literature, and achieves better overall classification performance than support vector machines and Bayesian networks.
  • String kernels for protein structural classification
    Solved protein structures are usually organized in a hierarchical scheme, such as SCOP and CATH, according to structure and evolutionary relationships. Classification of a new protein based on the hierarchy is crucial for inferring its function or its structure by comparative modeling. To detect subtle sequence similarities between remote homologies more effectively than previous methods, we propose several inexact matching string kernels and profile-based string kernels in use with SVM for this classification problem. The inexact matching kernels -- restricted gappy kernels, substitution kernels, and wildcard kernels -- compute along protein sequences for inexact matching of k-length subsequences ("k-mers"), while profile-based kernels make use of probabilistic profiles to define mutation neighborhoods of k-mers based on k -length windows of profiles. We also use the learned SVM classifier to extract discriminative sequence motifs, which correspond to meaningful structural features in the protein data.
  • Protein ranking by network diffusion
    Protein database search or protein ranking is about retrieving a ranked list of proteins that are homologous to the given query protein in a protein database. Sequence similarity discovered by ranking algorithms can also help infer functional properties of the query and target sequences without requiring any structural information. Therefore, protein ranking is a more flexible and convenient way of studying new proteins than structural classification. We propose a new network diffusion algorithm, called MotifProp, based on protein-motif network, in which edges connect proteins with the k -mer based motif features that they contain. We show the new algorithm can detect more subtle similarities than pairwise comparison methods, such as PSI-BLAST, that is used to initialize the ranking. Despite the complex structure of the protein-motif network, MotifProp can be easily interpreted using those top-ranked motifs and motif-rich regions induced by the propagation, both of which are helpful for discovering conserved structural components in remote homologies.
  • Protein local structure prediction
    3D protein structures can be represented by a series of dihedral torsion angle pairs along the polypeptide backbone. Accurate prediction of protein backbone torsion angles will substantially improve the modeling of local structures of protein sequence segments, especially the loop conformations that do not form regular structures as in alpha-helices or beta-strands. Several kernel methods for structured output learning, which can capture the dependency in output space, are used to predict structural labels of each amino acid residue of a protein. The structural labels are either conformational states or protein blocks, both of which are reduced representations of backbone torsion angles. Our approaches make use of the dependency between the structural labels in a local structure neighborhood along with the input information of amino acid residues to predict the output labels more accurately.
  • Protein domain segmentation
    Multi-domain proteins are often problematic in structural and functional analysis due to the presence of multiple folds. Therefore, accurate domain segmentation is necessary for annotating entire genome sequences. Except for Pfam-based approaches, which are not designed for handling remote homologies, none of the previous works has attempted to tackle the two complementary tasks, which are the prediction of domain boundaries and the classification of  domain structure, in a unified method. We propose to use the positional SVM classification scores from multiple folds to predict domain labels and boundaries simultaneously. Expecting that most of the positive positional classification scores will concentrate on the regions of positive domains that are homologous to the corresponding fold, we are currently exploring various learning techniques to predict domain boundaries as well as domain labels.