Research Projects
-
Mining clinical and genetic markers of human disease
Identification of clinical and genetic markers of disease can provide crucial information for both disease treatment and etiology. This complex task involves associating high-dimensional patterns such as large-scale gene expressions and single nucleotide polymorphisms (SNPs) with disease-related phenotypes using very few samples. We introduce a new graph-based semi-supervised feature classification algorithm to identify discriminative patterns by learning on bipartite graphs built from clinical variables, gene expressions and SNPs. Instead of performing feature selection or data-mining association analysis, our algorithm directly classifies the feature nodes in a bipartite graph as positive, negative or neutral with network propagation, which captures the interactions between both samples and features (clinical and genetic variables) by exploring the global structure of the graph. Although optimized for classifying the features, our algorithm can also simultaneously classify the test samples for disease prognosis/diagnosis. We apply our algorithm to studying the Rosetta breast cancer dataset and chronic fatigue syndrome on a CAMDA contest dataset. Our algorithm identifies interesting clinical and genetic markers that are consistent with previous studies in the literature, and achieves better overall classification performance than support vector machines and Bayesian networks.
-
String kernels for protein structural
classification
Solved protein structures are usually organized in a hierarchical scheme, such
as SCOP and CATH, according to structure and evolutionary relationships.
Classification of a new protein based on the hierarchy is crucial for inferring its
function or its structure by comparative modeling. To detect subtle sequence similarities
between remote homologies more effectively than previous methods, we
propose several inexact matching string kernels and profile-based
string kernels in use with SVM for this classification problem. The
inexact matching kernels -- restricted gappy kernels, substitution kernels, and
wildcard kernels -- compute along protein sequences for inexact matching of
k-length subsequences ("k-mers"), while profile-based kernels
make use of probabilistic profiles to define mutation neighborhoods of
k-mers based on k -length windows of profiles. We also
use the learned SVM classifier to extract discriminative sequence motifs, which
correspond to meaningful structural features in the protein data.
-
Protein ranking by network
diffusion
Protein database
search or protein ranking is about retrieving a ranked list of proteins that are
homologous to the given query protein in a protein database. Sequence similarity
discovered by ranking algorithms can also help infer functional properties of
the query and target sequences without requiring any structural information.
Therefore, protein ranking is a more flexible and convenient way of studying new
proteins than structural classification. We propose a new network diffusion
algorithm, called MotifProp, based on protein-motif network, in
which edges connect proteins with the k -mer based motif features that they contain. We
show the new algorithm can detect more subtle similarities than
pairwise comparison methods, such as PSI-BLAST, that is used to initialize the
ranking. Despite the complex structure of the protein-motif network, MotifProp
can be easily interpreted using those top-ranked motifs and motif-rich regions
induced by the propagation, both of which are helpful for discovering
conserved structural components in remote homologies.
-
Protein local structure
prediction
3D protein structures can be represented by a series
of dihedral torsion angle pairs along the polypeptide backbone.
Accurate prediction of protein backbone torsion angles will substantially
improve the modeling of local structures of protein sequence
segments, especially the loop conformations that do not form regular structures
as in alpha-helices or beta-strands. Several kernel methods for structured output
learning, which can capture the dependency in output space, are used to
predict structural labels of each amino acid residue of a protein. The
structural labels are either conformational states or protein blocks, both of
which are reduced representations of backbone torsion angles. Our approaches
make use of the dependency between the structural labels in a local structure
neighborhood along with the input information of amino acid residues to
predict the output labels more accurately.
- Protein domain
segmentation
Multi-domain proteins are
often
problematic in structural and functional analysis due to the
presence of multiple folds. Therefore, accurate domain segmentation is necessary for
annotating entire genome sequences. Except for Pfam-based approaches, which are not
designed for handling remote homologies, none of the previous works has attempted
to tackle the two complementary tasks, which are the prediction of domain boundaries
and the classification of domain structure, in a unified method. We propose to use
the positional SVM classification scores from multiple folds to predict domain
labels and boundaries simultaneously. Expecting that most of the positive
positional classification scores will concentrate on the regions of positive
domains that are homologous to the corresponding fold, we are currently
exploring various learning techniques to predict domain boundaries as well as
domain labels.
|