|
|
|
|
[Home] [Research] [People] [Publication] Research Projects
1. Mining clinical and genetic
markers of human disease Identification of clinical and genetic markers of disease
can provide crucial information for both disease treatment and etiology. This
complex task involves associating high-dimensional patterns such as large-scale
gene expressions and single nucleotide polymorphisms (SNPs)
with disease-related phenotypes using very few samples. We introduce a new
graph-based semi-supervised feature classification algorithm to identify
discriminative patterns by learning on bipartite graphs built from clinical
variables, gene expressions and SNPs. Instead of
performing feature selection or data-mining association analysis, our
algorithm directly classifies the feature nodes in a bipartite graph as
positive, negative or neutral with network propagation, which captures the
interactions between both samples and features (clinical and genetic
variables) by exploring the global structure of the graph. Although optimized
for classifying the features, our algorithm can also simultaneously classify
the test samples for disease prognosis/diagnosis. We apply our algorithm to
studying the Rosetta breast cancer dataset and chronic fatigue syndrome on a
CAMDA contest dataset. Our algorithm identifies interesting clinical and
genetic markers that are consistent with previous studies in the literature,
and achieves better overall classification performance than support vector
machines and Bayesian networks. 2. String kernels for protein
structural classification Solved protein structures are usually organized in a
hierarchical scheme, such as SCOP and CATH, according to structure and
evolutionary relationships. Classification of a new protein based on the
hierarchy is crucial for inferring its function or its structure by
comparative modeling. To detect subtle sequence similarities between
remote homologies more effectively than previous methods, we propose several
inexact matching string kernels and profile-based string kernels in use with
SVM for this classification problem. The inexact matching kernels --
restricted gappy kernels, substitution kernels, and
wildcard kernels -- compute along protein sequences for inexact matching of k-length
subsequences ("k-mers"), while profile-based
kernels make use of probabilistic profiles to define mutation neighborhoods
of k-mers based on k -length windows
of profiles. We also use the learned SVM classifier to extract discriminative
sequence motifs, which correspond to meaningful structural features in
the protein data. 3. Protein ranking by network
diffusion
4. Protein local structure
prediction 3D protein structures can be represented by a series of
dihedral torsion angle pairs along the polypeptide backbone. Accurate
prediction of protein backbone torsion angles will substantially improve the
modeling of local structures of protein sequence segments, especially the
loop conformations that do not form regular structures as in alpha-helices or
beta-strands. Several kernel methods for structured output learning, which
can capture the dependency in output space, are used to predict structural
labels of each amino acid residue of a protein. The structural labels are
either conformational states or protein blocks, both of which are reduced
representations of backbone torsion angles. Our approaches make use of the
dependency between the structural labels in a local structure neighborhood
along with the input information of amino acid residues to predict the output
labels more accurately. 5.
Protein domain segmentation Multi-domain proteins are often problematic in structural
and functional analysis due to the presence of multiple folds. Therefore, accurate
domain segmentation is necessary for annotating entire genome sequences.
Except for Pfam-based approaches, which are
not designed for handling remote homologies, none of the previous works has
attempted to tackle the two complementary tasks, which are the
prediction of domain boundaries and the classification of domain structure, in a unified method. We
propose to use the positional SVM classification scores from multiple folds
to predict domain labels and boundaries simultaneously. Expecting that most
of the positive positional classification scores will concentrate on the
regions of positive domains that are homologous to the corresponding fold, we
are currently exploring various learning techniques to predict domain
boundaries as well as domain labels. |
|