ROBUST AND EFFICIENT IDNETFICATION OF BIOMARKERS
BY CLASSIFYING FEATURES
TAEHYUN HWANG1, HUGUES SICOTTE2,
ZE TIAN1, BAOLIN WU3, DENNIS WIGLE4,
JEAN-PIERRE KOCHER2, VIPIN KUMAR1
AND RUI KUANG1
1Department of Computer Science and Engineering, University of Minnesota
Twin Cities
2Bioinformatics Core,
3Division
of Biostatistics,
4Division of General Thoracic
Surgery,
Abstract
Motivation
A central problem in
biomarker discovery from large-scale gene expressions or single nucleotide
polymorphisms (SNPs) is the computational challenge of taking into account the
dependence among all the features. Methods that ignore the dependence usually
identify non-reproducible disease markers across independent datasets.
We introduce a new
graph-based semi-supervised feature classification algorithm to identify discriminative
disease markers by learning on bipartite graphs. Our algorithm directly
classifies the feature nodes in a bipartite graph as positive, negative or
neutral with network propagation to capture the dependence among both samples
and features (clinical and genetic variables) by exploring bi-cluster
structures in the graph.
Two features of our
algorithm are 1) our algorithm can find a global optimal labeling to capture
the dependence among all the features and thus, generates highly replicable
results across independent microarray or other high-thoughput datasets. 2) our
algorithm is capable of handling hundreds of thousands of features and thus,
are particularly useful for biomarker identification from large-scale gene
expressions and SNPs. In addition, although designed for classifying features,
our algorithm can also simultaneously classify test samples for disease
prognosis/diagnosis.
Results
We applied
the network propagation algorithm to studying three large scale breast cancer
datasets. Our algorithm achieved competitive classification performance
compared with SVMs and other baseline methods, and identified several markers
with clinical or biological relevance with the disease. More importantly, our
algorithm also identified highly reproducible marker genes and enriched
functions from the independent datasets.
Supplementary Information and Source Code