Feature Classification on Graphs

ROBUST AND EFFICIENT IDNETFICATION OF BIOMARKERS

BY CLASSIFYING FEATURES ON GRAPHS

TAEHYUN HWANG¹, HUGUES SICOTTE², ZE TIAN¹, BAOLIN WU³, DENNIS WIGLE⁴,

JEAN-PIERRE KOCHER², VIPIN KUMAR¹ AND RUI KUANG¹

¹Department of Computer Science and Engineering, University of Minnesota Twin Cities

²Bioinformatics Core, Mayo Clinic College of Medicine

³Division of Biostatistics, School of Public Health, University of Minnesota Twin Citie

⁴Division of General Thoracic Surgery, Mayo Clinic Cancer Center

Abstract

Motivation

A central problem in biomarker discovery from large-scale gene expressions or single nucleotide polymorphisms (SNPs) is the computational challenge of taking into account the dependence among all the features. Methods that ignore the dependence usually identify non-reproducible disease markers across independent datasets.

We introduce a new graph-based semi-supervised feature classification algorithm to identify discriminative disease markers by learning on bipartite graphs. Our algorithm directly classifies the feature nodes in a bipartite graph as positive, negative or neutral with network propagation to capture the dependence among both samples and features (clinical and genetic variables) by exploring bi-cluster structures in the graph.

Two features of our algorithm are 1) our algorithm can find a global optimal labeling to capture the dependence among all the features and thus, generates highly replicable results across independent microarray or other high-thoughput datasets. 2) our algorithm is capable of handling hundreds of thousands of features and thus, are particularly useful for biomarker identification from large-scale gene expressions and SNPs. In addition, although designed for classifying features, our algorithm can also simultaneously classify test samples for disease prognosis/diagnosis.

Results

We applied the network propagation algorithm to studying three large scale breast cancer datasets. Our algorithm achieved competitive classification performance compared with SVMs and other baseline methods, and identified several markers with clinical or biological relevance with the disease. More importantly, our algorithm also identified highly reproducible marker genes and enriched functions from the independent datasets.

Full Paper [PDF]

Supplementary Document [PDF]

Supplementary Information and Source Code

Compbio Home