SubPatCNV

Abstract | Download | Instructions | Contact

Background: DNA copy-number variations (CNVs) are genome aberrations that could disrupt normal biological functions and lead to abnormal cell growth and tumor genesis. Identifying causal copy-number variations of cancer is an important step in understanding the molecular mechanisms of cancer and developing effective treatment. Existing copy-number variation discovery methods are statistical approaches based on calculating a positional summary statistic for the copy-number variation across all patients in the dataset and thus tend to miss large aberrant copy-number variation regions in patient subsets. Little previous work has focused on customized identification of copy-number variations that only exhibit in subsets of patients.

Results: We introduce a tool for mining CNV subspace patterns (SubPatCNV), which is able to identify all aberrant CNV regions specific to arbitrary patient subsets larger than a support threshold. SubPatCNV is an approximate association pattern mining algorithm under a spatial constraint on the positional CNV probe features. In the experiments on a large-scale bladder cancer dataset, SubPatCNV discovered many large aberrant CNV events in patient subgroups and also reported CNV regions highly specific to clinical variables such as tumor grade or stage and enriched with more known oncogenes compared with other existing CNV discovery methods.

Conclusions: Identifying causal CNVs driving cancer development is a difficult problem. SubPatCNV is an easy to use, open-source software tool that provides the flexibility of identifying aberrant copy-number variation regions specific to patient subgroups of different sizes.


Download the toolbox: HERE

Toolbox Steps

    Preprocessing data

  1. In scripts/ execute create_file_structure.sh to set up the file structure. You will need to edit line 3 to set the dataset name.
  2. Transform raw data into a matrix form text file. Matrix form is a nx(m+2) matrix where n is the number of probes and m is the number of samples. The first two columns must be the chromosome number and the base pair location of the probe (that's the +2 part). There should be no additional header or probe labels. Place the text file in datasets/your_dataset/data/
  3. In matlab_code/ execute binarize_data.m on the matrix text file. You will need to edit lines 12 and 13 for your dataset and text file name. This will create files for each chromosome for amplification and deletion CNV events each in datasets/your_dataset/data/datafiles/
  4. Running SubPatCNV algorithm

  5. In scripts/ execute run_experiments.sh to run the SubPatCNV algorithm on your data. You will need to edit line 4 to set the dataset name. The results will be created for each chromosome and for amplification and deletion CNV events in datasets/your_dataset/data/outfiles/
  6. Visualizing results

  7. You can visualize the results by running any of the scripts in matlab_code/. You will need to edit the first few lines in each for your dataset.
    • num_patterns_figure.m: Plot of the number of discovered patterns with respect to support value.
    • pattern_figures.m: Heatmap of patterns discovered with individual patient clinical variables labeled.
    • chr_pattern_figures.m: Plot of pattern location and subset patient clinical variable associations on specific chromosome.
    • genome_pattern_figures.m: Plot of pattern location along the genome.
    • pattern_size_dist.m: Plot of the (normalized) pattern size distributions.
    • oncogene_coverage_figure.m: Plot of the oncogene coverage by patterns discovered by SubPatCNV.

File formats:

  • log2_matrix_file.txt: Must be a tab delimited text file where rows are the probes and columns are the samples. First two columns must be chromosome and probe bp location. All columns after are samples.
  • oncogene_data.txt: Must be a tab delimited text file that contains oncogene information that the user is interested in analyzing. Rows are oncogenes. Columns are: oncogene name, chromosome, bp start, bp stop.
  • clinical_variables.txt: Must be a tab delimited text file that contains the clinical variable labels for each sample. Must have header as first row in file. Rows are samples. Columns are the different clinical variables.

Contact Rui Kuang at kuang@cs.umn.edu or Nicholas Johnson at njohnson@cs.umn.edu for questions, comments, or bug reports.