Call for Papers


Functional Clustering of Gene Expression Profiles in Human Cancers Challenge



The World Health Organization's Global Burden of Disease statistics identified cancer as the second largest global cause of death, after cardiovascular disease. Cancer is the fastest growing segment of the disease burden; global cancer deaths are projected to increase from 7.1 million in 2002 to 11.5 million in 2030. Cancer research produces huge quantities of data that serve as a basis for the development of improved diagnosis and therapies. Advanced statistical and machine learning methods are needed for interpretation of primary data and generation of new knowledge needed for the development of new diagnostic tools, drugs, and vaccines. The aim of this challenge is identification of functional clusters of genes from gene expression profiles in three major cancers: breast, colon and lung. Identification of functional groups and subgroups of genes responsible for the development and spread of these cancers as well as their subtypes are urgently needed for proper classification and identification of key processes that can be targeted therapeutically.


Gene expression profiling using microarray has become a routine research tool in biomedicine. This high-throughput technology allows the researcher to monitor whole-genome gene expression profiles under different experiment conditions or disease phenotypes (including subtypes), as well as time course experiments. This type of screening helps identify genes with similar expression pattern under various conditions or time course. Co-expression of genes often indicates their co-regulation or participation in related functional biological pathways or processes. One of the strategies in identifying these groups of genes with similar expression patterns is by using unsupervised machine learning techniques such as clustering approaches. Many clustering algorithms have been developed an applied to the analysis of microarray data ranging from simple hierarchical clustering to bi-clustering approaches. These analyses usually provide a good starting point for further examination of specific pathways and relevant biological processes. However, the complexity of this problem requires fine-tuning and human intervention for determination of useful gene clusters associated with specific biological functions.


Challenge Format:

For this ICMLA 2009 Challenge, we invite participants to develop and submit unsupervised machine learning algorithms with the aim to minimize the fine-tuning and human intervention steps in identifying groups of genes with biological functions from a set of training data. The true labels and other relevant information will be provided for the training samples. The clustering algorithms developed using these identified groups of genes will be used to cluster an independent test set (true labels and other relevant information will not be provided for testing samples). Gene expression data, both training and test sets, will be provided in two formats: the raw CEL files and the intensity files. These samples were profiled on the Affymetrix Human Genome GeneChip HG_U133 Plus 2.0, which contains ~54,000 probes covering ~20,000 annotated genes. Participants can use either the CEL files (performed their own normalization) or the intensity files as the gene expression matrix for identifying the gene clusters. All the samples used in this Challenge were profiled by the International Genomics Consortium (IGC) for the Expression Project for Oncology (expO) and were deposited in the NCI GEO Omnibus under the series of GSE2109. We acknowledge their efforts in making these valuable data publically available to the community.



Training set: 400 samples (70 lung cancers, 130 colon cancers, and 200 breast cancers).

Click here to download the training data (signal intensity TRAINING_SET.xls).

For raw CEL files, click (TRAIN_BREAST1.CEL.tar.gz, TRAIN_BREAST2.CEL.tar.gz, TRAIN_LUNG.CEL.tar.gz, TRAIN_COLON.CEL.tar.gz, )


Testing set: 250 samples (50 lung cancers, 100 colon cancers, and 100 breast cancers).

Click here to down the testing data (signal intensity TESTING_SET.xls).

For raw CEL files, click (TEST_1.CEL.tar.gz, TEST_2.CEL.tar.gz, TEST_3.CEL.tar.gz, TEST_4.CEL.tar.gz)


Submission and Evaluation:

Short technical papers submitted will be reviewed based on: (i) the clarity and novelty of the algorithm and steps in identifying the gene clusters; (ii) the association of relevant biological function in the groups of genes; (iii) the ability to correctly cluster the independent test samples; and (iv) the discovery of new biological insights in these cancers.



A selection of the submitted papers in this Challenge will be invited for publication of longer version of their papers to be published in a bioinformatics journal (to be determined). The full paper submissions will be subject to regular peer review.



Paper Submission Deadline:                  August 15, 2009
Notification of acceptance:                       September 7, 2009
Camera-ready papers & Pre-registration:   October 1, 2009
The ICMLA Conference:                           December 13-15, 2009


The authors would submit papers through the main conference submission website. Papers must correspond to the requirements detailed in the instructions to authors. Accepted papers must be presented by one of the authors to be published in the conference proceeding. If you have any questions, do not hesitate to direct your questions to Dr. Tan (AikChoon.Tan{at}


All challenge submissions will be handled electronically. Detailed instructions for submitting a paper are provided on the conference home page at



ICMLA 2009 Challenge Organizers:

         Aik Choon Tan, Ph.D. University of Colorado Denver School of Medicine, USA

         Vladimir Brusic, Ph.D. Dana-Farber Cancer Institute/Harvard University, USA