The CAMDA Contest Challenges

CAMDA 2018 presents:

CAMDA encourages an open contest, where all analyses of the contest data sets are of interest, not limited to the questions suggested here. There is an online forum for the free discussion of the contest data sets and their analysis, in which you are encouraged to participate.

We look forward to a lively contest!

MetaSUB Forensics Challenge

The MetaSUB International Consortium is building a longitudinal metagenomic map of mass-transit systems and other public spaces across the globe. The consortium maintains a strategic partnership with CAMDA and this year provides data from global City Sampling Days for the first-ever multi-city forensic analyses.

CAMDA delegates receive access to hundreds of novel MetaSUB samples, comprising several gigabases of whole genome shotgun (WGS) metagenomics data. Samples are collected from multiple surfaces in mass-transit systems (handrails, ticket machines screens and keypads, plastic, metal, or wooden benches, etc.). The primary data set covers multiple cities around the world, with tens of samples per city. Together, they form a unique resource for the study of biodiversity within and across geographic locations or surface types.

Three complementary independent test sets will be provided for exploration:

  • About 30 new samples from different cities and surface types already featured in the primary dataset - can you tell which?
  • At least 3 different new 'mystery' cities not featured before. Each new city will be represented by 10 or more samples.
  • About 20 samples from 'mystery' locations, some not featured before, with no information about which samples might come from the same city.

Analysis suggestions:
A key challenge in genomic forensics is the construction of a microbiome fingerprint which will allow the identification of the geographical origin of a sample.

Typical considerations include:

  • How and how well can we exploit metagenomic fingerprints for identifying the origin of a sample?
  • How reliably can we identify single samples?
  • How do multiple samples and/or multiple surface types affect the quality of our predictions?

The primary data set is available now. The first two sets of mystery samples will be released in Jan 2018 and the final test samples will become available later in time. Please read and accept the data download agreement for access.

CMap Drug Safety Challenge

Attrition in drug discovery and development due to safety / toxicity issues remains a significant concern, and there are strong efforts to identify and mitigate risk as early as possible. Drug-induced liver injury (DILI) is one of the primary problems in drug development and regulatory clearance due to the poor performance of existing preclinical models. There is a pressing need to evaluate alternative methods for predicting DILI, with great hopes being placed in modern approaches from statistics and machine learning applied to genome scale profiling data. A critical question thus is if we can better integrate, understand, and exploit information from cell-based screens like the Broad Institute Connectivity Map (CMap, Science 313, Nature Reviews Cancer 7).

This CAMDA challenge focuses on understanding or predicting drug induced liver injury in humans from cell-based screens, specifically the CMap gene expression responses of two different cancer cell lines (MCF7 and PC3) to 276 drug compounds. To also support supervised approaches, we provide clinical DILI results as training labels for 190 drugs.

Analysis suggestions:

  • Identification and interpretation of differences in cell-line response across drugs and across cell-line type
  • Prediction of human clinical DILI results from cell-line responses

Everyone will be invited to publish their method and results as a full research paper in the CAMDA Proceedings after the conference. This year, however, in addition to the regular open-ended data analysis contest, we can also offer all participants to contribute to an FDA meta-analysis paper of prediction methods that will be published separately. If you chose to take part in this additional track, you need to freeze your solution by 15 April and submit a short description of approach, training performance (accuracy, specificity, sensitivity, MCC, …) and your DILI predictions for the 86 unlabeled drugs. The labels for these 86 drugs will be released 16 April, giving you about four weeks to finalize an extended abstract for submission to CAMDA.

Lists of Affymetrix microarray gene expression files (raw .CEL files) for all 276 drug compounds and clinical DILI labels for 190 of these are available for download now. Please read and accept the data download agreement for access. Labels for the remaining 86 drugs will be released 16 April.

Cancer Data Integration Challenge

Examine the power of data integration in a real-world clinical settings. Many approaches work well on some data-sets yet not on others. We here challenge you to demonstrate a unified single approach to data-integration that matches or outperforms the current state of the art on two different diseases, breast cancer and neuroblastoma.

Breast cancer affects about 3 million women every year (McGuire et al, Cancers 7), and this number is growing fast, especially in developed countries. Can you improve on the large Metabric study (Curtis et al., Nature 486, and Dream Challenge, Margolin et al, Sci Transl Med 5)? The cohort is biologically heterogeneous with all five distinct PAM50 breast cancer subtypes represented. Matched profiles for microarray and copy number data as well as clinical information (survival times, multiple prognostic markers, therapy data) are available for about 2,000 patients.

Neuroblastoma is the most common extracranial solid tumor in children. The base study compared RNA-seq and Agilent microarray gene expression profiles for clinical endpoint prediction of 498 children patients (FDA SEQC - Zhang et al, Genome Biology 16). The published summary data are complemented by raw signal level data for gene expression arrays, RNA-Seq expression profiles, and extended clinical meta-data. In addition, we provide matched aCGH data for 145 of these patients for copy number analysis (Fischer lab, Köln - Stigliani et al, Neoplasia 14, Coco et al, IJC 131, Kocak et al, Cell Death Dis 4, Theissen et al, Genes Chromosomes Cancer 53).

Analysis suggestions:

  • Efficient horizontal data integration (inter-type), combining gene expression, CNV/CNA, clinical markers …
  • Efficient vertical data integration (intra-type), e.g., combining the expression profiles from complementary high-throughput technologies (RNA-seq and microarrays), combining information across patients, …
  • Characterization of differences in algorithm performance between the two diseases, identification of possible causes and mitigation strategies.


  • Better survival time prediction by effective data integration or improved models.
  • Advancing our understanding of the mechanisms behind cancer progression or therapy response by effective data integration or novel functional (network/pathway) analysis.
  • Improved cancer subgrouping.

Data download For Neuroblastoma, raw microarray data (expression and CGH arrays) as well as RNA-Seq expression profiles are provided together with sample description file. Participants who want to use this dataset must read and accept the data download agreement for access. In addition, raw RNA-Seq reads can optionally be made available on completion of an ethical use agreement with the University of Köln.

The breast cancer dataset has been compiled from public sources. Please read and accept the data download agreement for access.


Atul Butte, MD, PhD
Atul Butte, MD, PhD
Stanford University School of Medicine

Nikolaus Rajewsky, PhD
Nikolaus Rajewsky, PhD
Max-Delbrück-Center for Molecular Medicine

Terry Speed, PhD
Terry Speed, PhD
The Walter and Eliza Hall Institute of Medical Research

Sandrine Dupoit, PhD
Sandrine Dudoit, PhD
University of California, Berkeley

John Quackenbush, PhD
John Quackenbush, PhD
Harvard School of Public Health

Eran Segal, PhD
Eran Segal, PhD
Weizmann Institute of Science

John Storey, PhD
John Storey, PhD
Princeton University

Chris Sander, PhD
Chris Sander, PhD
Memorial Sloan Kettering Cancer Center

Temple F. Smith, PhD
Temple F. Smith, PhD
Boston University

Curtis Huttenhower, PhD
Curtis Huttenhower, PhD
Harvard School of Public Health

Christopher E. Mason, PhD
Christopher E. Mason, PhD
Weill Cornell Medicine

Lodewyk Wessels, PhD
Lodewyk Wessels, PhD
Netherlands Cancer Institute

Cesare Furlanello, PhD
Cesare Furlanello, PhD
Fondazione Bruno Kessler

Intention to Submit due5 April 2018
FDA meta-analysis opt in due15 April 2018
Extended Abstract Proposals due13 May 2018
Notification of Accepted Contributions24 May 2018
Early Registration Closes7 Jun 2018
CAMDA2018 Conference7-8 Jul 2018
ISMB 2018 Conference6–10 Jul 2018
Full Paper Submission 24 Sep 2018
Click to save the dates!

Biology Direct




Smashing Studio