Supplementary Materials and Methods and Supplementary Tables Table of Contents: A. Supplementary Materials and Methods (pages from 1 to 3) B. Supplementary Tables (four excel files, legends at page 4) C. References (page 5) A. Supplementary Materials and Methods Whole-Exome sequencing and identification of somatic mutations Human samples were collected at the University of Rome “Tor Vergata” University of Bologna and University of Torino (Italy), and genomic DNA was isolated using standard protocols. In order to identify tumor-specific mutations, we compared each leukemic sample to the corresponding normal DNA, isolated at the time of clinical and molecular remission of the disease. For the murine samples, genomic DNA was isolated from the leukemic spleen of PML-RARA KI mice or from the skin of the same animal, used as matched normal DNA, using the Genomic DNA isolation kit (QIAGEN) according to the manufacturer’s instructions. Exome-capture was performed using the SureSelectXT Human All Exon v.1 for hAMLs, Human All Exon v.2 kit for hAPLs and SureSelectXT Mouse All Exon kit (Agilent Technologies) for mouse samples, following the manufacturer’s specifications. Whole-exome sequencing was performed with the Illumina GAIIx with 76 bp paired-end reads for 8 samples (hAML#Mi1, hAML#Mi2, hAML#Mi3, hAPL#Mi1, hAPL#Mi2, hAPL#Mi3, hAPL#Mi4, hAPL#Mi5) and with Illumina Hiseq 2000 platform with 101 bp paired-end reads for the remaining 18 samples (including mouse samples). Alignment to the reference genomes (hg19 for human and mm9 for mouse) was performed using Burrows-Wheeler Aligner (BWA)1. After Next-Generation Sequencing data preprocessing (local realignment, duplicate marking and base quality recalibration) using GATK2, we obtained a haploid mean coverage of 60X for Illumina GAIIx and 160X for Illumina Hiseq 2000. We identified single nucleotide variants (SNVs) and small insertions/deletions (indels) in our samples using MuTect3 and Somatic Indels Detector (present in GATK), respectively. We applied to the resulting variants the following additional filters: i) the minimum read depth=10 for both normal and tumor samples; ii) the minimum number of alternative reads=7; iii) at least 25% of all reads covering the 1 position should have the variant allele. The identified variants were functionally annotated using ANNOVAR4. We excluded from further analysis variants in non-coding regions, synonymous variants and variants present in highly repetitive regions. Validation of the identified mutations was performed by PCR amplification, by designing specific primer pairs for each mutation (sequences available upon request), followed by Sanger sequencing. All PCR products were evaluated on a 2% agarose gel, sequenced in both directions using Big Dye Terminator reactions and loaded on an ABI PRISM 3730xl DNA analyzer. Sequences were analyzed using the Sequencing Analysis 5.2 software. Published variants In order to generate a dataset of mutations with all the available AML and APL samples, we combined the results of our sequencing analyses with previously published data. From the published variants, we excluded from further analysis the variants with missing Entrez Gene Id, the ones present in mitochondria, in pseudogenes or in intronic region, as well as all the mutations classified as “silent” or “rna”. Concerning the genes found mutated in mouse genomes, we included in the analysis the human orthologous. Significantly-mutated gene analysis The massive sequencing of 247 AML genomes led to the identification of a total of 1559 genes targeted by mutations. However, only few mutations are supposed to be the leading cause of clonal expansion (drivers), while the majority of them appear to be passengers, having no role in the development of cancer. Here we propose a statistical strategy to identify putative driver genes: genes are defined as significantly-mutated if they are recurrently mutated in AMLs (i.e. present in at least two patients) and with a frequency of mutation higher than expected by chance. In details, for each gene, we measured the mutation rate per base: calculated as the ratio between the number of mutations per gene and the length of the coding sequence of the gene. The length of each gene was calculated as the maximum length between all non-overlapping transcript isoforms. We compared this value to the average mutation rate per base identified in AML exomes (as the ratio of the average number of mutations per sample, multiplied by the number of 2 samples, and the dimension of the exome), using one tailed Poisson test. We used the Benjamini–Hochberg procedure to correct for False Discovery Rate (FDR). Analyzing all the AML samples together, we found 191 genes recurrently targeted by mutations but only 31 of them were significantly mutated in accordance to our statistical strategy (q-value<0.005, Supplementary Table 4). The validity of our statistical approach is illustrated, for example, by Titin (TTN). TTN has the longest coding sequence in the genome and is found recurrently mutated in AMLs. However, it is not identified as significantly-mutated by our statistical analysis as the analysis accounts for the long target size of this gene. Since initiating mutations of AMLs are diverse, considering only the significantlymutated genes in all cases together, we might miss some recurrently mutated genes that play an important role in the pathogenesis of each cytogenetic subgroup. In order to identify specific cooperative mutations, we searched for genes with significantly higher mutation rate than expected by chance in each subgroup, according to the same statistical approach described above (for a complete list refer to Supplementary Table 3). Specific mutated genes are significantly associated with particular subgroups While some frequently mutated genes, such as WT1, KRAS, NRAS, IDH2, and SMC3, are common to many AML subgroups, other genes are significantly associated with particular karyotypes (Table 2). To understand if a gene is significantly associated to a specific category, we used a Benjamini-Hochberg-adjusted Fisher’s one-tailed exact test. 3 B. Supplementary Tables Supplementary Table 1. Characteristics of the samples analyzed and genes affected by somatic mutations identified by whole-exome sequencing. Abbreviations: APL, acute promyelocytic leukemia; AML, acute myeloid leukemia; h, human; m, murine; NK, normal karyotype; wt, wild-type; mut, mutant; bcr, breakpoint cluster region of the PML/RARA oncogene; FAB, French-American-British classification; Dx, diagnosis. Supplementary Table 2. List of somatic mutations identified by our analysis 1 Function: exonic or splicing. Exonic Function: whether it is a synonymous SNV, stopgain, non-/frameshift substitution. 3 ESP5400_ALL: Allele frequency in 5400 NHLBI-ESP exomes. 4 1000g2012feb_ALL: 1000genome allele frequencies (February 2012 release). 5 dbSNP135: dbSNP reference number. 6 AVSIFT prediction scores. 7 COSMIC site: presence in the COSMIC database. 8 Freq: ratio between the number of alternative reads covering the mutation and its total coverage. 9 Validated: mutation validated by Sanger sequencing. 2 Supplementary Table 3. Significantly-mutated genes associated to specific AML subgroups. 1 Number of mutations identified for a specific gene in the indicated AML subgroup. Statistical significance of the association of a specific mutated gene to the indicated AML subgroup. 3 Mutated genes exclusively associated to a specific AML subgroup. 2 Supplementary Table 4. Significantly-mutated genes associated to all AML samples. 1 Total number of mutations identified for each mutated gene. Significantly-mutated genes (q-value≤0.005) 4 C. References 1. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010). 2. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing nextgeneration DNA sequencing data. Genome Res. 20, 1297–1303 (2010). 3. Cibulskis, K. K., Lawrence, M. S. M., Carter, S. L. S., Sivachenko, A. A., Jaffe, D. D., Sougnez, C. C., et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31, 213–219 (2013). 4. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010). 5