Supporting Methods: Identification of amplicons and amplification hotspots The criteria used to identify amplified chromosomal segments by array CGH varies considerably between studies. Since it is believed that amplifications arise under selective pressure to replicate the same localized event multiple times, this should result in large ratio changes for regions of the genome less than a chromosome arm in size (Fridlyand et al., 2004). This is in contrast to single copy gains, which result in small ratio changes. Therefore, direct thresholding of array data at a particular ratio is commonly used to determine array elements that are amplified and distinguish them from low level gains. However, the threshold value often differs greatly ranging from a log2 ratio of 0.4 for some studies (Tonon et al., 2005) to as high as 1.0 for others (Garcia et al., 2005; Garnis et al., 2006). A commonly used threshold by our group and others is a log2 ratio of 0.8 (Choi et al., 2006; Choi et al., 2007; Lockwood et al., 2007). To determine the stringency of this threshold for the hybridization profiles in our dataset, we calculated the average log2 ratio and standard deviation for each individual profile for a segment of normal copy number (defined as the three unique sets of ten contiguous clones with the minimal sum of |average ratio| + standard deviation). From this calculation, a log2 ratio of 0.8 corresponded to a minimum separation of 13 standard deviations from the normal copy number segment for the sample with greatest amount of experimental noise (range 13 to 64 standard deviations, median of 29). Therefore, we determined that a direct threshold of 0.8 would represent very stringent criteria to define amplified chromosomal segments and would be appropriate for use in our particular dataset. To identify high level amplifications, the following algorithm was used. First the array CGH data was filtered to exclude clones with standard deviations between replicate values >0.075. Next, each clone was determined to be a member of a high level amplification if its log2 ratio was ≥0.8 (Choi et al., 2006; Choi et al., 2007; Lockwood et al., 2007). Clones were then analyzed in the context of moving averages of three, five and seven clone sliding windows; if the central value exceeded 0.8 the clone was also determined to be amplified. To define the boundaries of amplified regions we then examined clones adjacent to those identified as amplified from the method described above. As array elements may display partial ratio response to a copy number alteration if the clone does not completely overlap with the gain, we analyzed neighbouring clones to identify those that demonstrated a log2 ratio greater than or equal to 0.4 (one half of the initial threshold). If the clone exhibited this level of partial response, it was also determined to be part of the amplicon. Finally, amplicons were systematically examined in order to reduce the effect of outliers from segmenting a single amplification event into multiple regions by including all clones with amplified neighbours on both sides into the same amplicon. To elucidate hotspots of copy number amplification, the base pair positions for each amplicon in every sample was determined as described above. These segments were then mapped to the same genomic coordinate backbone (hg15) in order to determine the incidence of amplification across the entire human genome for all 104 samples. The amplification incidence was then plotted against genomic position using SeeGH Frequency Plot (Coe et al., 2006) to generate a histogram of recurrent amplification. To define the criteria for an amplification hotspot, we determined a statistically significant frequency threshold by modeling the occurrence of amplification using a Poisson distribution. From this calculation, it was found that an incidence of three was statistically significant (p<0.05). To be stringent and reduce the effect of outliers, a threshold of five (4.8%, p< 0.0008) was used to define hotspots across the 104 samples. The base pair positions of the array elements flaking the region of minimal amplicon overlap was used to determine the hotspot boundaries. The RefSeq database was then used to determine the genes mapping to within the genomic coordinates of each amplification hotspot. The SeeGH software package is available upon request at: http://www.flintbox.ca/. Statistical analysis of amplification hotspots and fragile site co-localization Chromosome band location for fragile sites was obtained from the National Center for Biotechnology Information Entrez Gene database (Bethesda, CA, USA) and sorted as rare or common. Only common fragile sites on autosomes were used in subsequent analysis and the chromosome bands at which they were located were denoted as a fragile site band (FSB). Chromosome bands that harboured an amplification hotspot were flagged as a hotspot band (HSB). The proportion of HSBs co-localized with FSBs was compared to the proportion of HSBs on all autosomal bands using a chi-squared test to determine if there was a statistically significant enrichment of amplification hotspots within fragile site regions. Functional assessment of amplified genes The functional and canonical pathway analyses were generated through the use of Ingenuity Pathways Analysis (Ingenuity® Systems, www.ingenuity.com). Functional Analysis identified the biological functions and diseases that were most significant to the data set. Genes from the dataset that met the amplification count cutoff of occurring five or more times and were associated with biological functions and/or diseases in the Ingenuity Pathways Knowledge Base were considered for the analysis. Fischer’s exact test was used to calculate a p-value determining the probability that each biological function and/or disease assigned to that data set is due to chance alone. Canonical pathways analysis identified the pathways from the Ingenuity Pathways Analysis library of canonical pathways that were most significant to the data set. The significance of the association between the data set and the canonical pathway was measured in 2 ways: 1) A ratio of the number of genes from the data set that map to the pathway divided by the total number of genes that map to the canonical pathway is displayed. 2) Fischer’s exact test was used to calculate a p-value determining the probability that the association between the genes in the dataset and the canonical pathway is explained by chance alone. Integration of genomic and gene expression data To integrate gene expression with copy number data, we first determined genes amplified two or more times in the 27 lines. For each gene locus, we scored for copy number status and defined if a line was amplified or neutral in copy number. Neutral copy number status was defined using aCGH-Smooth (Jong et al., 2004) as previously described (Chari et al., 2006; Lockwood et al., 2007). Amplified copy number status was defined using criteria described above. We then identified the Affymetrix probe sets corresponding to these genes, and filtered for probes demonstrating a present or marginal quality score in at least 50% of the samples with amplification of the specific gene. Gene expression data were then compared between amplified and neutral copy number groups for each gene using the Mann-Whitney U test to identify those that were overexpressed in the amplified samples with a p-value < 0.05. References: Chari R, Lockwood WW, Coe BP, Chu A, Macey D, Thomson A et al (2006). SIGMA: a system for integrative genomic microarray analysis of cancer genomes. BMC Genomics 7: 324. Choi JS, Zheng LT, Ha E, Lim YJ, Kim YH, Wang YP et al (2006). Comparative genomic hybridization array analysis and real-time PCR reveals genomic copy number alteration for lung adenocarcinomas. Lung 184: 355-62. Choi YW, Choi JS, Zheng LT, Lim YJ, Yoon HK, Kim YH et al (2007). Comparative genomic hybridization array analysis and real time PCR reveals genomic alterations in squamous cell carcinomas of the lung. Lung Cancer 55: 43-51. Coe BP, Lee EH, Chi B, Girard L, Minna JD, Gazdar AF et al (2006). Gain of a region on 7p22.3, containing MAD1L1, is the most frequent event in small-cell lung cancer cell lines. Genes Chromosomes Cancer 45: 11-9. Fridlyand J, Snijders A, Pinkel D, Albertson DG, Jain AN (2004). Hidden Markov models approach to the analysis of array CGH data. J. Multivar. Anal. 90: 132-53. Garcia MJ, Pole JC, Chin SF, Teschendorff A, Naderi A, Ozdag H et al (2005). A 1 Mb minimal amplicon at 8p11-12 in breast cancer identifies new candidate oncogenes. Oncogene 24: 523545. Garnis C, Lockwood WW, Vucic E, Ge Y, Girard L, Minna JD et al (2006). High resolution analysis of non-small cell lung cancer cell lines by whole genome tiling path array CGH. Int J Cancer 118: 1556-64. Jong K, Marchiori E, Meijer G, Vaart AV, Ylstra B (2004). Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics 20: 3636-7. Lockwood WW, Coe BP, Williams AC, MacAulay C, Lam WL (2007). Whole genome tiling path array CGH analysis of segmental copy number alterations in cervical cancer cell lines. Int J Cancer 120: 436-43. Tonon G, Wong KK, Maulik G, Brennan C, Feng B, Zhang Y et al (2005). High-resolution genomic profiles of human lung cancer. Proc Natl Acad Sci U S A 102: 9625-30.