Supplementary Methods (doc 58K)

advertisement
Supporting Methods:
Identification of amplicons and amplification hotspots
The criteria used to identify amplified chromosomal segments by array CGH varies considerably
between studies. Since it is believed that amplifications arise under selective pressure to
replicate the same localized event multiple times, this should result in large ratio changes for
regions of the genome less than a chromosome arm in size (Fridlyand et al., 2004). This is in
contrast to single copy gains, which result in small ratio changes. Therefore, direct thresholding
of array data at a particular ratio is commonly used to determine array elements that are
amplified and distinguish them from low level gains. However, the threshold value often differs
greatly ranging from a log2 ratio of 0.4 for some studies (Tonon et al., 2005) to as high as 1.0 for
others (Garcia et al., 2005; Garnis et al., 2006). A commonly used threshold by our group and
others is a log2 ratio of 0.8 (Choi et al., 2006; Choi et al., 2007; Lockwood et al., 2007). To
determine the stringency of this threshold for the hybridization profiles in our dataset, we
calculated the average log2 ratio and standard deviation for each individual profile for a segment
of normal copy number (defined as the three unique sets of ten contiguous clones with the
minimal sum of |average ratio| + standard deviation). From this calculation, a log2 ratio of 0.8
corresponded to a minimum separation of 13 standard deviations from the normal copy number
segment for the sample with greatest amount of experimental noise (range 13 to 64 standard
deviations, median of 29). Therefore, we determined that a direct threshold of 0.8 would
represent very stringent criteria to define amplified chromosomal segments and would be
appropriate for use in our particular dataset.
To identify high level amplifications, the following algorithm was used. First the array CGH
data was filtered to exclude clones with standard deviations between replicate values >0.075.
Next, each clone was determined to be a member of a high level amplification if its log2 ratio
was ≥0.8 (Choi et al., 2006; Choi et al., 2007; Lockwood et al., 2007). Clones were then
analyzed in the context of moving averages of three, five and seven clone sliding windows; if the
central value exceeded 0.8 the clone was also determined to be amplified. To define the
boundaries of amplified regions we then examined clones adjacent to those identified as
amplified from the method described above. As array elements may display partial ratio
response to a copy number alteration if the clone does not completely overlap with the gain, we
analyzed neighbouring clones to identify those that demonstrated a log2 ratio greater than or
equal to 0.4 (one half of the initial threshold). If the clone exhibited this level of partial
response, it was also determined to be part of the amplicon. Finally, amplicons were
systematically examined in order to reduce the effect of outliers from segmenting a single
amplification event into multiple regions by including all clones with amplified neighbours on
both sides into the same amplicon.
To elucidate hotspots of copy number amplification, the base pair positions for each amplicon in
every sample was determined as described above. These segments were then mapped to the
same genomic coordinate backbone (hg15) in order to determine the incidence of amplification
across the entire human genome for all 104 samples. The amplification incidence was then
plotted against genomic position using SeeGH Frequency Plot (Coe et al., 2006) to generate a
histogram of recurrent amplification. To define the criteria for an amplification hotspot, we
determined a statistically significant frequency threshold by modeling the occurrence of
amplification using a Poisson distribution. From this calculation, it was found that an incidence
of three was statistically significant (p<0.05). To be stringent and reduce the effect of outliers, a
threshold of five (4.8%, p< 0.0008) was used to define hotspots across the 104 samples. The
base pair positions of the array elements flaking the region of minimal amplicon overlap was
used to determine the hotspot boundaries. The RefSeq database was then used to determine the
genes mapping to within the genomic coordinates of each amplification hotspot. The SeeGH
software package is available upon request at: http://www.flintbox.ca/.
Statistical analysis of amplification hotspots and fragile site co-localization
Chromosome band location for fragile sites was obtained from the National Center for
Biotechnology Information Entrez Gene database (Bethesda, CA, USA) and sorted as rare or
common. Only common fragile sites on autosomes were used in subsequent analysis and the
chromosome bands at which they were located were denoted as a fragile site band (FSB).
Chromosome bands that harboured an amplification hotspot were flagged as a hotspot band
(HSB). The proportion of HSBs co-localized with FSBs was compared to the proportion of
HSBs on all autosomal bands using a chi-squared test to determine if there was a statistically
significant enrichment of amplification hotspots within fragile site regions.
Functional assessment of amplified genes
The functional and canonical pathway analyses were generated through the use of Ingenuity
Pathways Analysis (Ingenuity® Systems, www.ingenuity.com). Functional Analysis identified
the biological functions and diseases that were most significant to the data set. Genes from the
dataset that met the amplification count cutoff of occurring five or more times and were
associated with biological functions and/or diseases in the Ingenuity Pathways Knowledge Base
were considered for the analysis. Fischer’s exact test was used to calculate a p-value
determining the probability that each biological function and/or disease assigned to that data set
is due to chance alone. Canonical pathways analysis identified the pathways from the Ingenuity
Pathways Analysis library of canonical pathways that were most significant to the data set. The
significance of the association between the data set and the canonical pathway was measured in 2
ways: 1) A ratio of the number of genes from the data set that map to the pathway divided by the
total number of genes that map to the canonical pathway is displayed. 2) Fischer’s exact test was
used to calculate a p-value determining the probability that the association between the genes in
the dataset and the canonical pathway is explained by chance alone.
Integration of genomic and gene expression data
To integrate gene expression with copy number data, we first determined genes amplified two or
more times in the 27 lines. For each gene locus, we scored for copy number status and defined if
a line was amplified or neutral in copy number. Neutral copy number status was defined using
aCGH-Smooth (Jong et al., 2004) as previously described (Chari et al., 2006; Lockwood et al.,
2007). Amplified copy number status was defined using criteria described above. We then
identified the Affymetrix probe sets corresponding to these genes, and filtered for probes
demonstrating a present or marginal quality score in at least 50% of the samples with
amplification of the specific gene. Gene expression data were then compared between amplified
and neutral copy number groups for each gene using the Mann-Whitney U test to identify those
that were overexpressed in the amplified samples with a p-value < 0.05.
References:
Chari R, Lockwood WW, Coe BP, Chu A, Macey D, Thomson A et al (2006). SIGMA: a system
for integrative genomic microarray analysis of cancer genomes. BMC Genomics 7: 324.
Choi JS, Zheng LT, Ha E, Lim YJ, Kim YH, Wang YP et al (2006). Comparative genomic
hybridization array analysis and real-time PCR reveals genomic copy number alteration for lung
adenocarcinomas. Lung 184: 355-62.
Choi YW, Choi JS, Zheng LT, Lim YJ, Yoon HK, Kim YH et al (2007). Comparative genomic
hybridization array analysis and real time PCR reveals genomic alterations in squamous cell
carcinomas of the lung. Lung Cancer 55: 43-51.
Coe BP, Lee EH, Chi B, Girard L, Minna JD, Gazdar AF et al (2006). Gain of a region on
7p22.3, containing MAD1L1, is the most frequent event in small-cell lung cancer cell lines.
Genes Chromosomes Cancer 45: 11-9.
Fridlyand J, Snijders A, Pinkel D, Albertson DG, Jain AN (2004). Hidden Markov models
approach to the analysis of array CGH data. J. Multivar. Anal. 90: 132-53.
Garcia MJ, Pole JC, Chin SF, Teschendorff A, Naderi A, Ozdag H et al (2005). A 1 Mb minimal
amplicon at 8p11-12 in breast cancer identifies new candidate oncogenes. Oncogene 24: 523545.
Garnis C, Lockwood WW, Vucic E, Ge Y, Girard L, Minna JD et al (2006). High resolution
analysis of non-small cell lung cancer cell lines by whole genome tiling path array CGH. Int J
Cancer 118: 1556-64.
Jong K, Marchiori E, Meijer G, Vaart AV, Ylstra B (2004). Breakpoint identification and
smoothing of array comparative genomic hybridization data. Bioinformatics 20: 3636-7.
Lockwood WW, Coe BP, Williams AC, MacAulay C, Lam WL (2007). Whole genome tiling
path array CGH analysis of segmental copy number alterations in cervical cancer cell lines. Int J
Cancer 120: 436-43.
Tonon G, Wong KK, Maulik G, Brennan C, Feng B, Zhang Y et al (2005). High-resolution
genomic profiles of human lung cancer. Proc Natl Acad Sci U S A 102: 9625-30.
Download