Cross-Species DNA Copy Number Analyses Identifies Multiple 1q21

advertisement
Cross-Species DNA Copy Number Analyses Identifies Multiple 1q21-q23 Subtype-Specific
Driver Genes for Breast Cancer
Grace O. Silva1,2,3, Xiaping He3, Joel S. Parker1,3, Michael L. Gatza1,3, Lisa A. Carey3,4, Jack P.
Hou5,6, Stacy L. Moulder7, Paul K. Marcom8, Jian Ma5,9, Jeffrey M. Rosen10, and Charles M.
Perou1,2,3,#
1
Department of Genetics, University of North Carolina, Chapel Hill, NC 27599 USA
email: silvag@email.unc.edu (GOS)
2
Curriculum in Bioinformatics and Computational Biology, University of North Carolina, Chapel
Hill, NC 27599 USA
3
Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC
27599 USA email: xiaping_he@med.unc.edu (XH); parkerjs@email.unc.edu (JSP);
mgatza@email.unc.edu (MLG)
4
Department of Medicine, Division of Hematology/Oncology, University of North Carolina,
Chapel Hill, NC 27599 USA email: lisa_carey@email.unc.edu (LAC)
5
Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801
USA email: jackhou2@illinois.edu (JPH)
6
Medical Scholars Program, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
7
Department of Breast Medical Oncology, Division of Cancer Medicine, University of Texas
MD Anderson Cancer Center, Houston, TX 301438 USA email: smoulder@mdanderson.org
(SLM)
8
Department of Medicine, Division of Oncology, Duke University, Durham, NC 27710 USA
email: marco001@mc.duke.edu (PKM)
9
Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign,
Urbana, IL 61801 USA email: jianma@illinois.edu (JM)
10
Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX
77030 USA email: jrosen@bcm.edu (JMR)
#Corresponding Author
Charles M. Perou, PhD
Lineberger Comprehensive Cancer Center
450 West Drive, CB7295
University of North Carolina, Chapel Hill, NC, 27599, USA
Tel. 919-843-5740, Fax 919-843-5718, cperou@med.unc.edu
1
Methods
Breast cancer tumor datasets
For these comparative studies, two human datasets and one mouse dataset were used that
contained both gene expression and DNA copy number data (Table 1). The two human datasets
were: (1) tumors collected at the University of North Carolina at Chapel Hill and the Oslo
University Hospital, Radiumhospitalet, Norway (“UNC”, n=159, GSE52173), and (2) The
Cancer Genome Atlas (TCGA) Project dataset [1] (“TCGA”, n=485, https://tcgadata.nci.nih.gov/tcga). The third dataset contained tumors from numerous mouse mammary
tumor models including GEM mammary models with inactivation of TP53, BRCA1, BRG1, and
over-expression of cMYC, HER2/ERBB2/Neu, PyMT and WNT1 (“mouse”, n = 73, GSE52173)
(Supplementary Table 1). All newly collected human samples from UNC were done using IRBapproved protocols, and all mouse samples in accordance with IACUC guidelines. DNA copy
number data was collected from three platforms: Illumina 660-Quad SNP arrays (UNC),
Affymetrix 6.0 SNP arrays (TCGA), and Agilent oligonucleotide-based DNA copy number
arrays (mouse). The publically available level 3 segmented copy number data for the TCGA
dataset was downloaded through the TCGA data portal (https://tcga-data.nci.nih.gov/tcga/) and
the published PAM50 subtype calls were used [1]. The UNC human expression dataset, along
with the mouse expression dataset, are available at the Gene Expression Omnibus (GEO).
Demographic and clinical characteristics of the UNC tumors are provided in Supplementary
Table 2.
For the UNC human array comparative genomic hybridization copy number studies,
Illumina 660-Quad SNP arrays were used according to the manufacturer’s protocol. Quality
control and data analysis were performed using GenomeStudio Genotyping Module v1.0
software (Illumina, Inc., San Diego, CA). SNP clustering and genotyping were performed on all
samples. Replicates (having a genome-wide genotype correlation >99%) and samples with call
rates <85% were removed. B allele frequencies (the proportion of minor (‘B’) alleles to total (‘A’
+ ‘B’) alleles) were calculated and the base-2 log ratio of the total intensity (‘A’ + ‘B’ alleles) of
the subject over the total intensity of the normal (LRR values) were extracted; in accordance
with consent form, we have deposited the LRR values into the GEO (GSE52173). Genome-wide
DNA copy number segmentation was determined using the sup-Wald identification of DNAcopy changes method (SWITCHdna) [2]. For each sample, SWITCHdna identifies transition
2
points within the relative copy number data then averages all values within a segment to
determine that segment’s overall relative copy number, and calculates an associated Z-score for
each segment across all samples [2]. To highlight and plot frequently occurring CNAs within an
assigned group, SWITCHdna implements a post-segmentation plotting function. Specifically,
segmented data from each sample within an assigned class are aligned with one another, the
overlapping gain and loss segments are grouped, and the frequency of occurrence calculated
relative to all samples within the group. The overlapping of segments from samples within the
same group highlights common CNAs across the genome; however, group segments are smaller
in length and provide a caveat that potential driver genes (especially those of larger lengths) may
span multiple segments and therefore will be filtered out by this strategy. To address this
concern, Supplemental Table 6 contains the subtype-specific significance scores of all genes
tested, and is an additional resource to search for potentially important genes that span multiple
high frequent segments.
For the UNC mouse aCGH study, Agilent 244,000 feature microarrays were used
according to the manufacturer’s protocol and a 2-color/sample strategy. All mouse tumor
samples were assayed versus FVB normal mouse DNA and the R/G ratio obtained, which is the
relative measure of DNA copy number (GSE52173). The R/G ratios were first lowess
normalized then segmentation performed using SWITCHdna.
For the UNC human and mouse gene expression studies, Agilent expression microarrays
were used as previously described [3, 4]; in this study and for the UNC human sample set, there
were 30 new gene expression arrays (GSE52173). All microarray and patient clinical data are
available in the UNC Microarray Database (UMD; https://genome.unc.edu/) and GEO. The
probes were filtered by requiring the lowest normalized intensity values in both the sample and
control to be >10. The normalized base-2 log ratio of the Cy5 sample/Cy3 control from probes
mapping to the same gene (Entrez ID) were averaged to create a gene expression matrix.
Cross-species assessment of subtype-specific changes in genomic DNA copy number
To identify subtype-specific CNAs from segmentation data generated by the various copy
number array platforms, we produced an add-on script to the SWITCHdna method of DNA copy
number change point detection [2]. We created an R suite of functions called SWITCHplus,
3
which can identify segments of the genome with copy number changes specific for a user
determined set of tumors, thus providing a supervised method for analyzing copy number data.
SWITCHplus is provided as a source script in R and available for download at:
https://genome.unc.edu/SWITCHplus/. SWITCHplus begins with the input of identified
segments of CNAs from SWITCHdna [2] (or other segmentation tools) and the associated
relative copy number value for each segment across the entire genome. Similar to SWITCHdna,
external supervising information (i.e. subtype or disease group information) was used to
aggregate the data and create a genome-wide CNA frequency landscape plot. Regions not
specifically associated with a subtype/group are then displayed in gray while subtype-specific
CNAs are colored in red for gains, and green for losses.
Using the hg19 or mm9 build annotation, downloaded (October 2012) from the UCSC
genome browser (http://genome.ucsc.edu/) [5], genes were selected if they fell completely within
an identified segment of CNA, and all genes within a given segment were assigned the copy
number value associated with that segment. Next, for each subtype in each data set, we
performed a t-test on each gene’s copy number value from all the samples of that subtype against
the copy number values for that same gene in all other samples not of that subtype. Genes with a
p-value less than 0.05 were selected and labeled as “subtype-specific”. Meanwhile, genes that
met this significance threshold across multiple subtypes were labeled “subtype-associated”.
Note, that we did not perform multiple hypothesis testing corrections as we chose alternative
biologically-based filtering criteria (Figure 1) based upon cross-species conservation. However,
to address the false discovery rate (FDR), we permuted the data a 1000 times and calculated a
FDR of 0.12 (not depicted) for a random selection of conserved genes within subtype-specific
segments occurring at a greater than or equal to 15% frequency (stage 3 of the pipeline). By
continuing down the pipeline, we further decrease the false positive genes by filtering out genes
without functional implications (Supplemental Table 3).
To compare CNAs cross-species, all genes from a given segment were assigned the mean
segment value and then repeated for every sample and segment thereby creating a copy number
gene matrix for each species. Subsequently, the resulting gene lists from each matrix was filtered
to a set that was matching and overlapping between the two species. Next, focusing on the mouse
genes present on the intersecting gene list, we used the hg19 gene annotation to directly annotate
the mouse genes into human genomic order for a direct comparison of CNAs between the two
4
species. Conserved CNAs were identified as copy number segments that contained at least one
subtype-specific altered gene that overlapped with both species when the two genomes were
aligned into the same genomic order.
Computational analysis of candidate driver genes within conserved CNAs
In order to identify putative driver alterations within regions of copy number gains or
losses, we began with all the conserved CNAs with a subtype segment frequency of 15% or
greater. To distinguish putative drivers from passengers, three further criteria were used. We first
identified genes within a CNA that demonstrate concordance between the DNA and RNA
expression (i.e. expression of a given gene correlated with the relative copy number of that gene,
be it a loss or gain). A Pearson correlation was used, for each gene, to examine the relationship
between the gene’s LRR copy number value and base-2 log expression value across all patients,
and the resulting p-values adjusted for multiple test correlation using the Benjamini Hochberg
method. Genes with p-values > 0.05 were removed and the remaining concordant genes were
identified by having a positive correlation value. The second criterion filtered for conserved
CNAs that contained genes with a breast cell line RNAi-associated phenotype as published in the
Solimini et al. 2012 RNAi screen on Human Mammary Epithelial Cells [6]; namely if the gene
increased proliferation it was labeled as a growth enhancer and oncogene, or “GO gene”,
whereas a suppressor of tumorigenesis and/or proliferation was labeled as a “STOP gene”. The
third criterion was to identify top ranking genes when scored using DawnRank [7]. In this
method, we used a larger (and inclusive) TCGA cohort (n=815) and DNA copy number changes
were treated as discrete input variables (either amplified, normal, or lost), to determine whether
DNA copy number changes on a gene level, perturbed the expression of other genes in the
network. DawnRank ranks the list of perturbed genes, in a single sample, based on the gene's
impact on the expression changes of downstream genes in the network. Genes with a high
downstream impact are considered more likely to be drivers. DawnRank was run for each sample
in the cohort, and an additional cohort-level DawnRank score was calculated using Condorcet
rank aggregation. In many supplemental tables and figures we show these criteria separately for
each CNA, highlighting the genes within a segment that have any or all of these “filtering”
properties.
5
References
1.
2.
3.
4.
5.
6.
7.
Cancer T, Atlas G (2012) Comprehensive molecular portraits of human breast tumours. Nature 490:61–70.
doi: 10.1038/nature11412
Weigman VJ, Chao H-H, Shabalin AA, et al. (2012) Basal-like Breast cancer DNA copy number losses
identify genes involved in genomic instability, response to therapy, and patient survival. Breast Cancer Res
Treat 113:865–880. doi: 10.1007/s10549-011-1846-y
Prat A, Parker JS, Karginova O, et al. (2010) Phenotypic and molecular characterization of the claudin-low
intrinsic subtype of breast cancer. Breast Cancer Research 12:R68. doi: 10.1186/bcr2635
Herschkowitz JI, Zhao W, Zhang M, et al. (2012) Comparative oncogenomics identifies breast tumors
enriched in functional tumor-initiating cells. Proceedings of the National Academy of Sciences of the United
States of America 109:2778–83. doi: 10.1073/pnas.1018862108
Kent WJ, Sugnet CW, Furey TS, et al. (2002) The Human Genome Browser at UCSC. Genome Research
12:996–1006. doi: 10.1101/gr.229102
Solimini NL, Xu Q, Mermel CH, et al. (2012) Recurrent hemizygous deletions in cancers may optimize
proliferative potential. Science 337:104–109. doi: 10.1126/science.1219580
Hou JP, Ma J (2014) DawnRank: discovering personalized driver genes in cancer. Genome Medicine 6:56.
doi: 10.1186/s13073-014-0056-8
6
Download