Cross-Species DNA Copy Number Analyses Identifies Multiple 1q21-q23 Subtype-Specific Driver Genes for Breast Cancer Grace O. Silva1,2,3, Xiaping He3, Joel S. Parker1,3, Michael L. Gatza1,3, Lisa A. Carey3,4, Jack P. Hou5,6, Stacy L. Moulder7, Paul K. Marcom8, Jian Ma5,9, Jeffrey M. Rosen10, and Charles M. Perou1,2,3,# 1 Department of Genetics, University of North Carolina, Chapel Hill, NC 27599 USA email: silvag@email.unc.edu (GOS) 2 Curriculum in Bioinformatics and Computational Biology, University of North Carolina, Chapel Hill, NC 27599 USA 3 Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599 USA email: xiaping_he@med.unc.edu (XH); parkerjs@email.unc.edu (JSP); mgatza@email.unc.edu (MLG) 4 Department of Medicine, Division of Hematology/Oncology, University of North Carolina, Chapel Hill, NC 27599 USA email: lisa_carey@email.unc.edu (LAC) 5 Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA email: jackhou2@illinois.edu (JPH) 6 Medical Scholars Program, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA 7 Department of Breast Medical Oncology, Division of Cancer Medicine, University of Texas MD Anderson Cancer Center, Houston, TX 301438 USA email: smoulder@mdanderson.org (SLM) 8 Department of Medicine, Division of Oncology, Duke University, Durham, NC 27710 USA email: marco001@mc.duke.edu (PKM) 9 Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA email: jianma@illinois.edu (JM) 10 Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX 77030 USA email: jrosen@bcm.edu (JMR) #Corresponding Author Charles M. Perou, PhD Lineberger Comprehensive Cancer Center 450 West Drive, CB7295 University of North Carolina, Chapel Hill, NC, 27599, USA Tel. 919-843-5740, Fax 919-843-5718, cperou@med.unc.edu 1 Methods Breast cancer tumor datasets For these comparative studies, two human datasets and one mouse dataset were used that contained both gene expression and DNA copy number data (Table 1). The two human datasets were: (1) tumors collected at the University of North Carolina at Chapel Hill and the Oslo University Hospital, Radiumhospitalet, Norway (“UNC”, n=159, GSE52173), and (2) The Cancer Genome Atlas (TCGA) Project dataset [1] (“TCGA”, n=485, https://tcgadata.nci.nih.gov/tcga). The third dataset contained tumors from numerous mouse mammary tumor models including GEM mammary models with inactivation of TP53, BRCA1, BRG1, and over-expression of cMYC, HER2/ERBB2/Neu, PyMT and WNT1 (“mouse”, n = 73, GSE52173) (Supplementary Table 1). All newly collected human samples from UNC were done using IRBapproved protocols, and all mouse samples in accordance with IACUC guidelines. DNA copy number data was collected from three platforms: Illumina 660-Quad SNP arrays (UNC), Affymetrix 6.0 SNP arrays (TCGA), and Agilent oligonucleotide-based DNA copy number arrays (mouse). The publically available level 3 segmented copy number data for the TCGA dataset was downloaded through the TCGA data portal (https://tcga-data.nci.nih.gov/tcga/) and the published PAM50 subtype calls were used [1]. The UNC human expression dataset, along with the mouse expression dataset, are available at the Gene Expression Omnibus (GEO). Demographic and clinical characteristics of the UNC tumors are provided in Supplementary Table 2. For the UNC human array comparative genomic hybridization copy number studies, Illumina 660-Quad SNP arrays were used according to the manufacturer’s protocol. Quality control and data analysis were performed using GenomeStudio Genotyping Module v1.0 software (Illumina, Inc., San Diego, CA). SNP clustering and genotyping were performed on all samples. Replicates (having a genome-wide genotype correlation >99%) and samples with call rates <85% were removed. B allele frequencies (the proportion of minor (‘B’) alleles to total (‘A’ + ‘B’) alleles) were calculated and the base-2 log ratio of the total intensity (‘A’ + ‘B’ alleles) of the subject over the total intensity of the normal (LRR values) were extracted; in accordance with consent form, we have deposited the LRR values into the GEO (GSE52173). Genome-wide DNA copy number segmentation was determined using the sup-Wald identification of DNAcopy changes method (SWITCHdna) [2]. For each sample, SWITCHdna identifies transition 2 points within the relative copy number data then averages all values within a segment to determine that segment’s overall relative copy number, and calculates an associated Z-score for each segment across all samples [2]. To highlight and plot frequently occurring CNAs within an assigned group, SWITCHdna implements a post-segmentation plotting function. Specifically, segmented data from each sample within an assigned class are aligned with one another, the overlapping gain and loss segments are grouped, and the frequency of occurrence calculated relative to all samples within the group. The overlapping of segments from samples within the same group highlights common CNAs across the genome; however, group segments are smaller in length and provide a caveat that potential driver genes (especially those of larger lengths) may span multiple segments and therefore will be filtered out by this strategy. To address this concern, Supplemental Table 6 contains the subtype-specific significance scores of all genes tested, and is an additional resource to search for potentially important genes that span multiple high frequent segments. For the UNC mouse aCGH study, Agilent 244,000 feature microarrays were used according to the manufacturer’s protocol and a 2-color/sample strategy. All mouse tumor samples were assayed versus FVB normal mouse DNA and the R/G ratio obtained, which is the relative measure of DNA copy number (GSE52173). The R/G ratios were first lowess normalized then segmentation performed using SWITCHdna. For the UNC human and mouse gene expression studies, Agilent expression microarrays were used as previously described [3, 4]; in this study and for the UNC human sample set, there were 30 new gene expression arrays (GSE52173). All microarray and patient clinical data are available in the UNC Microarray Database (UMD; https://genome.unc.edu/) and GEO. The probes were filtered by requiring the lowest normalized intensity values in both the sample and control to be >10. The normalized base-2 log ratio of the Cy5 sample/Cy3 control from probes mapping to the same gene (Entrez ID) were averaged to create a gene expression matrix. Cross-species assessment of subtype-specific changes in genomic DNA copy number To identify subtype-specific CNAs from segmentation data generated by the various copy number array platforms, we produced an add-on script to the SWITCHdna method of DNA copy number change point detection [2]. We created an R suite of functions called SWITCHplus, 3 which can identify segments of the genome with copy number changes specific for a user determined set of tumors, thus providing a supervised method for analyzing copy number data. SWITCHplus is provided as a source script in R and available for download at: https://genome.unc.edu/SWITCHplus/. SWITCHplus begins with the input of identified segments of CNAs from SWITCHdna [2] (or other segmentation tools) and the associated relative copy number value for each segment across the entire genome. Similar to SWITCHdna, external supervising information (i.e. subtype or disease group information) was used to aggregate the data and create a genome-wide CNA frequency landscape plot. Regions not specifically associated with a subtype/group are then displayed in gray while subtype-specific CNAs are colored in red for gains, and green for losses. Using the hg19 or mm9 build annotation, downloaded (October 2012) from the UCSC genome browser (http://genome.ucsc.edu/) [5], genes were selected if they fell completely within an identified segment of CNA, and all genes within a given segment were assigned the copy number value associated with that segment. Next, for each subtype in each data set, we performed a t-test on each gene’s copy number value from all the samples of that subtype against the copy number values for that same gene in all other samples not of that subtype. Genes with a p-value less than 0.05 were selected and labeled as “subtype-specific”. Meanwhile, genes that met this significance threshold across multiple subtypes were labeled “subtype-associated”. Note, that we did not perform multiple hypothesis testing corrections as we chose alternative biologically-based filtering criteria (Figure 1) based upon cross-species conservation. However, to address the false discovery rate (FDR), we permuted the data a 1000 times and calculated a FDR of 0.12 (not depicted) for a random selection of conserved genes within subtype-specific segments occurring at a greater than or equal to 15% frequency (stage 3 of the pipeline). By continuing down the pipeline, we further decrease the false positive genes by filtering out genes without functional implications (Supplemental Table 3). To compare CNAs cross-species, all genes from a given segment were assigned the mean segment value and then repeated for every sample and segment thereby creating a copy number gene matrix for each species. Subsequently, the resulting gene lists from each matrix was filtered to a set that was matching and overlapping between the two species. Next, focusing on the mouse genes present on the intersecting gene list, we used the hg19 gene annotation to directly annotate the mouse genes into human genomic order for a direct comparison of CNAs between the two 4 species. Conserved CNAs were identified as copy number segments that contained at least one subtype-specific altered gene that overlapped with both species when the two genomes were aligned into the same genomic order. Computational analysis of candidate driver genes within conserved CNAs In order to identify putative driver alterations within regions of copy number gains or losses, we began with all the conserved CNAs with a subtype segment frequency of 15% or greater. To distinguish putative drivers from passengers, three further criteria were used. We first identified genes within a CNA that demonstrate concordance between the DNA and RNA expression (i.e. expression of a given gene correlated with the relative copy number of that gene, be it a loss or gain). A Pearson correlation was used, for each gene, to examine the relationship between the gene’s LRR copy number value and base-2 log expression value across all patients, and the resulting p-values adjusted for multiple test correlation using the Benjamini Hochberg method. Genes with p-values > 0.05 were removed and the remaining concordant genes were identified by having a positive correlation value. The second criterion filtered for conserved CNAs that contained genes with a breast cell line RNAi-associated phenotype as published in the Solimini et al. 2012 RNAi screen on Human Mammary Epithelial Cells [6]; namely if the gene increased proliferation it was labeled as a growth enhancer and oncogene, or “GO gene”, whereas a suppressor of tumorigenesis and/or proliferation was labeled as a “STOP gene”. The third criterion was to identify top ranking genes when scored using DawnRank [7]. In this method, we used a larger (and inclusive) TCGA cohort (n=815) and DNA copy number changes were treated as discrete input variables (either amplified, normal, or lost), to determine whether DNA copy number changes on a gene level, perturbed the expression of other genes in the network. DawnRank ranks the list of perturbed genes, in a single sample, based on the gene's impact on the expression changes of downstream genes in the network. Genes with a high downstream impact are considered more likely to be drivers. DawnRank was run for each sample in the cohort, and an additional cohort-level DawnRank score was calculated using Condorcet rank aggregation. In many supplemental tables and figures we show these criteria separately for each CNA, highlighting the genes within a segment that have any or all of these “filtering” properties. 5 References 1. 2. 3. 4. 5. 6. 7. Cancer T, Atlas G (2012) Comprehensive molecular portraits of human breast tumours. Nature 490:61–70. doi: 10.1038/nature11412 Weigman VJ, Chao H-H, Shabalin AA, et al. (2012) Basal-like Breast cancer DNA copy number losses identify genes involved in genomic instability, response to therapy, and patient survival. Breast Cancer Res Treat 113:865–880. doi: 10.1007/s10549-011-1846-y Prat A, Parker JS, Karginova O, et al. (2010) Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer. Breast Cancer Research 12:R68. doi: 10.1186/bcr2635 Herschkowitz JI, Zhao W, Zhang M, et al. (2012) Comparative oncogenomics identifies breast tumors enriched in functional tumor-initiating cells. Proceedings of the National Academy of Sciences of the United States of America 109:2778–83. doi: 10.1073/pnas.1018862108 Kent WJ, Sugnet CW, Furey TS, et al. (2002) The Human Genome Browser at UCSC. Genome Research 12:996–1006. doi: 10.1101/gr.229102 Solimini NL, Xu Q, Mermel CH, et al. (2012) Recurrent hemizygous deletions in cancers may optimize proliferative potential. Science 337:104–109. doi: 10.1126/science.1219580 Hou JP, Ma J (2014) DawnRank: discovering personalized driver genes in cancer. Genome Medicine 6:56. doi: 10.1186/s13073-014-0056-8 6