SNAPR: a bioinformatics pipeline for efficient and accurate RNA-seq alignment and analysis Andrew T. Magis1,2, Cory C. Funk1, and Nathan D. Price1,2§ 1Institute 2Center for Systems Biology, Seattle, WA for Biophysics and Computational Biology, University of Illinois, Urbana, IL Supplemental Section SNAPR Alignment SNAPR incorporates the annotation directly into the alignment process to improve alignment accuracy. Each mate (one for single-end, two for paired-end) is aligned to both the genome and the transcriptome independently, creating a set of unique putative alignment positions for mate(s) to both indices. The final alignment position and subsequent mapping score is determined based on criteria described below. Aligning to both the transcriptome and the genome simultaneously serves a dual purpose: it ensures that all possible alignment positions for a mate are considered, and it allows paired end reads to cross transcriptome/genome boundaries. While an annotation provides critical prior knowledge about the likelihood of alignment positions, gene boundaries are often truncated at the 5’ and 3’ ends. SNAPR can also align paired end reads for which one mate occurs in an intron and the other in an exon, a situation resulting from unannotated exons or sequenced pre-mRNA. The current version of SNAPR is designed for genomes with curated annotations, and does not attempt to find novel splice junctions. SNAPR Error Model The mapping quality (MAPQ) field in the SAM file format specification is defined as −10 log10 𝑋 where X is the probability that the mapping position is incorrect. This value is rounded off to the nearest integer in the SAM file. Accurately calculating MAPQ necessitates the identification of all alternative alignment positions for a read to a given assembly. Genomic read aligners such as BWA and novoalign estimate this value by finding as many alternative alignments positions as possible in a reasonable amount of time while reporting the highest quality alignment. Other factors such as repetitiveness of the genome and base qualities are also taken into account. As reported in the companion paper, SNAP effectively calculates the MAPQ value for genomic alignments such that Area Under the Receiver Operator Characteristic (AUROC) is among the highest of all aligners tested. This accuracy is enabled by SNAP’s ability to rapidly find alternative alignments using the hash index. Transcriptomic alignments present greater difficulty in estimating MAPQ, due to the fact that reads can be spliced in many different ways and therefore accurately estimating alternative alignment positions is difficult. Tophat, for example, only reports five values for MAPQ in the output SAM file: 0, 1, 2, 3 and 50. If two alignments are identified for a read, the probability of any one of the alignments being correct is assumed to be 0.5, irrespective of mismatches, yielding a MAPQ value of −10 log10 0.5 = 3, rounded off to the nearest integer. A MAPQ value of 50 is reported if an alignment is unique. As such, the MAPQ value is not particularly useful in determining whether a given read is accurately aligned. Ultimately, most end users are interested in mapping uniqueness, but alignment uniqueness is not well defined when one allows for mismatches between reads and their target. Without allowing for mismatches, any read containing a sequencing error or derived from a transcript containing any form of a single nucleotide variant (SNV) will be discarded as an imperfect match. Alternatively, allowing for N mismatches in a read of length N enables the alignment of any read to any position, regardless of sequence. Aligners must therefore strike a balance between these two extremes, allowing for a reasonable number of mismatches that preserves the ability to identify variation while constraining alignments enough to yield meaningful information. Compounding the problem is the repetitiveness of the target genome, which leads to high sequence similarity in distal regions, potentially confounding even conservative alignment algorithms. SNAPR makes use of an error model in which the uniqueness of a particular alignment is determined through a confidence threshold that is defined by the user. All alignments are identified within a userspecified edit distance (-d) between the read and the reference, and reverse sorted by the number of mismatches. If the top-scoring alignment has at least n fewer mismatches than the second-best scoring alignment, where n is a confidence interval specified by the user (-c), the alignment is considered unique and is reported with the SNAP-computed MAPQ value. Otherwise, the alignment is considered non-unique and is reported with a MAPQ value of 1. All analyses in this paper were generated using the default confidence interval of 2. Alignment Classification A major feature of the SNAPR pipeline is the manner in which the annotation is incorporated into the categorization of putative alignments. All valid alignments for each read are compared to the list of annotated genes. If both mates align within the boundaries of an annotated gene, that alignment is categorized as an intra-gene alignment. Putative alignments that are not intra-gene but occur on the same chromosome are categorized as intra-chromosomal. Finally, putative alignments that cross chromosome boundaries are categorized as inter-chromosomal. Each read may have valid alignments that occur in one or more of these categories. SNAPR prioritizes alignments in the following order: intra-gene, intra-chromosomal, and inter-chromosomal. If a read generates an alignment that is categorized as intragene within the user-specified error model, no alignments in any of the other categories are considered valid, even if those alignments contain fewer mismatches with the reference genome. Similarly, if a read generates no valid alignment in the intra-gene category, but a valid alignment in the intra-chromosomal category exists, no alignments are considered from the inter-chromosomal category. This categorization scheme is designed to leverage the likelihood of alignments occurring within annotated genes while biasing the algorithm against inter-chromosomal alignments. As a result, any intra- or inter-chromosomal alignments that do pass through these filters are subsequently more likely to result from real biological events. Preparation of genome and transcriptome indices SNAPR indices were generated using the GRCh37 genome assembly and the Ensembl v68 human genome annotation using the following commands: snapr index Homo_sapiens.GRCh37.68.genome.fa genome20 –t16 snapr transcriptome Homo_sapiens.GRCh37.68.gtf Homo_sapiens.GRCh37.68.genome.fa transcriptome20 –t16 – O1000 Preparation of contamination database All sequenced viral genomes (1376 genomes), all sequenced fungal genomes (35 genomes), and all sequenced bacterial genomes (2646 genomes) were downloaded from the NCBI FTP site (ftp.ncbi.nih.gov/genomes). A custom Python script was used to select one bacterial genome from each genus to be part of the contamination database, for a total of 1145 genomes. If all bacterial genomes were included the size of the contaminant database would exceed 232 bases, the current limit on genome size for SNAPR. Note that we do not claim this method to be the most effective way to identify bacterial contaminants. A more targeted approach involving common or expected contaminants would be the most effective, and SNAPR allows the user to specify any contaminant database they create. Our default contaminant database is freely available on our Amazon EC2 snapshot. The genome FASTA files described above were concatenated into a single file. All spaces in the header sequences for each genome were replaced by underscores so they would be part of the SNAPR index rather than truncated. The contamination database was built using the following command: snapr index contamination_db.fa contamination20 –t16 Sample analysis Each sample in BAM format was downloaded directly to an Amazon EC2 cr1.8xlarge instance using CGHub GeneTorrent v3.8.5 and the sample analysis_id. Analysis_ids were obtained as described for each cancer type below. SNAPR genome and transcriptome indices were copied to local (ephemeral) storage from an EBS volume. Each sample was processed using the following command (replacing ${UUID} with the analysis_id of the sample): snapr paired /mnt/snap/genome20 /mnt/snap/transcriptome20 /mnt/snap/Homo_sapiens.GRCh37.68.gtf *.bam -o ${UUID}.snap.bam -t 16 -M -c 2 -d 4 -rg ${UUID} -I -ct /mnt/snap/contamination20 All output files were stored back to a shared EBS volume for downstream processing. No external software packages were used in any analysis step, although some custom Python scripts were used to compile the output for this study. All scripts are provided in our Amazon EC2 snapshot. Generation of Mason Reads The complete Ensembl v68 transcriptome was generated from the GRCh37 human genome assembly using a custom Python script. For each transcript in the annotation, a spliced FASTQ sequence was written to the output file transcriptome.fa The VCF file containing the Venter genome was converted into transcriptome coordinates, such that if a variant overlapped with one or more annotated transcripts, one variant per transcript appears in the output VCF file. In this manner all variants were included in annotated transcripts. For example, the following heterozygous variant occurs in exon 4 of the gene EGFR (Ensembl ID: ENSG00000146648) in the Venter VCF file: CHR 7 POS ID 55214348 . REF C ALT T QUAL 200 FILTER INFO PASS . FORMAT GT GT 1|0 There are eight Ensembl v68 EGFR transcripts that intersect this position, but only seven of them are in exons, so seven variants appear in the output file in transcriptome coordinates. POS fields now reflect the spliced distance from the beginning of the transcript. CHR POS ID REF ALT QUAL FILTER INFO FORMAT 634 . C T 200 PASS . GT 651 . C T 200 PASS . GT 720 . C T 200 PASS . GT 719 . C T 200 PASS . GT 718 . C T 200 PASS . GT 622 . C T 200 PASS . GT 498 . C T 200 PASS . GT Reads were generated using Mason with the following lines of code, using FASTQ file of the entire Ensembl v68 transcriptome as the input ‘genome’: ENST00000442591 ENST00000275493 ENST00000342916 ENST00000344576 ENST00000420316 ENST00000450046 ENST00000454757 GT 1|0 1|0 1|0 1|0 1|0 1|0 1|0 the mason illumina –vcf venter.ensembl.vcf -N 20000000 --matepairs -n 100 -sq --output-file mason.venter.100.transcriptome.fastq transcriptome.fa Genomic reads were generated from the standard Venter VCF coordinates and the GRCh37 genome assembly using the following command: mason illumina -vcf venter.vcf -N 5000000 --mate-pairs -n 100 -sq --output-file mason.venter.100.genome.fastq Homo_sapiens.GRCh37.68.genome.fa Pseudogene reads were generated in the same manner as ordinary transcriptome reads, except only sequences from the Ensembl v68 annotation were used that had a ‘source’ field of one of the following: IG_C_pseudogene IG_J_pseudogene IG_V_pseudogene polymorphic_pseudogene processed_pseudogene pseudogene transcribed_processed_pseudogene transcribed_unprocessed_pseudogene TR_J_pseudogene TR_V_pseudogene unitary_pseudogene unprocessed_pseudogene Alignment with STAR, Tophat2/Bowtie2, and Subjunc STAR STAR v2.3.0e was used for all analyses. The STAR index was built using the GRCh37 human genome assembly and Ensembl v68 annotation using the following command, using an overhang of 99 as indicated by the STAR manual for optimal alignments of 100 bp reads: STAR --runMode genomeGenerate --genomeDir star_db/GRCh37.overhang99 --genomeFastaFiles Homo_sapiens.GRCh37.68.genome.fa --sjdbGTFfile Homo_sapiens.GRCh37.68.gtf --sjdbOverhang 99 --runThreadN 16 STAR alignments were performed using the following command, replacing ${FILE1} and ${FILE} with the proper filenames and ${PREFIX} with the output filename prefix: STAR --genomeDir star_db/GRCh37.overhang99 --readFilesIn ${FILE1} ${FILE2} --runThreadN 16 --outFilterMismatchNmax 8 --outFileNamePrefix ${PREFIX}.star. --outFilterMultimapNmax 10 Subjunc Subread/Subjunc v1.3.6-p1 was used for all analyses. The Subread index was built using the GRCh37 human genome assembly using the following command: subread-buildindex -o GRCh37 Homo_sapiens.GRCh37.68.genome.fa Reads were aligned to this index with the following command, replacing ${FILE1} and ${FILE} with the proper filenames and ${PREFIX} with the output filename prefix: subjunc -i GRCh37 -r ${FILE1} –R ${FILE1} -o ${PREFIX}.sam –T 16 Tophat2 Bowtie2 v2.1.0 and Tophat2 v2.0.9 were used for all analyses. The Bowtie2 index was built for the GRCh37 genome assembly using the following command: bowtie2-build –f Homo_sapiens.GRCh37.68.genome.fa GRCh37.genome Alignments with Tophat2 were performed using the following command (using the most sensitive search option and not considering novel splice junctions), replacing ${FILE1} and ${FILE} with the proper filenames and ${PREFIX} with the output filename prefix: tophat --b2-very-sensitive --output-dir ${PREFIX} -p 16 -r 100 --mate-std-dev 50 -g 10 -N 4 --read-edit-dist 4 -library-type fr-unstranded --no-novel-juncs -G Homo_sapiens.GRCh37.68.gtf --transcriptome-index GRCh37.transcriptome GRCh37.genome ${FILE1} ${FILE2} Sorting and BAM conversion (for STAR, Subjunc, and Tophat2) SAMtools 0.1.19 was used for sorting and BAM conversion. samtools view -uST Homo_sapiens.GRCh37.68.genome.fa ${PREFIX}.sam | samtools sort - ${PREFIX} Read group assignment with Picard (not counted in running times) SNAPR can automatically assign read groups to the output BAM file, as can Tophat, but STAR and Subjunc do not have this functionality. It should be noted that although read group assignment was not included in the running time for the paper, this step is still necessary for many downstream analyses. java -jar AddOrReplaceReadGroups.jar I=${PREFIX}.bam O=${PREFIX}.RG.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=Illumina RGLB=Library RGPU=Platform RGID=1 RGSM=${PREFIX} VALIDATION_STRINGENCY=LENIENT Mapping Quality Reassignment (not counted in running times) STAR uses MAPQ=255 to refer to ‘uniquely’ mapped reads, similar to old versions of Tophat. The standard interpretation of MAPQ=255 is ‘unknown’ mapping quality, and GATK will ignore these reads by default. It is therefore necessary to reassign these mapping qualities. It should be noted that although reassignment of mapping qualities was not included in the running time for the paper, this step is still necessary for many downstream analyses. java -jar GenomeAnalysisTK.jar -T PrintReads -rf ReassignOneMappingQuality -RMQT 60 -RMQF 255 -I ${PREFIX}.RG.bam -o ${PREFIX}.RG.MAPQ.bam Variant Calling with Genome Analysis Toolkit UnifiedGenotyper The reader may wonder why the UnifiedGenotyper was used instead of the HaplotypeCaller. At the time of this study, the HaplotypeCaller does not support the ‘N’ CIGAR operator which indicates splicing, and therefore cannot be used for RNAseq data: java -jar GenomeAnalysisTK.jar -R Homo_sapiens.GRCh37.68.genome.fa -T UnifiedGenotyper -I ${PREFIX}.bam -o ${PREFIX}.vcf -stand_call_conf 30.0 stand_emit_conf 30.0 -nt 16 Generation of read counts SNAPR generates gene read counts automatically as part of the alignment process. Read counts for the other three alignment algorithms were generated in the following manner, replacing ${PREFIX} with the appropriate SAM file prefix: STAR and Tophat2/Bowtie2 HTSeq-count 0.5.4p5 was used to generate read counts. STAR and Tophat2 both use a similar scoring scheme for multiple-aligned reads. Calculation of MAPQ is determined by the probability of an alignment being incorrect, assuming all alignments are equally probable. If a read has two alignments, then MAPQ = -10 log (0.5) = 3, rounded off to the nearest integer. If a read has three alignments, then MAPQ = -10 log (0.667) = 2, and so on. Therefore MAPQ values < 4 are considered multiply-aligned, and were filtered out using the ‘–a 4’ option to htseq-count. htseq-count -m union -s no -a 4 -i gene_id -t exon -q ${PREFIX}.sam Homo_sapiens.GRCh37.68.gtf > ${PREFIX}.gene_id.counts.txt Subjunc Subjunc uses it’s own internal application to generate read counts, with no option to filter multiple-aligned reads: featureCounts -p -T 16 -a Homo_sapiens.GRCh37.68.gtf -t exon -g gene_id -i ${PREFIX}.sam -o ${PREFIX}.gene_id.counts.txt Stomach Adenocarcinoma (STAD) 312 RNA-seq samples of stomach adenocarcinoma were identified from the UCSC Cancer Genomics Hub (https://cghub.ucsc.edu) using the following cgquery command: cgquery "platform=ILLUMINA&center_name=BCCAGSC&library_strategy=RNA -Seq&state=live&disease_abbr=STAD" All samples were sequenced to 100bp length on an Illumina sequencer using the paired end protocol at the British Columbia Cancer Agency Genome Sciences Centre (BCCAGSC). Infections and Contaminants As described in the main text, detectable levels (>10 reads) of Epstein-Barr virus type 1 (human herpesvirus 4) were identified in 70 samples of STAD, while 33 samples contained detectable levels of Epstein-Barr virus type 2. 17 samples contained EBV type 1 as the strongest identifiable contaminant all bacteria, viral, and fungal genomes. 41 samples contained detectable levels of cytomegalovirus (human herpesvirus 5). One sample contained cytomegalovirus as the top contaminant out of all bacteria, viral, and fungal genomes. Out of all the cancers tested for this study, STAD contained the highest number of species of bacteria, fungi, and viruses strongly detected in the samples, with 31 different genera or species appearing at least once in the top position of the contaminant report. Propionibacterium, Cupriavidus, Shigella, and Pseudomonas were the top bacterial genera represented in the STAD samples. One of these genera was the top contaminant in 117 of the samples. Genus Propionibacterium contains some of the most common skin bacteria in humans. Genus Shigella contains common human pathogens that infect the host intestine. Genus Pseudomonas contains some of the most common hospital-acquired infectious bacteria. The fungus Thielavia terrestris was also strongly represented in the samples, occurring as the top contaminant in 45 of the STAD samples. Intra- and Inter-gene Fusions SNAPR identified many fusions that involve genes commonly associated with stomach adenocarcinoma, many of which appeared in multiple samples. For brevity we only discuss a few of the most clinically relevant fusions here; complete results can be found in the supplement spreadsheets. Note that all fusions in the spreadsheet are duplicated in order to make visualization easier: all fusions can then be sorted by each gene participating in the fusion to easily visualize clusters of fusions. Therefore there are exactly one half as many fusions predicted as listed in the spreadsheet. There was evidence for ERBB2 (HER2) fusions in many STAD samples, which is known to be mutated, amplified and overexpressed in gastric cancers1,2. One sample contained evidence for ERBB2-RARA gene fusion; RARA (retinoic acid receptor alpha) is commonly fused in acute promyelocytic leukemia. IGF2 (insulin-like growth factor 2) was strongly predicted to be involved in gene fusions in stomach adenocarcinoma. IGF2 is commonly overexpressed in gastric cancers3. There is strong evidence for a PVT1-QRSL1 fusion in one sample, involving tens of thousands of reads. PVT1 is an oncogene frequently involved in translocations in Burkitt’s lymphomas4 Glioblastoma multiforme (GBM) 159 RNA-seq samples of glioblastoma multiforme were identified from the UCSC Cancer Genomics Hub (https://cghub.ucsc.edu) using the following cgquery command: cgquery "platform=ILLUMINA&center_name=UNCLCCC&library_strategy=RNA-Seq&state=live&disease_abbr=GBM" Only the 58 samples samples with identified fusions in Frattini et al were used in downstream analysis of GBM samples for this study5. All samples were sequenced to 76bp length on an Illumina sequencer using the paired end protocol at the University of North Carolina Lineberger Comprehensive Cancer Center (UNC-LCCC). Infections and Contaminants Out of all the cancers tested for this study, GBM contained the fewest number of genera and species in the top position in the contaminant report. The number one contaminant was the fungus Aspergillus oryzae RIB40 contig in 55 out of 58 samples tested. There were also very few viral-mapping reads identified, with only a few species represented. Intra- and Inter-gene Fusions Note that all fusions in the spreadsheet are duplicated in order to make visualization easier: all fusions can then be sorted by each gene participating in the fusion to easily visualize clusters of fusions. Therefore there are exactly one half as many fusions predicted as listed in the spreadsheet. These samples were analyzed in order to compare with those discovered in Frattini et al.5. We discover 86% of the identical fusions as that study, and an additional 6% partial fusions (where one of the two genes is predicted to be fused). EGFR (epidermal growth factor receptor) is involved in many gene fusions in GBM, often fused to SEPT14 (septin 14) and SEC61G. SEPT14 is also commonly fused, as is LANCL2. We also discover the FGFR3-TACC3 fusion in two samples, as identified in Singh et al6. We additionally discover a fusion between FGFR3 and KCND3. Acute Myeloid Leukemia (LAML) Sample Protocol 117 RNA-seq samples of acute myeloid leukemia were identified from the UCSC Cancer Genomics Hub (https://cghub.ucsc.edu) using the following cgquery command: cgquery "platform=ILLUMINA&center_name=BCCAGSC&library_strategy=RNA -Seq&state=live&disease_abbr=LAML" All samples were sequenced to 100bp length on an Illumina sequencer using the paired end protocol at the British Columbia Cancer Agency Genome Sciences Centre (BCCAGSC). Infections and Contaminants LAML stands out from the other cancers as having the strongest evidence for contamination or infection of any of the cancer types by a bacteria genus. Genus Acinetobacter was found as the top contaminant in 10 of the 117 samples tested, containing an average of 5,687,367 reads mapping. Some samples contained over 13 million reads from this genus. This is easily the highest level of infection for any contaminant in any sample tested, two to three orders of magnitude greater than the average contaminant. Genus Acinetobacter contains species that are commonly associated with life-threatening infections in immunocompromised individuals. Other genera represented include Pseudomonas, Acidovorax, and Propionibacterium. Cytomegalovirus is detected in 48 samples, some of them quite strongly. Intra- and Inter-gene Fusions Note that all fusions in the spreadsheet are duplicated in order to make visualization easier: all fusions can then be sorted by each gene participating in the fusion to easily visualize clusters. Therefore there are exactly one half as many fusions predicted as listed in the spreadsheet. Few cancers are as associated with translocations than leukemia, and SNAPR discovers many of the canonical fusions in these samples. CBFB-MYH11 is identified in seven samples, which is commonly associated with acute myeloid leukemia. RUNX1-RUNX1T1 is identified in four samples, also commonly associated with acute myeloid leukemia. BCR-ABL1 is identified in two samples, which is associated with chronic myeloid leukemia, and PML-RARA is identified in seven samples, commonly associated with acute promelocytic leukemia). Ovarian Serous Cystadenocarcinoma (OV) Notes on analysis During analysis of the 808 samples we noticed that there were only 404 actual patient samples but that there were two BAM files for each sample. Each BAM file was assigned its own analysis_id. This accounts for the identical fusions identified for this cancer type. This demonstrates the consistency with which SNAPR is able to identify sample fusions, usually finding identical numbers of spliced and mated reads for each fusion between duplicate samples. Sample Protocol 808 RNA-seq samples of ovarian serous cystadenocarcinoma were identified from the UCSC Cancer Genomics Hub (https://cghub.ucsc.edu) using the following cgquery command: cgquery "platform=ILLUMINA&center_name=BCCAGSC&library_strategy=RNA -Seq&state=live&disease_abbr=OV" All samples were sequenced to 100bp length on an Illumina sequencer using the paired end protocol at the British Columbia Cancer Agency Genome Sciences Centre (BCCAGSC). Infections and Contaminants The top contaminants in OV samples are genera Pseudomonas and Propionibacterium, collectively representing the top contaminant in 741 out of 808 samples tested. Enterobacteria phage lambda is also strongly represented, appearing as the top contaminant in 43 samples. Enterobacteria phages are unusually highly represented in these samples, with detectable levels of the phages in every single sample tested. Many samples contain multiple species of these phages. Several samples of OV contain detectable levels of cytomegalovirus, with two samples containing appreciable levels of the virus. Intra- and Inter-gene Fusions SNAPR identified many fusions that involve genes commonly associated with ovarian cancer, many of which appeared in multiple samples. For brevity we only discuss the most clinically relevant fusions here; complete results can be found in the supplement spreadsheets. Note that all fusions in the spreadsheet are duplicated in order to make visualization easier: all fusions can then be sorted by each gene participating in the fusion to easily visualize clusters of fusions. Therefore there are exactly one half as many fusions predicted as listed in the spreadsheet. IGF2 (insulin-like growth factor 2) was strongly predicted to be involved in gene fusions. High expression of IGF2 has been associated with poor prognosis in advanced serous ovarian cancer7,8. Interestingly, IGF2 is commonly fused with GPX3 (glutathione peroxidase 3), which is commonly overexpressed in serous adenocarcinoma9 and suppressed in papillary serous ovarian cancer10. IGF2 is also commonly fused with H19, a ncRNA that is nearby on chromosome 11. H19 is a tumor suppressor that is maternally imprinted while nearby IGF2 is paternally imprinted7. H19 is also considered a tumor marker for serous ovarian carcinoma11. IGF2 is also commonly fused to BGN (biglycan). There is evidence for MUC16 fusions in eight serous ovarian adenocarcinoma samples. MUC16 is commonly overexpressed in ovarian cancers and has been associated with tumorogenesis and metastasis12. There is evidence for a WNT7A fusion in two samples. WNT7A is associated with tumor growth and progression in ovarian cancer through the WNT/β-catenin pathway13. There is evidence for XIAP (X-linked inhibitor of apoptosis) fusions in two samples, which is associated with evasion of apoptosis in ovarian cancer14. Bibliography 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Park, Y. S. et al. Comprehensive analysis of HER2 expression and gene amplification in gastric cancers using immunohistochemistry and in situ hybridization: which scoring system should we use? Hum. Pathol. 43, 413– 422 (2012). Lee, J. W. et al. ERBB2 kinase domain mutation in a gastric cancer metastasis. APMIS 113, 683–687 (2005). Wu, M. S. et al. Loss of imprinting and overexpression of IGF2 gene in gastric adenocarcinoma. Cancer Lett. 120, 9–14 (1997). Carramusa, L. et al. The PVT-1 oncogene is a Myc protein target that is overexpressed in transformed cells. J. Cell. Physiol. 213, 511–518 (2007). Frattini, V. et al. The integrated landscape of driver genomic alterations in glioblastoma. Nat. Genet. 45, 1141–1149 (2013). Singh, D. et al. Transforming fusions of FGFR and TACC genes in human glioblastoma. Science 337, 1231–1235 (2012). Murphy, S. K. et al. Frequent IGF2/H19 domain epigenetic alterations and elevated IGF2 expression in epithelial ovarian cancer. Mol. Cancer Res. 4, 283– 292 (2006). Sayer, R. A. et al. High insulin-like growth factor-2 (IGF-2) gene expression is an independent predictor of poor survival for patients with advanced stage serous epithelial ovarian cancer. Gynecol. Oncol. 96, 355–361 (2005). Lee, H. J. et al. Immunohistochemical evidence for the over-expression of Glutathione peroxidase 3 in clear cell type ovarian adenocarcinoma. Med. Oncol. 28 Suppl 1, S522–7 (2011). Agnani, D. et al. Decreased levels of serum glutathione peroxidase 3 are associated with papillary serous ovarian cancer and disease progression. J Ovarian Res 4, 18 (2011). Tanos, V. et al. Expression of the imprinted H19 oncofetal RNA in epithelial ovarian cancer. Eur. J. Obstet. Gynecol. Reprod. Biol. 85, 7–11 (1999). Thériault, C. et al. MUC16 (CA125) regulates epithelial ovarian cancer cell growth, tumorigenesis and metastasis. Gynecol. Oncol. 121, 434–443 (2011). Yoshioka, S. et al. WNT7A regulates tumor growth and progression in ovarian cancer through the WNT/β-catenin pathway. Mol. Cancer Res. 10, 469–482 (2012). Shaw, T. J., Lacasse, E. C., Durkin, J. P. & Vanderhyden, B. C. Downregulation of XIAP expression in ovarian cancer cells induces cell death in vitro and in vivo. Int. J. Cancer 122, 1430–1434 (2008).