Supplemental Section SNAPR Alignment

advertisement
SNAPR: a bioinformatics pipeline for efficient and accurate RNA-seq
alignment and analysis
Andrew T. Magis1,2, Cory C. Funk1, and Nathan D. Price1,2§
1Institute
2Center
for Systems Biology, Seattle, WA
for Biophysics and Computational Biology, University of Illinois, Urbana, IL
Supplemental Section
SNAPR Alignment
SNAPR incorporates the annotation directly into the alignment process to
improve alignment accuracy. Each mate (one for single-end, two for paired-end) is
aligned to both the genome and the transcriptome independently, creating a set of
unique putative alignment positions for mate(s) to both indices. The final alignment
position and subsequent mapping score is determined based on criteria described
below. Aligning to both the transcriptome and the genome simultaneously serves a
dual purpose: it ensures that all possible alignment positions for a mate are
considered, and it allows paired end reads to cross transcriptome/genome
boundaries. While an annotation provides critical prior knowledge about the
likelihood of alignment positions, gene boundaries are often truncated at the 5’ and
3’ ends. SNAPR can also align paired end reads for which one mate occurs in an
intron and the other in an exon, a situation resulting from unannotated exons or
sequenced pre-mRNA. The current version of SNAPR is designed for genomes with
curated annotations, and does not attempt to find novel splice junctions.
SNAPR Error Model
The mapping quality (MAPQ) field in the SAM file format specification is defined as
−10 log10 𝑋 where X is the probability that the mapping position is incorrect. This
value is rounded off to the nearest integer in the SAM file. Accurately calculating
MAPQ necessitates the identification of all alternative alignment positions for a read
to a given assembly. Genomic read aligners such as BWA and novoalign estimate this
value by finding as many alternative alignments positions as possible in a
reasonable amount of time while reporting the highest quality alignment. Other
factors such as repetitiveness of the genome and base qualities are also taken into
account. As reported in the companion paper, SNAP effectively calculates the MAPQ
value for genomic alignments such that Area Under the Receiver Operator
Characteristic (AUROC) is among the highest of all aligners tested. This accuracy is
enabled by SNAP’s ability to rapidly find alternative alignments using the hash
index. Transcriptomic alignments present greater difficulty in estimating MAPQ, due
to the fact that reads can be spliced in many different ways and therefore accurately
estimating alternative alignment positions is difficult. Tophat, for example, only
reports five values for MAPQ in the output SAM file: 0, 1, 2, 3 and 50. If two
alignments are identified for a read, the probability of any one of the alignments
being correct is assumed to be 0.5, irrespective of mismatches, yielding a MAPQ
value of −10 log10 0.5 = 3, rounded off to the nearest integer. A MAPQ value of 50 is
reported if an alignment is unique. As such, the MAPQ value is not particularly
useful in determining whether a given read is accurately aligned. Ultimately, most
end users are interested in mapping uniqueness, but alignment uniqueness is not
well defined when one allows for mismatches between reads and their target.
Without allowing for mismatches, any read containing a sequencing error or derived
from a transcript containing any form of a single nucleotide variant (SNV) will be
discarded as an imperfect match. Alternatively, allowing for N mismatches in a read
of length N enables the alignment of any read to any position, regardless of
sequence. Aligners must therefore strike a balance between these two extremes,
allowing for a reasonable number of mismatches that preserves the ability to
identify variation while constraining alignments enough to yield meaningful
information. Compounding the problem is the repetitiveness of the target genome,
which leads to high sequence similarity in distal regions, potentially confounding
even conservative alignment algorithms. SNAPR makes use of an error model in
which the uniqueness of a particular alignment is determined through a confidence
threshold that is defined by the user. All alignments are identified within a userspecified edit distance (-d) between the read and the reference, and reverse sorted
by the number of mismatches. If the top-scoring alignment has at least n fewer
mismatches than the second-best scoring alignment, where n is a confidence
interval specified by the user (-c), the alignment is considered unique and is
reported with the SNAP-computed MAPQ value. Otherwise, the alignment is
considered non-unique and is reported with a MAPQ value of 1. All analyses in this
paper were generated using the default confidence interval of 2.
Alignment Classification
A major feature of the SNAPR pipeline is the manner in which the annotation
is incorporated into the categorization of putative alignments. All valid alignments
for each read are compared to the list of annotated genes. If both mates align within
the boundaries of an annotated gene, that alignment is categorized as an intra-gene
alignment. Putative alignments that are not intra-gene but occur on the same
chromosome are categorized as intra-chromosomal. Finally, putative alignments that
cross chromosome boundaries are categorized as inter-chromosomal. Each read may
have valid alignments that occur in one or more of these categories. SNAPR
prioritizes alignments in the following order: intra-gene, intra-chromosomal, and
inter-chromosomal. If a read generates an alignment that is categorized as intragene within the user-specified error model, no alignments in any of the other
categories are considered valid, even if those alignments contain fewer mismatches
with the reference genome. Similarly, if a read generates no valid alignment in the
intra-gene category, but a valid alignment in the intra-chromosomal category exists,
no alignments are considered from the inter-chromosomal category. This
categorization scheme is designed to leverage the likelihood of alignments occurring
within annotated genes while biasing the algorithm against inter-chromosomal
alignments. As a result, any intra- or inter-chromosomal alignments that do pass
through these filters are subsequently more likely to result from real biological
events.
Preparation of genome and transcriptome indices
SNAPR indices were generated using the GRCh37 genome assembly and the
Ensembl v68 human genome annotation using the following commands:
snapr index Homo_sapiens.GRCh37.68.genome.fa genome20 –t16
snapr transcriptome Homo_sapiens.GRCh37.68.gtf
Homo_sapiens.GRCh37.68.genome.fa transcriptome20 –t16 –
O1000
Preparation of contamination database
All sequenced viral genomes (1376 genomes), all sequenced fungal genomes (35
genomes), and all sequenced bacterial genomes (2646 genomes) were downloaded
from the NCBI FTP site (ftp.ncbi.nih.gov/genomes). A custom Python script was
used to select one bacterial genome from each genus to be part of the contamination
database, for a total of 1145 genomes. If all bacterial genomes were included the
size of the contaminant database would exceed 232 bases, the current limit on
genome size for SNAPR. Note that we do not claim this method to be the most
effective way to identify bacterial contaminants. A more targeted approach
involving common or expected contaminants would be the most effective, and
SNAPR allows the user to specify any contaminant database they create. Our default
contaminant database is freely available on our Amazon EC2 snapshot.
The genome FASTA files described above were concatenated into a single file. All
spaces in the header sequences for each genome were replaced by underscores so
they would be part of the SNAPR index rather than truncated. The contamination
database was built using the following command:
snapr index contamination_db.fa contamination20 –t16
Sample analysis
Each sample in BAM format was downloaded directly to an Amazon EC2 cr1.8xlarge
instance using CGHub GeneTorrent v3.8.5 and the sample analysis_id. Analysis_ids
were obtained as described for each cancer type below. SNAPR genome and
transcriptome indices were copied to local (ephemeral) storage from an EBS
volume. Each sample was processed using the following command (replacing
${UUID} with the analysis_id of the sample):
snapr paired /mnt/snap/genome20 /mnt/snap/transcriptome20
/mnt/snap/Homo_sapiens.GRCh37.68.gtf *.bam -o
${UUID}.snap.bam -t 16 -M -c 2 -d 4 -rg ${UUID} -I -ct
/mnt/snap/contamination20
All output files were stored back to a shared EBS volume for downstream
processing. No external software packages were used in any analysis step, although
some custom Python scripts were used to compile the output for this study. All
scripts are provided in our Amazon EC2 snapshot.
Generation of Mason Reads
The complete Ensembl v68 transcriptome was generated from the GRCh37 human
genome assembly using a custom Python script. For each transcript in the
annotation, a spliced FASTQ sequence was written to the output file
transcriptome.fa
The VCF file containing the Venter genome was converted into transcriptome
coordinates, such that if a variant overlapped with one or more annotated
transcripts, one variant per transcript appears in the output VCF file. In this manner
all variants were included in annotated transcripts. For example, the following
heterozygous variant occurs in exon 4 of the gene EGFR (Ensembl ID:
ENSG00000146648) in the Venter VCF file:
CHR
7
POS
ID
55214348 .
REF
C
ALT
T
QUAL
200
FILTER INFO
PASS
.
FORMAT GT
GT
1|0
There are eight Ensembl v68 EGFR transcripts that intersect this position, but only
seven of them are in exons, so seven variants appear in the output file in
transcriptome coordinates. POS fields now reflect the spliced distance from the
beginning of the transcript.
CHR
POS ID REF ALT QUAL FILTER INFO FORMAT
634 .
C
T
200
PASS
.
GT
651 .
C
T
200
PASS
.
GT
720 .
C
T
200
PASS
.
GT
719 .
C
T
200
PASS
.
GT
718 .
C
T
200
PASS
.
GT
622 .
C
T
200
PASS
.
GT
498 .
C
T
200
PASS
.
GT
Reads were generated using Mason with the following lines of code, using
FASTQ file of the entire Ensembl v68 transcriptome as the input ‘genome’:
ENST00000442591
ENST00000275493
ENST00000342916
ENST00000344576
ENST00000420316
ENST00000450046
ENST00000454757
GT
1|0
1|0
1|0
1|0
1|0
1|0
1|0
the
mason illumina –vcf venter.ensembl.vcf -N 20000000 --matepairs -n 100 -sq --output-file
mason.venter.100.transcriptome.fastq transcriptome.fa
Genomic reads were generated from the standard Venter VCF coordinates and the
GRCh37 genome assembly using the following command:
mason illumina -vcf venter.vcf -N 5000000 --mate-pairs -n
100 -sq --output-file mason.venter.100.genome.fastq
Homo_sapiens.GRCh37.68.genome.fa
Pseudogene reads were generated in the same manner as ordinary transcriptome
reads, except only sequences from the Ensembl v68 annotation were used that had a
‘source’ field of one of the following:
IG_C_pseudogene
IG_J_pseudogene
IG_V_pseudogene
polymorphic_pseudogene
processed_pseudogene
pseudogene
transcribed_processed_pseudogene
transcribed_unprocessed_pseudogene
TR_J_pseudogene
TR_V_pseudogene
unitary_pseudogene
unprocessed_pseudogene
Alignment with STAR, Tophat2/Bowtie2, and Subjunc
STAR
STAR v2.3.0e was used for all analyses. The STAR index was built using the GRCh37
human genome assembly and Ensembl v68 annotation using the following
command, using an overhang of 99 as indicated by the STAR manual for optimal
alignments of 100 bp reads:
STAR --runMode genomeGenerate --genomeDir
star_db/GRCh37.overhang99 --genomeFastaFiles
Homo_sapiens.GRCh37.68.genome.fa --sjdbGTFfile
Homo_sapiens.GRCh37.68.gtf --sjdbOverhang 99 --runThreadN
16
STAR alignments were performed using the following command, replacing ${FILE1}
and ${FILE} with the proper filenames and ${PREFIX} with the output filename
prefix:
STAR --genomeDir star_db/GRCh37.overhang99 --readFilesIn
${FILE1} ${FILE2} --runThreadN 16 --outFilterMismatchNmax 8
--outFileNamePrefix ${PREFIX}.star. --outFilterMultimapNmax
10
Subjunc
Subread/Subjunc v1.3.6-p1 was used for all analyses. The Subread index was built
using the GRCh37 human genome assembly using the following command:
subread-buildindex -o GRCh37
Homo_sapiens.GRCh37.68.genome.fa
Reads were aligned to this index with the following command, replacing ${FILE1}
and ${FILE} with the proper filenames and ${PREFIX} with the output filename
prefix:
subjunc -i GRCh37 -r ${FILE1} –R ${FILE1} -o ${PREFIX}.sam
–T 16
Tophat2
Bowtie2 v2.1.0 and Tophat2 v2.0.9 were used for all analyses. The Bowtie2 index
was built for the GRCh37 genome assembly using the following command:
bowtie2-build –f Homo_sapiens.GRCh37.68.genome.fa
GRCh37.genome
Alignments with Tophat2 were performed using the following command (using the
most sensitive search option and not considering novel splice junctions), replacing
${FILE1} and ${FILE} with the proper filenames and ${PREFIX} with the output
filename prefix:
tophat --b2-very-sensitive --output-dir ${PREFIX} -p 16 -r
100 --mate-std-dev 50 -g 10 -N 4 --read-edit-dist 4 -library-type fr-unstranded --no-novel-juncs -G
Homo_sapiens.GRCh37.68.gtf --transcriptome-index
GRCh37.transcriptome GRCh37.genome ${FILE1} ${FILE2}
Sorting and BAM conversion (for STAR, Subjunc, and Tophat2)
SAMtools 0.1.19 was used for sorting and BAM conversion.
samtools view -uST Homo_sapiens.GRCh37.68.genome.fa
${PREFIX}.sam | samtools sort - ${PREFIX}
Read group assignment with Picard (not counted in running times)
SNAPR can automatically assign read groups to the output BAM file, as can Tophat,
but STAR and Subjunc do not have this functionality. It should be noted that
although read group assignment was not included in the running time for the paper,
this step is still necessary for many downstream analyses.
java -jar AddOrReplaceReadGroups.jar I=${PREFIX}.bam
O=${PREFIX}.RG.bam SORT_ORDER=coordinate CREATE_INDEX=true
RGPL=Illumina RGLB=Library RGPU=Platform RGID=1
RGSM=${PREFIX} VALIDATION_STRINGENCY=LENIENT
Mapping Quality Reassignment (not counted in running times)
STAR uses MAPQ=255 to refer to ‘uniquely’ mapped reads, similar to old versions of
Tophat. The standard interpretation of MAPQ=255 is ‘unknown’ mapping quality,
and GATK will ignore these reads by default. It is therefore necessary to reassign
these mapping qualities. It should be noted that although reassignment of mapping
qualities was not included in the running time for the paper, this step is still
necessary for many downstream analyses.
java -jar GenomeAnalysisTK.jar -T PrintReads -rf
ReassignOneMappingQuality -RMQT 60 -RMQF 255 -I
${PREFIX}.RG.bam -o ${PREFIX}.RG.MAPQ.bam
Variant Calling with Genome Analysis Toolkit UnifiedGenotyper
The reader may wonder why the UnifiedGenotyper was used instead of the
HaplotypeCaller. At the time of this study, the HaplotypeCaller does not support the
‘N’ CIGAR operator which indicates splicing, and therefore cannot be used for RNAseq data:
java -jar GenomeAnalysisTK.jar -R
Homo_sapiens.GRCh37.68.genome.fa -T UnifiedGenotyper -I
${PREFIX}.bam -o ${PREFIX}.vcf -stand_call_conf 30.0 stand_emit_conf 30.0 -nt 16
Generation of read counts
SNAPR generates gene read counts automatically as part of the alignment process.
Read counts for the other three alignment algorithms were generated in the
following manner, replacing ${PREFIX} with the appropriate SAM file prefix:
STAR and Tophat2/Bowtie2
HTSeq-count 0.5.4p5 was used to generate read counts. STAR and Tophat2 both
use a similar scoring scheme for multiple-aligned reads. Calculation of MAPQ is
determined by the probability of an alignment being incorrect, assuming all
alignments are equally probable. If a read has two alignments, then MAPQ = -10 log
(0.5) = 3, rounded off to the nearest integer. If a read has three alignments, then
MAPQ = -10 log (0.667) = 2, and so on. Therefore MAPQ values < 4 are considered
multiply-aligned, and were filtered out using the ‘–a 4’ option to htseq-count.
htseq-count -m union -s no -a 4 -i gene_id -t exon -q
${PREFIX}.sam Homo_sapiens.GRCh37.68.gtf >
${PREFIX}.gene_id.counts.txt
Subjunc
Subjunc uses it’s own internal application to generate read counts, with no option to
filter multiple-aligned reads:
featureCounts -p -T 16 -a Homo_sapiens.GRCh37.68.gtf -t
exon -g gene_id -i ${PREFIX}.sam -o
${PREFIX}.gene_id.counts.txt
Stomach Adenocarcinoma (STAD)
312 RNA-seq samples of stomach adenocarcinoma were identified from the UCSC
Cancer Genomics Hub (https://cghub.ucsc.edu) using the following cgquery
command:
cgquery
"platform=ILLUMINA&center_name=BCCAGSC&library_strategy=RNA
-Seq&state=live&disease_abbr=STAD"
All samples were sequenced to 100bp length on an Illumina sequencer using the
paired end protocol at the British Columbia Cancer Agency Genome Sciences Centre
(BCCAGSC).
Infections and Contaminants
As described in the main text, detectable levels (>10 reads) of Epstein-Barr virus
type 1 (human herpesvirus 4) were identified in 70 samples of STAD, while 33
samples contained detectable levels of Epstein-Barr virus type 2. 17 samples
contained EBV type 1 as the strongest identifiable contaminant all bacteria, viral,
and fungal genomes. 41 samples contained detectable levels of cytomegalovirus
(human herpesvirus 5). One sample contained cytomegalovirus as the top
contaminant out of all bacteria, viral, and fungal genomes.
Out of all the cancers tested for this study, STAD contained the highest number of
species of bacteria, fungi, and viruses strongly detected in the samples, with 31
different genera or species appearing at least once in the top position of the
contaminant report. Propionibacterium, Cupriavidus, Shigella, and Pseudomonas
were the top bacterial genera represented in the STAD samples. One of these genera
was the top contaminant in 117 of the samples. Genus Propionibacterium contains
some of the most common skin bacteria in humans. Genus Shigella contains
common human pathogens that infect the host intestine. Genus Pseudomonas
contains some of the most common hospital-acquired infectious bacteria. The
fungus Thielavia terrestris was also strongly represented in the samples, occurring
as the top contaminant in 45 of the STAD samples.
Intra- and Inter-gene Fusions
SNAPR identified many fusions that involve genes commonly associated with
stomach adenocarcinoma, many of which appeared in multiple samples. For brevity
we only discuss a few of the most clinically relevant fusions here; complete results
can be found in the supplement spreadsheets. Note that all fusions in the
spreadsheet are duplicated in order to make visualization easier: all fusions can
then be sorted by each gene participating in the fusion to easily visualize clusters of
fusions. Therefore there are exactly one half as many fusions predicted as listed in
the spreadsheet.
There was evidence for ERBB2 (HER2) fusions in many STAD samples, which is
known to be mutated, amplified and overexpressed in gastric cancers1,2. One
sample contained evidence for ERBB2-RARA gene fusion; RARA (retinoic acid
receptor alpha) is commonly fused in acute promyelocytic leukemia.
IGF2 (insulin-like growth factor 2) was strongly predicted to be involved in gene
fusions in stomach adenocarcinoma. IGF2 is commonly overexpressed in gastric
cancers3. There is strong evidence for a PVT1-QRSL1 fusion in one sample, involving
tens of thousands of reads. PVT1 is an oncogene frequently involved in
translocations in Burkitt’s lymphomas4
Glioblastoma multiforme (GBM)
159 RNA-seq samples of glioblastoma multiforme were identified from the UCSC
Cancer Genomics Hub (https://cghub.ucsc.edu) using the following cgquery
command:
cgquery "platform=ILLUMINA&center_name=UNCLCCC&library_strategy=RNA-Seq&state=live&disease_abbr=GBM"
Only the 58 samples samples with identified fusions in Frattini et al were used in
downstream analysis of GBM samples for this study5.
All samples were sequenced to 76bp length on an Illumina sequencer using the
paired end protocol at the University of North Carolina Lineberger Comprehensive
Cancer Center (UNC-LCCC).
Infections and Contaminants
Out of all the cancers tested for this study, GBM contained the fewest number of
genera and species in the top position in the contaminant report. The number one
contaminant was the fungus Aspergillus oryzae RIB40 contig in 55 out of 58
samples tested. There were also very few viral-mapping reads identified, with only
a few species represented.
Intra- and Inter-gene Fusions
Note that all fusions in the spreadsheet are duplicated in order to make visualization
easier: all fusions can then be sorted by each gene participating in the fusion to
easily visualize clusters of fusions. Therefore there are exactly one half as many
fusions predicted as listed in the spreadsheet.
These samples were analyzed in order to compare with those discovered in Frattini
et al.5. We discover 86% of the identical fusions as that study, and an additional 6%
partial fusions (where one of the two genes is predicted to be fused). EGFR
(epidermal growth factor receptor) is involved in many gene fusions in GBM, often
fused to SEPT14 (septin 14) and SEC61G. SEPT14 is also commonly fused, as is
LANCL2. We also discover the FGFR3-TACC3 fusion in two samples, as identified in
Singh et al6. We additionally discover a fusion between FGFR3 and KCND3.
Acute Myeloid Leukemia (LAML)
Sample Protocol
117 RNA-seq samples of acute myeloid leukemia were identified from the UCSC
Cancer Genomics Hub (https://cghub.ucsc.edu) using the following cgquery
command:
cgquery
"platform=ILLUMINA&center_name=BCCAGSC&library_strategy=RNA
-Seq&state=live&disease_abbr=LAML"
All samples were sequenced to 100bp length on an Illumina sequencer using the
paired end protocol at the British Columbia Cancer Agency Genome Sciences Centre
(BCCAGSC).
Infections and Contaminants
LAML stands out from the other cancers as having the strongest evidence for
contamination or infection of any of the cancer types by a bacteria genus. Genus
Acinetobacter was found as the top contaminant in 10 of the 117 samples tested,
containing an average of 5,687,367 reads mapping. Some samples contained over
13 million reads from this genus. This is easily the highest level of infection for any
contaminant in any sample tested, two to three orders of magnitude greater than
the average contaminant. Genus Acinetobacter contains species that are commonly
associated with life-threatening infections in immunocompromised individuals.
Other
genera
represented
include
Pseudomonas,
Acidovorax,
and
Propionibacterium. Cytomegalovirus is detected in 48 samples, some of them quite
strongly.
Intra- and Inter-gene Fusions
Note that all fusions in the spreadsheet are duplicated in order to make visualization
easier: all fusions can then be sorted by each gene participating in the fusion to
easily visualize clusters. Therefore there are exactly one half as many fusions
predicted as listed in the spreadsheet.
Few cancers are as associated with translocations than leukemia, and SNAPR
discovers many of the canonical fusions in these samples. CBFB-MYH11 is identified
in seven samples, which is commonly associated with acute myeloid leukemia.
RUNX1-RUNX1T1 is identified in four samples, also commonly associated with acute
myeloid leukemia. BCR-ABL1 is identified in two samples, which is associated with
chronic myeloid leukemia, and PML-RARA is identified in seven samples, commonly
associated with acute promelocytic leukemia).
Ovarian Serous Cystadenocarcinoma (OV)
Notes on analysis
During analysis of the 808 samples we noticed that there were only 404 actual
patient samples but that there were two BAM files for each sample. Each BAM file
was assigned its own analysis_id. This accounts for the identical fusions identified
for this cancer type. This demonstrates the consistency with which SNAPR is able to
identify sample fusions, usually finding identical numbers of spliced and mated
reads for each fusion between duplicate samples.
Sample Protocol
808 RNA-seq samples of ovarian serous cystadenocarcinoma were identified from
the UCSC Cancer Genomics Hub (https://cghub.ucsc.edu) using the following
cgquery command:
cgquery
"platform=ILLUMINA&center_name=BCCAGSC&library_strategy=RNA
-Seq&state=live&disease_abbr=OV"
All samples were sequenced to 100bp length on an Illumina sequencer using the
paired end protocol at the British Columbia Cancer Agency Genome Sciences Centre
(BCCAGSC).
Infections and Contaminants
The top contaminants in OV samples are genera Pseudomonas and
Propionibacterium, collectively representing the top contaminant in 741 out of 808
samples tested. Enterobacteria phage lambda is also strongly represented,
appearing as the top contaminant in 43 samples. Enterobacteria phages are
unusually highly represented in these samples, with detectable levels of the phages
in every single sample tested. Many samples contain multiple species of these
phages. Several samples of OV contain detectable levels of cytomegalovirus, with
two samples containing appreciable levels of the virus.
Intra- and Inter-gene Fusions
SNAPR identified many fusions that involve genes commonly associated with
ovarian cancer, many of which appeared in multiple samples. For brevity we only
discuss the most clinically relevant fusions here; complete results can be found in
the supplement spreadsheets. Note that all fusions in the spreadsheet are duplicated
in order to make visualization easier: all fusions can then be sorted by each gene
participating in the fusion to easily visualize clusters of fusions. Therefore there are
exactly one half as many fusions predicted as listed in the spreadsheet.
IGF2 (insulin-like growth factor 2) was strongly predicted to be involved in gene
fusions. High expression of IGF2 has been associated with poor prognosis in
advanced serous ovarian cancer7,8. Interestingly, IGF2 is commonly fused with GPX3
(glutathione peroxidase 3), which is commonly overexpressed in serous
adenocarcinoma9 and suppressed in papillary serous ovarian cancer10. IGF2 is also
commonly fused with H19, a ncRNA that is nearby on chromosome 11. H19 is a
tumor suppressor that is maternally imprinted while nearby IGF2 is paternally
imprinted7. H19 is also considered a tumor marker for serous ovarian carcinoma11.
IGF2 is also commonly fused to BGN (biglycan).
There is evidence for MUC16 fusions in eight serous ovarian adenocarcinoma
samples. MUC16 is commonly overexpressed in ovarian cancers and has been
associated with tumorogenesis and metastasis12. There is evidence for a WNT7A
fusion in two samples. WNT7A is associated with tumor growth and progression in
ovarian cancer through the WNT/β-catenin pathway13. There is evidence for XIAP
(X-linked inhibitor of apoptosis) fusions in two samples, which is associated with
evasion of apoptosis in ovarian cancer14.
Bibliography
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Park, Y. S. et al. Comprehensive analysis of HER2 expression and gene
amplification in gastric cancers using immunohistochemistry and in situ
hybridization: which scoring system should we use? Hum. Pathol. 43, 413–
422 (2012).
Lee, J. W. et al. ERBB2 kinase domain mutation in a gastric cancer metastasis.
APMIS 113, 683–687 (2005).
Wu, M. S. et al. Loss of imprinting and overexpression of IGF2 gene in gastric
adenocarcinoma. Cancer Lett. 120, 9–14 (1997).
Carramusa, L. et al. The PVT-1 oncogene is a Myc protein target that is
overexpressed in transformed cells. J. Cell. Physiol. 213, 511–518 (2007).
Frattini, V. et al. The integrated landscape of driver genomic alterations in
glioblastoma. Nat. Genet. 45, 1141–1149 (2013).
Singh, D. et al. Transforming fusions of FGFR and TACC genes in human
glioblastoma. Science 337, 1231–1235 (2012).
Murphy, S. K. et al. Frequent IGF2/H19 domain epigenetic alterations and
elevated IGF2 expression in epithelial ovarian cancer. Mol. Cancer Res. 4, 283–
292 (2006).
Sayer, R. A. et al. High insulin-like growth factor-2 (IGF-2) gene expression is
an independent predictor of poor survival for patients with advanced stage
serous epithelial ovarian cancer. Gynecol. Oncol. 96, 355–361 (2005).
Lee, H. J. et al. Immunohistochemical evidence for the over-expression of
Glutathione peroxidase 3 in clear cell type ovarian adenocarcinoma. Med.
Oncol. 28 Suppl 1, S522–7 (2011).
Agnani, D. et al. Decreased levels of serum glutathione peroxidase 3 are
associated with papillary serous ovarian cancer and disease progression. J
Ovarian Res 4, 18 (2011).
Tanos, V. et al. Expression of the imprinted H19 oncofetal RNA in epithelial
ovarian cancer. Eur. J. Obstet. Gynecol. Reprod. Biol. 85, 7–11 (1999).
Thériault, C. et al. MUC16 (CA125) regulates epithelial ovarian cancer cell
growth, tumorigenesis and metastasis. Gynecol. Oncol. 121, 434–443 (2011).
Yoshioka, S. et al. WNT7A regulates tumor growth and progression in ovarian
cancer through the WNT/β-catenin pathway. Mol. Cancer Res. 10, 469–482
(2012).
Shaw, T. J., Lacasse, E. C., Durkin, J. P. & Vanderhyden, B. C. Downregulation of
XIAP expression in ovarian cancer cells induces cell death in vitro and in vivo.
Int. J. Cancer 122, 1430–1434 (2008).
Download