Supplementary Material (doc 43K)

advertisement
Supplemental Material and Methods
Bioinformatic Evaluation of Sequencing Data
File Conversion and Alignment
Raw sequences that were obtained using the SCS 2.6 (qseq-files) and the SCS
2.8/SCS 2.9 (bcl-files) software (Illumina) were converted directly to fastq-files using
GERALD (Illumina) or were first converted from bcl to qseq files using the off-line
base (OLB) caller software (Illumina), followed by conversion to fastq-files using
GERALD, respectively. The fastq files were then used for the alignment to the
reference genome hg19 using the Burrows Wheeler Aligner (BWA). The output
alignment-files were converted to bam-files and then further processed in order to
exclude identical reads and reads that mapped to more than one position (using the
Picard tool and R/Bioconductor), resulting in uniquely mapped unique reads. These
reads were then compared to the exome design-files (v1 design file: 36Mb; v2
design-file: 44Mb, Nimblegen) to calculate the percentage of reads “on target”, the
coverage “on target” and the distribution of the reads “on target” (uniformity) (see
S_Table 2).
Single Nucleotide _Variant (SNV)-Calling, -Filtering and Biological Interpretation
Unpaired SNV-calling was performed using the Genome Analysis Toolkit (GATK)
pipeline as described,16 followed by custom annotation and filtering steps (Figure 1).
First, all SNVs were called that were not present in the reference genome (hg19).
Thereafter, SNVs were excluded that either appeared in the 1000 genomes database
or the dbSNP database, assuming that these SNVs might be of less importance for
tumorigenesis. Based on the SeattleSeq annotation tool, we also excluded those
SNVs that did not lead to an amino acid exchange (synonymous mutations) and
those that were predicted to have benign structural consequences according to
Polyphen 1. Of note, SeattleSeq lists all transcripts of a gene (different protein length
or accession numbers). Since the same mutation might have different effects on
different transcripts, all of them were listed. In addition, we matched the readily
filtered mutational data of 38 MM, published in the supplements of Chapman and
colleagues with the mutation data of our cases.5 This increased the number of
primary MM to 43 and helped us to efficiently extract the tumor relevant SNVs from
our six cell lines (Figure 1). Specifically, for our discovery approaches, we focussed
on genes that were affected by a mutation in at least one of our five primary MM or
one of our six cell lines plus at least one of the 38 primary MM.
5
Finally, we applied three additional bioinformatics predictors, namely GERP (0 - 1),
phastCons (-11.6 - +5.82) and Polyphen 2 (benign, probably damaging, possibly
damaging) to focus on those genes that are affected by mutations that likely lead to
“functional” changes of the protein. Assuming that the lowest score is equal to 0%
and the highest score is equal to 100%, we chose a threshold of 65% for GERP and
phastCons and excluded those SNVs that were neither probably nor possibly
damaging according to Polyphen 2. The resulting gene lists (79 genes and 193
genes, Figure 1) were then used to discover new tumor relevant pathways.
In detail, the gene lists that were generated by our filtering approach (79 genes and
193 genes, Figure 1) as well as the mutation data from Chapman et al. (1,429 genes)
were entered to the GSEA (Gene Set Enrichment Analysis) database (MSigDB) and
pathway annotations using the C2-collection (3,272 curated gene sets) were
performed.17 The signaling network was determined using the String 9.0 (Search
Tool for the Retrieval of Interacting Genes/Proteins) database (analysis performed
with low- and medium confidence)18 and by manual literature search. Specifically, we
screened a broad panel of adhesion molecules, receptor tyrosine kinases and
downstream effectors and generated a signaling network that only contained tumorrelevant genes (non-synonymous mutations occurring in primary MM but not in the
corresponding normal tissue and that were damaging according to functional
predictors). Information on protein domains was obtained using the graphical view of
the Nucleotide database from NCBI and the String database.
Download