Low-Level Copy Number Analysis Part 1

advertisement
Low-Level
Copy Number Analysis
Part 1 - Background
Henrik Bengtsson
Post doc, Department of Statistics,
University of California, Berkeley, USA
CEIT Workshop on SNP arrays,
Dec 15-17, 2008, San Sebastian
Acknowledgments
UC Berkeley:
• James Bullard
• Kasper Hansen
• Elizabeth Purdom
• Terry Speed
Lawrence Berkeley National Labs:
• Amrita Ray
• Paul Spellman
John Hopkins, Baltimore:
• Benilton Carvalho
• Rafael Irizarry
WEHI, Melbourne, Australia:
• Mark Robinson
• Ken Simpson
ISREC, Lausanne, Switzerland:
• Pratyaksha “Asa” Wirapati
Affymetrix, California:
• Ben Bolstad
• Simon Cawley
• Steve Chervitz
• Harley Gorrell
• Earl Hubbell
• Luis Jevons
• Chuck Sugnet
• Jim Veitch
• Alan Williams
Detect more and smaller aberrations with less errors
CRMA
detection rate
other
#1
other #2
other #3
Copy number analysis is about finding
"aberrations" in one or several individuals
The HapMap project
- Large project to identify SNPs in Humans (2003-)
The HapMap is a catalog of common genetic variants (SNPs)
that occur in human beings. It describes what These variants
are, where they occur in our DNA, and How they are
distributed among people within populations and among
populations in different parts of the world.
URL: http://www.hapmap.org/
The HapMap project
- 270 normal individuals genotyped by different labs
using various technologies
•
•
•
•
90 CEU individuals (Utah/Europe, 30 trio families)
90 YRI individuals (Nigeria; 30 trio families)
45 CHB (China; unrelated)
45 JPT (Japan; unrelated)
Publicly available:
• High quality data.
• Raw data, e.g. Affymetrix CEL files.
• Genotypes.
• Studied by many groups.
Copy number polymorphism
- People share common CN aberrations (2005-)
The Cancer Genome Atlas (TCGA) project
- Large project for genetic mapping of tumors (2007-)
The TCGA project
- A large number of tissues are studies with many DNA & RNA technologies
• Tumor types:
– brain cancer (glioblastoma multiforme, or GBM),
– lung cancer (squamous cell carcinoma of the lung), and
– ovarian cancer (serous cystadenocarcinoma of the ovary).
• 234 tumors (of 500) characterized.
• Multiple labs in the US
– Broad, Harvard, Stanford, LBNL, …
• High quality data.
• Platforms: Affymetrix, Illumina, Agilent, …
• Gene-, exon-, microRNA- expression, methylation, SNP &
CN, sequencing…
• Raw and summarized data immediately available (publicly),
e.g. Affymetrix CEL files.
Combining copy numbers across platforms & labs
Henrik Bengtsson (UC Berkeley), Amrita Ray (LBNL), Paul Spellman (LBNL), Terry Speed (UC Berkeley)
BACKGROUND:
Whole-genome copy-number (CN) studies are rapidly
expanding, and with this expansion comes a demand for
increased precision and resolution of CN estimates. Several
recent studies have obtained CN estimates from more than
one platform on the same samples, and it is natural to want to
combine the different estimates in order to meet this demand.
PROBLEM:
CN estimates from different platforms show different
degrees of attenuation of the true CN changes. Differences
can also be observed in CN estimates from the same platform
run in different labs, or in the same lab, with different analytical
methods. This is the reason why it is not straightforward
matter to combine CN estimates from different sources
(platforms, labs, analysis methods, etc).
(A) Broad, Affymetrix GWS6, n=1800K, 1.59kb/locus, 25-mers
(B) Stanford, Illumina 550K, n=550K, 5.53kb/locus, 50-mers
(C) MSKCC, Agilent 244K, n=236K, 12.7kb/locus, 60-mers
(D) Harvard, Agilent 244K, n=236K, 12.7kb/locus, 60-mers
The smoothed raw CNs from the four sources have similar
CN profiles but different mean levels.
Tumor/normal CNs by four TCGA centers (“sources”)
in a 60Mb region on Chr3 in sample TCGA-02-104.
(The combined set would consist of 2,822K loci with
0.95kb/locus.)
METHOD:
We have developed a single-sample multi-source
normalization that brings full-resolution CN
estimates to the same scale across sources.
Kernel estimators and principal component
curves are used to estimate the non-linear
relationships between the sources. Full-resolution
data is then normalized such that these
relationships become linear.
The normalized estimates are such that for any
underlying CN level, the mean level of the CN
estimates is the same regardless of source. CNs
with consistent mean levels are better suited for
being combined across sources, e.g. existing
segmentation methods may be used to identify
aberrant regions.
Before normalization:
Non-linearity between pairs
The smoothed normalized CNs from the four sources have
similar CN profiles and same mean levels.
After normalization:
Linearity between pairs
Normalized full-resolution CNs for the four sources.
RESULTS:
We use microarray-based CN estimates from The Cancer
Genome Atlas (TCGA) project to illustrate the method. We
show that after normalization the mean levels of randomly
selected CN aberrations are the same across platforms,
and that the normalized and combined data better separate
two CN states at a given resolution.
We conclude that it is possible to combine CNs from
multiple sources such that the resolution becomes
effectively larger, and when multiple platforms are
combined, they also enhance the genome coverage by
complementing each other in different regions.
(A)
(B)
(C)
A 400kb region in TCGA-02-104 on Chr 3:
CNs from different sources give different
segmenting results at different precisions.
(D)
(comb+raw)
(comb+norm)
At any given resolution (amount of smoothing), with combined
normalized CNs (solid red) one can separate two CN states better
than with combined raw CNs (dot-dashed red), and with each of
the individuals sources (gray dotted).
With combined normalized CNs, there is more
power to detect change points (CPs) and their
locations are more precise.
Examples of genomic profiles
The Affymetrix
platform
The Affymetrix GeneChip
is a synthesized high-density (single-array) microarray
*
*
*
*
*
5 µm
1.28 cm
5 µm
1.28 cm
1 million identical
25-mer sequences
6.5 million probes/chip
Copy-number probes are used to quantify
the amount of DNA at known loci
...CGTAGCCATCGGTAAGTACTCAATGATAG...
ATCGGTAGCCATTCATGAGTTACTA
CN locus:
PM:
CN=1
CN=2
** *
PM = c
CN=3
** *
PM = 2c
** *
PM = 3c
Single Nucleotide Polymorphism (SNP)
Definition:
A sequence variation such that two chromosomes may
differ by a single nucleotide (A, T, C, or G).
Allele A:
Allele B:
A
...CGTAGCCATCGGTA/GTACTCAATGATAG...
G
A person is either AA, AB, or BB at this SNP.
Probes for SNPs
PMA:
Allele A:
ATCGGTAGCCATTCATGAGTTACTA
...CGTAGCCATCGGTAAGTACTCAATGATAG...
Allele B:
PMB:
...CGTAGCCATCGGTAGGTACTCAATGATAG...
ATCGGTAGCCATCCATGAGTTACTA
(Also MMs, but not in the newer chips, so we will not use these!)
AA
BB
AB
** *
** *
PMA >> PMB
*
** *
** *
PMA ¼ PMB
** *
** *
PMA << PMB
SNP probes can also be used to
estimate total copy numbers
BB
AA
** *
** *
** *
PM = PMA + PMB = 2c
PM = PMA + PMB = 2c
AB
** *
AAB
** *
** *
PM = PMA + PMB = 2c
** *
** *
PM = PMA + PMB = 3c
*
The Affymetrix assay
- takes 4-5 working days to complete
1. Start with target gDNA (genomic DNA) or mRNA.
2. Obtain labeled single-stranded target DNA fragments
for hybridization to the probes on the chip.
3. After hybridization, washing, and scanning we get a
digital image.
4. Image summarized across pixels to probe-level
intensities before we begin. This is our "raw data".
Restriction enzymes digest the DNA, which is then
amplified and hybridized
Target DNA find their way to complementary
probes by massive parallel hybridization
Scanning
Image Analysis
Example array:
Dimensions: 1600x1600 cells
Each cell: 3x3 pixels
Dynamic range: 65536 (16-bits) intensity levels
Cell summaries: (mean pixel, stddev pixel, #pixels)
Preparation
+ Hybridization
+ Scanning
DAT File(s)
[Image, pixel intensities]
Image analysis
workable raw data
CEL File(s)
[Probe Cell Intensity]
+
Low-level analysis
Segmentation
CDF
[Chip Description File]
A brief history
of Affymetrix SNP & CN arrays
How did we get here?
Data from 2003 on Chr22 (on of the smaller chromosomes)
zoom in
2003: 10,000 loci
x1
2004: 100,000 loci
x10
2005: 500,000 loci
x50
2006: 900,000 loci
x90
2007: 1,800,000 loci
x180
Genome-Wide Human SNP Array 6.0
- state-of-the-art array
• > 906,600 SNPs:
– Unbiased selection of 482,000 SNPs:
historical SNPs from the SNP Array 5.0 (== 500K)
– Selection of additional 424,000 SNPs:
•
•
•
•
•
Tag SNPs
SNPs from chromosomes X and Y
Mitochondrial SNPs
Recent SNPs added to the dbSNP database
SNPs in recombination hotspots
• > 946,000 copy-number probes:
– 202,000 probes targeting 5,677 CNV regions from the Toronto
Database of Genomic Variants. Regions resolve into 3,182
distinct, non-overlapping segments; on average 61 probe sets
per region
– 744,000 probes, evenly spaced along the genome
Rapid increase in density
Distance between loci:
4x further out…
10K
100K
500K
5.0
6.0
294kb
26kb
next?
6.0kb
3.6kb
# loci
1.6kb
year
2003
2004
2005
2006
2007
Affymetrix & Illumina are competing
- we get more bang for the buck (cup)
Released
10K
100K
500K
5.0
6.0
July 2003
April 2004
Sept 2005
Feb 2007
May 2007
# SNPs
10,204
116,204
500,568
500,568
934,946
# CNPs
-
-
-
340,742
946,371
10,204
116,204
500,568
841,310
1,878,317
294kb
25.8kb
6.0kb
3.6kb
1.6kb
Price / chip set
65 USD
400 USD
300 USD
175 USD
300 USD
# loci / cup of
espresso ($1.35)
116 loci
215 loci
1236 loci
3561 loci
4638 loci
# loci
Distance
Price source: Affymetrix Pricing Information [http://store.affymetrix.com/] and Berkeley Coffee Shops, Dec 2008.
Affymetrix are moving away from MM probes
- therefore we don’t utilize them
Target DNA:
Perfect match (PM):
Mis-match (MM):
...CGTAGCCATCGGTAAGTACTCAATGATAG...
|||||||||||||||||||||||||
ATCGGTAGCCATTCATGAGTTACTA
ATCGGTAGCCATACATGAGTTACTA
25 nucleotides
*
Target seq.
**
PM
*
X
**
MM
*
*
**
Other DNA
Other DNA
Other seq.
*
other PMs
Low-Level
Copy Number Analysis
Part 2 – Simple preprocessing
Henrik Bengtsson
Post doc, Department of Statistics,
University of California, Berkeley, USA
CEIT Workshop on SNP arrays,
Dec 15-17, 2008, San Sebastian
Recap: Copy-number probes
...CGTAGCCATCGGTAAGTACTCAATGATAG...
ATCGGTAGCCATTCATGAGTTACTA
CN locus:
PM:
CN=1
CN=2
** *
PM = c
CN=3
** *
PM = 2c
** *
PM = 3c
Recap: Adding SNP probes gives total CN signal
BB
AA
** *
** *
** *
PM = PMA + PMB = 2c
PM = PMA + PMB = 2c
AB
** *
AAB
** *
** *
PM = PMA + PMB = 2c
** *
** *
PM = PMA + PMB = 3c
*
Notation
- here and in our papers
Indices:
Arrays/samples: i = 1, 2, …, I
Loci/SNPs/CN units: j = 1, 2, …, J
Replicated probes for SNP: k = 1, 2, …, K
Probe signals:
CN locus: yij = PMij (single-probe units)
SNP allele pair k: (yijkA,yijkB) = (PMijkA,PMijkB)
Summarized signals (“chip effects”):
CN locus: ij
SNP: (ijA,ijB)
A simple way to obtain CN estimates
• Calculate non-polymorphic SNP summaries:
– For each array i=1,…,I and SNP j=1,…,J:
• Probe allele pairs: (PMijkA,PMijkB); k=1,…,K
• For both alleles, average across probes:
ijA = mediank {PMijkA}, ijB = mediank {PMijkB}
• Sum both alleles: ij = ijA + ijB
• Calculate reference Rj across all arrays:
– For each SNP j=1,…,J:
• Rj = mediani {ij}
• Calculate CN log-ratios:
– For each array i=1,…,I and SNP j=1,…,J:
• Mij = log2 (ij / Rj)
The software tools make this easy for you
- using aroma.affymetrix package
cs <- AffymetrixCelSet$byName(“GSE8605”,
chipType=“Mapping10K_Xba142”);
plm <- AvgCnPlm(cs, combineAlleles=TRUE);
fit(plm);
ces <- getChipEffectSet(plm);
theta <- extractTheta(ces);
thetaR <- rowMedians(theta);
M <- log2(theta / thetaR);
Copy number regions are found by
lining up estimates along the chromosome
Example: Log-ratios for one sample on Chromosome 22.
Even without a segmentation algorithm,
we can easily spot a deletion here.
If we don’t add up the alleles, we get allele-specific
estimates from which we can get genotypes
Example: (ijA,ijB) for one SNP across all samples
BB
AB
AA
There are a lot of artifacts in microarray data
- can we do better?
Systematic variation can be added due to:
• Spatial artifacts
• Intensity dependent effects
• Probe-sequence dependent effects
• GC-content effects
• PCR effects
• Lab & people effects
• Non-calibrated scanners
• …?
Spatial artifacts (“extreme”)
http://plmimagegallery.bmbolstad.com/
Intensity dependent artifacts/variation
Lab and people effects/variation
PCR fragment length effects/variation
32.5Mb deletion on chr 11
Before
= 0.246
After
= 0.225
“Wave” patterns along genome
Nucleotide-position effect ()
Probe-sequence effects/variation
- probes respond differently
Position (t)
Low-Level
Copy Number Analysis
Part 3 – aroma.affymetrix
Henrik Bengtsson
Post doc, Department of Statistics,
University of California, Berkeley, USA
CEIT Workshop on SNP arrays,
Dec 15-17, 2008, San Sebastian
aroma.affymetrix processes
unlimited number of arrays
• Processes unlimited number of arrays:
– Bounded memory algorithms.
– Works toward file system.
– Persistent memory: robust & picks up where last stopped.
• Memory requirements: 1.0-2.0GB RAM.
– Example: RMA on 4500 HG-U133A arrays uses ~500MB of RAM.
– Example: CRMA on 300 SNP6.0 arrays uses ~1.5GB of RAM.
– Example: FIRMA on 200 HuEx-1.0 arrays uses ~1.5GB of RAM.
• Cross platform: Linux/Unix, Windows, OSX.
• Supports most Affymetrix chip types:
– All chip types with a CDF (and some more).
– Custom CDFs.
aroma.affymetrix "implements"
several existing methods
• Calibration and normalization:
–
–
–
Background correction methods: RMA, gcRMA, ...
Allelic cross-talk calibration, quantile normalization, spatial normalization,
probe-sequence normalization, …
PCR fragment-length normalization, GC-content normalization.
• Probe-level summarization:
–
multiplicative (dChip), affine, and log-additive (RMA) models. Easy to add new.
• Quality assessment:
–
–
RLE (Relative Log Expressions), NUSE (Normalized Unscaled Standard Error)
Spatial plots: probe signals, PLM residuals, chip effects, CDF annotations, ...
• Paired & non-paired copy-number analysis:
–
–
–
–
–
All SNP & CN platforms. Multiple chip types.
CRMA (our methods for estimating raw CNs).
Allele-specific and/or total CN estimates
Genotyping via CRLMM
Segmentation method: CBS & GLAD. Easy to add more.
• Miscellaneous:
–
–
–
–
Alternative splicing (exon arrays): Finding Isoforms using RMA (FIRMA)
Tiling-array analysis: MAT processing
Resequencing arrays
Gene expression arrays (of course)
aroma.affymetrix is an
open-source solution
• Community:
– 250+ users worldwide (approx 10 installation per day)
– Active mailing list (Google Groups; 150+ message per month)
– Collaborative documentation and vignettes (Google Groups).
• Development:
–
–
–
–
–
–
~3 years since start:
Jan 2006-Oct 2006:
Phase I: Identifying API (1-3 person project).
Oct 2006-Feb 2007:
Phase II: Maturing API and testing (10-15 users & developers).
Feb 2007-Aug 2007: Phase III: Extended real-world testing (30-50 users & developers).
Aug 2007-Fall 2008:
Phase IV: Public release and heaps of CPU mileage (more "wild" use cases).
Fall 2008-…
Phase V: Third party extensions are coming in. More chip types supported.
– 3-5 active developers. One main maintainer / code coordinator.
– Some external code snippet contributors.
– Lots of validation code - catches existing and future bugs.
– 1,000+ pages code / Rd pages.
• Standards:
– Standard file formats, e.g. reads/writes CEL (via affxparser/Fusion SDK).
– Imports and exports to: APT, GTC, CNAG, CNAT, dChip, Bioconductor etc.
– Strict directory structures and relative pathnames
=> portable scripts, robustness, more automated validations, easier to troubleshoot, simplified support.
– Utilizes existing packages, e.g. preprocessCore, gcRMA, DNAcopy...
Walk-through example
Complete aroma.affymetrix script for copy-number analysis of
270 SNP6.0 samples
cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6")
csR <- AffymetrixCelSet$byName("HapMap270", cdf=cdf)
acc <- AllelicCrosstalkCalibration(csR)
csC <- process(acc)
bpn <- BasePositionNormalization(csC)
csN <- process(bpn)
plm <- AvgCnPlm(csN, combineAlleles=TRUE)
fit(plm)
ces <- getChipEffectSet(plm)
fln <- FragmentLengthNormalization(ces)
cesN <- process(fln)
seg <- CbsModel(cesN)
ce <- ChromosomeExplorer(seg)
process(ce)
Offline & online dynamic HTML reports
Example: ChromosomeExplorer
Setup is as simple as placing the files in
a strict & standardized directory structure
annotationData/
chipTypes/
GenomeWideSNP_6/
GenomeWideSNP_6.CDF
GenomeWideSNP_6.UGP
GenomeWideSNP_6.UFL
rawData/
HapMap270,CEU/
GenomeWideSNP_6/
*.CEL
No (absolute) pathnames are used
- maximizes portability
annotationData/
chipTypes/
GenomeWideSNP_6/
GenomeWideSNP_6.CDF
GenomeWideSNP_6.UGP
...
cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6")
print(cdf)
AffymetrixCdfFile:
Path: annotationData/chipTypes/GenomeWideSNP_6
Filename: GenomeWideSNP_6.cdf
Filesize: 470.44MB
File format: v4 (binary; XDA)
Chip type: GenomeWideSNP_6
Dimension: 2572x2680
Number of cells: 6892960
Number of units: 1881415
...
The file system is the memory
- data is loaded only when needed
cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6")
csR <- AffymetrixCelSet$byName("HapMap270", cdf=cdf)
AffymetrixCelSet:
Name: HapMap270
Tags: CEU
Path: rawData/HapMap270,CEU/GenomeWideSNP_6
Chip type: GenomeWideSNP_6
Number of arrays: 270
Names: NA06985, NA06991, ..., NA07019
Total file size: 17.7GB
RAM: 0.01MB
Normalized data is stored as CEL files
- import to any software
acc <- AllelicCrosstalkCalibration(csR)
csC <- process(acc)
print(csC)
AffymetrixCelSet:
Name: HapMap270
Tags: CEU,ACC,ra,-XY
Path: probeData/HapMap270,CEU,ACC,ra,-XY/GenomeWideSNP_6
Chip type: GenomeWideSNP_6
Number of arrays: 270
Names: NA06985, NA06991, ..., NA07019
Total file size: 17.7GB
RAM: 0.01MB
files <- getPathnames(csC)
print(files[1])
[1] "probeData/HapMap270,CEU,ACC,ra,-XY/
GenomeWideSNP_6/NA06985.CEL"
Data sets (directories) are marked
with unique tags
qn <- QuantileNormalization(csC)
csN <- process(qn)
print(csN)
AffymetrixCelSet:
Name: HapMap270
Tags: CEU,ACC,ra,-XY,ACC,QN
Path: probeData/HapMap270,CEU,ACC,ra,-XY,QN/GenomeWideSNP_6
Chip type: GenomeWideSNP_6
Number of arrays: 270
Names: NA06985, NA06991, ..., NA07019
Total file size: 17.7GB
RAM: 0.01MB
Download