Qunyuan Zhang
Division of Statistical Genomics
Department of Genetics & Center for Genome Sciences
Washington University School of Medicine
04 - 23 – 2010
GEMS Course: M 21-621 Computational Statistical Genetics
1
Gene Copy Number
The gene copy number (also "copy number variants" or CNVs) is the amount of copies of a particular gene in the genotype of an individual. Recent evidence shows that the gene copy number can be elevated in cancer cells. For instance, the EGFR copy number can be higher than normal in Non-small cell lung cancer. …Elevating the gene copy number of a particular gene can increase the expression of the protein that it encodes.
From Wikipedia www.wikipedia.org
2
DNA Copy Number
A Copy Number Variant (CNV) represents a copy number change involving a
DNA fragment that is ~1 kilobases or larger.
From Nature Reviews Genetics , Feuk et al. 2006
DNA Copy Number ≠ DNA Tandem Repeat Number (e.g. microsatellites)
<10 bases
DNA Copy Number ≠ RNA Copy Number
RNA Copy Number = Gene Expression Level
DNA transcription mRNA
Copy Number is the amount of copies of a particular fragment of nucleic acid molecular chain. It refers to DNA Copy Number in most publications.
3
- restriction fragment length polymorphism (RFLP)
- amplified fragment length polymorphism (AFLP)
- random amplification of polymorphic DNA (RAPD)
- variable number of tandem repeat (VNTR; e.g., mini- and microsatellite)
- single nucleotide polymorphism (SNP)
- presence/absence of transportable elements
…
- structural alterations (deletions, duplications, insertions , inversions … )
DNA copy number variant (CNV)
Association with phenotypes/diseases genes/genetic factors
4
Mutation, LOH, Copy Number Aberration (CNA)
Normal cell CN=2
Homologous repeats
Segmental duplications
Chromosomal rearrangements
Duplicative transpositions
Non-allelic recombinations
…… Tumor cells deletion amplification
CN=0 CN=1 CN=2 CN=3 CN=4
5
Quantitative Polymerase Chain Reaction (Q-PCR) : DNA Amplification
(dNTPs, primers, Taq polymerase, fluorescent dye)
PCR less CN amplification less DNA low fluorescent intensity more CN amplification more DNA high fluorescent intensity
(one fragment each time)
Microarray : DNA Hybridization
(dNTPs, primers, Taq polymerase, fluorescent dye)
PCR less CN amplification less DNA arrayed probes low intensities more CN amplification more DNA arrayed probes high intensities
(multiple/different fragments, mixed pool)
Hybridization
6
Array Comparative Genomic Hybridization (CGH)
Tumor: red intensity
Normal: green intensity more DNA copy number more DNA hybridization higher intensity
Red < Green: Deletion (CN<2)
Red > Green: Amplification (CN>2)
Red = Green: No Alteration (CN=2)
7
SNP Array
Tumor Normal
Affymetrix Mapping
250K StyI chip
~250K probe sets
~250K SNPs probe set (24 probes)
Deletion
CN=2
CN=1
Deletion
CN=2
CN=0
CN>2
Amplification
CN=2 more DNA copy number more DNA hybridization higher intensity
8
Genotyping & Copy Number Calling
CN=0
2 copy deletion, genotype (_//_)
CN=1
1 copy deletion, genotype (_//B)
CN=2
Normal , genotype (A//B)
1 copy amplification, genotype (AA//B)
CN=3
CN=4
2 copy amplification, genotype (AA//BB)
9
BBBB
BB
A_
AB
AABB
AA
10
11
Finished chips (scanner) Raw image data [.DAT files]
(experiment info [ .EXP]) (image processing software) chip description file
[.CDF]
Probe level raw intensity data [.CEL files]
Preprocessing : Background adjustment , Normalization , Summarization
Summarized intensity data
Raw copy number (CN) data [log ratio of tumor/normal intensities]
Significance test of CN changes
Estimation of CN
Smoothing and boundary determination
Concurrent regions among population
Amplification and deletion frequencies among populations
Association analysis
12
Reduces unevenness of a single chip
Makes intensities of different positions on a chip comparable
Before adjustment After adjustment
Corrected Intensity (S’) = Observed Intensity (S) – Background Intensity (B)
For each region i, B(i) = Mean of the lowest 2% intensities in region i
AffyMetrix MAS 5.0
13
Eliminates non-specific hybridization signal
Obtains accurate intensity values for specific hybridization sense or antisense strands
25 oligonucleotide probes quartet probe set
PM only, PM-MM, Ideal MM, etc.
14
Reduces technical variation between chips
Makes intensities from different chips comparable
Before normalization After normalization
S’ =
S – Mean of S
STD of S
S’ ~ N(0,1 )
Base Line Array (linear); Quantile Normalization etc.
15
Combines the multiple probe intensities for each probe set to produce a summarized value for subsequent analyses.
Average methods:
PM only or PM-MM, allele specific or non-specific
Model based method : Li & Wong , 2001
Gene Expression Index
16
before Log transformation
S after Log transformation
Log(S)
S : Summarized raw intensity
S’ : Log transformation, S’ = log
2
(S)
Log ratio of sample i / sample ref.
CN_log2 = log
2
(S i
/S ref
)
CN = 2(S i
/S ref
)
Raw CN
17
Smoothing
Significance test of amplification and deletion
Segmentation
CN estimation
18
… .. … … . . . . .. …… …… .. … … . . . . .. …… … .. …… … ..
Window 1
Window 2
Window 3
Window 4
Window 5
Window 6
Window 7
Window 8
Window 9
Window 10
………..
Window k
………..
Window N
Each window (k ) contains n consecutive SNPs (k, k+1, k+2, k+3, …, k+n-1)
19
Affymetrix
Chrom. 7 Chrom. 7
Mbp
Illumina
Chrom. 7
Mbp
Mbp
Chrom. 7
Mbp
20
Mbp Mbp Mbp
Mbp Mbp
21
Mbp Mbp epidermal growth factor receptor (EGFR)
22
(Break chrom. into CN-homologous pieces)
BioConductor R Packages (www.bioconductor.org)
DNAcopy package, circular binary segmentation (CBS)
GLAD package, adaptive weights smoothing (AWS)
23
1,2,3, ….,i-1 , i , i+1,…,j-1,j , j+1,...n
Given any i , j , n
S i
Z ij
k i
1
( S j x k
, S j
S i
)
( k j
1 j x k
i )
, S n
( S n
1 ( j
i )
1 k n
1
S x k j
( n
j
S i
)
i )
( n
j
i )
Z
C
max Z ij
Iterate until Zc is not significant.
24
Olshen et al. Biostatistics. 2004 Oct;5(4):557-72.
CNAT (www.affymetrix.com); dChip (www.dchip.org) ; CNAG (www.genome.umin.jp) position hidden status
(unknown CN )
…
SNP_i SNP_i+1 SNP_i+2 SNP_i+3 SNP_i+4
…
CN=?
CN=?
CN=?
CN=?
CN=?
observed status
(raw CN = log ratio of intensities) log ratio log ratio log ratio log ratio log ratio
CN estimation: finding a sequence of CN values which maximizes the likelihood of observed raw CN.
Algorithm: Viterbi algorithm (can be Iterative)
Information/assumptions below are needed
Background probabilities: Overall probabilities of possible CN values.
P(CN=x); x=0,1,2,3,4,…, n (usually,n<10)
Transition probabilities: Probabilities of CN values of each SNP conditional on the previous one.
P(CN_i+1=x i
| CN_i=x j
); x=0,1,2,3,4,…, or n
Emission probabilities: Probabilities of observed raw CN values of each SNP conditional on the hidden/unknown/true CN status.
25
P(log ratio<x|CN=y)=f(x|CN=y); x=one of real numbers; y=0,1,2,3,4, …, or n
(An Example)
Black: Normal Intensities, Red: Tumor Intensities, Green: Tumor- Normal
Blue: HMM estimated CNs in Tumor Tissue
CN=4
CN=3
CN=2 CN=1
26
for Single Sample Analysis
•
Hsu et al. 2005. Denoising array-based comparative genomic hybridization data using wavelets.
Biostatistics 6: 211-226.
•
Hupe et al. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions.
Bioinformatics 20: 3413-3422.
•
Jong et al. 2004. Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics 20: 3636-3637.
•
Lai et al. 2005. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 21: 3763-3770.
•
Lai et al. 2005. A statistical method to detect chromosomal regions with DNA copy number alterations using SNP-array-based CGH data. Comput Biol Chem 29: 47-54.
•
Olshen et al. Circular binary segmentation for the analysis of array-based DNA copy number data.
Biostatistics 5: 557-572.
•
Picard et al. 2005. A statistical approach for array CGH data analysis. BMC Bioinformatics 6: 27.
•
Shah et al. 2007. Modeling recurrent DNA copy number alterations in array CGH data.
Bioinformatics 23: i450-458.
•
Nilsson et al. Bioinformatics. 2009 Apr 15;25(8):1078-9 . Epub 2009 Feb 19.
27
Common/Reocurrent Region Identification samples
Nature 2007, 450 , 893-898
28
Genome-wide Raw Copy Number Changes
(sliding window plot, averaged over ~400 pairs )
29
Diskin et al. 2006. STAC, Genome Res 16:
1149-1158. Permutation test
30
GISTIC Beroukhim et al. 2007. Proc Natl
Acad Sci U S A 104:
20007-20012
Weir et al. Nature 2007,
450 , 893-898
31
CMDS Method
Q Zhang et al. Bioinformatics, 2009 doi:10.1093/bioinformatics/btp708
32
for Multiple Sample Analysis
•
( GISTIC ) Beroukhim et al. 2007. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci U S A 104: 20007-20012.
•
( STAC ) Diskin et al. 2006. STAC: A method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. Genome Res 16: 1149-1158.
•
( MSA ) Guttman et al. 2007. Assessing the significance of conserved genomic aberrations using high resolution genomic microarrays. PLoS Genet 3: e143.
•
( GFA ) Lipson et al. 2006. Efficient calculation of interval scores for DNA copy number data analysis. J Comput Biol 13: 215-228.
•
( MAR ) Rouveirol et al. 2006. Computation of recurrent minimal genomic alterations from array-CGH data. Bioinformatics 22: 849-856.
•( CMDS ) Zhang et al. CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data. Bioinformatics , 2009 doi:10.1093/bioinformatics/btp708
33
Nature Genetics 41 , 1061 - 1067 (2009)
34
Science 2007:Vol. 318. pp. 420 - 426
DOI: 10.1126/science.1149504
35
36