Copy Number Variations

advertisement
CZ5225: Modeling and Simulation in Biology
Lecture 10: Copy Number Variations
Prof. Chen Yu Zong
Tel: 6516-6877
Email: phacyz@nus.edu.sg
http://bidd.nus.edu.sg
Room 08-14, level 8, S16, NUS
Copy number variation (CNV)
What is it?
• A form of human genetic variation: instead of 2 copies of
each region of each chromosome (diploid), some people
have amplifications or losses (> 1kb) in different regions
– this doesn’t include translocations or inversions
• We all have such regions
– the publicly available genome NA15510 has
between 5 & 240 by various estimates
– they are only rarely harmful (but rare things do happen)
2
Copy-number probes are used to quantify
the amount of DNA at known loci
CN locus: ...CGTAGCCATCGGTAAGTACTCAATGATAG...
PM:
ATCGGTAGCCATTCATGAGTTACTA
CN=1
CN=2
** *
PM = c
CN=3
** *
PM = 2c
** *
PM = 3c
Copy number variation
Population genomics
The genomes of two humans differ more in a
structural sense than at the nucleotide level; a
recent paper estimates that on average two of us
differ by
~ 4 - 24 Mb of genetic due to Copy Number
Variation
~ 2.5 Mb due to Single Nucleotide
Polymorphisms
4
Abundance of CNVs in the
human population
?
Still an open question but probably
thousands, at low allelic frequency
(<20%)
Abundance of deletion CNVs in
the human population
Comparison of overlapping CNVs identified by Conrad et al. (2006) and McCarroll et al. (2006).
Freeman et al. Genome Res 2006
Non-allelic homologous recombination events
between low-copy repeats (LCR-NAHR)
Lupski & Inoue, TIG 2002
Duplications and Deletions of LCRs
mediated by NAHR
LCRs in
direct
orientation
LCRs in
inverted
orientation
Inversions
Intrachromatid recombination
between LCRs
LCRs in direct orientation
Deletion
LCRs in inverted orientation
Inversion
Mechanisms generating
genomic deletions
Copy number variation
Relations to human disease
Responsible for a number of rare genetic conditions.
For example, Down syndrome ( trisomy 21), Cri du chat
syndrome (a partial deletion of 5p).
Implicated in complex diseases. For example:
CCL3L1 CN   HIV/AIDS susceptibility; also,
some sporadic (non-inherited) CN variants are strongly
associated with autism, while
Tumors typically have a lot of chromosomal abnormalities,
including recurrent CN changes.
11
Evolutionary and medical
implications of CNVs:
CCL3L1 as an example
Gonzales et al., Science, 2005
When CCL3L1 occupies the CCR5 receptor on
CD4 cells, it blocks HIV's entry.
Copy-number variation of CCL3L1 within
and among human and chimp populations
Gonzales et al., Science, 2005
CCL3L1 and HIV Infection
Individuals with a high
CCL3L1 gene copy
number relative to their
population average are
more resistant to HIV
infection than those with a
low copy number,
presumably because there
is more ligand to compete
with HIV during binding to
CCR5.
Gonzales et al., Science, 2005
Trisomy 21
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
15
Partial deletion of chr 5p
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
16
A cytogeneticist’s story
“The story is about diagnosis of a 3 month old baby with
macrocephaly and some heart problems. The doctors
questioned a couple of syndromes which we tested for and
found negative. Rather than continue this ‘shot in the dark’
approach, we put the case on an array and found a 2Mb
deletion which notably deletes the gene NSD1 on chr 5,
mutations in which are known to be cause Sotos syndrome.
This is an overgrowth syndrome and fits with the
macrocephaly.
The bottom line is that we are able to diagnose quicker by
this approach and delineate exactly the underlying genetic
change.”
17
A cytogeneticist’s story
Chromosome 5
2Mb deletion
18
Many tumors have gross CN changes
19
A lung cancer cell line vs matched normal lymphoblast,
from Nannya et al Cancer Res 2005;65:6071-6079
Research into gonad dysfunction:
Human sex reversal
• 20% of 46,XY females have mutations in SRY
• 80% of 46,XY females unexplained!
• 90% of 46,XX males due to translocation SRY
• 10% of 46,XX males unexplained!
Suggests loss of function and gain of function mutations in
other genes may cause sex reversal. We’re looking at shared
deletions.
20
Affymetrix SNP chip terminology
Genomic DNA
SNP
A
TAGCCATCGGTA GTACTCAATGAT
G
Perfect Match probe for Allele A
ATCGGTAGCCATTCATGAGTTACTA
Perfect Match probe for Allele B
ATCGGTAGCCATCCATGAGTTACTA
Genotyping: answering the question about the two
copies of the chromosome on which the SNP is located:
Is a sample AA (AA) , AB (AG) or BB (GG) at this SNP?
21
Affymetrix GeneChip
*
*
*
*
*
*
5µ
5µ
1.28cm
1.28cm
6.4 million features/ chip
> 1 million identical
25 bp probes / feature
GeneChip Mapping Assay Overview
250 ng Genomic DNA
Xba
Xba
Xba
RE Digestion
PCR: One Primer
Amplification
Adaptor
Ligation
Complexity
Reduction
Fragmentation
and Labeling
Hyb & Wash
AA BB AB
23
Principal low-level analysis steps
• Background adjustment and normalization at probe level
These steps are to remove lab/operator/reagent effects
• Combining probe level summaries to probe set level
summary: best done robustly, on many chips at once
This is to remove probe affinity effects and discordant observations
(gross errors/non-responding probes, etc)
• Possibly further rounds of normalization (probe set level)
as lab/cohort/batch/other effects are frequently still visible
• Derive the relevant copy-number quantities
Finally,
quality assessment is an important low-level task.
24
Preprocessing for total CN using SNP probe
pairs (250K chip)
TT
AT
AA
Modification by H Bengtsson of a method due to A Wirapati developed some
years25ago for microsatellite genotyping; similar to the approach used by Illumina.
Background adjustment
and normalization
26
Outcome similar to that achieved by quantile normalization
Low-level analysis problems remain
unsolved; why?
• The feature size keeps  and so the # features/chip
keeps;
• Fewer and fewer features are used for a given
measurement, allowing more measurements to be made
using a single chip
These considerations all place more and more demands on
the low-level analysis: to maintain the quality of existing
measurements, and to obtain good new ones.
27
SNP probes can be used to
estimate total copy numbers
BB
AA
** *
** *
** *
PM = PMA + PMB = 2c
PM = PMA + PMB = 2c
AB
** *
AAB
** *
** *
PM = PMA + PMB = 2c
** *
** *
PM = PMA + PMB = 3c
*
SNP probe tiling strategy
SNP 0 position
A/G
TAGCCATCGGTA N GTACTCAATGAT*
PM 0 Allele A
MM 0 Allele A
ATCGGTAGCCAT T
ATCGGTAGCCAT A
CATGAGTTACTA
CATGAGTTACTA
PM 0 Allele B
MM 0 Allele B
ATCGGTAGCCAT C
ATCGGTAGCCAT G
CATGAGTTACTA
CATGAGTTACTA
Central probe quartet
29
SNP probe tiling strategy
SNP
A / G +4 Position
TAGCCATCGGTA N GTA C TCAATGATCAGCT*
PM +4 Allele A
MM +4 Allele A
GTAGCCAT T CAT G AGTTACTAGTCG
GTAGCCAT T CAT C AGTTACTAGTCG
PM +4 Allele B
MM +4 Allele B
GTAGCCAT C CAT G AGTTACTAGTCG
GTAGCCAT C CAT C AGTTACTAGTCG
+4 offset probe quartet
30
SNP for Identifying Copy Number Variations
• Using SNP chips to identify change in total copy
number (i.e. CN ≠ 2)
• Outline a new method (CRMA)
• Evaluate and compare it with other methods
• Make some closing remarks on further issues
31
Copy-number estimation using
Robust Multichip Analysis (CRMA)
CRMA
Preprocessing
(probe signals)
Total CN
Summarization
(SNP signals )
32
allelic crosstalk
(or quantile)
PM=PMA+PMB
log-additive
PM only
Post-processing
fragment-length
(GC-content)
Raw total CNs
R = Reference
Mij = log2(ij/Rj)
chip i, probe j
A few details are passed over. Ask me later if you care about them.
Crosstalk between alleles
- adds significant artifacts to signals
Cross-hybridization:
Allele A: TCGGTAAGTACTC
Allele B: TCGGTATGTACTC
AA
** *
AB
** *
** *
PMA ≈ PMB
** *
PMA >> PMB
*
BB
** *
** *
PMA << PMB
There are six possible allele pairs
• Nucleotides: {A, C, G, T}
• Ordered pairs:
– (A,C), (A,G), (A,T), (C,G), (C,T), (G,C)
• Because of different nucleotides bind differently,
the crosstalk from A to C might be very different
from A to T.
Crosstalk between alleles
is easy to spot
Example:
BB
Data from one array
AB
PMB
AA
+
PMA
offset
Probe pairs
(PMA, PMB)
for nucleotide pair
(A,T)
Crosstalk between alleles
can be estimated and corrected for
What is done:
Offset is removed
from SNPs and CN
units.
BB
AB
PMB
AA
+
PMA
no offset
Crosstalk is
removed
from SNPs.
Copy-number estimation using
Robust Multichip Analysis (CRMA)
CRMA
Preprocessing
(probe signals)
allelic crosstalk
(or quantile)
Already briefly described.
Total CN
Summarization
(SNP signals )
PM=PMA+PMB
log-additive
PM only
Postprocessing
fragment-length
(GC-content)
Raw total CNs
Mij = log2(ij/Rj)
37
Copy-number estimation using
Robust Multichip Analysis (CRMA)
CRMA
Preprocessing
(probe signals)
Total CN
Summarization
(SNPsignals )
allelic crosstalk
(quantile)
PM=PMA+PMB
log-additive
PM only
Postprocessing fragment-length
(GC-content)
Raw total CNs
38
Mij = log2(ij/Rj)
 That’s it!
Copy-number estimation using
Robust Multichip Analysis (CRMA)
CRMA
Preprocessing
(probe signals)
allelic crosstalk
(quantile)
Total CNs
PM=PMA+PMB
Summarization
(SNP signals )
log-additive
PM only
Postprocessing fragment-length
(GC-content)
Raw total CNs
39
Mij = log2(ij/Rj)
log2(PMijk) = log2ij + log2jk + ijk
Fit using rlm
Copy-number estimation using
Robust Multichip Analysis (CRMA)
CRMA
Preprocessing
(probe signals)
Total CN
Summarization
(SNP signals )
allelic crosstalk
(quantile)
Longer fragments get less
well amplified by PCR and
so give weaker SNP signals
PM=PMA+PMB
log-additive
PM-only
Postprocessing fragment-length
(GC-content)
Raw total CNs
40
Mij = log2(ij/Rj)
100K
Copy-number estimation using
Robust Multichip Analysis (CRMA)
CRMA
Preprocessing
(probe signals)
Total CN
Summarization
(SNP signals )
allelic crosstalk
(quantile)
Longer fragments get less
well amplified by PCR and
so give weaker SNP signals
PM=PMA+PMB
log-additive
PM-only
Postprocessing fragment-length
(GC-content)
Raw total CNs
41
Mij = log2(ij/Rj)
500K
Copy-number estimation using
Robust Multichip Analysis (CRMA)
CRMA
Preprocessing
(probe signals)
Total CN
Summarization
(SNP signals )
allelic crosstalk
(quantile)
Longer fragments get less
well amplified by PCR and
so give weaker SNP signals
PM=PMA+PMB
log-additive
PM-only
Postprocessing fragment-length
(GC-content)
Raw total CNs
42
Mij = log2(ij/Rj)
500K
Copy-number estimation using
Robust Multichip Analysis (CRMA)
CRMA
Preprocessing
(probe signals)
allelic crosstalk
(quantile)
Total CN
PM=PMA+PMB
Summarization
(SNP signals )
log-additive
PM-only
Postprocessing
fragment-length
(GC-content)
Raw total CNs
Mij = log2(ij/Rj)
43
Care required with the number and
nature of Reference samples used
Comparison of 4 methods
CRMA
dChip
(Li & Wong
2001)
CNAG*
(Nannya et al
2005)
CNAT v4
(Affymetrix
2006)
Preprocessing
(probe signals)
allelic crosstalk
(quantile)
quantile
scale
quantile
Total CN
PM=PMA+PMB
PM=PMA+PMB
MM=MMA+MMB
PM=PMA+PMB
“log-additive”
PM-only
Summarization
(SNP signals )
Log additive
PM only
Multiplicative
PM-MM
Post-processing
fragment-length
(GC-content)
Raw total CNs
Mij = log2(ij/Rj)
Mij = log2(ij/Rj)
=A+B
fragment-length
(GC-content)
fragment-length
(GC-content)
Mij = log2(ij/Rj)
Mij = log2(ij/Rj)
44
Further bioinformatic issues
• Estimating copy number: needs calibration data
• Segmentation (of chromosomes into constant
copy number regions): an HMM-like algorithm
• Analyzing family CN data: a different HMM
• Incorporating non-polymorphic probes:
independent HMM observations to be weighted
and combined
• Dealing with mixed normal-abnormal samples
• Utilizing poor quality DNA samples
• Estimating allele-specific copy number
45
Download