Jieming-cbb752a12-mining-your-personal

advertisement
Mining your
Personal Genome
Jieming Chen
Yale University
CBB752a12
What is Personal Genomics?
• Personal genomics is the branch of genomics
concerned with the sequencing and analysis
of the genome of an individual -- Wikipedia
• Is it not possible before?
- Genetics VS genomics
- Post-Human-Genome-Project (HGP)
genomics
Nature (2010)
2000
2003
2006
2008
2010
Personal Genomics
1. From basic research, to clinic, then to
the masses
2. Tools to mine your own genome
3. Ethics and Privacy
Increasingly “personalized” genomics…
GENOMICS IN BASIC RESEARCH
Before mass sequencing
- mass genotyping • Genotyping
- Determination of the
genotypes of parts (usually
genetic variations) of
an individual’s genome
using biological assays
• SNP arrays  Hybrid arrays
- SNP (single nucleotide
polymorphisms)
genotyping
- Main players:
Affymetrix VS Illumina
Affymetrix: http://www.affymetrix.com/
Illumina : http://www.illumina.com/
Before mass sequencing
- mass genotyping • Array CGH (comparative
genomic hybridization)
- CNV (copy number variation) genotyping
- Main players: Agilent VS Nimblegen
- Main application:
 detection of genomic abnormalities in cancer
 detection of large structural aberrations (especially
at the chromosomal level)
SNP arrays
• Affymetrix
• Illumina
1K
100K
10K
Xba
240K
50K
Hind
250K
100K
Xba
300K
Nsp
500K
SNP 5.0
SNP 6.0
Axiom
Sty
550K
610K
650K
• Probes on microarray technology
Affymetrix Axiom Solutions
http://www.affymetrix.com/
1M
Omni
SNP selection in array design
1) SNP quantity
- limited by microarray technology
2) SNP content
- random probes or probes for ‘tag’ SNPs
- random probes are produced by specific enzymes in
some array technology
- ‘tag’ SNPs is one that represents a group of SNPs in a
genomic region due to a phenomenon called, linkage
disequilibrium (LD).
- LD refers to the non-random association of alleles at 2 or
more loci.
- Haplotypes refers to a certain configuration of alleles
that are transmitted together (or assumed to be).
- One can, in theory, predict the larger group of SNPs with
a smaller set of SNPs
Linkage Disequilibrium
Parent 2
Parent 1
A
B
a
b
A B
a b
A B
a b
A B
A B
a b
High LD -> No Recombination
(r2 = 1) SNP1 “tags” SNP2
B
a
b
X
OR
a b
A
A b
A b
a B
A B
A b
a B
A B
A b
etc…
Low LD -> Recombination
Many possibilities
ASHG 2008 Hapmap Tutorial: http://hapmap.ncbi.nlm.nih.gov/tutorials.html.en
The International HapMap Project
• Largely exploited the idea of haplotypes and
LD
- reduce cost
(sequencing is expensive)
- capitalize on microarray
technology
• Involved Illumina,
Affymetrix,
>20 institutions worldwide
• HapMap1 (2003) and
Hapmap2 (2005)
- 4 populations (270 indiv):
CEU (NW European from Utah),
CHB (Han Chinese from Beijing),
JPT (Japanese from Tokyo),
YRI (Yoruban from Nigeria)
• Hapmap3 (2010)
- 11 populations (4+7, 1301 indiv)
The International HapMap Project
• Provided the foundation
for future human genomic
projects:
www.hapmap.org
- maturation of the microarray technology
- tool development from industry and academia
- the use of common variations in disease studies and
genome-wide association studies (GWAS)
- population-specific genetic differences
- samples
- consent and ethical issues
• Major limitations:
1) coverage (the entire genome is not covered)
2) rare variants are unlikely to be uncovered
3) population-based genome-wide studies
Even with limited information,
genomics is getting “personalized”…
Basic
• Human reference genome refinement
• Human evolution and natural selection
• Comparative genomics
A C T G
Ancestry of individuals
• Population structure
• Human migration route
• Haplotyping
• Linguistics
Clinical applications
• Pharmacogenetics/genomics
• Disease associations
ETC. ETC. ETC……
HUGO PASNP Consortium (2009), Science
Heralding the personal genomes
• HapMapP3 draft 1 came out in 2009 and
paper published in 2010
• Venter genome (2007) and Watson genome
(2008)
• Faster, cheaper and more accurate sequencing
technologies  Transitioning into personal
genomes
• 2009-2011, 1000 Genomes Project sequenced
1092 genomes from 14 different populations
2007
2008
2008
2008
2009
2009
2009
2009
2009
2009
2010
Further into the personal genome
• Beyond simply sequencing the personal genome
• If a family trio is sequenced (mum, dad, child), one can
potentially phase the variations of the child into its
maternal and paternal alleles.*
• Phasing refers to the determination of the haplotype
of an individual’s sequence.
• It can be done experimentally (not feasible for largescale phasing) or computationally.
• Typical computational phasing algorithms include the
use of HMM (e.g. BEAGLE, Browning & Browning 2007,
AJHG) and EM (e.g. fastPHASE, Scheet & Stephens
2006, AJHG).
*Note that phasing can also be done with unrelated individuals  but you
won’t know the maternal or paternal chromosomes
Phasing
Simple example of phased sequence of
the child (as opposed to ‘unphased’,
highlighted black)
Father
A
B
c
D
Mother
a
B
c
d
A
B
C
d
Child
A a
B B
c C
d d
a
B
c
d
Parent 1
Parent 2
Child
Informative to
phase child’s
genome?
Homozygous
Homozygous
Any
Yes
Homozygous
Heterozygous
Any
Yes
Heterozygous
Homozygous
Any
Yes
Heterozygous
Heterozygous
Homozygous
Yes
Heterozygous
Heterozygous
Heterozygous
No
Adapted from:
http://www.chromosomechronicles.com/2009/09/30/use-familysnp-data-to-phase-your-own-genome/
Allele-specific binding (ASB) and
expression (ASE)
Possible causes for ASB/ASE
1) Epigenetic effects, e.g. imprinting, where methylation silences a
maternal/paternal gene
2) Genetic variations (such as SNPs) disrupting a binding motif or modifying a
gene on a single parental haplotype
3) Random mono-allelic expression/binding
Clinical examples
1) Angelman Syndrome – maternal gene(s) on chromosome 15 inactivated
or deleted, paternal gene imprinted
2) Prader-Willi Syndrome – paternal gene(s) on chromosome 15 inactivated
or deleted, maternal gene imprinted
Using a phased genome to study ASB and ASE
• Integrate phased sequence with ChIP-seq (binding) and RNA-seq
(expression) data to obtain allele-specific information in binding and
expression (Rozowsky J et. al. 2011)
“Personalization in progress… Watch this space”
PERSONAL GENOMICS IN CLINICAL
RESEARCH
Personal genomics in Clinic
Some areas that clinicians are
interested in that genomics can
potentially improve:
• Disease prediction
• Pharmocogenetics/genomics
• Response to therapy
• Patient care (personalized
environmental and
epigenetic information,
patient data privacy etc. etc.)
• Personalized medicine and healthcare
Examples of some genomic technologies in clinical research
1) Genome-Wide Association Studies
2) Exome sequencing
3) Pharmacogenetics/genomics
4) Gene expression profiles via RNA-seq
McCarthy et. al. 2008
Genome-Wide Association Studies
(GWAS)
• First successful GWAS was done at Yale, in
2005 for age-related macular degeneration
(AMD) (Klein R. et. al. 2005, Science)
- 96 cases, 50 controls, 116K SNPs
Klein R. et. al. 2005
GWAS
• Perpetuated by HapMap and microarray technology
• Hypothesis-free
• Main aims:
1) to find the molecular pathways/mechanisms of complex diseases/traits
2) to find genetic markers that these phenotypes are associated with
• Common-disease-common-variant hypothesis
- phenotypes are results of cumulative effects of a number of common
variants, with at best modest effect sizes
McCarthy et. al. 2008, Nature Reviews
GWAS
• Usually SNP-based
• Conduct association tests for each SNP
between case VS control to see if there is a
significant difference between 2 cohorts.
Allele
# Cases
# Controls
A
nA,case
nA,ctrl
B
nB,case
nB,ctrl
where
and n is the minor allele frequency.
McCarthy et. al. 2008, Nature Reviews
GWAS
Limitations
• Note that even though termed “whole genome”, GWAS
till now work mostly with microarray tech
 use ‘tag SNPs’ which are in LD with many other SNPs
 GWAS may not (and typically do not) find the
causative variant.
• High number of false positives with array-based GWAS
 currently, the GWAS variants explained only a small
genetic fraction of common disease risk
• Heading towards sequencing-based GWAS, especially
in looking at uncommon or rare variants
GWAS
Limitations (cont’d)
• Results can be population-specific, e.g. Type 2 diabetes risk allele frequencies
decrease from Sub-Saharan Africa through Europe to East Asia
However, they did provide new insights into novel disease-associated pathways and
mechanisms – for instance in AMD.
Catalog of GWAS
http://www.genome.gov/26525384
Chen R et. al. (2012), PLoS Genetics
Pharmacogenetics/genomics
• Pharmacogenetics
- refers to the study of genetic variations of
individual patient responses to drugs,
conventionally in single or a small set of genes
• Pharmacogenomics
- refers to large-scale/genome-wide study of
genetic variations of individual patient
responses to drugs
Interethnic variations in drug
responses
• Warfarin is a classic example.
 a very widely-used anti-coagulant and one of the most well-studied
drug
 extremely difficult to dose because of a narrow therapeutic window
 genes with haplotypes that affect dosage: VKORC1 and CYP2C9
 Warfarin sensitivity (on average): Asians>Caucasians>African
Americans
Rettie A & Tai G (2006), Molecular Interventions Review
Quantifying interethnic variation in the
genome: an application
• A popular measure in population genetics is
the fixation index, FST, which essentially
measures population differentiation.
Chen J et. al. (2010), Pharmacogenomics
A peek into a potential future
1. Charcot-Marie-Tooth neuropathy (Lupski et. al., 2010,
NEJM)
 Whole genome sequencing of the lead author
himself, who has the disease, and his family
 found 2 causative mutations associated with the
disease, on a region on chromosome 5 affecting
SH3TC2 (SH3 and tetratricopeptide repeats 2 gene)
2. The Snyder Experiment (Chen R et. al., 2012, Cell)
 integration of genomic, transcriptomic, proteomic,
metabolomic, and autoantibody profiles
 of a single healthy individual over a 14-month period
 revealed a predisposition to Type 2 diabetes despite
having no family history
“knowledge is mightier, IF you wield it right”
EMPOWERING THE MASSES
Personal genomics for the Masses
What can you mine from your own genome?
How can you mine your own genome?
What can you tell from your own genome?
•
•
•
•
Disease susceptibility
Ancestry
Pharmocogenetics
Traits
ETC. ETC.
The Bottom-up Pyramid Information
Flow
Public
Clinical research
Basic Research
Personal genomics for the Masses
• Unprecedented accessibility to the public
• Brought about by direct-to-consumer genomic
companies
 Big 3: DeCode, Navigenics, 23andMe
23andMe
• Genotype ~ 1million SNPs per genome
• Illumina OmniExpress customized microarray
• Ancestry
Traits
Drug response
Disease risks
• Provides your raw data which you can
download
34
Beyond 23andMe –
Ancestry
Population panels:
HAPMAP (Intl HapMap Consortium 2003, Nature)
HGDP (Li JZ et. al. 2008, Science)
PASNP (HUGO PASNP Consortium 2009, Science)
SGVP (Teo YY et. al. 2009, Gen. Res.)
Middle East
35
Inset: http://www.clker.com/clipart-9213.html
Beyond 23andMe –
Ancestry
Chen J et. al. (2009), AJHG
36
PCA in genomic data
 SNPs in LD can skew PCA
Modified PCA (Price et. al. (2006), Nat Genet)
• 0,1,2 represent the genotypes of SNPs (0=AA, 1=AB,
2=BB, assuming biallelic SNPs)
Samples 
• then instead of
normalizing by column, SNPs sample YOU CEU1 CEU2 CEU3
SNP1
1
1
1
0
normalize by row
SNP2
2
1
2
0
• variables = individuals
SNP3
1
2
2
0
• observations = SNPs
• correlation matrix of individuals
• plot PC1 vs PC2 by loadings (variables) instead of by PC
scores (observations)
PCA interpretation
• Genetic differentiation by geography
• Studies that showed cultural, linguistic and
historical association with such pattern
Novembre J et. al. 2008, Nature
International Stem Cell Consortium (2011), Nat Biotech
Disease status
• Considerations:
 population
panel in which
your results
are based on
 how wellstudied is the
disease
Mendelian diseases
• High penetrance
• Highly likely to be detected,
hence the results are more likely to be true
• Some populations might have a higher rate
2009 Rosner et. al. Annu. Rev.
Genomics. Hum. Genet.
Drug Response
Ancestry Neanderthal
43
Tools to mine your own genome
Projects/software from the public
• Dienekes Pontikos - EURO-DNA-CALC
Dienekes Anthropology Blog
http://dienekes.blogspot.com/2008/06/euro-dna-calc-11-released.html
• Dodecad Project
http://dodecad.blogspot.com
• Eurogenes
http://eurogenes.blogspot.com/
Other resources
• Galaxy (http://galaxy.psu.edu/)
• Interpretome (http://esquilax.stanford.edu/)
• SNPTips Firefox browser extension (http://snptips.5amsolutions.com/)
• SNPedia/Promethease (http://www.snpedia.com/index.php/Promethease)
• A comprehensive list of tools to probe 23andMe data.
http://www.23andyou.com/3rdparty
GALAXY
•
•
•
•
•
http://galaxy.psu.edu/
Web-based platform
Designed for anybody to use
Workflow concept
GALAXY demo
Genomic elements discovery
and annotation
Academia:
• Human Genome Project
• Hapmap
• 1000 Genomes Project
Clinic:
• Disease association
• Pharmacogenetics
• Biomarkers
Industry
 Expedite the democratization
process
• Navigenics
• 23andMe
• deCodeMe
• Illumina
• Affymetrix
Academia
+ Clinic
Industry
Everybody else
Some Privacy and ethical issues
• Privacy
 can your identity really be kept anonymous in a
research project?
 Li et. al. 2004, Science
“Our calculations show that measuring as few as 75
statistically independent SNPs would define a small
group that contained the real owner of the DNA.”
• Ethics
 how much, if at all, of your genomic information do
you own?
 where do biological relatives stand in all these?
 genetic discrimination especially with insurance
companies
Download