Li Hsu
Biostatistics and Biomathematics Program
Fred Hutchinson Cancer Research Center
• Cancer facts
• Linkage analysis of family studies
• Genome-wide association studies
• The etiology of cancer is multifactorial, with genetic, environmental, medical, and lifestyle factors interacting to produce a given malignancy.
• The breakthroughs in high throughput genotyping technologies have made it possible for systematically identifying genes that are responsible for disease occurrence.
• BRCA1 (breast cancer 1) is a human gene that belongs to a class of genes known as tumor suppressors, which maintains genomic integrity to prevent uncontrolled proliferation. Variations in the gene have been implicated in a number of hereditary cancers, namely breast, ovarian and prostate. The BRCA1 gene is located on the long (q) arm of chromosome 17 at 38Mb.
Probability of developing breast cancer by age (Chen et al.
2009) carriers
Non-carriers
Age 50
Age 60
Age 70
Age 80
Probability of Developing Breast Cancer for
BRCA1 carriers
Average Person BRCA1 Carrier
2.1%(1.7%-2.7%) 18.8%(8.2%-2.3%)
4.1%(3.4-5.0%)
7.2%(6.0%-9.0%)
31.3%(14.3%-61.2%)
45.4%(22.7%-74.3%)
10.2%(8.4%-12.5%) 54.9%(30.4%-81.4%)
• How was BRCA1 found?
1/2
1/3
3/4
2/4
3/4
3/2
1/4 1/4 1/2 3/2
Assume disease gene (D) is rare with full penetrance
3/4
D/d
1/2 d/d
1/3 d/D
3/2
D/d
3/4
D/d
2/4 d/d
1/4 1/4 d/d D/d
1/2 d/d
3/2
D/d
• Disease allele (D) originally in chromosome with allele 3
• How often does D co-segregate with allele 3 (non-recombinant)?
Assume disease gene (D) is rare with full penetrance
3/4
D/d
1/2 d/d
1/3 d/D
3/2
D/d
3/4
D/d
2/4 d/d
1/4 1/4 d/d D/d
1/2 d/d
3/2
D/d
• Disease allele (D) originally in chromosome with allele 3
• How often does D co-segregate with allele 3 (non-recombinant)?
– 5 meiosises
• How often is D separated from allele 3
(recombinant)?
Assume disease gene (D) is rare with full penetrance
3/4
D/d
1/2 d/d
1/3 d/D
3/2
D/d
3/4
D/d
2/4 d/d
1/4 1/4 d/d D/d
1/2 d/d
3/2
D/d
• Disease allele (D) originally in chromosome with allele 3
• How often does D co-segregate with allele 3 (non-recombinant)?
– 5 meiosises
• How often is D separated from allele 3
(recombinant)?
– 1 meiosis
• Set a parameter θ which measures the distance between allele 3 and D by how frequently they recombine.
• The likelihood function L(θ) = (1- θ) 5 θ
• The maximum likelihood estimate is 1/6
• LOD = log
10
= 0.63
L(1/6)/L(1/2)
• LOD for 7 families = 7x0.63 = 4.41
• Linkage analysis has narrowed down to a region about 1Mb. However it took another four years before the BRCA1 gene was mapped.
• Reduced penetrance, phenocopy, and genetic heterogeneity are among the factors that limit the success of the linkage analysis.
• Relevance of the findings to the population at large.
• The Human Genome Project began in 1990 and completed in 2003.
Part of sequence from Chromosome 7
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
AGACGGAGTTTCACTCTTGTTGCCAACCTGGAGTGCAGTGGCGTGATCTCAGCTCACTGCACACTCCGCTTTC C/T GG
TTTCAAGCGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGACTACAGTCACACACCACCACGCCCGGCTAATTTTTG
TATTTTTAGTAGAGTTGGGGTTTCACCATGTTGGCCAGACTGGTCTCGAACTCCTGACCTTGTGATCCGCCAGCCTCT
GCCTCCCAAAGAGCTGGGATTACAGGCGTGAGCCACCGCGCTCGGCCCTTTGCATCAATTTCTACAGCTTGTTTTCTT
TGCCTGGACTTTACAAGTCTTACCTTGTTCTGCCTTCAGATATTTGTGTGGTCTCATTCTGGTGTGCCAGTAGCTAAAA
ATCCATGATTTGCTCTCATCCCACTCCTGTTGTTCATCTCCTCTTATCTGGGGTCAC A/C TATCTCTTCGTGATTGCATTC
TGATCCCCAGTACTTAGCATGTGCGTAACAACTCTGCCTCTGCTTTCCCAGGCTGTTGATGGGGTGCTGTTCATGCCT
CAGAAAAATGCATTGTAAGTTAAATTATTAAAGATTTTAAATATAGGAAAAAAGTAAGCAAACATAAGGAACAAAAAG
GAAAGAACATGTATTCTAATCCATTATTTATTATACAATTAAGAAATTTGGAAACTTTAGATTACACTGCTTTTAGAGAT
GGAGATGTAGTAAGTCTTTTACTCTTTACAAAATACATGTGTTAGCAATTTTGGGAAGAATAGTAACTCACCCGAACA
GTGTAATGTGAATATGTCACTTACTAGAGGAAAGAAGGCACTTGAAAAACATCTCTAAACCGTATAAAAACAATTACA
TCATAATGATGAAAACCCAAGGAATTTTTTTAGAAAACATTACCAGGGCTAATAACAAAGTAGAGCCACATGTCATTT
ATCTTCCCTTTGTGTCTGTGTGAGAATTCTAGAGTTATATTTGTACATAGCATGGAAAAATGAGAGGCTAGTTTATCAA
CTAGTTCATTTTTAAAAGTCTAACACATCCTAGGTATAGGTGAACTGTCCTCCTGCCAATGTATTGCACATTTGTGCCC
AGATCCAGCATAGGGTATGTTTGCCATTTACAAACGTTTATGTCTTAAGAGAGGAAATATGAAGAGCAAAACAGTGCA
TGCTGGAGAGAGAAAGCTGATACAAATATAAATGAAACAATAATTGGAAAAATTGAGAAACTACTCATTTTCTAAATT
ACTCATGTATTTTCCTAGAATTTAAGTCTTTTAATTTTTGATAAATCCCAATGTGAGACAAGATAAGTATTAGTGATGGT
ATGAGTAATTAATATCTGTTATATAATATTCATTTTCATAGTGGAAGAAATAAAATAAAGGTTGTGATGATTGTTGATTA
TTTTTTCTAGAGGGGTTGTCAGGGAAAGAAATTGCTTTTTTTCATTCTCTCTTTCCACTAAGAAAGTTCAACTATTAATT
TAGGCACATACAATAATTACTCCATTCTAAAATGCCAAAAAGGTAATTTAAGAGACTTAAAACTGAAAAGTTTAAGATA
GTCACACTGAACTATATTAAAAAATCCACAGGGTGGTTGGAACTAGGCCTTATATTAAAGAGGCTAAAAATTGCAATA
AGACCACAGGCTTTAAATATGGCTTTAAACTGTGAAAGGTGAAACTAGAATGAATAAAATCCTATAAATTTAAATCAA
AAGAAAGAAACAAACT A/G AAATTAAAGTTAATATACAAGAATATGGTGGCCTGGATCTAGTGAACATATAGTAAAGA
TAAAACAGAATATTTCTGAAAAATCCTGGAAAATCTTTTGGGCTAACCTGAAAACAGTATATTTGAAACTATTTTTAAA
Genome-Wide Association Study
• 550,000 SNPs on an array
• 2000 diseased individuals (colon cancer cases) and 2000 normal individuals
• Genotype all DNAs for 550,000 SNPs
• That is 2 billion genotyping!
GWAS on Type 2 Diabetes (Steinthorsdottir et al., 2007, Nature Genetics)
AA
Aa aa
Cases Controls
751
539
108
3107
1887
277
1398 5271
3858
2426
385
6669
AA
Aa aa
Cases Controls
809
509
81
3049
1917
305
1398 5271
• Expected count for cases if AA is not associated with the disease. First, calculate the frequency of AA genotype in both cases and controls combined: freq = 3858/6669 = 57.85%
• For 1398 cases, we expect to see 1398*57.85%=809 individuals having genotype AA.
3858
2426
385
6669
GWAS on Type 2 Diabetes
• The chi-square statistic is calculated by finding the difference between each observed and expected for each cell, squaring them, dividing each by the expected, and taking the sum of the results.
( 757 809 )^2/ 809 +( 3107 3049 )^2/ 3049 +…
• Compare the value to a standard chi-square distribution with degrees of freedom (# rows-1)*(# col -1) = 2.
• The p-value for this SNP is 6.772e-5.
• Too many SNPs!
• Identifying gene-gene and geneenvironmental interactions are now possible.
Germline mutations account for only a small portion of cancer cases.
http://envirocancer.cornell.edu/FactSheet/General/fs48.inheritance.cfm
• The amount of the data that have been generated increases exponentially in the last few years.
• This creates a great demand on efficient and valid computational and statistical methods and tools for picking the needles from a haystack.