An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA Types of Polymorphism • Single nucleotide polymorphisms (SNP) constitute about 90% of polymorphisms. • Insertions, deletions. • Microsatellite repeats: a locus where different numbers of copies of a short repeat sequence are found in different people. • Gross genetic losses or rearrangements. Large-scale Polymorphism Single Nucleotide Polymorphisms • Each person different at 1 in 1000 letters. • SNPs responsible for human individuality! • Some SNPs cause human diseases (e.g. cancer, cystic fibrosis, Alzheimer’s). • Enormous efforts have been made to identify specific mutations that cause disease. Single Nucleotide Polymorphism Mutation can occur as easily as the loss of a single chemical group from one nucleotide base, e.g. the amino group of cytosine. Creating a Mutation Genomic Density of SNPs • Comparing two random chromosome, one SNP per 1000 bp. • Comparing 40 people (2 chromosomes each), expect 17 million SNPs in the complete human genome (3 billion bp). • In coding region (5% of genome) expect 500,000 cSNPs, perhaps 6 per gene. SNPs: a detailed record of human genetic history • Each SNP is typically a single mutation event, that occurred in a context of certain pre-existing SNPs. • As time passes this context is gradually lost due to recombination. A B C B C D E F time D SNP C initially created linked to SNPs ABDEF... “Island” of linkage shrinks... A record of the origins, migrations, and mixing of the world’s peoples • The size of the “island” of strong linkage around a SNP indicates its age (small = old) • The SNPs it’s linked to give a “genetic fingerprint” of the original person it’s from. • In principle each SNP can be used to track all his descendants. • Each person has 300,000 common SNPs-- a very rich record of their genetic history. SNPs in lipoprotein lipase (LPL) gene. SNP genotypes in 71 individuals in the LPL gene heterozygote (X/y) homozygote (y/y) SNP Allele Frequency SNP Haplotypes reconstructed from LPL genotype data SNP Linkage Disequilibrium The Hunt for Disease Genes • Currently: finding a disease gene can take years, because there are very few markers, forcing researchers to search dozens of genes. • SNPs are a powerful tool for discovering genes that cause disease: with a SNP in every gene, could directly map a disease to a single gene. Mapping Disease Genes microsatellite disease gene chromosome genes SNPs • Look for genetic linkage of disease to marker • Microsatellite markers are too widely spaced to get to the individual gene level. • There are common SNPs in every gene. Identification of SNPs • In 1998, Wang et al. reported ~ 3000 SNPs. • Currently about 200,000 SNPs have been identified in total by experiment (in public databases). • A pharmaceutical industry SNP consortium has been formed to fund identification of 300,000 SNPs to be shared publically. SNPs for Pharmacogenomics • Differences in efficacy and side effects from person to person can be a big problem for drug clinical trials / approval. • If SNPs that correlate with these differences can be identified, the clinical trial could be limited to patients where efficacy is likely to be best, with least side effects. • These SNPs would then also have to be tested on prospective patients for the drug. Single-nucleotide polymorphism in the human mu opioid receptor gene alters -endorphin binding and activity: Possible implications for opiate addiction Bond et al. PNAS 95:9608 The mu opioid receptor is the primary site of action for the most commonly used opioids, including morphine, heroin, fentanyl, and methadone. The A118G variant receptor binds endorphin, an endogenous opioid that activates the mu opioid receptor, approximately three times more tightly than the most common allelic form of the receptor. Furthermore, -endorphin is approximately three times more potent at the A118G variant receptor than at the most common allelic form in agonistinduced activation of G protein-coupled potassium channels. Comprehensive EST Analysis of Single Nucleotide Polymorphism in the Human Genome Chris Lee Dept. of Chemistry & Biochemistry UCLA Targeting Functional Polymorphism via Expressed Sequences • Only 5% of the human genome corresponds to coding “genes” coding functional protein. • Look for functional SNPs by targeting these gene sequence regions. • Genes are “expressed” by transcription into mRNA, which is spliced, poly-adenylated and transcribed. • Purify polyA-mRNA, make cDNA, sequence. SNP Detection from ESTs • 1.4 million Expressed Sequence Tag (EST) sequences, 300-500 bp, from 950 people. • How to put together all the ESTs from the same gene, without mixing up related genes? • How to distinguish sequencing errors (very common) from genuine Single Nucleotide Polymorphisms? SNP Detection Approaches • Experimentally: random sampling of DNA. Very expensive, slow. • Computationally: find SNPs from existing experimental data. Sort out real SNPs from experimental sequencing errors. Difficult statistical and computational problems. • This experimental data was sitting around for years... Distinguishing SNPs from Sequencing Errors A T The frequency and pattern in which a polymorphism is observed, must rise above the rate of background, random error. Single-pass read sequences contain many errors which complicate the reliable detection of SNPs. There are miscalls (N), and frequent letter duplications / losses in runs (repeats of a single letter). These non-uniform error rates are critical in assessing the statistical significance of candidate SNPs like A (not in a run) vs. T (problematic because it involves a GG run). How to address this? • Adopt rigorous statistical approach based on measured frequencies from very large data. • Bayesian inference: carefully separate observations from hidden states you want to make inferences about. • “Integrate out” all assumptions by considering all possible values of the assumptions. • Explicitly measure degree of uncertainty in the predictions due to poor data, ambiguity. Odds ratio: SNP model vs. sequencing error model p(obs | SNP) SNPscore p(obs | err ) Consider both models: are the observations more consistent with a SNP or sequencing error? Error Model: treat True gene sequence as unknown p(obs | err ) p(obs | T ) p(T ) T •Treat all sequences T as equally likely (before you consider the actual observations (chromatograms). •Sum error model probability over all possible T. SNP Model 1 p(obs | SNP) p(obs | T , T ) p(T | T ) p(T ) p(obs | T * , T ) p(T * ) 3 T T •Rather than summing SNP model probability over all possible T, T*, calculate the probability for a specific SNP T* in a specific consensus T. * 1 p ( obs | T , T ) SNPscore 3 p(obs | T ) T Sequencing error model Treat individual observed sequences i as independent; treat alignment (what errors occurred) as uncertain. p(obs | T ) p(obsi | T ) p(obsi , A | T ) i i A Treat true gene sequence T as uncertain: sum over all possible T p(obs | err ) p(obsi , A | T ) T i A Hidden Markov Model Discrimination of SNP vs. Error The match states (M) of a profile is the equivalent of the true population sequence, and deletion (D), insertion (I) and emission probabilities are set to be the observed frequencies of sequencing errors conditioned on local sequence context. The sum probability for the SNP model, vs. the sum probability for the error-only model, yields an odds-ratio for the SNP. To assess putative SNP, consider all alternative possibilities • Sequencing error: calculate odds ratio SNP vs. error. Use PHRED score, local context. • Orientation errors: ESTs reported backwards? • Chimeras, mixed clusters: ESTs may not be properly clustered. Some ESTs chimeric? • Alignments: all possible ways EST could have been emitted from true sequence T. • “true” sequence: all possible T for the gene. SNP Model: “Local” allele frequency qz in one person p(obsi | T , T , z ) p(obsi , A | T , T , q z ) * * A z = 0, 1, 2 … p(obsiL qz = z/N, where N = 2 typically | T , T ) * z 0 ,1, 2 p( z ) p(obsi | T , T , z ) N z p( z | q) q (1 q) N z z * iL Assuming Hardy-Weinberg Use Library information: which sequences are from same person! Combine observations from all libraries L, and treat population allele frequency q as uncertain (so take integral over q= (0,1) ). 1 p (obs | T , T ) * 0 * p ( obs | T , T , q) p(q )dq iL L 2 z q (1 q) 2 z p(obsi , A | T * , T , z ) p(q)dq 0 L z 0,1, 2 z iL A 1 Posterior probability for population allele frequency q p(q | T * , T , obs) * p ( obs | T , T , q) iL L 1 0 * p ( obs | T , T , q)dq iL L Gives posterior distribution for q, taking into account all error rates in the observations, amount of sequence and library availability, ambiguities in the sequence, etc. 6 SNP observations from one library 0.000014 0.000012 0.00001 0.000008 0.000006 Series1 0.000004 0.000002 0 -0.000002 0 0.2 0.4 0.6 0.8 1 6 SNP observations scattered over all libraries 8E-13 7E-13 6E-13 5E-13 4E-13 Series1 3E-13 2E-13 1E-13 0 0 0.2 0.4 0.6 0.8 1 Alignment Accuracy Challenges • Automatic Multiple Sequence Alignment of 1000+ sequences is problematic. • Alignment accuracy is much more of a problem for SNP detection than for simply getting the right consensus. Consensus merely requires that the majority be aligned, whereas even a single alignment error will result in an incorrect SNP prediction. Sequencing Error Analysis • We have produced a dataset of 400,000,000 bp where we have reliable consensus, and therefore can identify all the sequencing errors. This could provide “corrected” EST sequences, or alternatively consensus, assembled gene sequences for a large fraction of human genes. • This also provides detailed statistics on the frequency of different types of sequencing errors, which show a startling variation depending on local sequence context. Background error rates of 0.3% substitution, 0.3% insertion, 0.7% deletion, rise dramatically Example SNP: GGA C/T CAA Cluster AA702884 C vs. T polymorphism Novel SNP, not previously identified. Automated SNP Detection Input Unigene: 1,400,000 Human ESTs, 300-500 bp long Word frequency based overlap & orientation detection Try all possible orientations; Don’t trust Unigene! Reorient ESTs: catch reversals, place in 5’ -> 3’ orientation Many errors in the reported data e.g. reversals, in majority of clusters! EST Alignment: accuracy predict gene consensus & SNPs 10-5000 ESTs per gene, 80,000 genes, 500-5000 bp long Statistical Assessment of candidate SNPs >50,000 believable SNPs hidden among >10,000,000 sequencing errors. Sequence Alignment Current Status: Results • 400,000,000 bp aligned w/ reliable consensus. • 83,000 consensus gene sequences produced. • 20,000 show significant homology to known proteins, almost all in expected + orientation. • 75,000 SNPs above LOD score of 3. • 30000 SNPs above LOD score of 6. • current estimate: 60,000 high frequency SNPs. Megakaryocyte Potentiating Factor (Unigene Cluster Hs.155981) Hs#S785496 gagg..cccactcccttg.ctggccccagccctgctgan.at.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact Hs#S1065649 gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtgatccccgttccaccccaagagaact Hs#S706294 gagggccccactcccttg.ctagtgtcagccctgctggggat.ccccgcctggccaggagcagagcacgggtggtccccattccaccccaagagaact Hs#S730843 Hs#S751356 gagggccc.actcccttg.ctggccccagcc.tgctgga.gt.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact gagggccc.actcccttg.ctggccccagccctgctgna.nt.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact Hs#S786081 gagggccccactcccttg.ctaggac.agcc.tgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact Hs#S417458 gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtgatccccgttccaccccaagagaact Hs#S751274 gagggccccactccctgggcttggcccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact Hs#S483955 gagggccccactcccttg.ctggccccagccctgctgga.atancccgcctggccaggagcag.gcacgggtnatccccgttccaccccaagagaact Hs#S1434119 gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact Hs#S1065241 aagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaaaagaact CONSENS0 gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact 1970 1980 1990 2000 2010 2020 2030 2040 2050 Chromatographic Evidence G G T G Hs#S785496 zu42c08.r1 G G T C C C G G T G A Hs#S1065649 oz03ho7.x1* A T C C C RFLP Detection of SNPs 1 2 3 4 5 6 7 8 9 10 11 86 nt 67 bases 67 nt G /A 86 bases 32 bases 35 bases GATC G MboI [MboI] 35 nt 32 nt TC GG GG GG AA GG GA GG GA GG AA GA genotype Verified 56 of 79 SNPs tested so far %verified RFLP Verification on 16-24 DNA Samples 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1-6 6-20 score >20 Verification Test: Whitehead cSNPs • Whitehead Institute has systematically searched for SNPs in 106 genes, using 20 Europeans, 10 Africans, 10 Asians. • On 54 genes, our predicted cSNPs (score>3) are verified by their results at a 70% rate. The Whitehead set may be incomplete. Gene #seq #lib AHC 16 APOD 207 AR 18 AT3 55 BDNF 18 CETP 14 CGA 122 CNTF 4 COMT 183 CYP11A 64 CYP11B2 7 DRD1 4 F10 29 F13A1 141 F2 37 F5 49 F9 7 FGB 266 total percentage verified 6 59 4 12 9 8 17 1 67 21 4 1 15 40 14 14 2 32 cSNPs low frequency predicted verified predicted verified 0 3 0 3 0 2 2 0 0 1 1 0 0 0 8 2 2 3 0 2 0 0 2 1 1 2 2 0 0 5 5 1 1 1 0 3 3 0 30 17 57% 9 high frequency predicted verified 0 2 2 1 1 0 0 6 1 2 0 0 1 2 1 2 1 4 1 3 4 1 3 1 11% 21 16 76% Validation Test: HLA-A • HLA polymorphism has been studied very extensively for the general population, providing a “gold standard” for all true positives. • 140 distinct HLA-A allele sequences available from Anthony Nolan Foundation database. Are any of our predicted HLA-A SNPs not independently verified by this data? T C T T gatggccgtc atggcgcccc gaaccctcgt cctgctactc tcgggggccc tggccctgac ccagacctgg C T C T T T A A A T AC C A T A gcgggctccc actccatgag gtatttcttc acatccgtgt cccggcccgg ccgcggggag ccccgcttca A T A AC C A A A AA G A T A T A A tcgccgtggg ctacgtggac gacacgcagt tcgtgcggtt cgacagcgac gccgcgagcc agaggatgga A A T A T G G A G C AA T T CA A C G AA A gccgcgggcg ccgtggatag agcaggaggg gccggagtat tgggacgggg agacacggaa tgtgaaggcc A AA T T CA C A G A AA T A A GG T C C A C T C A cactcacaga ctgaccgagt ggacctgggg accctgcgcg gctactacaa ccagagcgag gccggttctc G T C C AG CC T T C A A A G C G T C T C G acaccatcca gataatgtat ggctgcgacg tggggtcgga cgggcgcttc ctccgcgggt accacaggac C GG T C C T C G TGGAG T G G C C A T T A A gcctacgacg gcaaggatta catcgccctg aacgaggacc tgcgctcttg accgcggcgg acatggcggc T T A A C A A A C G C C A A T G CA CA T T tcagatcacc aagcgcaagt gggaggcggc ccatgtggcg gagcagttga gagcctacct ggagggcacg T C C A A T G A CA T T T G C G C A tgcgtggagt ggctccgcag atacctggag aacgggaagg agacgctgca gcgcacggac gcccccaaga C C G C GA A C C C A A T A G cgcatatgac tcaccacgct gtctctgacc atgaggccac cctgaggtgc tgggccctga gcttctaccc A C C C A A T A G T tgcggagatc acactgacct ggcagcggga tggggaggac cagacccagg acacggagct cgtggagacc T C T C C A G TAT A G aggcctgcag gggatggaac cttccagaag tgggcggctg tggtggtgcc ttctggacag gagcagagat G TAT A G A A T C TA acacctgcca tgtgcagcat gagggtctgc ccaagcccct caccctgaga tgggagccgt cttcccagcc A T C G TA G A C A T G caccatcccc atcgtgggca tcattgctgg cctggttctc tttggagctg tgatcactgg agctgtggtc G A T C A C C A T G A G C T A T C gctgctgtga tgtggaggag gaagagctca gatagaaaag gagggagcta ctctcaggct gcaagcagtg C G A C T T A A C acagtgccca gggctctgat gtgtctctca cagcttgtaa agtgtga A C HLA-A: 89% Verification Rate • Of total 108 SNPs we predicted in the coding region of HLA-A, 96 are independently validated by the known HLA-A allele sequences, and 12 are not. • By comparison, the NCI CGAP project (based on the same EST data) predicts just 10 SNPs in HLA-A (>90% false negatives!) Mass Spectrometry Validation • SNPs change the mass of a DNA fragment. • Sequenom Inc. has tested more than 1000 of our SNPs using mass spectrometry of pooled DNA samples. • 80% were detectably polymorphic in samples of 90 people. Bioinformatics Key to SNPs Estimated Number of Human SNPs found 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 MIT-AFFY NCI (NIH) Wash. U. UCLA EST-based SNP detection similar in reliability with experimental methods project Picoult Newberg et al Buetow et al. UCLA high-LOD UCLA LOD 3 Halushka et al. Cargill et al. # SNPs verification # people method 850 63% 18 sequencing & G 3000 82% 10 to 90 RFLP 30000 69%, 79% 8 to 24 RFLP 75000 57% 8 to 24 RFLP 874 79% resequenced VDA 560 55%, 60% resequenced VDA, DHPLC Application to Disease Gene Mapping • How do SNPs compare with traditional marker sets used for disease gene mapping projects? • Density: how dense is the marker set, when mapped onto the human genome? • Ideal: at least one marker per gene (strong linkage disequilibrium within 3kb) • Ideal: high heterozygosity for good statistics Hs.197713 (3/3) Hs.205802 (1/1) n MICROSATELLITE 0.1 Hs.211929 (108/51) n n n n Hs.193078 (3/2) Hs.176560 (6/6) EIF3S7 (261/106) n n n n n n AFM164ze3 0.2 Hs.139929 (16/12) n Hs.146766 (4/2) Hs.212478 (1/1) AFM273vd9 1 0.3 2 Chromosome 22 p13 p12 p11.2 p11.1 q11.1 q11.2 q12.1 q12.2 q12.3 q13.1 q13.2 q13.3 0.4 3 4 0.5 AFMa046za5 5 0.6 6 Hs.143856 (3/2) NCF4 (2/1) Hs.147244 (1/1) CSF2RB (29/14) n TST(118/56) n n n n n n n n n n n n 7 8 Hs.107692 (5/1) Hs.196536 (3/2) PVALB (66/17) n n n n n Hs.187027 (2/2) Hs.207456 (1/1) p 0.7 Hs.94810 (27/13) n 1.4 MB from 22q13.1 0.8 9 10 0.9 IL2RB (1/1) Hs.194750 (4/3) Hs.196941 (20/15) Hs.22011 (27/14) n n Hs.118700 (3/2) Hs.177397 (2/2) RAC2 (67/35) n 11 1.0 12 Hs.7189 (19/14) n Hs.174434 (1/1) Hs.220558 (3/2) Hs.187981 (8/3) 13 1.1 14 AFM261ye5 MFNG (123/37) n n Hs.187933 (2/1) Hs.57973 (19/12) n 1.2 MSE55 (26/16) n Contig NT_001454 (14.6 MB) 22q11.2 - q13.3 1.3 1.4 SNP Hs.178824 (1/1) Hs.190885 (9/6) n Hs.6071 (48/32) n n Hs.119913 (1/1) Hs.97858 (3/2) Hs.5790 (71/31) n n n n n n Hs.25744 (20/13) Mapping Test: positionally cloned genes • Positionally cloned genes represent a (somewhat) random sampling of genes. • They are examples of actual disease-gene mapping targets, that typically took years of linkage analysis and chromosome walking to find. • How good is the coverage and heterozygosity of our SNP marker set for these genes? Gene ALD APC CFTR CHM CLC1 DM DMD FMR1 GK GLYRA2 HD KRT9 MLH1 MNK MSH2 NDP NF1 NF2 OCRL PAX3 PAX6 PKD1 RB1 RET SOD1 SRY TSC VHL WND WT1 Disease SNPS Heterozygosity X-Linked Adrenoleukodystrophy 0 Adenomatous Polyposis Coli 7 0.45, 0.42, 0.40, 0.39, 0.26 Cystic Fibrosis 4 0.48, 0.38, 0.35, 0.35 Choroideremia 0 Thomsen Disease 0 Myotonic Dystrophy 0 Duchenne Muscular Dystrophy 3 0.47, 0.34, 0.24 Fragile X Syndrome 5 0.21, 0.21, 0.18, 0.18, 0.18 Glycerol Kinase Deficiency 9 0.50, 0.50, 0.49, 0.49, 0.49 Hyperekplexia 0 Huntington's Disease 4 0.50, 0.46, 0.35, 0.35 Epidermolytic Palmoplantar Keratoderma 0 Hereditary Non-polyposis Colon Cancer 2 0.24, 0.08 Menkes Syndrome 0 Hereditary Non-polyposis Colon Cancer 2 0.31, 0.28 Norrie Disease 1 0.49 Neurofibromatosis, Type 1 3 0.50, 0.50, 0.47 Neurofibromatosis, Type 2 3 0.49, 0.48, 0.26 Lowe Syndrome 7 0.50, 0.50, 0.50, 0.46, 0.33 Waardenburg Syndrome 5 0.38, 0.38, 0.38, 0.38, 0.38 Aniridia 0 Polycystic Kidney Disease 0 Retinoblastoma 3 0.36, 0.28, 0.10 Multiple Endocrine Neoplasia 2A 0 Amyotrophic Lateral Sclerosis 11 0.48, 0.35, 0.26, 0.21, 0.16 Gonadal Dysgenesis 0 Tuberous Sclerosis 0 Von Hippel-Lindau Disease 13 0.50, 0.50, 0.49, 0.48, 0.48 Wilson Disease 0 Wilms Tumor 3 0.44, 0.43, 0.41 SNP validation tests: b globin • b globin polymorphism has been studied intensively, identifying 100s of substitutions • Verify predicted SNPs against known mutations. • We detect 21 SNPs in b globin, 17 within exons. SNPs highly biased towards third codon position SNP codon distribution FEATURE cod_pos_1 cod_pos_2 cod_pos_3 n_SNPs 2 4 11 SNPs Biased towards Silent or Conservative Substitutions SNP substitution type FEATURE n_SNPs silent 6 conservative 9 non-conserved 2 codon pos. polym orphism AA AA protein LOD f (%) pos. disease location type 3 CAC HIS CAT HIS 83.1 17 2 surface silent 2 GAG GLU GTG VAL 240.8 5 6 ab interface non-conserved 3 GGC GLY GGA GLY 48.9 9 16 surface silent 2 AAG LYS AGG ARG 52.9 11 17 surface conservative 3 AGG ARG AGT SER 13.2 8 30 ab interface non-conserved 3 CTG LEU CTA LEU 12.9 7 31 core silent 3 GTG VAL GTC VAL 12.1 8 33 ab interface silent 2 GTC VAL GCC VAL 7.4 8 34 ab interface silent 3 CAC HIS CAA GLN 14.3 2 77 surface conservative 3 GAC ASP GAA GLU 3 2 79 surface conservative 1 AAG LYS GAG GLU 2.3 2 82 surface conservative 2 ACC THR AAC ASN 23.2 3 84 surface conservative 1 CTC LEU TTC PHE 2.1 2 105 ab interface conservative 3 GTG VAL GTT VAL 4.8 2 113 surface silent 3 CAC HIS CAA GLN 7.6 5 117 surface conservative 3 GAA GLU GAT ASP 4.2 4 121 surface conservative 3 GTG VAL GCC ALA 23.7 2 134 core conservative association sickle cell hemolytic anemia erythrocytosis SNPs detect three disease alleles • Mutations previously identified as causing disease, catalogued by Online Mendelian Inheritance in Man. • The only two non-conservative amino acid substitutions detected. • All three at the a-b chain interface. Verified SNP: Hb Tacoma disrupts a-b interface His 77 Gln Exposed, unlikely to disrupt stability What are the most polymorphic genes in the Human Genome? • Very large differences in polymorphism levels in different genes. • Maintaining high levels of diversity (large numbers of alleles) may indicate a selective pressure. • What can we learn from patterns of polymorphism? • Why are some genes so polymorphic? The Most Polymorphic Genes: Five Classes • Direct interactions with pathogens. • Very highly expressed genes. • Genes involved in tumorigenesis and survival/growth of tumors. • Viral- and transposon-derived sequences. • Large families of highly similar genes? Acknowledgements • Christopher Lee: K. Irizarry, B. Modrek, C. Grasso • Wing Wong (Statistics): C. Li • Stan Nelson (Human Genetics): V. Kustanovich, N. Brown