MOLECULAR GENETICS 2005 Tuesday November 29: Human Genetics, part I Liisa Kauppi (Keeney lab) RRL-1129 Phone 639 5180 Email kauppiL@mskcc.org Web resources: For BLAST searches, gene queries etc.: http://www.ensembl.org/Homo_sapiens/index.html For comprehensive listing and information on genetic diseases: Online Mendelian Inheritance in Man (OMIM) http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM SNP database: http://www.ncbi.nlm.nih.gov/SNP/index.html HapMap project: http://www.hapmap.org/index.html.en Textbook: Human molecular genetics (Strachan and Read), chapters 11 and 12 in version 2 of the book See any recent issue of American Journal of Human Genetics for applications of the methods described in this lecture (at www.ajhg.org). Human genetics can be studied in pedigrees or populations. There are limitations, however: unlike with model organisms such as mice or flies, in human genetics you cannot generate mutants, you cannot perform controlled matings, place your subjects in a controlled environment etc. We have to rely on “natural” mutants, i.e. patients with a phenotype. For a trait to be amenable to genetic analysis, it has to show some degree of heritability. This can be assessed in pedigrees (first degree relatives, twins). There are two basic types of genetic diseases: simple Mendelian with complete correlation between genotype and phenotype - if you have the mutant gene, you'll get the disease, and complex or multifactorial (risk of getting the disease is modified by individual's genotype; other factors like other genes and environment, also influence disease risk). Mapping disease genes in humans is done by using polymorphic DNA markers (single nucleotide polymorphisms [SNPs] and microsatellites). Terminology for SNPs is defined by allele frequency. A single base change, occurring in a population at a frequency of >1% is termed a SNP. When a single base change occurs at <1% it is considered to be a mutation. Expected genotype frequencies at any bi-allelic locus can be calculated using the HardyWeinberg equation. It has the following assumptions: large randomly mating population, no mutations, no migration between populations, no selection (all genotypes reproduce with equal success). Basic relations are: two alleles at a locus (B and b) frequency of the B allele = p, frequency of the b allele = q and p + q = 1 Genotype frequencies are given by the equation: p2 (BB) + 2pq (Bb) + q2 (bb) = 1 Any individual is the result of the union of two gametes which have the probability of containing a specific allele (equal to the allele frequency). This can be diagrammed as a Punnett square: p (B) q (b) p (B) p2 (BB) pq (Bb) q (b) pq (Bb) q2 (bb) HWE explains how recessive traits are maintained in a population, and allows calculations of carrier frequencies for recessive traits (with some caution). Departure from HWE suggests that some assumptions are not met, for example there may be recent migration or selection influencing the genotype frequencies. Family studies In a simple Mendelian disease, polymorphisms are studied in family members for genetic linkage. If the marker is close to the disease gene on the chromosome there is a low chance of recombination at meiosis and linkage is observed; if the marker and disease gene are far apart (or on different chromosomes), linkage is not observed. To calculate linkage, the number of recombinant (R) and non-recombinant (NR) offspring are counted for each sibship in the family. If the recombination fraction RF (), calculated as R/(R+NR), is significantly less than 0.5, there is evidence of linkage. Numbers of offspring in human families are small. In order to get statistically significant evidence for linkage, it is necessary to combine evidence from many families, and to make probabilistic guesses for families where not all members are available for study (all done on a computer). The lod (logarithm of odds) score Z is a statistic that describes the strength of evidence for linkage, at any chosen value of the RF, given the family data available. It is calculated as follows: Z= log [Likelihood of loci being linked/Likelihood of loci not being linked] A lod score of 3 or more is considered good evidence for linkage and a lod score of -2 or less is evidence against linkage. Values between -2 and 3 are inconclusive and more data must be obtained. A genome scan consists of genotyping a collection of families with the genetic disease using hundreds markers across the genome. By using many markers there is a good chance that any unknown gene will be close enough to one or two of them to show genetic linkage. The aim is to find linkage with two markers, one of which is on each side of the disease gene; then the disease gene must be between the two markers (usually a few Mb apart). Recognizing recombinants does the disease segregate with this marker? 1 I 25 16 II 6 21 34 III 31 32 41 NR NR NR 41 NR 42 32 NR R Recombination fraction is 1/6=0.167 Which marker is the disease locus closest to? Multi-point lod scores chr 3p12-14 Waardenburg syndrome type 2 Hughes et al. (1994) Nature Genet 7, 509-512 “Shared segment” analyses When there is no simple Mendelian inheritance pattern, we can analyze affected people only. When you only look at affecteds, there is no need to specify penetrance of the disease. In affected sib pair analysis, the aim is to identify genomic segments that are shared between the affected sibs more often than expected by chance. Association studies in populations In populations, cases (patients) and controls can be collected. Then you look for alleles that are more prevalent in the patient cohort than in the general population (controls). However, typing many markers in such large numbers of samples would be prohibitively expensive. Luckily, we can make use of a phenomenon called linkage disequilibrium (LD). This refers to the correlation between two alleles on a chromosomal segment. LD is a result of this chromosomal segment (haplotype) not having been broken up by meiotic recombination in the population. A haplotype is the set of alleles at more than one locus that is inherited by an individual from one of its parents. For two loci with two alleles each (A and a, B and b) there are four possible haplotypes: AB, Ab, aB, ab. A diploid individual’s genotype could be for example AB on one homologous chromosome and Ab on the other homolog. In other words, the individual has two haplotypes, one inherited from each parent, just as a one-locus genotype contains two alleles from the two parents. The opposite of LD is linkage equilibrium (or free association), which can be thought of as the two-locus version of the Hardy-Weinberg ratio, but it is a property of haplotypes, not genotypes. At linkage equilibrium, all four haplotypes are found in the population, at frequencies that are expected based on the genotype frequencies of the individual alleles. Linkage disequilibrium (LD) measures association between two alleles Mutation creates new variants A G A A A G T A Initially, the new allele is in LD with nearby alleles LD value = 1 Recombination reshuffles existing variation A G A G X T A A T LD diminishes If enough crossovers take place, the loci are in Тfree associationУ Commonly used LD measures: D Хand r2 The utility of LD for gene mapping is the fact that if there are extended chromosomal segments where alleles are inherited as “a package” (haplotype blocks), then there is no need to type every SNP in such a segment. It should be sufficient to select a few “tag SNPs” for a haplotype block and only genotype those. Recently, the HapMap project has characterized the extent of haplotype blocks in four human populations (Nigerians, Whites in the US, Japan, China), and identified SNPs for tagging these blocks. LD-based methods work best when there is a single susceptibility allele at any given disease locus, and generally perform very poorly if there is substantial allelic heterogeneity. The Common-Disease Common-Variant (CDCV) hypothesis states that disease-predisposing variants exist at relatively high frequency (i.e. >1%) in the population. According to the CDCV hypothesis, common diseases are caused by ancient alleles occurring on specific haplotypes; these haplotypes should be detectable in a casecontrol study using tagging SNPs. The alternative (and more pessimistic) hypothesis states that disease-predisposing alleles are sporadic new mutations, perhaps around the same genes, on different haplotypes. Different families with history of the same disease would then owe their condition to different mutations events. There are examples to show that for at least some common diseases the CDCV hypothesis is true (Alzheimer’s etc.). It is likely, however, that some diseases have underlying allelic heterogeneity. For any gene, there are a number of ways in which SNPs can alter its expression (nonsynonymous, modify splicing, alter promoter function etc.). If a pathway leading to the disease is complex (consider schizophrenia for example), the number of genes involved is higher, and the likelihood of several SNPs affecting gene function increases. Finally, recently it has been discovered that all humans carry several low copy number polymorphisms in their genomes (see Mike Wigler’s President’s seminar talk on Dec 14). It is likely that these too will influence susceptibility to common disease. There is also evidence that microsatellite and insertion/deletion (indel) polymorphisms can influence phenotypes: - Spinocerebellar ataxia Type10 (SCA10) (OMIM:+603516) is caused by the largest tandem repeat seen in human genome. Normal population has 10-22 repeats of the pentanucleotide ATTCT in intron 9 of SCA10 gene; SCA10 patients have 800-4500 repeat units that cause the disease allele to be up to 22.5 kb larger than the normal one. - Association between coronary heart disease and a 287-bp indel polymorphism located in intron 16 of the angiotensin converting enzyme (ACE) has been reported (OMIM 106180). This indel, known as ACE/ID is responsible for 50% of inter-individual variability of plasma ACE concentration.