CSE280a: Algorithmic topics in bioinformatics Vineet Bafna CSE280 Vineet Bafna The scope/syllabus • We will cover topics from the following areas: – – – – • Population genetics Computational Mass Spectrometry Biological networks (emphasis on comparative analysis) ncRNA The focus will be on the use of algorithms for analyzing data in these areas – Some background in algorithms (mathematical maturity) is helpful. – Relevant biology will be discussed on a ‘need to know’ basis CSE280 Vineet Bafna Logisitics • • • • This class has little required homework. All reading is optional, but recommended. 1 Final Exam (20%), and 1 research project (70%) Students help edit class notes in latex (10%) – – • Project Goal: – • • At most one topic per student Groups of <= 2 per topic Address research problems with minimum preparation Lectures will be given by instructors and by students. Most communication is electronic. – CSE280 Check http://www.cse.ucsd.edu/classes/wi08/cse280a Vineet Bafna From an individual to a population • • Individual genomes vary by about 1 in 1000bp. These small variations account for significant phenotype differences. – – • • • Disease susceptibility. Response to drugs How can we understand genetic variation in a population, and its consequences? It took a long time (10-15 yrs) to produce the draft sequence of the human genome. Soon (within 10-15 years), entire populations can have their DNA sequenced. Why do we care? CSE280 Vineet Bafna Population Genetics • • • • Individuals in a species (population) are phenotypically different. Often these differences are inherited (genetic). Understanding the genetic basis of these differences is a key challenge of biology! The analysis of these differences involves many interesting algorithmic questions. We will use these questions to illustrate algorithmic principles, and use algorithms to interpret genetic data. CSE280 Vineet Bafna Population genetics • We are all similar, yet we are different. How substantial are the differences? – – – – – – What are the sources of variation? As mutations arise, they are either neutral and subject to evolutionary drift, or they are (dis-)advantageous and under selective pressure. Can we tell? If you had DNA from many sub-populations, Asian, European, African, can you separate them? How can we detect recombination? Why are some people more likely to get a disease then others? How is disease gene mapping done? Phasing of chromosomes CSE280 Vineet Bafna Computational mass spectrometry • • Mass Spectrometry is a key technology for measuring active proteins, their interactions, and their post-translational modifications Computation plays a key role in interpreting mass spectrometry data. CSE280 Vineet Bafna Back to the genome • • • Recall that the genome contains protein-coding genes (1% of the genome) Another 5% might encode RNA, and regulatory sites How do we find regions of functional interest? CSE280 Vineet Bafna Population genetics basics CSE280 Vineet Bafna Scope of genetics lectures • • Basic terminology Key principles – – – – – – – – – CSE280 Sources of variation HW equilibrium Linkage Coalescent theory Recombination/Ancestral Recombination Graph Haplotypes/Haplotype phasing Population sub-structure Structural polymorphisms Medical genetics basis: Association mapping/pedigree analysis Vineet Bafna Terminology: allele • Allele: A specific variant at a location – The notion of alleles predates the concept of gene, and DNA. – Initially, alleles referred to variants that described a measurable trait (round/wrinkled seed) – Now, an allele might be a nucleotide on a chromosome, with no measurable phenotype. – As we discuss source of variation, we will have different kinds of alleles. CSE280 Vineet Bafna Terminology • Locus: The location of the allele – A nucleotide position. – A genetic marker – A gene – A chromosomal segment CSE280 Vineet Bafna Terminology • • • • Genotype: genetic makeup of (part of) an individual Phenotype: A measurable trait in an organism, often the consequence of a genetic variation Humans are diploid, they have 2 copies of each chromosome. – They may have heterozygosity/homozygosity at a location – Other organisms (plants) have higher forms of ploidy. – Additionally, some sites might have 2 allelic forms, or even many allelic forms. Haplotype: genetic makeup of (part of) a single chromosome CSE280 Vineet Bafna What causes variation in a population? • • • • Mutations (may lead to SNPs) Recombinations Other crossover events (gene conversion) Structural Polymorphisms CSE280 Vineet Bafna Single Nucleotide Polymorphisms • • • Small mutations that are sustained in a population are called SNPs SNPs are the most common source of variation studied The data is a matrix (rows are individuals, columns are loci). Only the variant positions are kept. CSE280 Vineet Bafna A->G Single Nucleotide Polymorphisms Infinite Sites Assumption: Each site mutates at most once CSE280 Vineet Bafna 00000101011 10001101001 01000101010 01000000011 00011110000 00101100110 Short Tandem Repeats GCTAGATCATCATCATCATTGCTAG GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATTGCTAGTTA GCTAGATCATCATCATCATCATTGC CSE280 Vineet Bafna 4 3 5 3 3 5 STR can be used as a DNA fingerprint • • • Consider a collection of regions with variable length repeats. Variable length repeats will lead to variable length DNA Vector of lengths is a fingerprint 4 3 5 3 3 5 2 3 1 2 1 3 loci CSE280 Vineet Bafna Structural polymorphisms • • • Large scale structural changes (deletions/insertions/inversions) may occur in a population. Copy Number variation Certain diseases (cancers) are marked by an abundance of these events CSE280 Vineet Bafna Personalized genome sequencing • • • • • • • • These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. CSE280 Vineet Bafna PLoS Biology, 2007 Recombination 00000000 11111111 00011111 • CSE280 Vineet Bafna Not all DNA recombines! Human DNA • • • Not all DNA recombines. mtDNA is inherited from the mother, and y-chromosome from the father CSE280 Vineet Bafna http://upload.wikimedia.org/wikipedia/commons/b/b2/Karyotype.png Gene Conversion • Gene Conversion versus single crossover – CSE280 Hard to distinguish in a population Vineet Bafna Topic 1: Basic Principles • In a ‘stable’ population, the distribution of alleles obeys certain laws – • HW Equilibrium – • Not really, and the deviations are interesting (due to mixing in a population) Linkage (dis)-equilibrium – CSE280 Due to recombination Vineet Bafna Hardy Weinberg equilibrium • • • • Consider a locus with 2 alleles, A, a p (respectively, q) is the frequency of A (resp. a) in the population 3 Genotypes: AA, Aa, aa Q: What is the frequency of each genotype If various assumptions are satisfied, (such as random mating, no natural selection), Then • PAA=p2 • PAa=2pq • Paa=q2 CSE280 Vineet Bafna Hardy Weinberg: why? • Assumptions: – – – – – • Diploid Sexual reproduction Random mating Bi-allelic sites Large population size, … Why? Each individual randomly picks his two chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on. CSE280 Vineet Bafna Hardy Weinberg: Generalizations • Multiple alleles with frequencies – By HW, 1,2, , H Pr[homozygous genotype i] = i2 Pr[heterozygous genotype i, j] = 2 i j • Multiple loci? CSE280 Vineet Bafna Hardy Weinberg: Implications • • • • The allele frequency does not change from generation to generation. Why? It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the mutation? Males are 100 times more likely to have the “red’ type of color blindness than females. Why? Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting. CSE280 Vineet Bafna Recombination 00000000 11111111 00011111 CSE280 Vineet Bafna What if there were no recombinations? • • • Life would be simpler Each individual sequence would have a single parent (even for higher ploidy) The genealogical relationship is expressed as a tree. This principle is used to track ancestry of an individual CSE280 Vineet Bafna The Infinite Sites Assumption 00000000 3 00100000 5 8 00100001 • • • CSE280 00101000 The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. Some phenotypes could be linked to the polymorphisms Some of the linkage is Vineet “destroyed” by recombination Bafna Infinite sites assumption and Perfect Phylogeny • • Each site is mutated at most once in the history. All descendants must carry the mutated value, and all others must carry the ancestral value i 1 in position i 0 in position i CSE280 Vineet Bafna Perfect Phylogeny • • Assume an evolutionary model in which no recombination takes place, only mutation. The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny. CSE280 Vineet Bafna Handling recombination • • A tree is not sufficient as a sequence may have 2 parents Recombination leads to loss of correlation between columns CSE280 Vineet Bafna Quiz 1 • • • Allele, locus, genotype, haplotype Hardy Weinberg equilibrium? Today: Linkage (dis)-equilibrium CSE280 Vineet Bafna Quiz 2 • Recall that a SNP data-set is a ‘binary’ matrix. – – • Rows are individual (chromosomes) Columns are alleles at a specific locus Suppose you have 2 SNP datasets of a contiguous genomic region. – – – CSE280 One from an African population, and one from a European Population. Can you tell which is which? How long does the genomic region have to be? Vineet Bafna Recombination, and populations • • • • • Think of a population of N individual chromosomes. The population remains stable from generation to generation. Without recombination, each individual has exactly one parent chromosome from the previous generation. With recombinations, each individual is derived from one or two parents. We will formalize this notion later in the context of coalescent theory. CSE280 Vineet Bafna Linkage (Dis)-equilibrium (LD) • • • Consider sites A &B Case 1: No recombination Each new individual chromosome chooses a parent from the existing ‘haplotype’ CSE280 Vineet Bafna A 0 0 0 0 1 1 1 1 B 1 1 0 0 0 0 0 0 1 0 Linkage (Dis)-equilibrium (LD) • • • Consider sites A &B Case 2: diploidy and recombination Each new individual chooses a parent from the existing alleles CSE280 Vineet Bafna A 0 0 0 0 1 1 1 1 B 1 1 0 0 0 0 0 0 1 1 Linkage (Dis)-equilibrium (LD) • Consider sites A &B • Case 1: No recombination Each new individual chooses a parent from the existing ‘haplotype’ – Pr[A,B=0,1] = 0.25 • Linkage disequilibrium Case 2: Extensive recombination Each new individual simply chooses and allele from either site – Pr[A,B=(0,1)]=0.125 • Linkage equilibrium • • • CSE280 Vineet Bafna A 0 0 0 0 1 1 1 1 B 1 1 0 0 0 0 0 0 LD • In the absence of recombination, – – • Correlation between columns The joint probability Pr[A=a,B=b] is different from P(a)P(b) With extensive recombination – CSE280 Pr(a,b)=P(a)P(b) Vineet Bafna Measures of LD • • Consider two bi-allelic sites with alleles marked with 0 and 1 Define – – • • P00 = Pr[Allele 0 in locus 1, and 0 in locus 2] P0* = Pr[Allele 0 in locus 1] Linkage equilibrium if P00 = P0* P*0 D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = … CSE280 Vineet Bafna LD over time • With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear – – – – CSE280 Let D(t) = LD at time t P(t)00 = (1-r) P(t-1)00 + r P(t-1)0* P(t-1)*0 D(t) = P(t)00 - P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0 (Why?) D(t) =(1-r) D(t-1) =(1-r)t D(0) Vineet Bafna Other measures of LD • D’ is obtained by dividing D by the largest possible value – – • • • Dmax = max {P1*P*1, P0*P*1, P1*P*0, P0*P*0 } Ex: D’ = abs(P11- P1* P*1)/ Dmax = D/(P1* P0* P*1 P*0)1/2 Let N be the number of individuals Show that 2N is the 2 statistic between the two sites 0 Site 1 0 1 P00N P0*N 1 CSE280 Site 2 Vineet Bafna LD over distance • Assumption – – • • Recombination rate increases linearly with distance LD decays exponentially with distance. The assumption is reasonable, but recombination rates vary from region to region, adding to complexity This simple fact is the basis of disease association mapping. CSE280 Vineet Bafna LD and disease mapping • • • Consider a mutation that is causal for a disease. The goal of disease gene mapping is to discover which gene (locus) carries the mutation. Consider every polymorphism, and check: – – • There might be too many polymorphisms Multiple mutations (even at a single locus) that lead to the same disease Instead, consider a dense sample of polymorphisms that span the genome CSE280 Vineet Bafna LD can be used to map disease genes LD 0 1 1 0 0 1 D N N D D N • • LD decays with distance from the disease allele. By plotting LD, one can short list the region containing the disease gene. CSE280 Vineet Bafna CSE280 Vineet Bafna QuickTime™ and a TIFF (LZW) decompressor QuickTime™ are needed to see this picture.and a TIFF (LZW) decompressor are needed to see this picture. • 269 individuals – – – – • 90 Yorubans 90 Europeans (CEPH) 44 Japanese 45 Chinese ~1M SNPs CSE280 Vineet Bafna LD and disease gene mapping problems • • • Marker density? Complex diseases Population sub-structure CSE280 Vineet Bafna Topic 2: Simulating population data • • We described various population genetic concepts (HW, LD), and their applicability The values of these parameters depend critically upon the population assumptions. – – – – – • What if we do not have infinite populations No random mating (Ex: geographic isolation) Sudden growth Bottlenecks Ad-mixture It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation? CSE280 Vineet Bafna