Stat 877(992) Statistical methods in molecular biology Course plans • Team taught: Newton, Larget, Ane, Keles, Kendziorski, Broman, Yandell • Per instructor homework set (six at 12pts each) • Final project, poster presentation (28 pts) National Research Council Report, 2004 Mathematics and 21st Century Biology “Progress in the biosciences will increasingly depend on deep and broad integration of mathematical analysis into studies at all levels of biological organization…: molecules, cells, organisms, populations, and Ecosystems.” “The committee regards the interface between mathematics and biology as biology-driven.” Some definitions [first approximations!] cell structural/functional unit of all living organisms protein organic compound produced and used by cell amino acid protein building block nucleic acid chainlike molecule involved in preservation, replication, and expression of hereditary information in every living cell nucleotide nucleic acid building block Example function: oxygen transport 2-3 x 10^13 red blood cells/body 2 x 10^6 new cells/second 95% of dry weight is protein hemoglobin hemoglobin more about hemoglobin sequence of amino acids in hemoglobin • alpha chain (141 amino acids) [2 subunits] • VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKA HGKKVADGLTLAVGHLDDLPGALSDLSNLHAHKLRVDPVNFKLLSHCLLSTLAVHLPND FTPAVHASLDKFLSSVSTVLTSKYR • beta chain (146 amino acids) [2 subunits] • VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN PGAVMGNPKVKAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDP ENFRLLGNVLALVVARHFGKDFTPELQASYQKVVAGVANALAHKYH A few amino acids (among 20 standard) V = Val = Valine L = Leu = Leucine M = Meth = Methionine more about amino acids Amino acids are concatenated into protein by the translation of information stored in messenger RNA Ribonucleic acid (RNA) Nucleotide bases A = adenine C = cytosine U = uracil G = guanine single stranded Amino acids are concatenated into protein by the translation of information stored in messenger RNA (mRNA) Met Ribonucleic acid (RNA) Nucleotide bases A = adenine C = cytosine U = uracil G = guanine Thr Glu Leu Arg Ser stop Amino acids are encoded by triples of mRNA nucleotides called codons more about the genetic code Translation: mRNA to protein via ribosome & tRNA Base pairing A-U, G-C video podcast of translation mRNA structure orientation 5’ to 3’ UTR = untranslated region: mRNA stability mRNA localization translational efficiency Mature mRNA may have been processed by splicing a primary transcript (pre-mRNA) Primary transcripts are produced by the transcription of DNA Deoxyribonucleic acid (DNA) double stranded 4 nucleotide bases ATGC base pairing: A-T, C-G Transcription: DNA to RNA via RNA polymerase initiate elongate terminate Central dogma of molecular biology Replication: DNA copies itself during cell division More on organization of DNA Chromosomes are organized structures of DNA and proteins that are found in cells. Each chromosome contains a single continuous piece of DNA. In diploid species, chromosomes are paired. Human chromosome total number base pairs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X (sex chromosome) Y (sex chromosome) 247,200,000 242,750,000 199,450,000 191,260,000 180,840,000 170,900,000 158,820,000 146,270,000 140,440,000 135,370,000 134,450,000 132,290,000 114,130,000 106,360,000 100,340,000 88,820,000 78,650,000 76,120,000 63,810,000 62,440,000 46,940,000 49,530,000 154,910,000 57,740,000 A genome equals the sequence of one full copy 3 Gbp, or 100 yrs at 1bp/second Estimates from Sanger’s Vertebrate Genome Annotation (VEGA) database, 7/07 2001: drafts of the human genome sequence published 1 % of bases are in exons 24 % of bases are in introns 2007: pilot phase of ENCODE project completed Encyclopedia Of DNA Elements majority of bases are transcribed extensive transcript overlap functions poorly understood Evolving definition of gene 1860s-1900s: a discrete unit of heredity (Mendel) 1910s: a distinct locus (Morgan) 1940s: the blueprint for a protein (Beadle & Tatum) 1960s: a transcribed code (Watson & Crick) Genome era: a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions Figure 5"> Figure 5 Mark B. Gerstein et al. Genome Res. 2007; 17: 669-681 Post ENCODE The gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products Gerstein et al 2007 What about Statistics? Statistics supports the development of genomic resources • In accomodating sequencing errors for genome assembly • In rating the significance of sequence matches by alignment algorithms Statistics supports analyses to determine the function of genes/transcripts/proteins • Gene regulation • Gene expression • Network considerations (many processes/functions) Example: oxygen transport According to the Gene Ontology (GO) project, 46 different genes are involved in this biological process Statistics is critical in analyzing patterns of genomic variation within populations, and in relating this variation to disease states or other phenotypes • Genomes differ from the reference copy (single nucleotide polymorphisms, structural variants) • Gene mapping by linkage and association methods Statistics is critical in analyzing patterns of genomic variation between populations/species • Phylogenetic analysis “Nothing in biology makes sense except in the light of evolution” -T. Dobzhansky Tree of life project “It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp earth, and to reflect that these elaborately constructed forms, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us. These laws, taken in the largest sense, being Growth with reproduction; Inheritance which is almost implied by reproduction; Variability from the indirect and direct action of the conditions of life, and from use and disuse; a Ratio of Increase so high as to lead to a Struggle for Life, and as a consequence to Natural Selection, entailing Divergence of Character and the Extinction of less improved forms. Thus, from the war of nature, from famine and death, the most exalted object which we are capable of conceiving, namely, the production of the higher animals, directly follows.” - Charles Darwin