Computational Biology Introduction, Basic Biology Q Nives Skunca Slides prepared by Dr. Christophe Dessimoz 19/21 September 2012 This week Course introduction Basic Biology perturbation Reality observation Catalogue observation observation Nature Georg Dionysius Ehret's illustration of Linnaeus's sexual system of plant classification, 1736 Model formulate/select recreate life “synthetic biology” take it apart “in vitro” obs. obs. obs. perturb. Validate on real data obs. perturb. f(x) Validate Estimate by simulation prediction Learning Outcomes • Understand basic concepts of molecular biology • Understand and apply fundamental models, algorithms, data structures, and computational techniques to answer biological questions • Wide range of topics, but special focus on biological sequences and their evolutionary context. Topics Molecular Genetics Gene Evolution Genome Evolution Mass Spectrometry Codon Bias x Modeling Dynamic programming Markov models Least squares Maximum Likelihood Optimization Heuristics Simulation Organization • Lecture • • • Wed 13-14 (CAB G52), Fri 13-15 (ML F34) Prof. Gonnet will hold the lectures Exercises: • • Thu 14-16 (CAB H56), starting this week If you do not have a nethz account, ask Stefan Zoller as soon as possible. Teaching Assistants • • Stefan Zoller Nives Skunca Date Sept. 19/21 Topic Course Introduction; Basic Molecular Biology Sept. 26/28 Markov models/String Alignment I Oct. 3/5 String Alignment II (indels, estimating distances) Oct. 10/12 Substitution Matrices Oct. 17/19 Approximate Alignment Methods; Statistics of Pairwise Alignments Oct. 24/26 Phylogeny I Oct.31/Nov.2 Phylogeny II Nov. 7/9 Phylogeny III Nov. 14/16 Multiple Sequence Alignments Nov. 21/23 Synthetic Evolution; Evaluation of Estimators Nov. 28/30 Current research; Mass profiling Dec. 5/7 Dec. 12/14 Dec. 19/21 Orthology/Lateral Gene Transfer Codon bias Genome Rearrangements Lecturer NS GHG GHG GHG GHG GHG GHG GHG AS DD/GHG Guests/ GHG NS SZ GHG Course Grade & Credits • Participation in the exercises is strongly encouraged, but not mandatory • Written Exam • • • During winter session 3 hours Only support materials are 2 A4 pages (4 sides), personally handwritten. Course Homepage http://www.cbrg.ethz.ch/education/CompBiol • • • • Course details Schedule Slides Exercises Darwin • • Interpreted language based on Maple • Available for download mac and linux (http://www.cbrg.ethz.ch/darwin) Environment for bioinformatics, can do sequence management, mathematics, alignments, trees, drawing, etc. Biorecipes www.biorecipes.com • A collection of real problems with coded solutions in the Darwin language • • Darwin input in green Darwin output in red Other materials • Slides can be downloaded from the course homepage. • Additional notes and references will be made available as well. Basic Biology Slides of this part are largely based on material from Dr. Gina Cannarozzi Basic Principles • • • • Universality of life on earth: water, carbon-based biochemistry; genetic material; genetic code (largely) universal. → common origin! Life is compartmentalized: cells are fundamental units of structure, function, organization Self-replicating Capable of Darwinian evolution 10 µm Cryptomonadales Encyclopedia of Life (eol.org) So what is life? “Living organisms undergo metabolism, maintain homeostasis, possess a capacity to grow, respond to stimuli, reproduce and, through natural selection, adapt to their environment in successive generations.” • What about endospores? viruses? mules? priests? prions? computer viruses? • In biology, there are exceptions to almost every rule. Inside a Cell Prokaryote http://www.osovo.com/diagram/prokaryoticcelldiagram.htm ~2 µm Eukaryote http://www.biologycorner.com/resources/cell.gif 10-30 µm Relevant components • • • • Ribosomes translate mRNA into proteins. Mitochondria (eukaryotes) have their own DNA and are a result of early inclusion of αproteobacteria into a eukaryotic cell. Chloroplasts (plants, protists) have their own DNA as a result of early inclusion of cyanobacteria into a eukaryotic cell. Plasmids (bacteria) are short pieces of circular DNA in multiple copies; nonessential; get transferred between bacteria. Genome chromosome chromatin histone • Genome: all the genetic material of an organism. • The genome consists of genes and non-coding regions. • Genes consist of regulatory regions, intron, exons, untranslated regions http://www.scfbio-iitd.res.in/tutorial/geneorganization.html Escherichia coli Homo sapiens 23 chromosome pairs 1 circular chromosome 1 plasmid (multiple copies) ~4.6 million base pairs ~3.9 million coding bases (85%) 4132 protein-coding genes 172 RNA (tRNA, rRNA,etc) 578 pseudogenes ~3 billion base pairs ~50 million coding bases (1.5%) ~21,000 protein-coding genes ~294,000 exons ~60,000 different transcripts ~6,000 pseudogenes ~4,800 RNA genes ~2,900 RNA pseudogenes DNA Deoxyribonucleic acid • • • • Double helix Backbones: phosphate and deoxyribose , directed (5’ → 3’), antiparallel 34 Å (3.4 nm) Connection: 4 bases Adenine, Thymine, Cytosine, Guanine. A-T and C-G are paired by hydrogen bonds (relatively weak) 3.3 Å (0.33 nm) Wikipedia DNA Bases PuRines PYrimidines C ···· G: 3 H-bonds A ···· T: 2H-bonds Wikipedia Hydrogen Bond • X-H ···· Y where X,Y is an electronegative atom (typically N,O,F) • Responsible for high boiling point of water (each H20 can have up to 4 H bonds) “Central dogma of molecular biology” Wikipedia DNA Replication Wikipedia Polymerase can only add bases from 5’→3’ (DNA is read 3’ → 5’) Movie time! Replication visualized: http://www.wehi.edu.au/education/wehitv/molecular_visualisations_of_dna/ End of day 1 RNA • • • • • Single stranded (can form structure) • microRNA: short nucleotides (~22 nts) which regulate gene function Uracil instead of Thymine mRNA: messenger RNA, for translation rRNA: subunit of ribosome tRNA: specific for one amino-acid, selectively bind to codon via ribosome. http://www.pdb.org/pdb/static.do? p=education_discussion/ molecule_of_the_month/pdb15_2.html Transcription • Transcription factors bind to promoter sites at the 5’ regulatory region. • • RNA polymerase, binds to the complex. • Genes can be on either strand, but direction of growing mRNA sequence is always 5’ → 3’ Working together, they open the DNA double helix. Roger Kornberg Nobel Prize Chemistry 2006 The chain shown in grey is RNA polymerase, with the portion that clamps on the DNA shaded in yellow. The DNA helix being unwound and transcribed by RNA polymerase is shown in green and blue, and the growing RNA stand is shown in red. http://med.stanford.edu/featured_topics/nobel/kornberg/release.html Post-transcriptional modifications (Eukaryotes) • • • 5’ Cap Poly-A tail Splicing (removal of introns) Research questions: Where are the introns? Where are the coding sequences? Where are the stop and start of transcription? Where are the binding sites for the transcription factors that control when transcription takes place? Alternative Splicing • • Humans: >50% of genes have splice variants. Dscam gene in D. melanogaster: 95 alternative exons can express 38,016 different mRNAs through alternative splicing. Translation Wikimedia Commons The Genetic Code Proteins • • Participate in most (all?) cellular processes • Encoded in DNA Made of 20 amino-acids (+ occasionally a cofactor, such as metal ion, heme, ATP, etc.) Alberts et al., “Essential cell biology: an introduction to the molecular biology of the cell”, Garland 1996 Functions of Proteins ... Amino Acids • Only sidechains differ (red) • Sidechains have diverse chemical properties (charge, size, pH, hydrophobicity, ...) Wikimedia Commons Peptide Bond G. Cannarozzi Proteins have a 3D structure Wikimedia Commons Biological sequences How are they identified? Where are they stored? Next Generation Sequencing Unidentified protein extracted from gel Proteomics MDISTLTASEEIE MEIDAEEIEIMAT IDLAEDLISLFM DDMFSSIDLESI NFEIFNSSDIDSI NIDLESIEEIEIMF EEIEIMATIFNSS DIDIMMDIMMD SINFEIFNSSDIDI MMDATIDLAED LISLFMDDMFSS IDLESINFEIFNSS Split into fragments of 5-10 amino acids e . . . AEDLISLFMDDM . . . Determine mass using MS (Mass Spectrometry) Determine amino acid sequence and compare with sequence database Sequence Database Jiang Long, Science Creative Quarterly Image Bank Protein Identified Growth of sequence databases Number of sequences x 10^7 2.0 Protein Data Bank 8QL3URW.%6ZLVVí3URW UniProtKB/TrEmbl 1.5 1.0 0.5 0 2000 2002 2004 2006 Year 2008 2010 2012 Getting Sequences Ensembl ... e.g. GenBank File e.g. GenBank File e.g. GenBank File Evolution Darwinian Evolution • • Start from an initial population Repeat: • reproduce and “mutate” randomly • natural selection: fittest individuals survive and have descendants → selects “good” mutations • sometimes: a “branching” occurs (e.g. speciation, duplication) Not only the “good” characters survive • Genetic drift (random sampling) • • • Population bottleneck Founder effect Genetic hitchhiking (neutral or mildly deleterious alleles linked to positively selected gene) Species Evolution Diane Dodd’s fruit fly experiment • Speciation: the evolutionary process by which new species arise • Can occur from geographic isolation or barriers, new niche entered, animal husbandry http://evolution.berkeley.edu/evolibrary/article/_0_0/evo_45 Genome Rearrangements e.g. Human vs. Dog Krzywinski et al. Circos: an information aesthetic for comparative genomics. Genome Research (2009) vol. 19 (9) pp. 1639-45 Example: recombination among E. coli strains Mau et al. Genome Biology 2006 7:R44 Whole genome duplications Gene Evolution Point mutations Kunkel, 2004, The Journal of Biological Chemistry Point mutations Purines Pyrimidines Insertion/deletion Lateral Gene Transfer Wikipedia http://www.scq.ubc.ca/attack-of-the-superbugs-antibiotic-resistance/ Recombination Gene Evolution • • • • • • Mutation (base substitution) Insertion/Deletion Transposition (horizontal transfer) Recombination Gene loss or gene duplication Splicing pattern mutations Evolutionary Distances How can we quantify the amount of evolution between two subjects? • • • • • Time since divergence Number of common traits. Edit distance (minimum # of elementary operations to transform one object into the other) ... Desirable properties • distance estimable without knowing history • metric properties (e.g. triangle inequality) Markovian Evolution Markov Model: every site evolves independently, probability of mutation only depends on present state (no memory), probabilities of mutation are expressed by transition matrix. A M1= A C G T C G T 0.900 0.033 0.033 0.033 0.033 0.900 0.033 0.033 0.033 0.033 0.900 0.033 0.033 0.033 0.033 0.900 After “one unit” of evolution, the probability that an A mutates into a C is given by the corresponding entry in the matrix: p(A→C | d=1) = M1[A→C] = 0.033 http://gi.cebitec.uni-bielefeld.de/people/boecker/bilder/tree_of_life_new.gif Augustin Augier, Arbre Botanique (1801) Lamarck, Philosophie Zoologique , 1809 Darwin, Notebook B, 1837 Edward Hitchcock, Elementary Geology, 1840 Haeckel, The Evolution of Man, 1879 rRNA was used by Woese (1987) to group early life forms into three kingdoms NO CS J C O R JK L F C EIX FRRALAM X S ST ACAAAC 3 S I TH TRRC C EF A O 1 YW CH LA B CC HHLF CHCHL CHLLCVF LM TA P UR N BBIF OA IFLAOA C TR OW A 8T ARRT TAS2 T TH D ET DEIG 28 EIRD A LEPB LEPIC J IN L MAGSM SA SR FA P OOC CV RH N MY CSSJKS YYC MM S2 C MY N CE P HY AU TTFO YCCCTB P M MYYC M MY UA 1MYC AA CP MY RH OB A PS EP PS K EA PS PSEE E B P EU2 4 PSSE14 E PSSEM PSE F5 PF AL PSE C MBAS U5 HA HRCAV SAC H DC 2 HRS D N THIC METITOC R A HC ALH ALLH EH X Y L XYL F T FA XA NC 8 XXAANNC5P XANO AC R M I D F R RE GBL COCO OR C PR PA RU W CARR P MAGMM ZYMM O RHOR T SPHAL V H CN SO C DI VE TM RU APHL PPC EEGEGG U R LL L XB YACKT CO YDIB PS SIAC PA AC NOV A ERY D LH GRA GBLCU OX RHIERH MES RHRIM HE IL3 C ILO SB BARBK BARQU BAR HE BR UO BR 2U US BR UM BR EB UA 2 BRAJA BRASO RHOPA BRASB RHOP2 RHOPS RHOP5 RHOPB NITWN NITHX R GB WI AGRT5 P CB BU R CAUC B PELUPM OTLR G RW OL RR HR WW EEH CJ EHRCR P Z EH R ANAMM ANA SM NEOCN E RIC ICF RY T RIC PR BR IC RIC ITB R OR CC O IL ID PS Y VI IN SH V I BV U Y SHEDO VB IBPA EF CH SHSH VIB N E E PHF1 SH SH EO LPAM OP S R H E S N HE SM C SH ES SR PSOLP ES A EH 3 W T PS EA 6 PARDP S1 RHOS4 JANSC SILPO SILST ROSDO HYPNA MARMM C BU M P CA AI BU UCOPB L B OF B L BL S4H RH R AE AE ERYE W RE SCOT 8 D G TPIA CLHY SLALT SSAA EANPPS RRPREPRPRP YYEYEYEYE LL 1T LKU 6I5 8O ICFO DOOSLL57 EHEIC HSFCICSSO SSHSEIEIBC SSHH O PH BA UC PA H H SMAE HHHAA HAEAE U S1 EIEIGNI I8E M HA AC AN ED TP SM U 2 TO AC SUL SUL O S SUL RAR PY AE PYR IL Y P RJ C PY R PD THE D BX RU KO U PYR PYYRRFAB P YRHO P N NEI EIM MA G F1B CHRV AZOSB O AZOS E DECAR BORB BO RPR RP AE BORA1 RALEH RAL EJ RALM EO RALS BURP1 BURPS BURP0 BURMA BURTA ARCFU METST METS3 METTH DEHSC SYMTH DEHE1 CLOD6 CLOTH CLOAB CLOT E CLONN CL CLOP OPSE 1 METFK N AT FR O H AT ATW T1 FRAT FR AT FR HALSA A HALM D NATP HALWD P METT METBU METBF A METM METAC UNCMA METLZ METMJ METHJ METKA METJAMETMP CENSY THEMA BURCM JANMA BURCA BURXL BURCH BURS3 HERAR THIDA VEREI METPP ACIAC ACISJ RHOFD NITMU EU A J NIT LNLS POPO NITEC M STA O EV THHEACTO NEQ T PIC NA E CL MY THETN MYCS5 XD MYNXADE A JEJFF AMMJR M CCACA SB NIT U IBL LS ACUE WO L SO D ELC SLG P EBA PS GEO DES OM DBD MS GEPELP F N SYYNA V S VH DES DG S IP DEA L W STA EQ S S SSTTS STTA S A SAT TBA AAA8 A ATA RAA AAA S3C NA W M A TF EN B LN N SUHID T RD L SA S FU GU ITH TR AN IE AS SY I VPT NY SY 3 SY NE NP L 6 SSYYN NJ JA GLB OV I P PR P RO OM PR R O M 5 P PR OMMS PR OM 0 PR OM 9 OM A OS S T1 TL SYYNP U SYNSCX SY N LE N S LELEI ITMRA SYN PW9 IBRIN YB S OR B 3PRO AR Y P PO ATHSA SY ROMM PT NRENM3 R 3 CC U E SORAR ERIEU MONDO CH ORNANICK LOXAF DASNO MYOLU CANFA FELCA BOVIN TUPGB OTOGA HUMAN PANTR MACMU RABIT SPETR DRO ME DR ANO OPS GA AED AE ECHTE NA L DE L LO DE B HA ST CRY NE UST MA APIM SCH PO PH AN O AS PF U MAG BO GR TC YA I RL I CA OE RATN MOUS CAVPO PIC FUGRU TETNG GASAC IN CIO CIOSA ORYLA AS HG O YE AS KLU LA T NG A B O EN O M UM LE BC6P8G RM RRRTFPPPP R1PP3 TTTSP PRSRTDTPR SRTTS TSSSR SSTS CAEBR CAERE CAEEL PLAF7 CRYPV DICDI XENTR DANRE CA LA L C AC PL B PA D PE S1 C 3 LAS CC CSLA 315 D 1 RAA TT2 TTR TRR SS SST W U RM 2 V ST SY R S ST STR RN26 RRPP SSTT M CL S LACLA LA NW SY HY DES RHZ CA OTA MO FK J RA P G FLA UP BA RR PA OO BB RGA E TRE BTO RED E H LT LCLPDLCDH L CH CHPE GI 8OR N CT N V8ARDP 3 BA CFR C P TH A BA B CY LA E NN AQUA P LH HE Y LP HEELPLHPJH H HE LA HE LLIS S S I W IN6 TATSAH LLIISSSM 1 J MOF OC EI H BBB BAAACC BBAAACC CCCR 1 CCAAHZH G NK B G EO BAACEOKTN L BA CSUD A CH B L L A D AC S LLAAACCA CD K CJG C BA OA A OA NYYW BP PE MYC CT MESMYC FL MS MYCG A UR MYCEPA PE MYCPU MYCMO MMYC YCH7 HJ H2 MYCP N MYC GE Eukaryota Archaea Bacteria Planctomycetes Fusobacteria candidate division TG1 Dictyoglomi Verrucomicrobia Aquificae Acidobacteria Deinococcus-Thermus Thermotogae Chloroflexi Chlamydiae Chlorobi Bacteroidetes Spirochaetes Tenericutes Cyanobacteria Clostridia Bacilli Lactobacillales Actinobacteria Proteobacteria F P BU AER E LA KLU ST YEA GA CAN PO E YN A CR TM US IME H EC TE F A OX L K IC H C N NA R O A PL A CR F7 Y PV D I CD CA I C EB CA AER R E EL E O R YL CI O C IN I O SA FU G TE RU G TNG A S AC PO AV C You are here. R AMT ON U OS E NO S DA LU O MY NFA A CALC AR O E R N F VI O IEU ND S O B GB ER MO P N TU GA MA TR O U N U OT H PA CM BIT TR A RA E M SP E R AN TR N E OM R D PS O DR A OG AN DAE AE AP D EL LO D H SC XE ST PIC HA DEB AL CAN