#16 - Profiles & HMMs 9/28/07 Lecture 16 #16_Sept28

#16 - Profiles & HMMs 9/28/07 Required Reading BCB 444/544 (before lecture) √ Mon & Wed Sept 24 & 26- Lecture 14 & 15 Review: Nucleus, Chromosomes, Genes, RNAs, Proteins Lecture 16 Surprise lecture: No assigned reading Profiles & Hidden Markov Models (HMMs) √ Fri Sept 28 - Lectures 16 Profiles & Hidden Markov Models • Chp 6 - pp 79-84 • Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html #16_Sept28 Thurs Sept 27 - Lab 4 & Mon Oct 1 - Lecture 17 Protein Families, Domains, and Motifs • Chp 7 - pp 85-96 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 1 Assignments & Announcements BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 2 BCB 544 - Extra Required Reading Fri Sept 26 Mon Sept 24 • Exam 1 - Graded & returned in class - Really! • HW#2 - Graded & returned in class - Really! BCB 544 Extra Required Reading Assignment: • Pollard KS, Salama SR, Lambert N, Lambot MA, Coppens S, Pedersen JS, Katzman S, King B, Onodera C, Siepel A, Kern AD, Dehay C, Igel H, Ares M Jr, Vanderhaeghen P, Haussler D. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443: 167-172. • http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html doi:10.1038/nature05113 • Answer KEYs posted on website • Grades posted on WebCT • HomeWork #3 - posted online Due: Mon Oct 8 by 5 PM • PDF available on class website - under Required Reading Link • HW544Extra #1 - posted online Due: Task 1.1 - Mon Oct 1 by noon Task 1.2 & Task 2 - Mon Oct 8 by 5 PM BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 3 Extra Credit Questions #2-6: BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 Extra Credit Questions #7 & #8: 2. What is the size of the dystrophin gene (in kb)? Is it still the largest known human protein? 3. What is the largest protein encoded in human genome (i.e., longest single polypeptide chain)? 4. What is the largest protein complex for which a structure is known (for any organism)? 5. What is the most abundant protein (naturally occurring) on earth? 6. Which state in the US has the largest number of mobile genetic elements (transposons) in its living population? Given that each male attending our BCB 444/544 class on a typical day is healthy (let's assume MH=7), and is generating sperm at a rate equal to the average normal rate for reproductively competent males (dSp /dT = ? per minute): 7a. How many rounds of meiosis will occur during our 50 minute class period? 7b. How many total sperm will be produced by our BCB 444/544 class during that class period? 8. How many rounds of meiosis will occur in the reproductively competent females in our class? (assume FH=5) For 1 pt total (0.2 pt each): Answer all questions correctly For 0.6 pts total (0.2 pt each): Answer all questions correctly For 2 pts total: Prepare a PPT slide with all correct answers For 1 pts total: Prepare a PPT slide with all correct answers • Choose one option - you can't earn 3 pts! • Choose one option - you can't earn more than 1 pt for this! & submit by to terrible@iastate.edu & submit by to terrible@iastate.edu & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 • Partial credit for incorrect answers? only if they are truly amusing! BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs BCB 444/544 Fall 07 Dobbs 4 9/28/07 • Partial credit for incorrect answers? only if they are truly amusing! 5 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 6 1 #16 - Profiles & HMMs 9/28/07 Modeling Metabolic Pathways? see MetNet Information flow in the cell? http://metnet.vrac.iastate.edu/MetNet_overview.htm • DNA -> RNA -> protein: • Replication = DNA to DNA - by DNA polymerase • Transcription = DNA to RNA - by RNA polymerase • Translation = RNA to protein - by ribosomes • Exceptions/Complications: • DNA rearrangements: (by mobile genetic elements, recombination) • Reverse transcription: (RNA -> DNA, by reverse transcriptase) • Post-transcriptional modifications: • RNA splicing (removal of introns, by spliceosome) • RNA editing (addition/removal of nucleotides - usually U's) • Post-translational modifications: • Protein processing BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 7 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs Chromosomes & Genes 9/28/07 8 Gene regulation • Transcriptional regulation is primarily mediated by proteins that bind cis-acting elements or DNA sequence signals associated with genes: • DNA level (sequence-specific) regulatory signals • Promoters, terminators • Enhancers, repressors, silencers • Chromatin level (global) regulation • Heterochromatin (inactive) •e.g., X-inactivation in female mammals Genes in chromatin are not just “beads on a string” they are packaged in complex structures that we don't yet fully understand BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 • In eukaryotes, genes are often regulated at other levels: • Post-transcriptional (RNA transport, splicing, stability) • Post-translational (protein localization, folding, stability) 9 Promoter = DNA sequences required for BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs that regulate initiation of transcription; sites, usually "close" to start site contain TF binding sites,can be far from start site! • Transcription factors (TFs) - proteins that regulate transcription • (In eukaryotes) RNA polymerase binds by recognizing a complex of TFs bound at promotor RNAP = RNA polymerase II Promoter Enhancer Repressor BCB 444/544 Fall 07 Dobbs 10-50,000 bp Enhancers "enhance" transcription ~200 bp BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 10 Enhancers & repressors = DNA sequences initiation of transcription; contain TF binding First, TFs must bind TF binding sites (TFBSs) within promoters; then RNA polymerase can bind and initiate transcription of RNA 9/28/07 Pre-mRNA 9/28/07 Repressors or silencers "repress" transcription 11 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs Gene Enhancer binding proteins (TFs) interact with RNAP Repressor binding proteins (TFs) block transcription 9/28/07 12 2 #16 - Profiles & HMMs 9/28/07 Transcription factors (TFs) & their binding sites (TFBSs) "Non-coding" DNA? Many genes encode RNA that is not translated • Transcription factors - trans-acting factors - proteins that either activate or repress transcription, usually by binding DNA (via a DNA binding domain) & interacting with RNA polymerase (via a "trans-activating domain) to affect rate of transcription initiation 4 Major Classes of RNA: 1. mRNA = messenger RNA 2. tRNA = transfer RNA 3. rRNA = ribosomal RNA • Promotors, enhancers, and repressors - all contain binding sites for transcription factors 4. "Other" - Lots of these, diverse structures & functions: "Natural" RNAs: • siRNA, miRNA, piRNA, snRNA, snoRNA, … • ribozymes • Artificial RNAs: • RNAi • antisense RNA • • Promoters - usually located close to start site; vs • Enhancers/Silencers/Repressor sequences - can be close or very far away: located upstream, downstream or even within the coding sequence of genes !! BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 13 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 14 Protein Sequence, Structure & Function RNA Sequence, Structure & Function • RNAs can have complex 3D stuctures (like proteins) & have many important functions in cellular processes • Amino acid sequence determines protein structure • But some proteins need help folding ("chaperones") in vivo • Protein structure determines function • But level, timing & location of expression are important • Interactions with other proteins, DNA, RNA, & small ligands are also very important!! Ribosomes contain RNAs & proteins • We don't know the "folding code" that determines how proteins fold! • We don't know the "recognition code" that determines how proteins find and interact with correct partners! Ribozymes are RNA enzymes capable of RNA cleavage • RNA molecules are believed to be precursors to DNA-based life • Form complementary base pairs and replicate (like DNA) • Perform enzymatic functions (like proteins) BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 15 A few Online Resources for: Cell & Molecular Biology BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 16 Chp 6 - Profiles & Hidden Markov Models SECTION II • • • • 9/28/07 NCBI Science Primer: What is a cell? NCBI Science Primer: What is a genome? BioTech’s Life Science Dictionary NCBI bookshelf SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • √ Position Specific Scoring Matrices (PSSMs) • √ PSI-BLAST TODAY: • Profiles • Markov Models & Hidden Markov Models BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs BCB 444/544 Fall 07 Dobbs 9/28/07 17 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 18 3 #16 - Profiles & HMMs 9/28/07 Algorithms & Software for MSA? #3 Applications of MSA (NOT covered on Exam1) Heuristic Methods - continued • Building phylogenetic trees • Finding conserved patterns: • Progressive alignments (Star Alignment, Clustal) • Others: T-Coffee, DbClustal -see text: can be better than Clustal • Match closely-related sequences first using a guide tree • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Partial order alignments (POA) • Doesn't rely on guide tree; adds sequences in order given • PRALINE • Identifying and characterizing protein families • Preprocesses input sequences by building profiles for each • Iterative methods • Find out which protein domains have same function • Idea: optimal solution can be found by repeatedly modifying existing suboptimal solutions (eg: PRRN) • Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) • Block-based Alignment • Multiple re-building attempts to find best alignment (eg: DIALIGN2 & Match-Box) • Local alignments • Profiles, Blocks, Patterns - more on these soon! BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 19 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 20 9/28/07 22 9/28/07 24 Patterns can also be represented as Sequence Logos Application: Discover Conserved Patterns Is there a conserved cis-acting regulatory sequence? Rationale: if sequences are homologous (derived from a common ancestor), they may be structurally/functionally equivalent TATA box = transcriptional promoter element Sequence Logo BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 21 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs Sequence Logos: for Promoter elements (TF Binding Sites) Sequence Logo • Example was created from a set of TATA binding sites from TRANSFAC database. • http://www.gene-regulation.com/pub/databases.html • Logo was created by WebLogo. • http://weblogo.berkeley.edu/logo.cgi • Can see TATA-box quite easily. BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs BCB 444/544 Fall 07 Dobbs 9/28/07 23 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 4 #16 - Profiles & HMMs 9/28/07 PSSM vs Profile Sequence Logos - for RNA Splicing Sites Position-Specific Scoring Matrix: from ungapped MSA PSI-BLAST Pseudocode Convert query to PSSM (or a Profile) do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM Human intron donor and acceptor sites Profile: from MSA, including gaps } Print current set of homologs Note: Xiong textbook distinguishes between PSSMs (which have no gaps) & Profiles (can include gaps). Thus, based on these definitions, PSI-BLAST uses a Profile to iteratively add new homologs - other authors refer to pattern used by PSI-BLAST as a PSSM. http://www-lmmb.ncifcrf.gov/~toms/gallery/SequenceLogoSculpture.gif BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 What is a PSSM? Position-Specific Scoring Matrix Xiong: PSSM = table that contains probability information re: residues at each position of an ungapped MSA Also, sometimes called: Position Weight Matrix (PWM) BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs -1 -2 -1 0 -1 -2 0 -2 R 5 0 5 -2 1 -3 -2 0 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 1 0 1 -2 5 -3 -2 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 H 0 1 0 I -3 -3 -3 L -2 -3 -2 1. Estimate probability of observing each residue (probability of A given M, where M is PSSM model) 2. Divide by background probability of observing each residue (probability of A given B, where B is background model) 3. Take log so that can add (rather than multiply) scores 0 “K” -2 at0 position -1 -2 3 8 -4 0 -4 2 -3 gets a-3score of -4 -2 0 -4 -3 K 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 Foreground model (i.e., the PSSM) & Pr (A M )# ! log 2 $$ ! % Pr (A B ) " Background model Note: Assumes positions are independent BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 27 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs Statistics References 9/28/07 28 Sequence Profiles Goal: to characterize sequences belonging to a class (structural or functional) & determine whether a query sequence also belongs to that class Statistical Inference (Hardcover) George Casella, Roger L. Berger StatWeb: 26 PSSM Entries = Log-Odds Scores Observed frequency of residue “A” A Q 9/28/07 This slide was modified I added more text to this slide 8 residue sequence 20 letter alphabet A PSSM is: • a representation of a motif • an n by m matrix, where n is size of alphabet & m is length of sequence • a matrix of scores in which entry at (i, j) is score assigned by PSSM to letter i at the jth position 25 • DNA or RNA sequences • Protein sequences A Guide to Basic Statistics for Biologists http://www.dur.ac.uk/stat.web/ • Basic Statistics: http://www.statsoft.com/textbook/stbasic.html (correlations, tests, frequencies, etc.) Idea is to provide a "model" of the class against which we can test the new sequence Electronic Statistics Textbook: StatSoft http://www.statsoft.com/textbook/stathome.html (from basic statistics to ANOVA to discriminant analysis, clustering, regression data mining, machine learning, etc.) BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs BCB 444/544 Fall 07 Dobbs 9/28/07 29 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 30 5 #16 - Profiles & HMMs 9/28/07 PSI-BLAST Limitations for generating patterns or "motifs" Protein Sequence Profiles & PSSMs • Profile - a table that lists frequencies of each amino acid in each position of a protein sequence • PSSM - a special type of Profile - with no gaps • With PSSMs, can't have insertions and deletions • With Profiles, essentially 'add extra columns' to PSSM to allow for gaps • Frequencies are calculated from a MSA containing a domain of interest • Can be used to generate a consensus sequence • Better approach (for defining domains)? • Profile HMM: elaborated version of a profile • Intuitively, a profile that models gaps • Derived scoring scheme can be used to align a new sequence to the profile • Profile can be used in database searches (PSI-BLAST) to find new sequences that match the profile • Profiles can also be used to compute MSAs heuristically (e.g., progressive alignment) BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 31 Sequence Motifs (Patterns) Types of representations? BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 32 HMMs: an example Nucleotide frequencies in human genome • √ Consensus Sequence • √ Sequence Logo - "enhanced"consensus sequence, in which symbol size ∝ information entropy • Information entropy??? In information theory, the Shannon entropy or information entropy is a measure of the [decrease in] uncertainty associated with a random variable. Entropy quantifies information in a piece of data. - Wikipedia A C T G 20.4 29.5 20.5 29.6 • Check out this interesting website: Tom Schneider, NCIF • http://www.ccrnp.ncifcrf.gov/~toms/glossary.html#sequence_logo • √ PSSM - Position-Specific Scoring Matrix • √ Profiles HMMs - Hidden Markov Models BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 33 CpG Islands 34 Goal: Find most likely explanation for observed variables Components: • Observed variables • Hidden variables • Emitted symbols • CpG dinucleotides are rarer than would be expected from independent probabilities of C and G (given the background frequencies in human genome) • High CpG frequency is sometimes biologically significant; e.g., sometimes associated with promoter regions (“start sites”for genes) • CpG island - a region where CpG dinucleotides are much more abundant than elsewhere BCB 444/544 Fall 07 Dobbs 9/28/07 Hidden Markov Models - HMMs Written CpG to distinguish from a C G base pair) BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 • Emission probabilities • Transition probabilities • Graphical representation to illustrate relationships among these 35 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 36 6 #16 - Profiles & HMMs 9/28/07 The Occasionally Dishonest Casino An HMM for Occasionally Dishonest Casino A casino uses a fair die most of the time, but occasionally switches to a "loaded" one • Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6 • Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½ • These are emission probabilities Transition probabilities Transition probabilities • Prob(Fair → Loaded) = 0.01 • Prob(Loaded → Fair) = 0.2 • Prob(Fair → Loaded) = 0.01 • Prob(Loaded → Fair) = 0.2 • Transitions between states obey a Markov process • (more on Markov chains/models/processes a bit later) Emission probabilities • Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6 • Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½ BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 37 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 38 HMM: Making the Inference The Occasionally Dishonest Casino • Known: • Model assigns a probability to each explanation for the observation, e.g.: • Structure of the model • Transition probabilities P(326|FFL) = P(3|F) · P(F→F) · P(2|F) · P(F→L) · P(6|L) = 1/6 · 0.99 · 1/6 · 0.01 · ½ • Hidden: What casino actually did • FFFFFLLLLLLLFFFF... • Observable: Series of die tosses • Maximum Likelihood: Determine which explanation is most likely • 3415256664666153... • Find path most likely to have produced observed sequence • Total Probability: Determine probability that observed sequence • What we must infer: was produced by HMM • Consider all paths that could have produced the observed sequence • When was a fair die used? • When was a loaded one used? • Answer is a sequence FFFFFFFLLLLLLFFF... BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9/28/07 39 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs x = x1 , x 2 , x 3 = 6,2,6 • x = sequence of symbols emitted by model • xi = symbol emitted at time i Pr(x , # (1) ) = a0F eF (6)aFF eF (2)aFF eF (6) • π = path, a sequence of states i-th state in π is πi ! • akr = probability of making a transition from state k to state r akr = Pr(" i = r | " i !1 = k ) (1 ) = FFF ! (2) = LLL • ek ( b) = probability that symbol b is emitted when in state k 1 1 1 " 0.99 " " 0.99 " 6 6 6 ! 0.00227 = 0.5 " Pr(x , " (2) ) = a0 LeL (6)aLLeL (2)aLLeL (6) = 0.5 ! 0.5 ! 0.8 ! 0.1 ! 0.8 ! 0.5 = 0.008 ek (b ) = Pr(xi = b | ! i = k ) ! (3) = LFL Pr(x , # (3) ) = a0 LeL (6)aLF eF (2)aFLeL (6)aL 0 = 0.5 " 0.5 " 0.2 " ! 0.0000417 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs BCB 444/544 Fall 07 Dobbs 40 Calculating Different Paths to an Observed Sequence HMM Notation • 9/28/07 9/28/07 41 BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 1 " 0.01 " 0.5 6 9/28/07 42 7 #16 - Profiles & HMMs 9/28/07 Identifying the Most Probable Path The most likely path π* satisfies: ! * = arg max Pr(x , ! ) ! To find π*, consider all possible ways the last "symbol" of x could have been emitted Let v k (i ) = Prob. of path ! 1 , L, ! i most likely Then to emit x1 , K, xi such that ! i = k v k (i ) = ek (xi ) max(v r (i ! 1)ark ) r BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs BCB 444/544 Fall 07 Dobbs 9/28/07 43 8

#16 - Profiles & HMMs 9/28/07 Lecture 16 #16_Sept28

Related documents

Products

Support

#16 - Profiles &amp; HMMs 9/28/07 Lecture 16 #16_Sept28

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

#16 - Profiles & HMMs 9/28/07 Lecture 16 #16_Sept28