OCCBIO 2006 Tutorial Fundamental Concepts of Bioinformatics Michael L. Raymer Computer Science, Biomedical Sciences Wright State University Bioinformatics Research Group Part I – Background Some basics of molecular biology, and some of the fundamental problems facing bioinformaticians The Central Dogma of molecular biology OCCBIO 2006 – Fundamental Bioinformatics 3 DNA structure and base pairing Polymer of: • Ribose sugar • Phosphate • Nitrogenous base Four bases • A, C, G, T Base pairing • A—T • G—C OCCBIO 2006 – Fundamental Bioinformatics 4 DNA is an information carrying molecule Arranged into 23 chromosome pairs in the nucleus of each cell Genes: coding information • < 5% of all DNA • Instructions for protein synthesis • Directions on when and where to synthesize proteins (regulatory regions) OCCBIO 2006 – Fundamental Bioinformatics 5 The Genetic Code Redundancy/robustness • • • • Synonymous codons Dual strands Diploidy Amino acid structure (?) OCCBIO 2006 – Fundamental Bioinformatics 6 Transcription DNA transcription RNA translation Protein Messenger RNA Carries instructions for a protein outside of the nucleus to the ribosome The ribosome is a protein complex that synthesizes new proteins OCCBIO 2006 – Fundamental Bioinformatics 8 Prokaryotic gene structure Yeast RNA Polymerase II Darst et al. in 1991 (Cell 66, pp 121-128) 5' UTR 3' UTR 5' 3' Coding Operator: regulation Stop Codon Promoter: RNA polymerase binding OCCBIO 2006 – Fundamental Bioinformatics 9 Regulation of transcription Energy budget Cellular differentiation & tissue function From W. Becker, L. Kleinsmith, and J. Hardin, The World of the Cell, Fourth Edition. Copyright © Addison Wesley Longman, Inc. OCCBIO 2006 – Fundamental Bioinformatics 10 Bioinformatics problems Shotgun sequencing Sequence alignment & multiple alignment • Database searches Phylogenetic tree induction Protein structure determination, modeling, and prediction Ligand screening and docking Many, many more OCCBIO 2006 – Fundamental Bioinformatics 11 Bioinformatics data DNA sequence information • Genome projects, etc. mRNA expression information • Microarrays, SAGE Metabolite concentrations • Mass Spec., NMR Spec., etc. Protein sequence information Protein structure information • X-Ray Crystallography OCCBIO 2006 – Fundamental Bioinformatics 12 Part II – Obtaining Sequences Sanger Sequencing Primer Walking Shotgun Approaches Fragment Assembly Algorithms Outline PCR Sanger Sequencing Primer Walking Shotgun Sequencing • Models • Algorithms • Analysis OCCBIO 2006 – Fundamental Bioinformatics 14 Polymerase chain reaction (PCR) OCCBIO 2006 – Fundamental Bioinformatics 15 Gel electrophoresis OCCBIO 2006 – Fundamental Bioinformatics 16 Sanger sequencing OCCBIO 2006 – Fundamental Bioinformatics 17 Limitations to sequencing You must have a primer of known sequence to initiate PCR Only about 1000nts can be sequenced in a single reaction The sequencing process is slow, so it is beneficial to do as much in parallel as possible • Primer hopping • Shotgun approach OCCBIO 2006 – Fundamental Bioinformatics 18 Shotgun Sequencing OCCBIO 2006 – Fundamental Bioinformatics 19 The Ideal Case Find maximal overlaps between fragments: ACCGT CGTGC TTAC TACCGT --ACCGT-----CGTGC TTAC-----TACCGT— TTACCGTGC Consensus sequence determined by vote OCCBIO 2006 – Fundamental Bioinformatics 20 Quality Metrics The coverage at position i of the target or consensus sequence is the number of fragments that overlap that position Target: No coverage Two contigs OCCBIO 2006 – Fundamental Bioinformatics 21 Quality Metrics Linkage – the degree of overlap between fragments Target: Perfect coverage, poor average linkage poor minimum linkage OCCBIO 2006 – Fundamental Bioinformatics 22 Real World Complications Base call errors Chimeric fragments, contamination (e.g. from the vector) --ACCGT-----CGTGC TTAC-----TGCCGT— TTACCGTGC Base Call Error --ACC-GT-----CAGTGC TTAC------TACC-GT— TTACC-GTGC Insertion Error OCCBIO 2006 – Fundamental Bioinformatics --ACCGT-----CGTGC TTAC-----TAC-GT— TTACCGTGC Deletion Error 23 Unknown Orientation A fragment can come from either strand CACGT ACGT ACTACG GTACT ACTGA CTGA OCCBIO 2006 – Fundamental Bioinformatics CACGT -ACGT --CGTAGT -----AGTAC --------ACTGA ---------CTGA 24 Repeats Direct repeats A X A X B C OCCBIO 2006 – Fundamental Bioinformatics X X C B X D X D 25 Repeats Direct repeats A X A X B D Y Y OCCBIO 2006 – Fundamental Bioinformatics C C X X D B Y E Y E 26 Repeats Inverted repeats X X X X OCCBIO 2006 – Fundamental Bioinformatics 27 Sequence Alignment Models Shortest common superstring • Input: A collection, F, of strings (fragments) • Output: A shortest possible string S such that for every f F, S is a superstring of f. Example: • F = {ACT, CTA, AGT} • S = ACTAGT OCCBIO 2006 – Fundamental Bioinformatics 28 Problems with the SCS model x x x x´ Directionality of fragments must be known No consideration of coverage Some simple consideration of linkage No consideration of base call errors OCCBIO 2006 – Fundamental Bioinformatics 29 Reconstruction Deals with errors and unknown orientation Definitions • f is an approximate substring of S at error level when ds(f, S) | f | Match = 0 • ds = substring edit distance: Reconstruction Mismatch = 1 Gap = 1 • Input: A collection, F, of strings, and a tolerance level, • Output: Shortest possible string, S, such that for every f F : minds f , S , ds f , S f OCCBIO 2006 – Fundamental Bioinformatics 30 Reconstruction Example Input: Output: F = {ATCAT, GTCG, CGAG, TACCA} = 0.25 ATGAT ------CGAC -CGAG ----TACCA ACGATACGAC ATCAT GTCG ds(CGAG, ACGATACGAC) = 1 = 0.25 4 So this output is OK for = 0.25 OCCBIO 2006 – Fundamental Bioinformatics 31 Gaps in Reconstruction Reconstruction allows gaps in fragments: AT-GA----ATCGATAGAC OCCBIO 2006 – Fundamental Bioinformatics ds = 1 32 Limitations of Reconstruction Models errors and unknown orientation Doesn’t handle repeats Doesn’t model coverage Only handles linkage in a very simple way Always produces a single contig OCCBIO 2006 – Fundamental Bioinformatics 33 Contigs Sometimes you just can’t put all of the fragments together into one contiguous sequence: No way to tell how much sequence is missing between them. OCCBIO 2006 – Fundamental Bioinformatics ? No way to tell the order of these two contigs. 34 Multicontig Definitions • A layout, L, is a multiple alignment of the fragments Columns numbered from 1 to |L | • Endpoints of a fragment: l(f) and r(f) • An overlap is a link is no other fragment completely covers the overlap Link OCCBIO 2006 – Fundamental Bioinformatics Not a link 35 Multicontig More definitions • The size of a link is the number of overlapping positions ACGTATAGCATGA GTA CATGATCA ACGTATAG GATCA A link of size 5 • The weakest link is the smallest link in the layout • A t-contig has a weakest link of size t • A collection, F, admits a t-contig if a t-contig can be constructed from the fragments in F OCCBIO 2006 – Fundamental Bioinformatics 36 Perfect Multicontig Input: F, and t Output: a minimum number of collections, Ci, such that every Ci admits a t-contig Let F = {GTAC, TAATG, TGTAA} t=3 t=1 --TAATG TGTAA-- TGTAA------TAATG--------GTAC GTAC OCCBIO 2006 – Fundamental Bioinformatics 37 Handling errors in Multicontig The image of a fragment is the portion of the consensus sequence, S, corresponding to the fragment in the layout S is an -consensus for a collection of fragments when the edit distance from each fragment, f, and its image is at most | f | TATAGCATCAT CGTC CATGATCA ACGGATAG GTCCA ACGTATAGCATGATCA OCCBIO 2006 – Fundamental Bioinformatics An -consensus for = 0.4 38 Definition of Multicontig Input: A collection, F , of strings, an integer t 0, and an error tolerance between 0 and 1 Output: A partition of F into the minimum number of collections Ci such that every Ci admits a t-contig with an -consensus OCCBIO 2006 – Fundamental Bioinformatics 39 Example of Multicontig Let = 0.4, t = 3 TATAGCATCAT ACGTC CATGATCAG ACGGATAG GTCCAG ACGTATAGCATGATCAG OCCBIO 2006 – Fundamental Bioinformatics 40 Algorithms Most of the algorithms to solve the fragment assembly problem are based on a graph model A graph, G, is a collection of edges, e, and vertices, v. • Directed or undirected • Weighted or unweighted We will discuss representations and other issues shortly… OCCBIO 2006 – Fundamental Bioinformatics A directed, unweighted graph 41 The Maximum Overlap Graph The text calls it an overlap multigraph Each directed edge, (u,v) is weighted with the length of the maximal overlap between a suffix of u and a prefix of v TACGA a 1 2 ACCC CTAAAG b 1 c 1 1 d OCCBIO 2006 – Fundamental Bioinformatics GACA 0-weight edges omitted! 42 Paths and Layouts The path dbc leads to the alignment: GACA----------ACCC----------CTAAAG TACGA a 1 2 ACCC CTAAAG c 1 1 d OCCBIO 2006 – Fundamental Bioinformatics b 1 GACA 43 Superstrings Every path that covers every node is a superstring Zero weight edges result in alignments like: GACA-----------GCCC------------TTAAAG Higher weights produce more overlap, and thus shorter strings The shortest common superstring is the highest weight path that covers every node OCCBIO 2006 – Fundamental Bioinformatics 44 Graph formulation of SCS Input: A weighted, directed graph Output: The highest-weight path that touches every node of the graph Does this problem sound familiar? OCCBIO 2006 – Fundamental Bioinformatics 45 The Greedy Algorithm Algorithm greedy Sort edges in decreasing weight order For each edge in this order If the edge does not form a cycle and the edge does not start or end at the same node as another edge in the set then add the edge to the current set End for End Algorithm Figure 4.16, page 125 OCCBIO 2006 – Fundamental Bioinformatics 46 Greedy Example 7 4 5 2 2 2 6 3 1 OCCBIO 2006 – Fundamental Bioinformatics 47 Greedy does not always find the best path GCC 2 ATGC 0 2 TGCAT 3 OCCBIO 2006 – Fundamental Bioinformatics 48 Tools for Shotgun Sequencing OCCBIO 2006 – Fundamental Bioinformatics 49 Common Difficulty Each of these problems is a method for modeling fragment assembly Each of these problems is provably intractable How? OCCBIO 2006 – Fundamental Bioinformatics 50 Embedding problems Suppose I told you that I had found a clever way to model the TSP as a shortest common superstring problem • • Paths between cities are represented as fragments The shortest path is the shortest common superstring of the fragments If this is true, then there are only two possibilities: 1. This problem is just as intractable as TSP 2. TSP is actually a tractable problem! OCCBIO 2006 – Fundamental Bioinformatics 51 NP-Complete Problems There is a collection of problems that computer scientists believe to be intractable • TSP is one of them Each of them has been modeled as one or more of the other NP-complete problems If you solve one, you solve them all A problem, p, is NP-hard if you can model one of these NP-complete problems as an instance of p OCCBIO 2006 – Fundamental Bioinformatics 52 NP-Completeness NP Subset sum 3-SAT OCCBIO 2006 – Fundamental Bioinformatics TSP P 53 P = NP? NP Subset sum 3-SAT NP P OCCBIO 2006 – Fundamental Bioinformatics 54 Part III – Sequence Alignments Needleman-Wunsch Smith-Waterman Dynamic Programming Why align sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA • What does it do? One approach: Is there a similar gene in another species? • Align sequences with known genes • Find the gene with the “best” match OCCBIO 2006 – Fundamental Bioinformatics 56 Comparing two sequences Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT OCCBIO 2006 – Fundamental Bioinformatics 57 Scoring a sequence alignment Match score: +1 Mismatch score: +0 Gap penalty: –1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Gaps: 7 × (– 1) OCCBIO 2006 – Fundamental Bioinformatics Score = +11 58 DNA Replication Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set DNA polymerase is the enzyme that copies DNA • Synthesizes in the 5' to 3' direction OCCBIO 2006 – Fundamental Bioinformatics 59 Over time, genes accumulate mutations Environmental factors • Radiation • Oxidation Mistakes in replication or repair Deletions, Duplications Insertions Inversions Point mutations OCCBIO 2006 – Fundamental Bioinformatics 60 Deletions Codon deletion: ACG ATA GCG TAT GTA TAG CCG… • Effect depends on the protein, position, etc. • Almost always deleterious • Sometimes lethal Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… • Almost always lethal OCCBIO 2006 – Fundamental Bioinformatics 61 Indels Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT OCCBIO 2006 – Fundamental Bioinformatics 62 Origination and length penalties We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT We can achieve this by penalizing more for a new gap, than for extending an existing gap OCCBIO 2006 – Fundamental Bioinformatics 63 Scoring a sequence alignment (2) Match/mismatch score: +1/+0 Origination/length penalty: –2/–1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Origination: 2 × (–2) Length: 7 × (–1) OCCBIO 2006 – Fundamental Bioinformatics Score = +7 64 How can we find an optimal alignment? Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG—CATCGTC--T-ATCT C(27,7) gap positions = ~888,000 possibilities It’s possible, as long as we don’t repeat our work! Dynamic programming: The Needleman & Wunsch algorithm OCCBIO 2006 – Fundamental Bioinformatics 65 What is the optimal alignment? ACTCG ACAGTAG Match: +1 Mismatch: 0 Gap: –1 OCCBIO 2006 – Fundamental Bioinformatics 66 Needleman-Wunsch: Step 1 Each sequence along one axis Mismatch penalty multiples in first row/column 0 in [1,1] (or [0,0] for the CS-minded) A C A G T A G 0 -1 -2 -3 -4 -5 -6 -7 A -1 1 OCCBIO 2006 – Fundamental Bioinformatics C -2 T -3 C -4 G -5 67 Needleman-Wunsch: Step 2 Vertical/Horiz. move: Score + (simple) gap penalty Diagonal move: Score + match/mismatch score Take the MAX of the three possibilities A C A G T A G 0 -1 -2 -3 -4 -5 -6 -7 A -1 1 OCCBIO 2006 – Fundamental Bioinformatics C -2 T -3 C -4 G -5 68 Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise… a a c a g t a g 0 -1 -2 -3 -4 -5 -6 -7 c -1 1 OCCBIO 2006 – Fundamental Bioinformatics t -2 0 c -3 -1 g -4 -2 -5 -3 69 Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise… a a c a g t a g 0 -1 -2 -3 -4 -5 -6 -7 c -1 1 0 -1 -2 -3 -4 -5 t -2 0 2 1 0 -1 -2 -3 c -3 -1 1 2 1 1 0 -1 g -4 -2 0 1 2 1 1 0 -5 -3 -1 0 2 2 1 2 The optimal alignment score is calculated in the lower-right corner OCCBIO 2006 – Fundamental Bioinformatics 70 But what is the optimal alignment To reconstruct the optimal alignment, we must determine of where the MAX at each step came from… a a c a g t a g 0 -1 -2 -3 -4 -5 -6 -7 OCCBIO 2006 – Fundamental Bioinformatics c -1 1 0 -1 -2 -3 -4 -5 t -2 0 2 1 0 -1 -2 -3 c -3 -1 1 2 1 1 0 -1 g -4 -2 0 1 2 1 1 0 -5 -3 -1 0 2 2 1 2 71 A path corresponds to an alignment = GAP in top sequence = GAP in left sequence = ALIGN both positions One path from the previous table: Corresponding alignment (start at the end): AC--TCG ACAGTAG OCCBIO 2006 – Fundamental Bioinformatics Score = +2 72 Practice Problem Find an optimal alignment for these two sequences: GCGGTT GCGT Match: +1 Mismatch: 0 Gap: –1 g c g t g 0 -1 -2 -3 -4 OCCBIO 2006 – Fundamental Bioinformatics c -1 g -2 g -3 t -4 t -5 -6 73 Practice Problem Find an optimal alignment for these two sequences: GCGGTT GCGT g c g g t t g c g t 0 -1 -2 -3 -4 -1 1 0 -1 -2 -2 0 2 1 0 -3 -1 1 3 2 GCGGTT GCG-TOCCBIO 2006 – Fundamental Bioinformatics -4 -2 0 2 3 -5 -3 -1 1 3 -6 -4 -2 0 2 Score = +2 74 Semi-global alignment Suppose we are aligning: GCG GGCG Which do you prefer? G-CG -GCG GGCG GGCG g g g c g 0 -1 -2 -3 -4 c -1 1 0 -1 -2 g -2 0 1 1 0 -3 -1 1 1 2 Semi-global alignment allows gaps at the ends for free. OCCBIO 2006 – Fundamental Bioinformatics 75 Semi-global alignment Semi-global alignment allows gaps at the ends for free. g g g c g 0 0 0 0 0 c 0 1 1 0 1 g 0 0 1 2 1 0 1 1 1 3 Initialize first row and column to all 0’s Allow free horizontal/vertical moves in last row and column OCCBIO 2006 – Fundamental Bioinformatics 76 Local alignment Global alignments – score the entire alignment Semi-global alignments – allow unscored gaps at the beginning or end of either sequence Local alignment – find the best matching subsequence CGATG AAATGGA This is achieved by allowing a 4th alternative at each position in the table: zero. OCCBIO 2006 – Fundamental Bioinformatics 77 Local alignment Mismatch = –1 this time c a a a t g g a 0 -1 -2 -3 -4 -5 -6 -7 g -1 0 0 0 0 0 0 0 a -2 0 0 0 0 1 1 0 t -3 0 1 1 0 0 0 2 g -4 0 0 0 2 1 0 1 -5 0 0 0 1 3 2 1 CGATG AAATGGA OCCBIO 2006 – Fundamental Bioinformatics 78 Optimal Substructure in Alignments Consider the alignment: ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Is it true that the alignment in the boxed region must be optimal? OCCBIO 2006 – Fundamental Bioinformatics 79 A Greedy Strategy Consider this pair of sequences GAGC GAP = 1 CAGC Greedy Approach: G or G or C Leads to GAGC-----CAGC Match = +1 G Better: OCCBIO 2006 – Fundamental Bioinformatics Mismatch = 2 GACG CACG 80 Breaking apart the problem Suppose we are aligning: ACTCG ACAGTAG First position choices: A A +1 CTCG CAGTAG A - -1 CTCG ACAGTAG A -1 ACTCG CAGTAG OCCBIO 2006 – Fundamental Bioinformatics 81 A Recursive Approach to Alignment Choose the best alignment based on these three possibilities: align(seq1, seq2) { if (both sequences empty) {return 0;} if (one string empty) { return(gapscore * num chars in nonempty seq); else { score1 = score(firstchar(seq1),firstchar(seq2)) + align(tail(seq1), tail(seq2)); score2 = align(tail(seq1), seq2) + gapscore; score3 = align(seq1, tail(seq2) + gapscore; return(min(score1, score2, score3)); } } } OCCBIO 2006 – Fundamental Bioinformatics 82 Time Complexity of RecurseAlign What is the recurrence equation for the time needed by RecurseAlign? T (n) 3T (n 1) 3 3 n 3 3 3 … 3 3 9 3 27 n 3 OCCBIO 2006 – Fundamental Bioinformatics 83 RecurseAlign repeats its work A C G T A T C G C G T A T A G A T G C T C T C G OCCBIO 2006 – Fundamental Bioinformatics 84 Dynamic Programming Remember all the subproblem answers along the way: a a c a g t a g 0 -1 -2 -3 -4 -5 -6 -7 c -1 1 0 -1 -2 -3 -4 -5 t -2 0 2 1 0 -1 -2 -3 c -3 -1 1 2 1 1 0 -1 g -4 -2 0 1 2 1 1 0 -5 -3 -1 0 2 2 1 2 This is possible for any problem that exhibits optimal substructure OCCBIO 2006 – Fundamental Bioinformatics 85 Saving Space Note that we can throw away the previous rows of the table as we fill it in: a This row is based only on this one a c a g t a g OCCBIO 2006 – Fundamental Bioinformatics 0 -1 -2 -3 -4 -5 -6 -7 c -1 1 0 -1 -2 -3 -4 -5 t -2 0 2 1 0 -1 -2 -3 c -3 -1 1 2 1 1 0 -1 g -4 -2 0 1 2 1 1 0 -5 -3 -1 0 2 2 1 2 86 Saving Space (2) Each row of the table contains the scores for aligning a prefix of the left-hand sequence with all prefixes of the top sequence: a Scores for aligning aca with all prefixes of actcg a c a g t a g OCCBIO 2006 – Fundamental Bioinformatics 0 -1 -2 -3 -4 -5 -6 -7 c -1 1 0 -1 -2 -3 -4 -5 t -2 0 2 1 0 -1 -2 -3 c -3 -1 1 2 1 1 0 -1 g -4 -2 0 1 2 1 1 0 -5 -3 -1 0 2 2 1 2 87 Divide and Conquer By using a recursive approach, we can use only two rows of the matrix at a time: • Choose the middle character of the top sequence, i • Find out where i aligns to the bottom sequence Needs two vectors of scores • Recursively align the sequences before and after the fixed positions i ACGCTATGCTCATAG CGACGCTCATCG OCCBIO 2006 – Fundamental Bioinformatics 88 Finding where i lines up Find out where i aligns to the bottom sequence Needs two vectors of scores i s: ACGCTATGCTCATAG t: CGACGCTCATCG Assuming i lines up with a character: alignscore = align(ACGCTAT, prefix(t)) + score(G, char from t) + align(CTCATAG, suffix(t)) Which character is best? • Can quickly find out the score for aligning ACGCTAT with every prefix of t. OCCBIO 2006 – Fundamental Bioinformatics 89 Finding where i lines up But, i may also line up with a gap i s: ACGCTATGCTCATAG t: CGACGCTCATCG Assuming i lines up with a gap: alignscore = align(ACGCTAT, prefix(t)) + gapscore + align(CTCATAG, suffix(t)) OCCBIO 2006 – Fundamental Bioinformatics 90 Recursive Call Fix the best position for I Call align recursively for the prefixes and suffixes: i s: ACGCTATGCTCATAG t: CGACGCTCATCG OCCBIO 2006 – Fundamental Bioinformatics 91 Complexity Let len(s) = m and len(t) = n i Space: 2m s: ACGCTATGCTCATAG Time: t: CGACGCTCATCG • Each call to build similarity vector = m´n´ • First call + recursive call: mn mn m T m, n T , 2 2 2 j m j T ,n 2 j m n m j m( n j ) 2m n OCCBIO 2006 – Fundamental Bioinformatics 92 General Gap Penalties Suppose we are no longer using simple gap penalties: • Origination = −2 • Length = −1 Consider the last position of the alignment for ACGTA with ACG We can’t determine the score for G or G unless we know the previous positions! OCCBIO 2006 – Fundamental Bioinformatics 93 Scoring Blocks Now we must score a block at a time A A C --- A TATCCG A C T AC A C T ACC T ------ C G C -- A block is a pair of characters, or a maximal group of gaps paired with characters To score a position, we need to either start a new block or add it to a previous block OCCBIO 2006 – Fundamental Bioinformatics 94 The Algorithm Three tables • a – scores for alignments ending in char-char blocks • b – scores for alignments ending in gaps in the top sequence (s) • c – scores for alignments ending in gaps in the left sequence (t) Scores no longer depend on only three positions, because we can put any number of gaps into the last block OCCBIO 2006 – Fundamental Bioinformatics 95 The Recurrences ai 1, j 1 ai, j p i, j maxbi 1, j 1 ci 1, j 1 ai, j k wk , for1 k j bi, j max ci, j k wk , for1 k j ai k , j wk , for1 k i ci, j max bi k , j wk , for1 k i OCCBIO 2006 – Fundamental Bioinformatics 96 The Optimal Alignment The optimal alignment is found by looking at the maximal value in the lower right of all three arrays The algorithm runs in O(n3) time • Uses O(n2) space OCCBIO 2006 – Fundamental Bioinformatics 97 Part IV – Database Searches BLAST Search statistics Database Searching How can we find a particular short sequence in a database of sequences (or one HUGE sequence)? Problem is identical to local sequence alignment, but on a much larger scale. We must also have some idea of the significance of a database hit. • Databases always return some kind of hit, how much attention should be paid to the result? OCCBIO 2006 – Fundamental Bioinformatics 99 BLAST BLAST – Basic Local Alignment Search Tool An approximation of the Needleman & Wunsch algorithm Sacrifices some search sensitivity for speed OCCBIO 2006 – Fundamental Bioinformatics 100 Scoring Matrices DNA Proteins • Identity • Transition/Transversion A R N D C Q E G H I L K M F P S T W Y V A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -4 1 1 1 -6 -3 0 • PAM • BLOSUM R N D C Q E G H I L K M F P S T W Y V 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 2 2 -4 1 1 0 2 -2 -3 1 -2 -4 -1 1 0 -4 -2 -2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 4 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 5 -2 -3 -4 -2 -3 -5 -1 1 0 -7 -5 -1 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 6 -3 4 2 -3 -3 -2 -2 -1 2 5 0 -5 -1 0 0 -3 -4 -2 6 0 -2 -2 -1 -4 -2 2 9 -5 -3 -2 0 7 -1 6 1 0 -6 -5 -1 3 1 -2 -3 -1 3 -5 -3 0 17 0 -6 10 2 4 OCCBIO 2006 – Fundamental Bioinformatics 101 The BLAST algorithm Break the search sequence into words • W = 3 for proteins, W = 12 for DNA MCGPFILGTYC CGP MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC MCG Include in the search all words that score above a certain value (T) for any search word MCG MCT MCN … CGP MGP CTP … OCCBIO 2006 – Fundamental Bioinformatics … This list can be computed in linear time 102 The Blast Algorithm (2) Search for the words in the database • Word locations can be precomputed and indexed • Searching for a short string in a long string Regular expression matching: FSA HSP (High Scoring Pair) = A match between a query word and the database Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A Extend the hit until the score falls below a threshold value, X OCCBIO 2006 – Fundamental Bioinformatics 103 OCCBIO 2006 – Fundamental Bioinformatics 104 Results from a BLAST search OCCBIO 2006 – Fundamental Bioinformatics 105 Search Significance Scores A search will always return some hits. How can we determine how “unusual” a particular alignment score is? • ORF’s Assumptions OCCBIO 2006 – Fundamental Bioinformatics 106 Assessing significance requires a distribution Frequency I have an apple of diameter 5”. Is that unusual? Diameter (cm) OCCBIO 2006 – Fundamental Bioinformatics 107 Is a match significant? • Scoring system • Database • Sequence to search for Length Composition Frequency Match scores for aligning my sequence with random sequences. Depends on: Match score How do we determine the random sequences? OCCBIO 2006 – Fundamental Bioinformatics 108 Generating “random” sequences Random uniform model: P(G) = P(A) = P(C) = P(T) = 0.25 • Doesn’t reflect nature Use sequences from a database • Might have genuine homology We want unrelated sequences Random shuffling of sequences • Preserves composition • Removes true homology OCCBIO 2006 – Fundamental Bioinformatics 109 What distribution do we expect to see? The mean of n random (i.i.d.) events tends towards a Gaussian distribution. • Example: Throw n dice and compute the mean. • Distribution of means: n=2 OCCBIO 2006 – Fundamental Bioinformatics n = 1000 110 The extreme value distribution This means that if we get the match scores for our sequence with n other sequences, the mean would follow a Gaussian distribution. The maximum of n (i.i.d.) random events tends towards the extreme value distribution as n grows large. OCCBIO 2006 – Fundamental Bioinformatics 111 Comparing distributions Extreme Value: Gaussian: f x 1 e x e e x OCCBIO 2006 – Fundamental Bioinformatics 1 f x e 2 x 2 2 2 112 Determining P-values If we can estimate and , then we can determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database. For sequence matches, a scoring system and database can be parameterized by two parameters, K and , related to and . • It would be nice if we could compare hit significance without regard to the database and scoring system used! OCCBIO 2006 – Fundamental Bioinformatics 113 Bit Scores The expected number of hits with score S is: E = Kmn e s • Where m and n are the sequence lengths Normalize the raw score using: S S ln K ln 2 Obtains a “bit score” S’, with a standard set of units. S The new E-value is: E mn 2 OCCBIO 2006 – Fundamental Bioinformatics 114 P values and E values Blast reports E-values E = 5, E = 10 versus P = 0.993 and P = 0.99995 When E < 0.01 P-values and E-values are nearly identical OCCBIO 2006 – Fundamental Bioinformatics 115 BLAST parameters Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. Raising the segment extension cutoff (X) returns longer extensions for each hit. Changing the minimum E-value changes the threshold for reporting a hit. OCCBIO 2006 – Fundamental Bioinformatics 116 Part V – Phylogenies Preliminaries Distance-based methods Parsimony Methods Phylogenetic Trees Hypothesis about the relationship between organisms Can be rooted or unrooted A B B C D E Time A E C D Root OCCBIO 2006 – Fundamental Bioinformatics 118 Tree proliferation 2n 3! N R n2 2 n 2! NU 2n 5! n 3 2 n 3! Species Number of Rooted Trees Number of Unrooted Trees 2 1 1 3 3 1 4 15 3 5 105 15 6 34,459,425 2,027,025 7 213,458,046,767,875 7,905,853,580,625 8 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875 OCCBIO 2006 – Fundamental Bioinformatics 119 Molecular phylogenetics Specific genomic sequence variations (alleles) are much more reliable than phenotypic characteristics More than one gene should be considered OCCBIO 2006 – Fundamental Bioinformatics 120 An ongoing didactic Pheneticists tend to prefer distance based metrics, as they emphasize relationships among data sets, rather than the paths they have taken to arrive at their current states. Cladists are generally more interested in evolutionary pathways, and tend to prefer more evolutionarily based approaches such as maximum parsimony. OCCBIO 2006 – Fundamental Bioinformatics 121 Distance matrix methods Species B A 9 B – C – D – C 8 11 – – D 12 15 10 – E 15 18 13 5 OCCBIO 2006 – Fundamental Bioinformatics 122 UPGMA Similar to average-link clustering Merge the closest two groups • Replace the distances for the new, merged group with the average of the distance for the previous two groups Repeat until all species are joined OCCBIO 2006 – Fundamental Bioinformatics 123 UPGMA Step 1 Species B A 9 B – C – D – C 8 11 – – D 12 15 10 – E 15 18 13 5 Merge D & E D Species B A 9 B – C – C 8 11 – DE OCCBIO 2006 – Fundamental Bioinformatics E 13.5 16.5 11.5 124 UPGMA Step 2 Species B A 9 B – C – C 8 11 – DE Merge A & C A C D E 13.5 16.5 11.5 Species AC DE OCCBIO 2006 – Fundamental Bioinformatics B 10 AC – 16.5 12.5 125 UPGMA Steps 3 & 4 Species AC DE B 10 AC – Merge B & AC A C B D E 16.5 12.5 Merge ABC & DE A C B D E (((A,C)B)(D,E)) OCCBIO 2006 – Fundamental Bioinformatics 126 Parsimony approaches Belong to the broader class of character based methods of phylogenetics Emphasize simpler, and thus more likely evolutionary pathways I: GCGGACG II: GTGGACG A (C or T) C I (C or T) T II OCCBIO 2006 – Fundamental Bioinformatics C I T II 127 Informative and uninformative sites Position Seq 1 2 3 4 5 6 1 G G G G G G 2 G G G A G T 3 G G A T A G 4 G A T C A T For positions 5 & 6, it is possible to select more parsimonious trees – those that invoke less substitutions. OCCBIO 2006 – Fundamental Bioinformatics 128 Parsimony methods Enumerate all possible trees Note the number of substitutions events invoked by each possible tree • Can be weighted by transition/transversion probabilities, etc. Select the most parsimonious OCCBIO 2006 – Fundamental Bioinformatics 129 Branch and Bound methods Key problem – number of possible trees grows enormous as the number of species gets large Branch and bound – a technique that allows large numbers of candidate trees to be rapidly disregarded Requires a “good guess” at the cost of the best tree OCCBIO 2006 – Fundamental Bioinformatics 130 Branch and Bound for TSP Find a minimum cost round-trip path that visits each intermediate city exactly once NP-complete Greedy approach: A,G,E,F,B,D,C,A = 251 OCCBIO 2006 – Fundamental Bioinformatics A 93 20 D 82 B 59 31 57 12 G 17 C 46 82 35 E 68 F 15 131 Search all possible paths A 93 20 All paths D 82 B 59 31 57 12 G 17 C 46 82 35 AG (20) E 68 F AB (46) AC (93) 15 AGF (88) AGFB AGFE Best estimate: 251 OCCBIO 2006 – Fundamental Bioinformatics AGE (55) AGFC ACB (175) ACD ACF ACBE (257) 132 Parsimony – Branch and Bound Use the UPGMA tree for an initial best estimate of the minimum cost (most parsimonious) tree Use branch and bound to explore all feasible trees Replace the best estimate as better trees are found Choose the most parsimonious OCCBIO 2006 – Fundamental Bioinformatics 133 Parsimony example Position Seq 1 2 3 4 5 6 1 G G G G G G 2 G G G A G T 3 G G A T A G 4 G A T C A T Position 5: All trees (1,2) [0] (1,3) [1] (1,4) [1] Etc. OCCBIO 2006 – Fundamental Bioinformatics 134 Part VI – Aligning protein sequences PAM matrices BLOSUM matrices Sequence Alignments Revisited Scoring nucleotide sequence alignments was easier • Match score • Possibly different scores for transitions and transversions For amino acids, there are many more possible substitutions How do we score which substitutions are highly penalized and which are moderately penalized? • Physical and chemical characteristics • Empirical methods OCCBIO 2006 – Fundamental Bioinformatics 136 Scoring Mismatches Physical and chemical characteristics • V I – Both small, both hydrophobic, conservative substitution, small penalty • V K – Small large, hydrophobic charged, large penalty • Requires some expert knowledge and judgement Empirical methods • How often does the substitution V I occur in proteins that are known to be related? Scoring matrices: PAM and BLOSUM OCCBIO 2006 – Fundamental Bioinformatics 137 PAM matrices PAM = “Point Accepted Mutation” interested only in mutations that have been “accepted” by natural selection Starts with a multiple sequence alignment of very similar (>85% identity) proteins. Assumed to be homologous Compute the relative mutability, mi, of each amino acid • e.g. mA = how many times was alanine substituted with anything else? OCCBIO 2006 – Fundamental Bioinformatics 138 Relative mutability ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL Across all pairs of sequences, there are 28 A X substitutions There are 10 ALA residues, so mA = 2.8 OCCBIO 2006 – Fundamental Bioinformatics 139 Pam Matrices, cont’d Construct a phylogenetic tree for the sequences in the alignment ACGCTAFKI AG GCGCTAFKI AG GCGCTGFKI FG,A = 3 IL ACGCTAFKL AL GCGCTLFKI CS ASGCTAFKL GA ACACTAFKL Calculate substitution frequences FX,X Substitutions may have occurred either way, so A G also counts as G A. OCCBIO 2006 – Fundamental Bioinformatics 140 Mutation Probabilities Mi,j represents the probability of J I substitution. M ij m j Fij ACGCTAFKI Fij AG GCGCTAFKI i AG GCGCTGFKI M G, A IL ACGCTAFKL AL GCGCTLFKI CS ASGCTAFKL GA ACACTAFKL 2.7 3 = 2.025 4 OCCBIO 2006 – Fundamental Bioinformatics 141 The PAM matrix The entries, Ri,j are the Mi,j values divided by the frequency of occurrence, fi, of residue i. fG = 10 GLY / 63 residues = 0.1587 RG,A = log(2.025/0.1587) = log(12.760) = 1.106 The log is taken so that we can add, rather than multiply entries to get compound probabilities. Log-odds matrix Diagonal entries are 1– mj OCCBIO 2006 – Fundamental Bioinformatics 142 Interpretation of PAM matrices PAM-1 – one substitution per 100 residues (a PAM unit of time) Multiply them together to get PAM-100, etc. “Suppose I start with a given polypeptide sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all amino acid residues have undergone substitutions at time t+n. Let the new sequence at time t+n be called M’. What is the probability that a residue of type j in M will be replaced by i in M’?” OCCBIO 2006 – Fundamental Bioinformatics 143 PAM matrix considerations If Mi,j is very small, we may not have a large enough sample to estimate the real probability. When we multiply the PAM matrices many times, the error is magnified. PAM-1 – similar sequences, PAM-1000 very dissimilar sequences OCCBIO 2006 – Fundamental Bioinformatics 144 BLOSUM matrix Starts by clustering proteins by similarity Avoids problems with small probabilities by using averages over clusters Numbering works opposite • BLOSUM-62 is appropriate for sequences of about 62% identity, while BLOSUM-80 is appropriate for more similar sequences. OCCBIO 2006 – Fundamental Bioinformatics 145 Part VII – Protein Structure Preliminaries Lattice Models Protein Folding Algorithms Illustrations from: C Branden and J Tooze, Introduction to Protein Structure, 2 nd ed. Garland Pub. ISBN 0815302703 The many functions of proteins Mechanoenzymes: myosin, actin Rhodopsin: allows vision Globins: transport oxygen Antibodies: immune system Enzymes: pepsin, renin, carboxypeptidase A Receptors: transmit messages through membranes Vitelogenin: molecular velcro • And hundreds of thousands more… OCCBIO 2006 – Fundamental Bioinformatics 147 Proteins are chains of amino acids Polymer – a molecule composed of repeating units OCCBIO 2006 – Fundamental Bioinformatics 148 Amino acid composition Basic Amino Acid Structure: • The side chain, R, varies for each of the 20 amino acids Side chain R H N C C H Amino group OCCBIO 2006 – Fundamental Bioinformatics O H OH Carboxyl group 149 The Peptide Bond Dehydration synthesis Repeating backbone: N–C –C –N–C –C O O • Convention – start at amino terminus and proceed to carboxy terminus OCCBIO 2006 – Fundamental Bioinformatics 150 Peptidyl polymers A few amino acids in a chain are called a polypeptide. A protein is usually composed of 50 to 400+ amino acids. Since part of the amino acid is lost during dehydration synthesis, we call the units of a protein amino acid residues. carbonyl carbon OCCBIO 2006 – Fundamental Bioinformatics amide nitrogen 151 Side chain properties Recall that the electronegativity of carbon is at about the middle of the scale for light elements • Carbon does not make hydrogen bonds with water easily – hydrophobic • O and N are generally more likely than C to h-bond to water – hydrophilic We group the amino acids into three general groups: • Hydrophobic • Charged (positive/basic & negative/acidic) • Polar OCCBIO 2006 – Fundamental Bioinformatics 152 The Hydrophobic Amino Acids Proline severely limits allowable conformations! OCCBIO 2006 – Fundamental Bioinformatics 153 The Charged Amino Acids OCCBIO 2006 – Fundamental Bioinformatics 154 The Polar Amino Acids OCCBIO 2006 – Fundamental Bioinformatics 155 More Polar Amino Acids And then there’s… OCCBIO 2006 – Fundamental Bioinformatics 156 Planarity of the peptide bond Psi () – the angle of rotation about the C-C bond. Phi () – the angle of rotation about the N-C bond. The planar bond angles and bond lengths are fixed. OCCBIO 2006 – Fundamental Bioinformatics 157 Phi and psi = = 180° is extended conformation : C to N–H : C=O to C OCCBIO 2006 – Fundamental Bioinformatics C=O C N–H 158 The Ramachandran Plot Observed (non-glycine) Calculated Observed (glycine) G. N. Ramachandran – first calculations of sterically allowed regions of phi and psi Note the structural importance of glycine OCCBIO 2006 – Fundamental Bioinformatics 159 Primary & Secondary Structure Primary structure = the linear sequence of amino acids comprising a protein: AGVGTVPMTAYGNDIQYYGQVT… Secondary structure • Regular patterns of hydrogen bonding in proteins result in two patterns that emerge in nearly every protein structure known: the -helix and the -sheet • The location of direction of these periodic, repeating structures is known as the secondary structure of the protein OCCBIO 2006 – Fundamental Bioinformatics 160 The alpha helix 60° OCCBIO 2006 – Fundamental Bioinformatics 161 Properties of the alpha helix 60° Hydrogen bonds between C=O of residue n, and NH of residue n+4 3.6 residues/turn 1.5 Å/residue rise 100°/residue turn OCCBIO 2006 – Fundamental Bioinformatics 162 Properties of -helices 4 – 40+ residues in length Often amphipathic or “dual-natured” • Half hydrophobic and half hydrophilic • Mostly when surface-exposed If we examine many -helices, we find trends… • Helix formers: Ala, Glu, Leu, Met • Helix breakers: Pro, Gly, Tyr, Ser OCCBIO 2006 – Fundamental Bioinformatics 163 The beta strand (& sheet) 135° +135° OCCBIO 2006 – Fundamental Bioinformatics 164 Properties of beta sheets Formed of stretches of 5-10 residues in extended conformation Pleated – each C a bit above or below the previous Parallel/aniparallel, contiguous/non-contiguous OCCBIO 2006 – Fundamental Bioinformatics 165 Parallel and anti-parallel -sheets Anti-parallel is slightly energetically favored Anti-parallel OCCBIO 2006 – Fundamental Bioinformatics Parallel 166 Turns and Loops Secondary structure elements are connected by regions of turns and loops Turns – short regions of non-, non- conformation Loops – larger stretches with no secondary structure. Often disordered. • “Random coil” • Sequences vary much more than secondary structure regions OCCBIO 2006 – Fundamental Bioinformatics 167 Levels of Protein Structure Secondary structure elements combine to form tertiary structure Quaternary structure occurs in multienzyme complexes • Many proteins are active only as homodimers, homotetramers, etc. Disulfide Bonds Two cyteines in close proximity will form a covalent bond Disulfide bond, disulfide bridge, or dicysteine bond. Significantly stabilizes tertiary structure. OCCBIO 2006 – Fundamental Bioinformatics 169 Protein Structure Examples OCCBIO 2006 – Fundamental Bioinformatics 170 Determining Protein Structure There are O(100,000) distinct proteins in the human proteome. 3D structures have been determined for 14,000 proteins, from all organisms • Includes duplicates with different ligands bound, etc. Coordinates are determined by X-ray crystallography OCCBIO 2006 – Fundamental Bioinformatics 171 X-Ray Crystallography ~0.5mm • The crystal is a mosaic of millions of copies of the protein. • As much as 70% is solvent (water)! • May take months (and a “green” thumb) to grow. OCCBIO 2006 – Fundamental Bioinformatics 172 X-Ray diffraction Image is averaged over: • Space (many copies) • Time (of the diffraction experiment) OCCBIO 2006 – Fundamental Bioinformatics 173 Electron Density Maps Resolution is dependent on the quality/regularity of the crystal R-factor is a measure of “leftover” electron density Solvent fitting Refinement OCCBIO 2006 – Fundamental Bioinformatics 174 The Protein Data Bank http://www.rcsb.org/pdb/ ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 N CA C O CB N CA C O N CA C O CB CG1 CG2 ALA ALA ALA ALA ALA GLY GLY GLY GLY VAL VAL VAL VAL VAL VAL VAL E E E E E E E E E E E E E E E E 1 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 OCCBIO 2006 – Fundamental Bioinformatics 22.382 22.957 23.572 23.948 23.932 23.656 24.216 25.653 26.258 26.213 27.594 28.569 28.429 27.834 29.259 26.811 47.782 47.648 46.251 45.688 48.787 45.723 44.393 44.308 45.296 43.110 42.879 43.613 43.444 41.363 41.013 40.649 112.975 111.613 111.545 112.603 111.380 110.336 110.087 110.579 110.994 110.521 110.975 110.055 108.822 110.979 111.404 111.850 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 24.09 22.40 21.32 21.54 22.79 19.17 17.35 16.49 15.35 16.21 16.02 15.69 16.43 16.66 17.35 17.03 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 175 Views of a protein Wireframe OCCBIO 2006 – Fundamental Bioinformatics Ball and stick 176 Views of a protein Spacefill Cartoon CPK colors Carbon = green, black, or grey Nitrogen = blue Oxygen = red Sulfur = yellow Hydrogen = white OCCBIO 2006 – Fundamental Bioinformatics 177 The Protein Folding Problem Central question of molecular biology: “Given a particular sequence of amino acid residues (primary structure), what will the tertiary/quaternary structure of the resulting protein be?” Input: AAVIKYGCAL… Output: 11, 22… = backbone conformation: (no side chains yet) OCCBIO 2006 – Fundamental Bioinformatics 178 Forces driving protein folding It is believed that hydrophobic collapse is a key driving force for protein folding • Hydrophobic core • Polar surface interacting with solvent Minimum volume (no cavities) Disulfide bond formation stabilizes Hydrogen bonds Polar and electrostatic interactions OCCBIO 2006 – Fundamental Bioinformatics 179 Folding help Proteins are, in fact, only marginally stable • Native state is typically only 5 to 10 kcal/mole more stable than the unfolded form Many proteins help in folding • Protein disulfide isomerase – catalyzes shuffling of disulfide bonds • Chaperones – break up aggregates and (in theory) unfold misfolded proteins OCCBIO 2006 – Fundamental Bioinformatics 180 The Hydrophobic Core Hemoglobin A is the protein in red blood cells (erythrocytes) responsible for binding oxygen. The mutation E6V in the chain places a hydrophobic Val on the surface of hemoglobin The resulting “sticky patch” causes hemoglobin S to agglutinate (stick together) and form fibers which deform the red blood cell and do not carry oxygen efficiently Sickle cell anemia was the first identified molecular disease OCCBIO 2006 – Fundamental Bioinformatics 181 Sickle Cell Anemia Sequestering hydrophobic residues in the protein core protects proteins from hydrophobic agglutination. OCCBIO 2006 – Fundamental Bioinformatics 182 Computational Problems in Protein Folding Two key questions: • Evaluation – how can we tell a correctly-folded protein from an incorrectly folded protein? H-bonds, electrostatics, hydrophobic effect, etc. Derive a function, see how well it does on “real” proteins • Optimization – once we get an evaluation function, can we optimize it? Simulated annealing/monte carlo EC Heuristics We’ll talk more about these methods later… OCCBIO 2006 – Fundamental Bioinformatics 183 Fold Optimization Simple lattice models (HPmodels) • Two types of residues: hydrophobic and polar • 2-D or 3-D lattice • The only force is hydrophobic collapse • Score = number of HH contacts OCCBIO 2006 – Fundamental Bioinformatics 184 Scoring Lattice Models H/P model scoring: count noncovalent hydrophobic interactions. Sometimes: • Penalize for buried polar or surface hydrophobic residues OCCBIO 2006 – Fundamental Bioinformatics 185 What can we do with lattice models? For smaller polypeptides, exhaustive search can be used • Looking at the “best” fold, even in such a simple model, can teach us interesting things about the protein folding process For larger chains, other optimization and search methods must be used • Greedy, branch and bound • Evolutionary computing, simulated annealing • Graph theoretical methods OCCBIO 2006 – Fundamental Bioinformatics 186 Learning from Lattice Models The “hydrophobic zipper” effect: Ken Dill ~ 1997 OCCBIO 2006 – Fundamental Bioinformatics 187 Representing a lattice model Absolute directions • UURRDLDRRU Relative directions • LFRFRRLLFFL • Advantage, we can’t have UD or RL in absolute • Only three directions: LRF What about bumps? LFRRR • Bad score • Use a better representation OCCBIO 2006 – Fundamental Bioinformatics 188 Preference-order representation Each position has two “preferences” • If it can’t have either of the two, it will take the “least favorite” path if possible Example: {LR},{FL},{RL}, {FR},{RL},{RL},{FR},{RF} Can still cause bumps: {LF},{FR},{RL},{FL}, {RL},{FL},{RF},{RL}, {FL} OCCBIO 2006 – Fundamental Bioinformatics 189 More realistic models Higher resolution lattices (45° lattice, etc.) Off-lattice models • Local moves • Optimization/search methods and / representations Greedy search Branch and bound EC, Monte Carlo, simulated annealing, etc. OCCBIO 2006 – Fundamental Bioinformatics 190 The Other Half of the Picture Now that we have a more realistic off-lattice model, we need a better energy function to evaluate a conformation (fold). Theoretical force field: • G = Gvan der Waals + Gh-bonds + Gsolvent + Gcoulomb Empirical force fields • Start with a database • Look at neighboring residues – similar to known protein folds? OCCBIO 2006 – Fundamental Bioinformatics 191 Threading: Fold recognition Given: • Sequence: IVACIVSTEYDVMKAAR… • A database of molecular coordinates Map the sequence onto each fold Evaluate • Objective 1: improve scoring function • Objective 2: folding OCCBIO 2006 – Fundamental Bioinformatics 192 Secondary Structure Prediction AGVGTVPMTAYGNDIQYYGQVT… A-VGIVPM-AYGQDIQY-GQVT… AG-GIIP--AYGNELQ--GQVT… AGVCTVPMTA---ELQYYG--T… AGVGTVPMTAYGNDIQYYGQVT… ----hhhHHHHHHhhh--eeEE… OCCBIO 2006 – Fundamental Bioinformatics 193 Secondary Structure Prediction Easier than folding • Current algorithms can prediction secondary structure with 70-80% accuracy Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222. • Based on frequencies of occurrence of residues in helices and sheets PhD – Neural network based • • Uses a multiple sequence alignment Rost & Sander, Proteins, 1994 , 19, 55-72 OCCBIO 2006 – Fundamental Bioinformatics 194 Chou-Fasman Parameters Name Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Abbrv A R D N C E Q G H I L K M F P S T W Y V P(a) 142 98 101 67 70 151 111 57 100 108 121 114 145 113 57 77 83 108 69 106 OCCBIO 2006 – Fundamental Bioinformatics P(b) P(turn) 83 66 93 95 54 146 89 156 119 119 37 74 110 98 75 156 87 95 160 47 130 59 74 101 105 60 138 60 55 152 75 143 119 96 137 96 147 114 170 50 f(i) 0.06 0.07 0.147 0.161 0.149 0.056 0.074 0.102 0.14 0.043 0.061 0.055 0.068 0.059 0.102 0.12 0.086 0.077 0.082 0.062 f(i+1) 0.076 0.106 0.11 0.083 0.05 0.06 0.098 0.085 0.047 0.034 0.025 0.115 0.082 0.041 0.301 0.139 0.108 0.013 0.065 0.048 f(i+2) 0.035 0.099 0.179 0.191 0.117 0.077 0.037 0.19 0.093 0.013 0.036 0.072 0.014 0.065 0.034 0.125 0.065 0.064 0.114 0.028 f(i+3) 0.058 0.085 0.081 0.091 0.128 0.064 0.098 0.152 0.054 0.056 0.07 0.095 0.055 0.065 0.068 0.106 0.079 0.167 0.125 0.053 195 Chou-Fasman Algorithm Identify -helices • 4 out of 6 contiguous amino acids that have P(a) > 100 • Extend the region until 4 amino acids with P(a) < 100 found • Compute P(a) and P(b); If the region is >5 residues and P(a) > P(b) identify as a helix Repeat for -sheets [use P(b)] If an and a region overlap, the overlapping region is predicted according to P(a) and P(b) OCCBIO 2006 – Fundamental Bioinformatics 196 Chou-Fasman, cont’d Identify hairpin turns: • P(t) = f(i) of the residue f(i+1) of the next residue f(i+2) of the following residue f(i+3) of the residue at position (i+3) • Predict a hairpin turn starting at positions where: P(t) > 0.000075 The average P(turn) for the four residues > 100 P(a) < P(turn) > P(b) for the four residues Accuracy 60-65% OCCBIO 2006 – Fundamental Bioinformatics 197 Chou-Fasman Example CAENKLDHVRGPTCILFMTWYNDGP CAENKL – Potential helix (!C and !N) Residues with P(a) < 100: RNCGPSTY • Extend: When we reach RGPT, we must stop • CAENKLDHV: P(a) = 972, P(b) = 843 • Declare alpha helix Identifying a hairpin turn • VRGP: P(t) = 0.000085 • Average P(turn) = 113.25 Avg P(a) = 79.5, Avg P(b) = 98.25 OCCBIO 2006 – Fundamental Bioinformatics 198 Other topics? Tools and languages Forensic DNA Microarray analysis OCCBIO 2006 – Fundamental Bioinformatics 199