Introduction to Bioinformatics Sequence Alignments Sequence Alignments Cornerstone of bioinformatics What is a sequence? • Nucleotide sequence • Amino acid sequence Pairwise and multiple sequence alignments • We will focus on pairwise alignments What alignments can help • Determine function of a newly discovered gene sequence • Determine evolutionary relationships among genes, proteins, and species • Predicting structure and function of protein Intro to Bioinformatics – Sequence Alignment 2 Acknowledgement: This notes is adapted from lecture notes of both Wright State University’s Bioinformatics Program and Professor Laurie Heyer of Davidson College with permission. DNA Replication Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set DNA polymerase is the enzyme that copies DNA • Reads the old strand in the 3´ to 5´ direction Intro to Bioinformatics – Sequence Alignment 3 Over time, genes accumulate mutations Environmental factors • Radiation • Oxidation Mistakes in replication or repair Deletions, Duplications Insertions, Inversions Translocations Point mutations Intro to Bioinformatics – Sequence Alignment 4 Deletions Codon deletion: ACG ATA GCG TAT GTA TAG CCG… • Effect depends on the protein, position, etc. • Almost always deleterious • Sometimes lethal Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… • Almost always lethal Intro to Bioinformatics – Sequence Alignment 5 Indels Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT Intro to Bioinformatics – Sequence Alignment 6 The Genetic Code Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU GAA Intro to Bioinformatics – Sequence Alignment 7 Comparing Two Sequences Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT Intro to Bioinformatics – Sequence Alignment 8 Why Align Sequences? The draft human genome is available Automated gene finding is possible Gene: AGTACGTATCGTATAGCGTAA • What does it do? One approach: Is there a similar gene in another species? • Align sequences with known genes • Find the gene with the “best” match Intro to Bioinformatics – Sequence Alignment 9 Gaps or No Gaps Examples Intro to Bioinformatics – Sequence Alignment 10 Scoring a Sequence Alignment Given Match score: +1 Mismatch score: +0 Gap penalty: –1 Sequences ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Score Gaps: 7 × (– 1) Intro to Bioinformatics – Sequence Alignment = +11 11 Origination and Length Penalties Note that a bioinformatics computational model or algorithm must be “biologically meaningful” or even “biologically significant” We want to find alignments that are evolutionarily likely. Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT We can achieve this by penalizing more for a new gap, than for extending an existing gap Intro to Bioinformatics – Sequence Alignment 12 Scoring a Sequence Alignment (2) Given Match/mismatch score: +1/+0 Gap origination/length penalty: –2/–1 Sequences ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT Matches: 18 × (+1) Mismatches: 2 × 0 Score Origination: 2 × (–2) Length: 7 × (–1) = +7 Caution: Sometime “gap extension” used instead of “gap length” … Intro to Bioinformatics – Sequence Alignment 13 How can we find an optimal alignment? Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG-CATCGTC--T-ATCT C(27,7) gap positions = ~888,000 possibilities It’s possible, as long as we don’t repeat our work! Dynamic programming: The Needleman & Wunsch algorithm Intro to Bioinformatics – Sequence Alignment 14 Dynamic Programming Technique of solving optimization problems • Find and memorize solutions for subproblems • Use those solutions to build solutions for larger subproblems • Continue until the final solution is found Recursive computation of cost function in a non-recursive fashion Intro to Bioinformatics – Sequence Alignment 15 Dynamic Programming Example • Solving Fibonacci number f(n) = f(n-1) + f(n-2) recursively takes exponential time Because many numbers are recalculated • Solving it using dynamic programming only takes a linear time Use an array f[] to store the numbers f[0] = 1; f[1] = 1; for (i = 2; i <= n; i++) f[i] = f[i – 1] + f[i – 2]; Intro to Bioinformatics – Sequence Alignment 16 Global Sequence Alignment Needleman-Wunsch algorithm Di,j = max{ Di-1,j + d(Ai, –), Di,j-1 + d(–, Bj), Di-1,j-1 + d(Ai, Bj) } B i-1 A i j-1 j Suppose we are aligning: Seq1 a with a … Seq2 a Intro to Bioinformatics – Sequence Alignment 0 -1 a -1 17 Dynamic Programming (DP) Concept Suppose we are aligning: CACGA CGA Intro to Bioinformatics – Sequence Alignment 18 DP – Recursion Perspective Suppose we are aligning: ACTCG ACAGTAG Last position choices: G G +1 ACTC ACAGTA G - -1 ACTC ACAGTAG G -1 ACTCG ACAGTA Intro to Bioinformatics – Sequence Alignment 19 What is the optimal alignment? ACTCG ACAGTAG Match: +1 Mismatch: 0 Gap: –1 Intro to Bioinformatics – Sequence Alignment 20 Needleman-Wunsch: Step 1 Each sequence along one axis Gap penalty multiples in first row/column 0 in [1,1] (or [0,0] for the CS-minded) A C A G T A G 0 -1 -2 -3 -4 -5 -6 -7 A -1 1 Intro to Bioinformatics – Sequence Alignment C -2 T -3 C -4 G -5 21 Needleman-Wunsch: Step 2 Vertical/Horiz. move: Score + (simple) gap penalty Diagonal move: Score + match/mismatch score Take the MAX of the three possibilities A C A G T A G 0 -1 -2 -3 -4 -5 -6 -7 A -1 1 Intro to Bioinformatics – Sequence Alignment C -2 T -3 C -4 G -5 22 Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise… a a c a g t a g 0 -1 -2 -3 -4 -5 -6 -7 c -1 1 Intro to Bioinformatics – Sequence Alignment t -2 0 c -3 -1 g -4 -2 -5 -3 23 Needleman-Wunsch: Step 2 (cont’d) Fill out the rest of the table likewise… a a c a g t a g 0 -1 -2 -3 -4 -5 -6 -7 c -1 1 0 -1 -2 -3 -4 -5 t -2 0 2 1 0 -1 -2 -3 c -3 -1 1 2 1 1 0 -1 g -4 -2 0 1 2 1 1 0 -5 -3 -1 0 2 2 1 2 The optimal alignment score is calculated in the lower-right corner Intro to Bioinformatics – Sequence Alignment 24 But what is the optimal alignment To reconstruct the optimal alignment, we must determine of where the MAX at each step came from… a a c a g t a g 0 -1 -2 -3 -4 -5 -6 -7 Intro to Bioinformatics – Sequence Alignment c -1 1 0 -1 -2 -3 -4 -5 t -2 0 2 1 0 -1 -2 -3 c -3 -1 1 2 1 1 0 -1 g -4 -2 0 1 2 1 1 0 -5 -3 -1 0 2 2 1 2 25 A path corresponds to an alignment = GAP in top sequence = GAP in left sequence = ALIGN both positions One path from the previous table: Corresponding alignment (start at the end): AC--TCG ACAGTAG Intro to Bioinformatics – Sequence Alignment Score = +2 26 Algorithm Analysis Brute force approach • If the length of both sequences is n, number of possibility = C(2n, n) = (2n)!/(n!)2 22n / (n)1/2, using Sterling’s approximation of n! = (2n)1/2e-nnn. • O(4n) Dynamic programming • O(mn), where the two sequence sizes are m and n, respectively • O(n2), if m is in the order of n Intro to Bioinformatics – Sequence Alignment 27 Practice Problem Find an optimal alignment for these two sequences: GCGGTT GCGT Match: +1 Mismatch: 0 Gap: –1 g c g t g 0 -1 -2 -3 -4 Intro to Bioinformatics – Sequence Alignment c -1 g -2 g -3 t -4 t -5 -6 28 Practice Problem Find an optimal alignment for these two sequences: GCGGTT GCGT g c g g t t g c g t 0 -1 -2 -3 -4 -1 1 0 -1 -2 -2 0 2 1 0 -3 -1 1 3 2 GCGGTT GCG-TIntro to Bioinformatics – Sequence Alignment -4 -2 0 2 3 -5 -3 -1 1 3 -6 -4 -2 0 2 Score = +2 29 Semi-global alignment Suppose we are aligning: GCG GGCG Which do you prefer? G-CG -GCG GGCG GGCG g g g c g 0 -1 -2 -3 -4 c -1 1 0 -1 -2 g -2 0 1 1 0 -3 -1 1 1 2 Semi-global alignment allows gaps at the ends for free. • Terminal gaps are usually the result of incomplete data acquisition no biologically significant Intro to Bioinformatics – Sequence Alignment 30 Semi-global alignment Semi-global alignment allows gaps at the ends for free. g g g c g 0 0 0 0 0 c 0 1 1 0 1 g 0 0 1 2 1 0 1 1 1 3 Initialize first row and column to all 0’s Allow free horizontal/vertical moves in last row and column Intro to Bioinformatics – Sequence Alignment 31 Local alignment Global alignments – score the entire alignment Semi-global alignments – allow unscored gaps at the beginning or end of either sequence Local alignment – find the best matching subsequence CGATG AAATGGA This is achieved by allowing a 4th alternative at each position in the table: zero. Intro to Bioinformatics – Sequence Alignment 32 Local Sequence Alignment Why local sequence alignment? • Subsequence comparison between a DNA sequence and a genome • Protein function domains • Exons matching Smith-Waterman algorithm Initialization: D1,j = , Di,1 = Di,j = max{ Di-1,j + d(Ai, –), Di,j-1 + d(–,Bj), Di-1,j-1 + d(Ai,Bj), 0} Intro to Bioinformatics – Sequence Alignment 33 Local alignment Score: Match = 1, Mismatch = -1, Gap = -1 c a a a t g g a 0 0 0 0 0 0 0 0 g 0 0 0 0 0 0 0 0 a 0 0 0 0 0 1 1 0 t 0 1 1 1 0 0 0 2 g 0 0 0 0 2 1 0 1 0 0 0 0 1 3 2 1 CGATG AAATGGA Intro to Bioinformatics – Sequence Alignment 34 Local alignment Another example Intro to Bioinformatics – Sequence Alignment 35 More Example Align ATGGCCTC ACGGCTC Mismatch = -3 Gap = -4 Global Alignment: ATGGCCTC ACGGC-TC -A T G G C C T C Intro to Bioinformatics – Sequence Alignment -0 -4 -8 -12 -16 -20 -24 -28 -32 A -4 1 -3 -7 -11 -15 -19 -23 -27 C -8 -3 -2 -6 -10 -10 -14 -18 -22 G -12 -7 -6 -1 -5 -9 -13 -17 -21 G -16 -11 -10 -5 0 -4 -8 -12 -16 C -20 -15 -14 -9 -4 1 -3 -7 -11 T -24 -19 -14 -13 -8 -3 -2 -2 -6 C -28 -23 -18 -17 -12 -7 -2 -5 -1 36 More Example Local Alignment: ATGGCCTC ACGG CTC or ATGGCCTC ACGGC TC Intro to Bioinformatics – Sequence Alignment -A T G G C C T C -0 0 0 0 0 0 0 0 0 A 0 1 0 0 0 0 0 0 0 C 0 0 0 0 0 1 1 0 1 G 0 0 0 1 1 0 0 0 0 G 0 0 0 1 2 0 0 0 0 C 0 0 0 0 0 3 1 0 1 T 0 0 1 0 0 0 0 2 0 C 0 0 0 0 0 1 1 0 3 37 Scoring Matrices for DNA Sequences Transition: A G C T Transversion: a purine (A or G) is replaced by a pyrimadine (C or T) or vice versa Intro to Bioinformatics – Sequence Alignment 38 Scoring Matrices for Protein Sequence PAM 250 Intro to Bioinformatics – Sequence Alignment 39 Scoring Matrices for Protein Sequence PAM (Percent Accepted Mutations) 1 PAM unit can be thought of as the amount of time in which an “average” protein mutates 1 out of every 100 amino acids. Perform multiple sequence alignment on a “family” of proteins that are at least 85% similar. Find the frequency of amino acid i and j are aligned to each other and normalize it. Entry mij in PAM1 represents the probability of amino acid i substituted by amino acid j in 1 PAM unit PAM 2 = PAM 1 × PAM 1 PAM n = (PAM 1)n, e.g., PAM 250 = (PAM 1)250 Questions Why are values in a PAM matrix integers? Shouldn’t a probability be between 0 and 1? Why is PAM250 possible? Shouldn’t a probability be less than or equal to 100%? Intro to Bioinformatics – Sequence Alignment 40 Scoring Matrices for Protein Sequence BLOSUM (BLOcks SUbstitution Matrix) 62 Intro to Bioinformatics – Sequence Alignment 41 Using Protein Scoring Matrices Divergence BLOSUM 80 PAM 1 Closely related Less divergent Less sensitive BLOSUM 62 PAM 120 Looking for BLOSUM 45 PAM 250 Distantly related More divergent More sensitive • Short similar sequences → use less sensitive matrix • Long dissimilar sequences → use more sensitive matrix • Unknown → use range of matrices Comparison • PAM – designed to track evolutionary origin of proteins • BLOSUM – designed to find conserved regions of proteins Intro to Bioinformatics – Sequence Alignment 42 Multiple Sequence Alignment Why multiple sequence alignment (MSA) • Two sequences might not look very similar. • But, some “similarity” emerges with more sequences, however. Is dynamic programming still applicable? CLUSTALW • One of most popular tools for MSA • Heuristic-based approach • Basic 1) Calculate the distance matrix based on pairwise alignments 2) Construct a guide tree NJ (unrooted tree) Mid-point (rooted tree) 3) Progressive alignment using the guide tree 43