Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway, Ireland TM Database Homology Searching • Use algorithms to increase efficiency and to provide a mathematical basis for searches which can be translated into statistical significance • Assumes that sequence, structure and function are inter-related • BLAST (Basic Local Alignment Search Tool) and FastA (Fast Alignment) – heuristic approximations of Needleman-Wunsch and SmithWaterman algorithms – reduce computation TM Needleman-Wunsch Algorithm • General algorithm for sequence comparison • Maximise a similarity score, to give ‘maximum match’ • Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions • Finds the best GLOBAL alignment of any two sequences • N-W involves an iterative matrix method of calculation – All possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array – All possible alignments (comparisons) are represented by pathways through this array TM Needleman-Wunsch Algorithm (cont.) • Three main steps 1. Assign similarity values 2. For each cell, look at all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway 3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment TM Needleman-Wunsch Algorithm (cont.) Similarity values • A numerical value is assigned to every cell in the array depending on the similarity/dissimilarity of the two residues • These may be simple scores or more complicated, e.g. related to chemical similarities or frequency of observed substitutions • The example shown has – match = +1 – mismatch = 0 M P R P 1 B R 1 C K C R N J C J A C L C Q R J N C B A 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 TM Needleman-Wunsch Algorithm (cont.) Score pathways through array • For each cell want to know the maximum possible score for an alignment ending at that point • Searches subrow and subcolumn, as shown, for the highest score • Adds this to the score for the current cell • Proceeds row by row through the array • Gap penalty for the introduction of gaps in the alignment (presumed insertions or deletions into one sequence) … here = 0 P B R C K C R N J C J A M 0 0 0 0 0 0 0 P 1 0 0 0 0 0 0 R 0 1 2 1 1 1 2 C 0 1 1 3 2 3 2 L 0 1 1 2 3 3 3 C 0 1 1 3 3 4 3 Q 0 1 1 2 3 3 4 R 0 1 2 2 3 3 ? J 0 1 1 2 3 3 N 0 1 1 2 3 3 C 0 1 1 3 3 4 B 0 2 1 2 3 3 A 0 1 2 2 3 3 1 1 1 1 1 1 Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j-1 -Wk +s(ai,bj)}, max{Hi-1, j-l -Wl +s(ai,bj)}} TM Needleman-Wunsch Algorithm (cont.) Construct alignment • The alignment score is cumulative by adding along a path through the array • The best alignment has the highest score i.e. the maximum match • Maximum match = largest number resulting from summing the cell values of every pathway • The maximum match will ALWAYS be somewhere in the outer row or column shown • The alignment is constructed by working backwards from the maximum match P B R C K C R N J C J A M 0 0 0 0 0 0 0 0 0 0 0 0 P 1 0 0 0 0 0 0 0 0 0 0 0 R 0 1 2 1 1 1 2 1 1 1 1 1 C 0 1 1 3 2 3 2 2 2 3 2 2 L 0 1 1 2 3 3 3 3 3 3 3 3 C 0 1 1 3 3 4 3 3 3 4 3 3 Q 0 1 1 2 3 3 4 4 4 4 4 4 R 0 1 2 2 3 3 5 4 4 4 4 4 J 0 1 1 2 3 3 4 5 6 5 6 5 N 0 1 1 2 3 3 4 6 5 6 6 6 C 0 1 1 3 3 4 4 5 6 7 6 6 B 0 2 1 2 3 3 4 5 6 6 7 7 A 0 1 2 2 3 3 4 5 6 6 7 8 MP-RCLCQR-JNCBA | || | | | | | -PBRCKC-RNJ-CJA TM Needleman-Wunsch Algorithm (cont.) Statistical Significance • Maximum match is a function of sequence relationship and composition • Would like to know probability of obtaining result (maximum match) from a pair of random sequences • Estimate this experimentally – form pairs of random sequences by randomly drawing one member from each set (I.e. have same composition as the real proteins) – if the value found for the real proteins is significantly different from that for the random proteins then the difference is a function of the sequences alone and not of their composition TM Smith-Waterman Algorithm • Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximise the similarity measure • For every cell the algorithm calculates ALL possible paths leading to it. These paths can be of any length and can contain insertions and deletions TM Smith-Waterman Algorithm (cont.) • Only works effectively when gap • • • penalties are used In example shown – match = +1 – mismatch = -1/3 – gap = -1+1/3k (k=extent of gap) Start with all cell values = 0 Looks in subcolumn and subrow shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account A A U G C C A U U G A C G G C 0.0 0.0 0.0 0.0 1.0 1.0 A 1.0 1.0 0.0 0.0 0.0 0.7 G 0.0 0.7 0.8 1.0 0.0 0.0 C 0.0 0.0 0.3 0.3 2.0 1.0 C 0.0 0.0 0.0 0.0 1.3 3.0 U 0.0 0.0 0.0 0.0 0.3 1.7 C 0.0 0.0 0.0 0.7 1.0 ? G 0.0 0.0 0.0 1.0 0.3 C 0.0 0.0 0.0 0.0 2.0 U 0.0 0.0 1.0 0.0 0.7 U 0.0 0.0 1.0 0.7 0.3 A 1.0 1.0 0.0 0.7 0.3 G 0.0 0.7 0.7 1.0 0.3 Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0} TM Smith-Waterman Algorithm (cont.) • Four possible ways of forming a path For every residue in the query sequence 1. Align with next residue of db sequence … score is previous score plus similarity score for the two residues 2. Deletion (i.e. match residue of query with a gap) … score is previous score minus gap penalty dependent on size of gap 3. Insertion (i.e. match residue of db sequence with a gap) … score is previous score minus gap penalty dependent on size of gap 4. Stop … score is zero • Choose whichever of these is the highest TM Smith-Waterman Algorithm (cont.) Construct Alignment • The score in each cell is the maximum possible score for an alignment of ANY LENGTH ending at those coordinates • Trace pathway back from highest scoring cell • This cell can be anywhere in the array • Align highest scoring segment A A U G C C A U U G A C G G C 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 A 1.0 1.0 0.0 0.0 0.0 0.7 2.0 0.7 0.3 0.0 1.0 0.0 0.7 0.0 G 0.0 0.7 0.8 1.0 0.0 0.0 0.7 1.7 0.3 1.3 0.0 0.7 1.0 1.7 C 0.0 0.0 0.3 0.3 2.0 1.0 0.3 0.3 1.3 0.0 1.0 1.0 0.3 0.7 GCC-UCG GCCAUUG C 0.0 0.0 0.0 0.0 1.3 3.0 1.7 1.3 1.0 1.0 0.3 2.0 0.7 0.3 U 0.0 0.0 0.0 0.0 0.3 1.7 2.7 2.7 2.3 1.0 0.7 0.7 1.7 0.3 C 0.0 0.0 0.0 0.7 1.0 1.3 1.3 2.3 2.3 2.0 0.7 1.7 0.3 1.3 G 0.0 0.0 0.0 1.0 0.3 1.0 1.0 1.0 2.0 3.3 2.0 1.7 2.7 1.3 C 0.0 0.0 0.0 0.0 2.0 1.3 0.7 0.7 0.7 2.0 3.0 3.0 1.7 2.3 U 0.0 0.0 1.0 0.0 0.7 1.7 1.0 1.7 1.7 1.7 1.7 2.7 2.7 1.3 U 0.0 0.0 1.0 0.7 0.3 0.3 1.3 2.0 2.7 1.3 1.3 1.3 2.3 2.3 A 1.0 1.0 0.0 0.7 0.3 0.0 1.3 1.0 1.7 2.3 2.3 1.0 1.0 2.0 G 0.0 0.7 0.7 1.0 0.3 0.0 0.0 1.0 1.0 2.7 2.0 2.0 2.0 2.0 TM Differences • Needleman-Wunsch • Smith-Waterman 1. Global alignments 2. Requires alignment score for a pair of residues to be >=0 3. No gap penalty required 1. Local alignments 2. Residue alignment score may be positive or negative 3. Requires a gap penalty to work effectively 4. Score can increase, decrease or stay level between two cells of a pathway 4. Score cannot decrease between two cells of a pathway TM