HW2 Computing Edit Distance and Sequence Alignments

advertisement
HW4 Biological Sequence Alignment
Due on Tuesday, Mar 1
All of the sequence related programming assignments have, so far, dealt with English
language strings. The point of that was two fold. The implementation details of the
algorithms would be more transparent. In a domain, English, that you are familiar with,
you could see first hand how powerful these algorithms are at finding the best match
among a set of known things with a very noisy (mutated) version of one of the known
things.
In this assignment we will make the transition to doing these operations on biological
sequences, (specifically proteins). This transition involves two programming elements,
a change from the English alphabet to the 20 letters in the amino acid alphabet, plus gap
(indel).
1) Redo problem 2, HW 3, but replace the tongue twisters with the list of 10 protein
sequences provided. Use the BLOSUM62 amino acid substitution matrix as the
basis of the cost function and a linear gap penalty of -10.
Test your program on the four query sequences provided. You may start with the T.A.
solution to HW 3. The sequences are in FASTA format. You can use FASTA.java for
reading FASTA files.
The solution to the first sequence is provided:
The best match is found between query sequence 1 and dictionary
sequence 9.
The alignment score is 244.
AHPDLVNAGGQPCGVLPGAAMFDSAMSFALIRGGHIDACVLGGLQVDEEANLANWVVPGKMVPGMGGA
MDLVTGSRKVIIAMEHCAKDGSA
57..147
ADADLINAGKETVTILPGASFFSSDESFAMIRGGHVDLTMLGAMQVSKYGDLANWMIPGKMVKGMGGA
MDLVSSAKTKVVVTMEHSAKGNA
358..448
Things you’ll have to address in the code to upgrade it to align protein sequences.
A. In lieu of initializing the program with the tongue twisters, initialize the program
with protein sequences. (The program will still compute alignments. It just uses
the alphabet of amino acids instead of English).
You can test your alignment code on the following example:
Given two protein sequences:
VPDPKFSSQTKDKLVSSEVKSAVEQQMNELLAEYLLENPTDAKIVVGKIID
IENPAFTSQTKEQLTTRVKDFGSRCEIPLEYINKIMKTDLATRMFEIADANEENALK
The high scoring pair (the first 16 characters of both sequences) is found using the
BLOSUM-62 matrix:
VPDPKFSSQTKDKLVS
IENPAFTSQTKEQLTT
0..15
0..15
The alignment score is 43.
B. Replace the cost function from the hard coded cost function, to a cost function
that looks up the costs for each amino acid pair in the BLOSUM matrix. To do
so you will have to:
a. Create a 20 x 20 matrix in your code, and initialize it with the BLOSUM
matrix.
b. Determine a way to “address” the cells in the matrix.
The problem here is you have a 20 x 20 matrix. The cost of substituting ‘A’
with ‘C’ is in cell (0, 4).
To save your typing, you may use this piece of Java code, which contains the
amino acid alphabet and the BLOSUM matrix.
You need to create a function that converts an amino acid letter to its index in
the matrix. For example
int index(char aminoAcid)
The following is a suggested implementation:
 In initialization outside this function, create an array that maps this
integer to the index of the amino acid in the BLOSUM matrix. For
example, the integer Character.toUpperCase(‘C’) – ‘A’ should
be mapped to 4.
 In this function, convert the amino acid letter to an integer in the range
from 0 to 25. For example: Character.toUpperCase(aminoAcid) –
‘A’
 Use the above array to get the index of the amino acid.
You may choose to use other data structures that you see appropriate, e.g. a hash
table. Note our alphabet only contains 20 letters of the English alphabet. You
code should detect illegal inputs, e.g. strings that contain the letter ‘Z’.
c.
Instead of testing character equality in the cost function, you’ll have to
look into your BLOSUM matrix and get the value out of the right cell. You
need to create a function that returns the cost of substituting one amino acid
by another. For example:
int cost(char aminoAcid1, char aminoAcid2)
The cost of substituting ‘A’ with ‘C’ is returned by cost(‘A’, ‘C’).
Aside: Speech recognition systems use variations of these same algorithms in a first layer
of matching with phoneme and/or word dictionaries of speech signals. Consider that
speech, or sound, is a continuous analog wave. In speech recognition that signals are
sampled (measured) at speeds of roughly 30 – 40 kilohertz. The result is a sequence of
numbers. The speech recognition companies have collected the number sequences for
each possible phoneme, (syllable in a language). When you talk into such a system, the
sequence representing your voice is compared to a dictionary of the sequences
representing all possible sounds. The alignment of each sound in the dictionary with
your input is computed, and a sequence of phonemes aligned to your voice input emerges.
While this explanation is simplified it is still an accurate portrayal. (I skipped the step
that sometimes includes a Fourier Transform of the number sequences). Besides
changing the alphabet to numbers, the biggest difference is the input, your voice, in
speech is very long, and the dictionary entries, phonemes, are short.
Speech recognition is sufficiently hard and, simultaneously, intolerant of errors, that a
“language-model” layer, detects if the sequence of phonemes is not grammatically correct.
If so, the second layer finds the minimum number of substitutions of best matches with
second, or third best matches, that results in a grammatically correct sentence.
Sometimes this is done with a second layer of these same algorithms. More often
algorithms, based on Markov-models, are used. These Markov-model algorithms are also
in endemic use in bioinformatics.
Extra Credit: There is active work on creating user interfaces to mp3 players such that
you can pick a song out of your player by humming or singing a short piece of the song.
Suggest how what you’ve learned so far in this class might be used to solve that problem.
Download