HW4 Biological Sequence Alignment Due on Tuesday, Mar 1 All of the sequence related programming assignments have, so far, dealt with English language strings. The point of that was two fold. The implementation details of the algorithms would be more transparent. In a domain, English, that you are familiar with, you could see first hand how powerful these algorithms are at finding the best match among a set of known things with a very noisy (mutated) version of one of the known things. In this assignment we will make the transition to doing these operations on biological sequences, (specifically proteins). This transition involves two programming elements, a change from the English alphabet to the 20 letters in the amino acid alphabet, plus gap (indel). 1) Redo problem 2, HW 3, but replace the tongue twisters with the list of 10 protein sequences provided. Use the BLOSUM62 amino acid substitution matrix as the basis of the cost function and a linear gap penalty of -10. Test your program on the four query sequences provided. You may start with the T.A. solution to HW 3. The sequences are in FASTA format. You can use FASTA.java for reading FASTA files. The solution to the first sequence is provided: The best match is found between query sequence 1 and dictionary sequence 9. The alignment score is 244. AHPDLVNAGGQPCGVLPGAAMFDSAMSFALIRGGHIDACVLGGLQVDEEANLANWVVPGKMVPGMGGA MDLVTGSRKVIIAMEHCAKDGSA 57..147 ADADLINAGKETVTILPGASFFSSDESFAMIRGGHVDLTMLGAMQVSKYGDLANWMIPGKMVKGMGGA MDLVSSAKTKVVVTMEHSAKGNA 358..448 Things you’ll have to address in the code to upgrade it to align protein sequences. A. In lieu of initializing the program with the tongue twisters, initialize the program with protein sequences. (The program will still compute alignments. It just uses the alphabet of amino acids instead of English). You can test your alignment code on the following example: Given two protein sequences: VPDPKFSSQTKDKLVSSEVKSAVEQQMNELLAEYLLENPTDAKIVVGKIID IENPAFTSQTKEQLTTRVKDFGSRCEIPLEYINKIMKTDLATRMFEIADANEENALK The high scoring pair (the first 16 characters of both sequences) is found using the BLOSUM-62 matrix: VPDPKFSSQTKDKLVS IENPAFTSQTKEQLTT 0..15 0..15 The alignment score is 43. B. Replace the cost function from the hard coded cost function, to a cost function that looks up the costs for each amino acid pair in the BLOSUM matrix. To do so you will have to: a. Create a 20 x 20 matrix in your code, and initialize it with the BLOSUM matrix. b. Determine a way to “address” the cells in the matrix. The problem here is you have a 20 x 20 matrix. The cost of substituting ‘A’ with ‘C’ is in cell (0, 4). To save your typing, you may use this piece of Java code, which contains the amino acid alphabet and the BLOSUM matrix. You need to create a function that converts an amino acid letter to its index in the matrix. For example int index(char aminoAcid) The following is a suggested implementation: In initialization outside this function, create an array that maps this integer to the index of the amino acid in the BLOSUM matrix. For example, the integer Character.toUpperCase(‘C’) – ‘A’ should be mapped to 4. In this function, convert the amino acid letter to an integer in the range from 0 to 25. For example: Character.toUpperCase(aminoAcid) – ‘A’ Use the above array to get the index of the amino acid. You may choose to use other data structures that you see appropriate, e.g. a hash table. Note our alphabet only contains 20 letters of the English alphabet. You code should detect illegal inputs, e.g. strings that contain the letter ‘Z’. c. Instead of testing character equality in the cost function, you’ll have to look into your BLOSUM matrix and get the value out of the right cell. You need to create a function that returns the cost of substituting one amino acid by another. For example: int cost(char aminoAcid1, char aminoAcid2) The cost of substituting ‘A’ with ‘C’ is returned by cost(‘A’, ‘C’). Aside: Speech recognition systems use variations of these same algorithms in a first layer of matching with phoneme and/or word dictionaries of speech signals. Consider that speech, or sound, is a continuous analog wave. In speech recognition that signals are sampled (measured) at speeds of roughly 30 – 40 kilohertz. The result is a sequence of numbers. The speech recognition companies have collected the number sequences for each possible phoneme, (syllable in a language). When you talk into such a system, the sequence representing your voice is compared to a dictionary of the sequences representing all possible sounds. The alignment of each sound in the dictionary with your input is computed, and a sequence of phonemes aligned to your voice input emerges. While this explanation is simplified it is still an accurate portrayal. (I skipped the step that sometimes includes a Fourier Transform of the number sequences). Besides changing the alphabet to numbers, the biggest difference is the input, your voice, in speech is very long, and the dictionary entries, phonemes, are short. Speech recognition is sufficiently hard and, simultaneously, intolerant of errors, that a “language-model” layer, detects if the sequence of phonemes is not grammatically correct. If so, the second layer finds the minimum number of substitutions of best matches with second, or third best matches, that results in a grammatically correct sentence. Sometimes this is done with a second layer of these same algorithms. More often algorithms, based on Markov-models, are used. These Markov-model algorithms are also in endemic use in bioinformatics. Extra Credit: There is active work on creating user interfaces to mp3 players such that you can pick a song out of your player by humming or singing a short piece of the song. Suggest how what you’ve learned so far in this class might be used to solve that problem.