Perl exercise 5 (Due on 28/12/2010) Don't forget to write well-organized scripts, use meaningful variables and to write comments about what you are doing in each part of code. Always debug your script to make sure they perform as you planned. Hashes 1. Write a script that reads a FASTA file (as an example you may use Ecoli.prot.fasta) and create a hash in which: a. The gi number (from the header) is the key, and the value is the ref accession. For example for the first header: >gi|16127995|ref|NP_414542.1| thr operon leader peptide... The key should be "16127995" the value should be "NP_414542.1". Ask the user for a gi number and print its ref accession. b. The gi number (from the header) is the key, and the value is the full sequence of the protein. For example for the first (short) protein: >gi|16127995|ref|NP_414542.1| thr operon leader peptide... MKRISTTITTTITITTGNGAG The key should be "16127995" the value should be "MKRISTTITTTITITTGNGAG". Note that for long sequences the value should be the full sequence of the protein (even if it contains multiple lines). As before, as the user for a gi number and print its sequence 2. Write a script that reads a PTT file (as an example you may use Ecoli.ptt) and create a hash in which the key is the Synonym (in E.coli it is the value that start with b followed 4 digits). And as a value save the Product. For example for the following line: 2801..3733 + 310 16127997 thrB b0003 - COG0083E homoserine kinase The key should be "b0003" the value should be "homoserine kinase". Ask the user for a Synonym and print its Product description. Note: The format of each line in the table is: <Start location>..<End location>\t<strand>\t<length in aa>\t <GI(= PID)>\t<Gene Symbol>\t<Synonym>\t<Code (not relevant for E.coli)>\t <COG accession>\t<Product description> Wait… there is more in the next page… Complex data structures 3. Write a script that reads a PTT file (as an example you may use Ecoli.ptt) and create a complex data structure: a. The key of the outer hash is the Synonym, and in an inner hash store the strand, the length and the product. Such that for the following line: 2801..3733 + 310 16127997 thrB b0003 - COG0083E homoserine kinase the value of $pttHash{b0003}->{"strand"} is + the value of $pttHash{b0003}->{"length"} is 310 the value of $pttHash{b0003}->{"product"} is homoserine kinase Pass through your data structure and print the synonym and product of all the proteins longer than 1400aa coded on the positive strand. b. Add to the inner hash of 3a a key "location" that will contain a reference to an array that have two elements: the first is the start position and the second the end position. For example for the line: 2801..3733 + 310 16127997 thrB b0003 - COG0083E homoserine kinase the value of $pttHash{b0003}->{"location"}->[0] is 2801, and the value of $pttHash{b0003}->{"location"}->[1] is 3733. Receive a start value and an end value from the user and print the location, synonym and product of the proteins that start after the start value provided by the user, and end before the end value provided by the user. 4. Write a script that reads a FASTA file of mRNA sequences, where the FASTA header line of each sequence has the following format: >accession #DE description #LN length #CD coding sequence start..end #TA TATA-box #PA poly-A start #RP repetitive element start..end The fields of #PA, #TA and #RP are optional (some sequences won’t have them), and #RP may appear more than once, but always at the end of the line. (The RP fields describe areas in the sequence that are annotated as repetitive elements) For example: >AF070670 #DE Homo sapiens protein phosphatase 2C alpha 2 mRNA, complete cds. #LN 2100 #CD 360..1334 #TA 328 #PA 2046 #RP 122..133 #RP 1874..1899 An example of such a file is available in the course webpage (the EHD_nucleotide.richFasta file). a. Store all data from the file in a complex data structure of your choice. b. Print the DEscription of all sequences with a TAta box. c. Ask the user for a length and print all accessions of sequences shorter than that length. d*. Print the lengths of the proteins coded by the mRNAs (number of amino acids in the Coding Sequence), and add a note for every protein with two or more RPs. e*. Ask the user for a word and print all accessions of sequences whose definition contained that word. For example, "phosphatase” appears in the header above.