EX. 5

advertisement
Perl exercise 5
(Due on 01/06/2009)
Don't forget to write well-organized scripts, use meaningful variables and to write comments
about what you are doing in each part of code. Always test your script on several examples. If you
need more biological sequences, search for appropriate examples in GenBank.
Hashes
1. Change the solution for class exercise 4 question 1 (reading names and expanses from a file):
After reading the input file, ask the user for a name (from STDIN) and print the sum of
expanses for that name (use a hash).
2. Write a script that reads a FASTA file and stores the sequences in a hash. The header line
should be the key and the sequence should be the value. Now ask the user for a header line and
extract the sequence from the hash.
3. Write a script that reads a Genbank record and stores the title of each paper in a hash, with the
last names of the authors as the keys (each title should be stored once for each of its authors).
Now ask the user for an author's last name and print the paper by this author. In cases that an
author appears on more than one paper – print only one of the papers it appears on (no need to
store all this author's papers).
Complex data structures
4. Write a script that reads a file where each line contains a sample number followed by any
number of protein-level measurements, for example:
104 0.4322 0.3992 0.4832
Store an array of measurements for each sample in a hash, where the sample number is the key.
Ask the user for a sample number and print a sorted list of measurements.
5. Write a script that reads a FASTA file of mRNA sequences, where the FASTA header line of
each sequence has the following format:
>accession #DE description #LN length #CD coding sequence
start..end #TA TATA-box #PA poly-A start #RP repetitive element
start..end
The fields of #PA, #TA and #RP are optional (some sequences won’t have them), and #RP may
appear more than once, but always at the end of the line. (The RP fields describe areas in the
sequence that are annotated as repetitive elements)
For example:
>AF070670 #DE Homo sapiens protein phosphatase 2C alpha 2 mRNA,
complete cds. #LN 2100 #CD 360..1334 #TA 328 #PA 2046 #RP
122..133 #RP 1874..1899
An example of such a file is available from the course webpage (the "rich-FASTA" file).
a. Store all data from the file in a complex data structure of your choice.
b. Ask the user for a length and print all accessions of sequences shorter than that length.
c. Print the lengths of the proteins coded by the mRNAs (number of amino acids), and add a
note for every protein with two or more RPs.
d. Ask the user for a word and print all accessions of sequences whose definition contained
that word. For example, "phosphatase” appears in the header above (hint: lesson 6 slides 1415).
Download