Ex5

advertisement
Perl exercise 5
(Due on 28/12/2010)
Don't forget to write well-organized scripts, use meaningful variables and to write comments about
what you are doing in each part of code. Always debug your script to make sure they perform as you
planned.
Hashes
1. Write a script that reads a FASTA file (as an example you may use Ecoli.prot.fasta) and create a
hash in which:
a. The gi number (from the header) is the key, and the value is the ref accession.
For example for the first header:
>gi|16127995|ref|NP_414542.1| thr operon leader peptide...
The key should be "16127995" the value should be "NP_414542.1".
Ask the user for a gi number and print its ref accession.
b. The gi number (from the header) is the key, and the value is the full sequence of the protein.
For example for the first (short) protein:
>gi|16127995|ref|NP_414542.1| thr operon leader peptide...
MKRISTTITTTITITTGNGAG
The key should be "16127995" the value should be "MKRISTTITTTITITTGNGAG".
Note that for long sequences the value should be the full sequence of the protein (even if it
contains multiple lines).
As before, as the user for a gi number and print its sequence
2. Write a script that reads a PTT file (as an example you may use Ecoli.ptt) and create a hash in
which the key is the Synonym (in E.coli it is the value that start with b followed 4 digits). And
as a value save the Product. For example for the following line:
2801..3733
+
310
16127997
thrB
b0003 -
COG0083E homoserine kinase
The key should be "b0003" the value should be "homoserine kinase".
Ask the user for a Synonym and print its Product description.
Note: The format of each line in the table is:
<Start location>..<End location>\t<strand>\t<length in aa>\t
<GI(= PID)>\t<Gene Symbol>\t<Synonym>\t<Code (not relevant for E.coli)>\t
<COG accession>\t<Product description>
Wait… there is more in the next page…
Complex data structures
3. Write a script that reads a PTT file (as an example you may use Ecoli.ptt) and create a complex
data structure:
a.
The key of the outer hash is the Synonym, and in an inner hash store the strand, the
length and the product. Such that for the following line:
2801..3733 +
310 16127997
thrB
b0003
- COG0083E
homoserine kinase
the value of $pttHash{b0003}->{"strand"} is +
the value of $pttHash{b0003}->{"length"} is 310
the value of $pttHash{b0003}->{"product"} is homoserine kinase
Pass through your data structure and print the synonym and product of all the proteins
longer than 1400aa coded on the positive strand.
b.
Add to the inner hash of 3a a key "location" that will contain a reference to an array that
have two elements: the first is the start position and the second the end position. For
example for the line:
2801..3733 +
310 16127997
thrB
b0003
- COG0083E
homoserine kinase
the value of $pttHash{b0003}->{"location"}->[0] is 2801, and
the value of $pttHash{b0003}->{"location"}->[1] is 3733.
Receive a start value and an end value from the user and print the location,
synonym and product of the proteins that start after the start value provided by the
user, and end before the end value provided by the user.
4. Write a script that reads a FASTA file of mRNA sequences, where the FASTA header line of
each sequence has the following format:
>accession #DE description #LN length #CD coding sequence
start..end #TA TATA-box #PA poly-A start #RP repetitive element
start..end
The fields of #PA, #TA and #RP are optional (some sequences won’t have them), and #RP may
appear more than once, but always at the end of the line. (The RP fields describe areas in the
sequence that are annotated as repetitive elements)
For example:
>AF070670 #DE Homo sapiens protein phosphatase 2C alpha 2 mRNA,
complete cds. #LN 2100 #CD 360..1334 #TA 328 #PA 2046 #RP
122..133 #RP 1874..1899
An example of such a file is available in the course webpage (the EHD_nucleotide.richFasta file).
a. Store all data from the file in a complex data structure of your choice.
b. Print the DEscription of all sequences with a TAta box.
c. Ask the user for a length and print all accessions of sequences shorter than that length.
d*. Print the lengths of the proteins coded by the mRNAs (number of amino acids in the Coding
Sequence), and add a note for every protein with two or more RPs.
e*. Ask the user for a word and print all accessions of sequences whose definition contained that
word. For example, "phosphatase” appears in the header above.
Download