Handout & Homework

advertisement
Genomics (Ecol 553L) Computational Lab
Week 11: Oct 30, Nov 1
Course webpage: http://genomics.arizona.edu/553/
Topics: Perl subroutines and modules
In class exercises:
1) Copy /genome/student/ecol553_week12 to your home directory.
2) Rewrite the seq_len.pl script to use Bio::SeqIO and name the new script
bp_seq_len.pl. How many lines of code do you need and how does this
compare to the older seq_len.pl script we wrote?
3) Using bp_seq_len.pl as a starting point, add the rev_com subroutine
from week11, and use Bio::SeqIO to write an output file that contains
reverse-complemented sequences, with each sequence name
prepended with “RC1_”. Name the new sequence file with “RC1_” as
the filename prefix. Name the script bp_seq_rc1.pl.
4) Go to http://doc.bioperl.org and find the documentation for the revcom
method in the Bio::Seq module. Remember to check PrimarySeq and
PrimarySeqI Included or Inherited modules on the BioPerl doc pages.
5) Using bp_seq_rc1.pl as a starting point, remove the rev_com subroutine
and figure out how to call a Bio::Seq method to reverse complement the
sequences. Name the new script bp_seq_rc2.pl and prepend each
output sequence name with “RC2_”. Name the new sequence file with
“RC2_” as the filename prefix.
6) Use the Unix diff command to compare the outputs of bp_seq_rc1.pl and
bp_seq_rc2.pl
7) Add an argument specifying the sequence file format (e.g. fasta,
genbank, etc.) to each of the bp_*.pl scripts to make them more flexible.
How do we pass the format to BioPerl?
Homework
calculate theta and pi for an input alignment
Create a perl program named piTheta.pl that is given a file name from the
command line. This file will be a simplified aligned fasta file. Each sequence will
be on a single line so that you can use the following pseudo-code to take in the
file:
Foreach line in the file
If the first char is not ‘>’
Push the line onto a sequence array
End if
End foreach
The sequences are also guaranteed to be the same length.
You will need to find the columns that are variable, calculate the nucleotide
ratios, etc. It might be helpful to keep an auxiliary array that has as many
elements as there are columns in the alignment that indicate if a column is
variable or not.
Your output to standard out should look like:
pi: <PI>
\t
var:<vartheta>
theta: <THETA>
\t
var:<vartheta>
To standard error you should output the alignment but only of the variable
columns.
For a reminder on the formulas for pi and theta look at Dr. Whiteman slides from
23 Oct.
Download