Perl Laboratory Study Guide – Section I

advertisement
Perl Laboratory Study Guide – Section I
1. Getting started



Everyone should use ssh2 to connect to Watson, at: watson.ecs.baylor.edu
Everyone is sharing a user space, so create a subdirectory using your name. The remainder of this
work should be done within your subdirectory.
To make sure that everything is working, write a simple perl script. Your script may be named
anything you wish, but must have the .pl extension: for example, test.pl. Watson has vi, pico,
and emacs editors. If you use the ‘perl’ executable there is no need to make your program exicutable.
#!usr/bin/perl –w
$DNA = “ATCGATGA”;
print $DNA,”\n”;
exit;


Perl compiles at run time. At the command prompt, run the program by typing: perl test.pl
The result should be: ATCGATGA
Question: Is the command ‘perl’ required to run this program? Also, is the first line of perl ‘#!usr/bin/perl’ required?
2. Getting help


Perl comes with many built-in help pages. In order to get familiar with these vast resources, explore
the man pages. The main perl man page has a list of some helpful links.
o man perl
o man perlintro
o man perltoc
In addition, perl comes with many built-in functions and the help pages that describe them. Try
finding some common functions.
o perldoc –f reverse
o perldoc –f push
o perldoc –f shift
3. Getting a FASTA reference file





Because we are going to use FASTA files as practice sets for opening and writing to files, we need to
get a test file. The easiest way to do this is to point your browser to the NCBI homepage:
http://www.ncbi.nlm.nih.gov/
Search Entrez for you favorite gene. (I have many favorites; if you can’t think of one, try prkr or
cos1.)
On the results page, follow the link for the protein database. If one doesn’t exist, pick another gene.
Using the drop-down menu, display the file in FASTA format and save it to your [yourname]/
directory. We will use this file to test our perl scripts.
Repeat this process using a nucleotide file.
4. Printing the contents of file
 The object of this section is to use perl to output the contents of a file to the screen using several
different approaches. In each case, your script should open a filename given at a prompt and should
include error catching. Save each step as a separate file under [yourname]/. Name each file
appropriately: ex4-1.pl, ex4-2.pl, ex4-3.pl, etc.
4-1. use a while (<FH>) loop.
> protein name | number | length
ACHYTCAHCYACHSGCETYAGCYSTGCA
ACTGACTACSHACSYFLASCHUICECIQUH
4-2. use an array to produce the same results as 4-1.
4-3. use an array that concatenates every line into one single line, removing all special end-ofline characters and white spaces. This line might come in handy: $seq =~ s/[\s\r\n\t
]//g;
>proteinname|number|lengthACHYTCAHCYACHSGCETYAGCYSTGCAACTGACTACSHACSYF
LASCHUICECIQUH
4-4. Modify 4-3 to (1) take the file name directly from the command line and (2) crate a single
line that does not include the FASTA header line. This method usually requires that you know
something about regular expressions: $seq =~ m/text/i is an example of a regular
expression. Because we know that by definition the first line of a FASTA format must include
a ‘>’, we can write a regular expression that will skip this line: if ($seq !~ /^>/) {…}
ACHYTCAHCYACHSGCETYAGCYSTGCAACTGACTACSHACSYFLASCHUICECIQUH
5. Determining the frequency of nucleotides



In order to get more comfortable using perl data structures, writing a few short scripts that count bases
or amino acids is important. Create a script named ex5-1.pl that accomplishes the same task as
the script written for example 4-4. If everything is working correctly, you will type at the prompt:
perl ex5-1.pl filename.fasta and the result will be:
ATCGATCGATCAGTCGATCGATGCATCGATCGCTGATGATCGTCGATCGATCGATCGATCGTACGATC
GATCGATCGATCGATCGATCGCTGACTGATAGCTACGTACGATGACGT
Now, we’re going to alter ex5-1.pl to count the numbers of A’s, G’s, C’s, and T’s. First,
incorporate a command that splits this single string into a large array; for example, @dna =
split(‘’, $string);
Make sure your program initializes a counter.
$A_Number =
$C_Number =
$G_Number =
$T_Number =
$Error = 0;

0;
0;
0;
0;
Loop through the bases, keeping count of the appropriate number of nucleotides.
foreach $base (@dna) {
if ($base eq ‘A’) { ++$A_Number; }
elsif ($base eq ‘C’) { ++$C_Number; }
elsif ($base eq ‘G’) { ++$G_Number; }
elsif ($base eq ‘T’) { ++$T_Number; }
else {
print “Error: I don’t recognize the bases\n”;
++$Error;
}
}


Perl has many built-in short cuts that will make this easier, but more complicated at the same time.
For example, in the first line above, the loop assigns each element in @dna to the temporary variable
$base. But it only does this because I have specified the variable. If I left out the variable $base
the compiler would assign the value to the temporary variable $_. The first line would now read: if
($_ eq ‘A’) { ++$A_Number; }
Another short cut is the implicit nature of equality and pattern matching. Instead of asking if $_ is
eq to ‘A’, we could ask if the pattern ‘A’ is found in the string: if ($_ =~ m/A/) {…}.
Because the temporary is already assigned to $_ if it is not declared in the foreach line, we can

leverage its implication: if (/A/) {…}. I know that this may be a bit confusing, re-write ex5-1.pl
as ex5-2.pl, using this shorthand approach.
We are going to count the number of bases without looping through as array, i.e. keeping the
sequence as a string. Use this method for script ex5-3.pl
for ($position=0; $position < length $dna; ++$position) {
$base = substr($dna, $position, 1);
while ($base =~ /a/gi) { ++$A_Number; }
while ($base =~ /g/gi) { ++$G_Number; }
while ($base =~ /c/gi) { ++$C_Number; }
while ($base =~ /t/gi) { ++$T_Number; }
while ($base !~ /[acgt]/gi) { ++$Error; }
}

Finally, we are going to count the number of A’s, G’s, C’s, and T’s in the DNA string using the
transliterate operator. Remember from lecture: $DNA =~ tr/AGCT/TCGA/ Also, tr/// is the
same as y/// In our version of perl, it is easy to use this to return the occurrence number of any
character by binding the post-transliteration operation to an integer. Create a script, ex5-4.pl, that
uses this approach to count and display the number of A’s, C’s, G’s, and T’s in your DNA sequence.
For example:
$A_Number = $DNA =~ y/A//;
6. Writing out to files



In this section you will learn to write text to a file. First, copy ex5-4.pl to ex6-1.pl
Add a line that takes an output filename from the command line. For example, the command line
should be something like: perl ex6-1.pl infile.fasta outfile.txt
At the end of the script, add a couple of lines that open, and write to, a results file. Below is an
example of what writing to a file might look like. Notice that the outfile is preceeded by ‘>’, which
indicates that the file must be created.
open ( RESULTFILE, “>$outfile”) or die (“Error: $!”);
print RESULTFILE “The results are overwriting everything that existed in
$outfile\n”;
close RESULTFILE;

Use this opportunity to explore some of perl’s special variables.
o What does the variable $0 hold?
o Print out the contents of @ARGV
Download