Perl Laboratory Study Guide – Section I 1. Getting started Everyone should use ssh2 to connect to Watson, at: watson.ecs.baylor.edu Everyone is sharing a user space, so create a subdirectory using your name. The remainder of this work should be done within your subdirectory. To make sure that everything is working, write a simple perl script. Your script may be named anything you wish, but must have the .pl extension: for example, test.pl. Watson has vi, pico, and emacs editors. If you use the ‘perl’ executable there is no need to make your program exicutable. #!usr/bin/perl –w $DNA = “ATCGATGA”; print $DNA,”\n”; exit; Perl compiles at run time. At the command prompt, run the program by typing: perl test.pl The result should be: ATCGATGA Question: Is the command ‘perl’ required to run this program? Also, is the first line of perl ‘#!usr/bin/perl’ required? 2. Getting help Perl comes with many built-in help pages. In order to get familiar with these vast resources, explore the man pages. The main perl man page has a list of some helpful links. o man perl o man perlintro o man perltoc In addition, perl comes with many built-in functions and the help pages that describe them. Try finding some common functions. o perldoc –f reverse o perldoc –f push o perldoc –f shift 3. Getting a FASTA reference file Because we are going to use FASTA files as practice sets for opening and writing to files, we need to get a test file. The easiest way to do this is to point your browser to the NCBI homepage: http://www.ncbi.nlm.nih.gov/ Search Entrez for you favorite gene. (I have many favorites; if you can’t think of one, try prkr or cos1.) On the results page, follow the link for the protein database. If one doesn’t exist, pick another gene. Using the drop-down menu, display the file in FASTA format and save it to your [yourname]/ directory. We will use this file to test our perl scripts. Repeat this process using a nucleotide file. 4. Printing the contents of file The object of this section is to use perl to output the contents of a file to the screen using several different approaches. In each case, your script should open a filename given at a prompt and should include error catching. Save each step as a separate file under [yourname]/. Name each file appropriately: ex4-1.pl, ex4-2.pl, ex4-3.pl, etc. 4-1. use a while (<FH>) loop. > protein name | number | length ACHYTCAHCYACHSGCETYAGCYSTGCA ACTGACTACSHACSYFLASCHUICECIQUH 4-2. use an array to produce the same results as 4-1. 4-3. use an array that concatenates every line into one single line, removing all special end-ofline characters and white spaces. This line might come in handy: $seq =~ s/[\s\r\n\t ]//g; >proteinname|number|lengthACHYTCAHCYACHSGCETYAGCYSTGCAACTGACTACSHACSYF LASCHUICECIQUH 4-4. Modify 4-3 to (1) take the file name directly from the command line and (2) crate a single line that does not include the FASTA header line. This method usually requires that you know something about regular expressions: $seq =~ m/text/i is an example of a regular expression. Because we know that by definition the first line of a FASTA format must include a ‘>’, we can write a regular expression that will skip this line: if ($seq !~ /^>/) {…} ACHYTCAHCYACHSGCETYAGCYSTGCAACTGACTACSHACSYFLASCHUICECIQUH 5. Determining the frequency of nucleotides In order to get more comfortable using perl data structures, writing a few short scripts that count bases or amino acids is important. Create a script named ex5-1.pl that accomplishes the same task as the script written for example 4-4. If everything is working correctly, you will type at the prompt: perl ex5-1.pl filename.fasta and the result will be: ATCGATCGATCAGTCGATCGATGCATCGATCGCTGATGATCGTCGATCGATCGATCGATCGTACGATC GATCGATCGATCGATCGATCGCTGACTGATAGCTACGTACGATGACGT Now, we’re going to alter ex5-1.pl to count the numbers of A’s, G’s, C’s, and T’s. First, incorporate a command that splits this single string into a large array; for example, @dna = split(‘’, $string); Make sure your program initializes a counter. $A_Number = $C_Number = $G_Number = $T_Number = $Error = 0; 0; 0; 0; 0; Loop through the bases, keeping count of the appropriate number of nucleotides. foreach $base (@dna) { if ($base eq ‘A’) { ++$A_Number; } elsif ($base eq ‘C’) { ++$C_Number; } elsif ($base eq ‘G’) { ++$G_Number; } elsif ($base eq ‘T’) { ++$T_Number; } else { print “Error: I don’t recognize the bases\n”; ++$Error; } } Perl has many built-in short cuts that will make this easier, but more complicated at the same time. For example, in the first line above, the loop assigns each element in @dna to the temporary variable $base. But it only does this because I have specified the variable. If I left out the variable $base the compiler would assign the value to the temporary variable $_. The first line would now read: if ($_ eq ‘A’) { ++$A_Number; } Another short cut is the implicit nature of equality and pattern matching. Instead of asking if $_ is eq to ‘A’, we could ask if the pattern ‘A’ is found in the string: if ($_ =~ m/A/) {…}. Because the temporary is already assigned to $_ if it is not declared in the foreach line, we can leverage its implication: if (/A/) {…}. I know that this may be a bit confusing, re-write ex5-1.pl as ex5-2.pl, using this shorthand approach. We are going to count the number of bases without looping through as array, i.e. keeping the sequence as a string. Use this method for script ex5-3.pl for ($position=0; $position < length $dna; ++$position) { $base = substr($dna, $position, 1); while ($base =~ /a/gi) { ++$A_Number; } while ($base =~ /g/gi) { ++$G_Number; } while ($base =~ /c/gi) { ++$C_Number; } while ($base =~ /t/gi) { ++$T_Number; } while ($base !~ /[acgt]/gi) { ++$Error; } } Finally, we are going to count the number of A’s, G’s, C’s, and T’s in the DNA string using the transliterate operator. Remember from lecture: $DNA =~ tr/AGCT/TCGA/ Also, tr/// is the same as y/// In our version of perl, it is easy to use this to return the occurrence number of any character by binding the post-transliteration operation to an integer. Create a script, ex5-4.pl, that uses this approach to count and display the number of A’s, C’s, G’s, and T’s in your DNA sequence. For example: $A_Number = $DNA =~ y/A//; 6. Writing out to files In this section you will learn to write text to a file. First, copy ex5-4.pl to ex6-1.pl Add a line that takes an output filename from the command line. For example, the command line should be something like: perl ex6-1.pl infile.fasta outfile.txt At the end of the script, add a couple of lines that open, and write to, a results file. Below is an example of what writing to a file might look like. Notice that the outfile is preceeded by ‘>’, which indicates that the file must be created. open ( RESULTFILE, “>$outfile”) or die (“Error: $!”); print RESULTFILE “The results are overwriting everything that existed in $outfile\n”; close RESULTFILE; Use this opportunity to explore some of perl’s special variables. o What does the variable $0 hold? o Print out the contents of @ARGV