Perl Laboratory Study Guide – Section III 9. Hashes There are three main data types in perl: scalar variables, arrays and hashes. Hashes provide very fast nested array look-ups. The format is similar to that of an array. o %hash = (‘key’ => ‘value’); o $value = $hash{‘key’}; You may declare a hash using almost any delimiter. %array = ( ‘key1’, ‘key2’, ‘key3’, ); ‘value1’, ‘value2’, ‘value3’, %array = ( ‘key1’=> ‘key2’=> ‘key3’=> ); ‘value1’, ‘value2’, ‘value3’, The keys and values in a hash may be addressed as arrays. o @keys = keys %hash; o @values = values %hash; In perl there is technically no such animal as an associative array, but we can use them anyway. They work like hashes (sort-of), but are much slower. $array[$i] -> [$j] produces the array $array[$i][$j] For ex9-1.pl, write a script that stores the following values as both an ‘associative array’ and as a hash: {[1][a], [1][b], [1][c]}, {[2][a],[2][b],[2][c]}, {[3][a],[3][b],[3][c]}. You may, of course, hard code the population of these data types, but populating them using for loops is probably more helpful in the long run. Modify ex9-1.pl to sort both the keys and values of arrays/hashes using numeric and lexicological sorting, and print the results to the screen. Don’t freak out … this isn’t that hard. o To sort arrays alphabetically: @array = sort @array; o To sort arrays numerically: @array = sort{$a <=> $b}@array; o To sort keys and values: foreach ( sort keys (%hash)) { print “$_\t”, “*” x $hash{$_},”\n”; } o To sort keys in ascending order: foreach (sort {$hash{$b}<=>$hash{$_}} keys (%hash)) { …… } 10. Hashes and the Genetic Code As you have no doubt figured out from your fist assignment, there are numerous ways to return the correct codon based on the appropriate tri-nucleotide combination. Here’s the most difficult method: sub codonReplacement { my($codon) = @_; return s if ($codon =~ /TCA/i ); return s elseif ($codon =~ /TCC/i); return s elseif ($codon =~ /TCG/i); … } Here is a better method: sub codonReplacement { my($codon) = @_; return A if ($codon =~ /GC./i ); return C elseif ($codon =~ /TG[TC]/i); return D elseif ($codon =~ /GA[TC]/i); … } Here is the best method: sub codonReplacement { my($codon) = @_; $codon uc $codon; my(%genetic_code) = ( ‘TCA’ => ‘S’, ‘TCC’ => ‘S’, ‘TCG’ => ‘S’ …. ); return $genetic_code{$codon} if (exists $genetic_code{$codon}) } The ‘best’ method is not merely a thought exercise. Accurately reproducing this hash table will make the rest of your semester much easier. This is up to you to finish. 11. A Sample Program By now I’m certain that you are getting fairly proficient at perl. This is a good thing. Type in the sample program below, ex11-1.pl, and think about the consequences of its findings. (I know that it’s a bit long, but you just might learn something.) A program to simulate the percentage of similar DNA in random seq #!/usr/bin/perl –w use strict; #declare and initialize variables my $percent; my @percentages; my $result; #initialize an array to store DNA my @randomDNA = (); #Seed the random number generator srand(time|$$); #Generate ten random DNA sets using a subroutine @randomDNA = make_random_DNA_set(10,10,10); #iterate through all pairs of sequences for (my $k = 0; $k < scaler @randomDNA-1; ++$k) { for (my $i = ($k + 1); $i < scaler @random_DNA; ++$i) { $percent = matching_percentage($random_DNA[$k], $random_DNA[$i]); puch(@percentages, $percent); } } #Average the result $result = 0; foreach (@percentages) $result += $_; $result = ($result / scaler(@percentages))*100; print “The average percentage of matching positions is “; print “$results\n\n”; exit; #Make a random set of DNA sub make_random_DNA_set { my($minLen, $maxLen, $sizeOfSet) = @_; #length of DNA fragment, each fragment, set my $length; my $dna; my @set; #create a set of random DNA for (my $i=0; $i < $sizeOfSet; ++$i) { #find a random length $length = random_length($minLen, $maxLen); #make a random DNA fragment $dna = make_random_DNA($length); #add DNA fragment to @set push(@set, $dna); } return @set; } #find random length between x and y sub random_length { my($minLen, $maxLen) = @_; return (int(rand($maxLen - $minLen+1))+$minLen); } #pick random nucleotide sub randomnucleotide { my(@nucleotides) = (‘A’,’C’,’G’,’T’); return randomelement(@nucleotides); } #randomly select element from array sub randomelement { my(@array) = @_; return $array[rand @array]; } #make_random DNA sub make_random_DNA { my($length) = @_; my $dna; for (my $i=0; $i < $length; ++$i) { $dna .= randomnucleotide(); } return $dna; } #matching percentage sub matching_percentage { my($string1, $string2) = @_; my($length) = length($string1); my($position); my($count) = 0; for ($position=0; $position<$length; ++$position) { if(substr($string1,$position,1) eq substr($string2,$position,1)) { ++$count; } } return $count/$length; 1. 2. 3. 4. How do you think the percentage of matching DNA generated from this random generator relates to percentages of matching DNA generated from real sets of DNA randomly extracted from different genomes? How can you modify this script to prove your theory? Does the percent randomness change as the random fragment length increases? Is the subroutine matching_percentage the best method for finding the matching percentage?