Procedures: this file contains the PERL and R scripts used to perform several of the analyses in Green et al. 2013. Below is a list of analyses and the scripts to run, in order, to perform them. The scripts themselves contain more detailed explanations about their individual functions in the opening comment lines. Gap reconstruction - to determine the locations of gaps in the reconstructed ancestral sequences. Start with the structural alignment of extant sequences. produced in RNA salsa - exactly the alignment used to perform bppML. - use vi or some other text editor to replace all gaps with‚ 0 and all nucleotides with 1 gap_reconstruction_conversion.pl - run bppancestor on the converted extant sequences, using the F84 substitution model with no rate categories infer_gaps.pl Calculate Stem GC Content - to calculate the stem GC content of the extant and ancestral sequences. Start with the output of the RNAsalsa run on the reconstructed sequences (these sequences must have gaps!). This output should include a file of sequences and a file of structures. concatenate_seqs_and_struc.pl count_stem_GC.pl calculate_stem_GC_percent.R output: stem_GC.pl Calculate branch lengths independent of GC content - this sequence of actions will calculate branch lengths independent of changes that effect the stem GC content, to be used in analyses of the stem GC content. concatenate_seqs_and_struc.pl find_mutation_types.pl count_mutation_types.pl make_GC_ind_tree.R - for this step you need the tree produced by bppML - the original tree produced in PhyML with branch lengths refined in bppML and a tree output by bppML that lists the labels of each node. Create all possible mutations: This sequence of scripts will make all possible mutations to the T. maritima ribosome. simulate_evolution.pl - run RNAsalsa using the T. maritima structural constraint file and s1,s2,s3 = 1 concatenate_seqs_and_struc.pl count_stem_GC.pl calculate_stem_GC_percent.R gap_reconstruction.pl #!\usr\bin\perl use warnings; use strict; #This program will read in a series of sequences represented as 0's and 1's, and convert the zeroes to C's and the 1's to A's. #The sequences must be previously converted, with gaps encoded as 0's and nucleotides encoded as 1's. This allows the numbers to be substituted as nucleotides without any confusion about which C's represent cytosine and which represent gaps during the conversion process (or which A's represent Adenine and which represent a nucleotide) ## Input file in clustalw format: numbers_alignment.fna ## Output file in clustalw format: converted_numbers_alignment.fna my $file=glob("numbers_alignment.fna"); ## input open(IN,$file); open(OUT,">converted_numbers_alignment.fna"); ## ouput while(<IN>){ if ($_=~/(\S{1,10})\s{6}(\S*)\n/){ ## If the line containts sequence, split into two parts my $name=$1; ##part one is the name of the sequence my $string=$2; ## part two is the sequence $string =~ tr/10/AC/; ##translate 1 to A and 0 to C print OUT "$name $string\n"; ## Print to outfile } else{ print OUT $_; ##If line does not contain sequence, just print that line to the outfile unchanged } } close(IN); close(OUT); infer_gaps.pl #!\usr\bin\perl #This program will read in the ancestral sequences in the file ancestral.seqs.fna #It will read in the reconstructed sequences in converted.output.seqs and use this information to determine the location of gaps in the ancestral sequences ## Original sequence file: ancestral.seqs.fna ## Converted sequence file: converted.output.seqs use warnings; use strict; while(defined(my $file=glob("ancestral.seqs.fna"))){ print "The file is $file\n"; open(IN,"<$file"); my @seqs=<IN>; close(IN); open(IN,"<converted.output.seqs"); my @gaps=<IN>; close(IN); open(OUT,">$file.withgaps"); for(my $i=0;$i<scalar(@gaps);$i++){ chomp($gaps[$i]); chomp($seqs[$i]); if ($gaps[$i]=~/^>/){ # if the file is an annotation line, do not tamper print OUT "$gaps[$i]\n"; } else{ my @gaps_temp=split("",$gaps[$i]); my @seqs_temp=split("",$seqs[$i]); for(my $j=0;$j<scalar(@gaps_temp);$j++){ # loop through every nucleotide in that line if ($gaps_temp[$j] eq "A"){ #if a nucleotide is present (A=nucleotide) print OUT "$seqs_temp[$j]"; } else { if ($gaps_temp[$j] eq "C"){ # If a gap is present (C=gap) print OUT "-" } else{ $gaps_temp[$j]"; } } } print "Error! Unknown character } } } print OUT "\n"; concatenate_seq_and_struc.pl #!\usr\bin\perl use warnings; use strict; #This program will match the output secondary structure from RNA salsa to the sequences in the structural alignment in CLUSTAL format #It will output those two things together, with the sequence followed by the structure, in one file in fasta format #IT will do this for each set of structures and sequences in this directory, and make a different output file for each set #Read in files to the arrays @struct and @seq while(defined(my $file=glob("*struct.aln"))){ $file =~ m/(.*?)structaln/; open(STRUC,"<$1structaln_struct.aln"); my @struct = <STRUC>; close(STRUC); open(SEQ,"<$1structaln_sequ.aln"); my @seq = <SEQ>; close(SEQ); #Make sure files were read in correctly, and count the length and number of species in each file my %names = (); foreach (@seq){ # print "$_"; if ($_ !~ /^CLUSTAL/){ if ($_ =~ /^(\w{1,10})\s/){ my $name = $1; if (defined($names{$name})){} else { $names{$name} = 1; } } } } my $species = scalar(keys(%names)); print "Found $species species\n"; foreach (@struct){ # print "$_\n"; if ($_ !~ /^CLUSTAL/){ if ($_ =~ /^(\w{1,10})\s/){ my $name=$1; if (defined($names{$name})){} else { print "ERROR! The genes in the struct and seq file don't match\n"; die; } } } } my $seq_len = scalar(@seq); my $struct_len = scalar(@struct); print "The number of lines in the sequence file is $seq_len.\nThe number of lines in the structure file is $struct_len.\n\n"; ### Make two hashes, each keyed by species name, one containing the structes and one containing the sequences ### Then print to outfile my %seq_hash = (); my %struc_hash = (); foreach (keys(%names)){ $seq_hash{$_}=""; $struc_hash{$_}=""; } ## This subroutine chomps a string and then splits it at the \s ## It returns the first and last item in the array generated by splitting the string ## For a line from a clustalw alignment this will be the sequence name and the sequence sub splomp { my $string = $_[0]; chomp($string); my @work = split(" ",$string); my $length = scalar(@work); my @final = ($work[0],$work[$length-1]); return @final; } open(OUT,">seqs_and_struc.out"); my $jcount = 0; for(my $i=1;$i<$seq_len;$i++){ #$i starts at 1 in order to skip first line of each file, which contains no info if ($seq[$i] !~ /^\w/){} else { ## Skips blank lines my @seq_strs = &splomp($seq[$i]); my @struc_strs = &splomp($struct[$i]); if (defined($seq_hash{$seq_strs[0]})){ $seq_hash{$seq_strs[0]} = join("",$seq_hash{$seq_strs[0]},$seq_strs[1]); $struc_hash{$struc_strs[0]}=join("",$struc_hash{$struc_strs[0]},$struc_strs[1 ]); $jcount ++; } else { print "ERROR THE OUTFILE IS INCORRECT\n"; ## This error indicates that the names of the sequences as extracted by the splomp function differ from the names of the sequences in the original hash of sequences - could be a malfucntion in splomp or a problem with infile formatting } } } print "Counted $jcount lines!\n"; foreach my $key (keys(%seq_hash)){ print OUT ">$key\n$seq_hash{$key}\n$struc_hash{$key}\n"; } close(OUT); } count_stem_GC.pl #!\usr\bin\perl use warnings; use strict; ## This script will read in the output of the match_seq_to_struc script, seqs_and_struc.out, and output the nucleotide counts of the stem regions and loop regions into the file nucleotide_count.txt open(IN,"<seqs_and_struc.out"); my @lines = <IN>; close(IN); ## Define two hashes, one with stem and one with loop nucleotide content my @nucleotides = ("A","C","G","U","-","N"); my %stem = (); my %loop = (); foreach (@nucleotides){ $stem{$_}=0; $loop{$_}=0; } open(OUT,">nucleotide_count.txt"); print OUT "File\tStem_-\tStem_A\tStem_N\tStem_C\tStem_G\tStem_U\tLoop_\tLoop_A\tLoop_N\tLoop_C\tLoop_G\tLoop_U"; ## Make three arrays, one each for names, sequences, and structures my @names = (); my @seqs=(); my @strucs=(); foreach(@lines){ if ($_ =~ /^>(\w{1,10})/){ push(@names,$1); } else { s/T/U/g; #Substitutes T with U tr/a-z/A-Z/; #Translates lowercase to uppercase if ($_ =~ /^([\w\-]*)\n/){ # print "SEQUENCE IS $1\n"; push(@seqs,$1); } else { # } } } $_ =~ /^([\.\-\(\)X]*)\n/; print "STRUCTURE IS $1\n"; push(@strucs, $1); ## Now loop through each array and count nucleotides based on secondary structures, skipping gaps for(my $i=0;$i<scalar(@names);$i++){ ## Loop through the array of organism names my @singles = split("",$seqs[$i]); # array of nucleotides my @notes = split("",$strucs[$i]); # Array of structure characters my $line_count =0; foreach my $nuc (@notes){ if ($nuc =~ /\-/){} ##SKIPS GAPS else{ if ($nuc =~ /\./){ $loop{$singles[$line_count]}+=1; } if ($nuc =~ /[\(\)]/){ $stem{$singles[$line_count]}+=1; } ## Skips any character that's not ( ) or . } $line_count++; } print OUT "\n$names[$i]"; foreach (keys(%stem)){ print "$_\t$stem{$_}\n"; print OUT "\t$stem{$_}"; $stem{$_}= 0; ## CLEARS THE HASH } foreach (keys(%loop)){ print "$_\t$loop{$_}\n"; print OUT "\t$loop{$_}"; $loop{$_} = 0; #CLEARS THE HASH } } close(OUT); calculate_stem_GC_percent.pl ## This R script will read in the nucleotide count file and calculate the stem and loop GC/GCAT content ### MUST BE 12 COLUMNS IN THE OUTPUT FILE #INPUT nucleotide_count.txt #OUTPUT: stem_GC.txt files <- list.files(pattern="*nucleotide_count.txt"); count <- 0 for(j in 1:length(files)){ file <- files[[j]] count <- count +1 data <- read.table(files[[j]],header=TRUE,row.names="File") tot.length <- c() percent.stem <- c() percent.loop <- c() stem.gc <- c() for (i in 1:nrow(data)){ tot.length[i] <- sum(data[i,]) percent.stem[i] <- sum(data[i,1:6])/tot.length[i] percent.loop[i] <- sum(data[i,7:12])/tot.length[i] stem.gc[i] <- sum(data[i,4:5])/sum(data[i,1:6]) # Divide by stem length, NOT total length } outmat <- cbind(tot.length,percent.stem,percent.loop,stem.gc) row.names(outmat) <- row.names(data) splits <- strsplit(files[[j]],"_") splits <- splits[[1]] part <- splits[1:(length(splits)-3)] temp <- "" for(k in 1:length(part)){ temp <- paste(temp,part[k],sep="_") } write.table(outmat, file=paste(temp,"stem_GC.txt",sep="")) } find_mutation_types.pl #!\usr\bin\perl use warnings; use strict; ## This script will read in the seqs_and_struc file, and create a hash of sequences and of structures, keyed by the name of the organism. ## Then it will read in the R table that contains ancestor and descendant nodes, and make a hash of descendants keyed by ancestors. ## Then it will loop through each ancestor-descendant pair, find the corresponding sequences and structures for each, and detect positions where there has been a change (in seq, struc, or both) ## It will write an output file for each branch, with each mutation as a new line, that details the ancestral character, ancestral structure, and descendant character and structure. ## This will be in the folder 110603_initial_outputs. ## Then it will read in these output files and make a condensed table summarizing the changes for each branch ################## Make hashes of sequences and structure ##########3 open(IN,"<seqs_and_struc.out"); my @lines=<IN>; close(IN); my %seqs=(); my %strucs=(); my $name=""; my $count=0; foreach(@lines){ if (m/^>(.*)\n/){ $name=$1; $name =~ s/_//g; $count=0; print "The name is $name\n" } else{ if ($count==0){ chomp; $seqs{$name}=$_; $count++; } if ($count!=0){ chomp; $strucs{$name}=$_; $count++; } } } #foreach(keys(%seqs)){ # print "$_\n$seqs{$_}\n$strucs{$_}\n"; #} #exit; ################ Make hashes of ancestors and descendants ######## open(IN,"edge_table.txt"); @lines=<IN>; close(IN); my %branches=(); foreach(@lines){ chomp; my @entries = split("\t",$_); if ($entries[1]=~m/\d{1,2}\_/){ my @new = split("_",$entries[1]); my $string=$new[1]; for(my $j=2;$j<scalar(@new);$j++){ $string = join("",$string,$new[$j]); } $entries[1] = $string; } $branches{$entries[1]} = $entries[0]; ## Hash of ancestors keyed by descendants } foreach(keys(%branches)){ print "$_\t$branches{$_}\n"; } ################ Loop through each ancestor-descendant and find mismatches system("mkdir initial_outputs"); foreach my $d (keys(%branches)){ my $a = $branches{$d}; print "$d\t ancestor $a\n"; my @a_seq = split("",$seqs{$a}); my @d_seq = split("",$seqs{$d}); my @a_struc = split("",$strucs{$a}); my @d_struc = split("",$strucs{$d}); my %positions =(); # This will store the positions of the errors my $i=0; foreach my $a_nuc(@a_seq){ my $d_nuc=$d_seq[$i]; # print "$a_nuc\t$d_nuc\n"; if ($a_nuc ne $d_nuc){ $positions{$i}="M"; #M for mutation } else{ $positions{$i} = "S";} #S for same $i++; } $i=0; foreach my $a_st(@a_struc){ my $d_st = $d_struc[$i]; if($a_st ne $d_st){ $positions{$i} = "M"; } $i++; } open(OUT, ">initial_outputs/output.$a.$d.txt"); print OUT "Position\tAnces_character\tAnces_structure\tDes_character\tDes_structure\n"; my $mut_count=0; my @key = sort(keys(%positions)); foreach(@key){ if ($positions{$_} eq "M"){ print OUT "$_\t$a_seq[$_]\t$a_struc[$_]\t$d_seq[$_]\t$d_struc[$_]\n"; $mut_count++; } } print "Found $mut_count mutations between $a and $d\n"; close(OUT); } count_mutation_types.pl #!\usr\bin\perl use warnings; use strict; ### This script inports the table 110603_changes_count.txt, and counts the number of changes along each branch that increase, decrease, or do not effect the stem GC content my %scores=( "A(" => -1, "A)" => -1, "U(" => -1, "U)" => -1, "A." => 0, "U." => 0, "C." => 0, "G." => 0, "C(" => 1, "G(" => 1, "C)" => 1, "G)" => 1, "--" => 0, ); open(OUT,">changes_count.txt"); print OUT "Branch\tIncrease\tDecrease\tNeutral\n"; while(defined(my $file=glob("110603_initial_outputs/output*"))){ $file =~ m/outputs\/output\.(.*)\./; my $name = $1; open(IN,$file); my $first = <IN>; # Ignore the first line my @lines=<IN>; close(IN); my $increase=0; my $decrease=0; my $neutral=0; foreach(@lines){ chomp; my @chars = split("\t",$_); my $ances = join("",$chars[1],$chars[2]); my $des = join("",$chars[3],$chars[4]); print "ancestor $ances descendant $des\n"; my $a_score = $scores{$ances}; my $d_score = $scores{$des}; if ($a_score == $d_score){$neutral++;} if ($a_score <= $d_score){$increase++;} if ($a_score >= $d_score){$decrease++;} } print OUT "$name\t$increase\t$decrease\t$neutral\n"; } make_GC_ind_tree.R ## This R script will perform analysis on table of the number of changes along each branch, and make a new tree with branch lengths based on the # of changes that did not effect the stem GC content that ocurred on that branch ## input data = changes_count.txt ## input tree = GTR.general.tree ## input tree (for node labels) = names.tree.mod.txt ## Read in the files and the branch lengths library(ape) data <- read.table("changes_count.txt",row.names="Branch",header=T) tree <- read.tree("GTR.general.tree") tree <drop.tip(tree,c("Sulfurihyd","Hydrogenob","Aaeolicus_","Chydrogeno","Tpseudetha" , "Ttengconge","Tkodakaren","Pfuriosus_"),trim.internal=TRUE) names.tree <- read.tree("names.tree.mod.txt") names.tree <drop.tip(names.tree,c("Sulfurihyd","Hydrogenob","Aaeolicus","Chydrogeno","Tpseu detha", "Ttengconge","Tkodakaren","Pfuriosus"),trim.internal=TRUE) tree.names <- c(names.tree$tip.label,names.tree$node.label) ### p <- data[,1] n <- data[,2] ne <- data[,3] names(p) <-names(sums) names(n) <- names(sums) names(ne) <- names(sums) diffs <- p-n names(diffs) <- names(sums) ratios <- n/p names(ratios) <- names(diffs) plot(tree$edge.length,ratios[edge.table[,2]]) plot(ne[edge.table[,2]],diffs[edge.table[,2]],pch=18,cex=.9,col="darkgreen",xlim=c(0 ,120)) text(ne[edge.table[,2]],diffs[edge.table[,2]],edge.table[,2],cex=.7,pos=4) ### Make a tree with different branch lengths ne.norm <- ne[edge.table[,2]]/max(ne[edge.table[,2]]) branch.tree <- tree branch.tree$edge.length <- ne.norm write.tree(branch.tree,"110809_modified_branch_tree.txt") ### library(ape) branch.tree$tip.label <- c("T. maritima str. 2812B","T. maritima str. MSB8", "T. cell2", "T. sp RQ2","T. neapolitana","T. petrophila","T. napthophila","T. lettingae","T. elfii","T. subterranea","T. hypogea","T. thermarum","Ts. atlanticus","Ts. geolei","Ts. japonicus","Ts. africanus","Ts. melanesiensis","F. islandicum","F. changbaicum","F. nodosum","F. gondowanense","K. olearia","Ms. prima","M. hydrogenitolerans","M. piezophila","M. okinawensis","M. camini","P. mexicana","P. halophila","P. mobilis","P. olearia","P. sibirica") pdf(file="Modified branch lengths tree") plot.phylo(branch.tree) dev.off() simulate_evolution.pl #!\usr\bin\perl use warnings; use strict; #This script will make every possible point mutation in the T. maritima ribosomal sequence, and then print them to the file 110530_sim_output.fna ## The input file is the fna file of the T. maritima 16S rRNA ## All of the simulated files will be output in sim_outputs.fna ## The modified sequences will also be written to separate files in the outputs directory ## RNA salsa should be run on sim_output.fna, with s1,s2,s3=1 and a structural constraint for the characterized T. maritima 16S rRna open(IN,"tmar_16S_seq.fna"); my @lines = <IN>; close(IN); my $string=""; foreach(@lines){ #joins the sequence into a large string chomp($_); if ($_=~/^>/){} else{ $string=join("",$string,$_); } } $string=~s/-//g; #removes gaps my @chars =("A","C","G","T"); my @seq_array = split("",$string); open(OUT,">sim_output.fna"); system("mkdir outputs"); my $count=0; foreach(@seq_array){ my @index=(); my $char_count=0; my @temp_seq_array=@seq_array; foreach my $char (@chars){ if ($_ eq $char){ $index[$char_count]=1; } else{ $index[$char_count]=0; } $char_count++; } my $ind_count=0; foreach my $ind (@index){ if ($ind==0){ $temp_seq_array[$count]=$chars[$ind_count]; my $tr_char=$count+1; #The position of the change made my $orig = $seq_array[$count]; #The original base my $new=$chars[$ind_count]; #The new base print OUT ">T_mar_sim char:$tr_char orig:$orig new:$new\n"; my $temp_seq=join("",@temp_seq_array); print OUT "$temp_seq\n"; ##This makes an output file for each mutant open(SING,">outputs/Tmar_sim.$tr_char.$orig.$new.fna"); print SING ">T_mariMSB8_constraint\n$string\n"; print SING ">T_mari.$tr_char.$orig.$new\n$temp_seq\n"; close(SING); } else{} $ind_count++; } } $count++;