Procedures: this file contains the PERL and R scripts used to

advertisement
Procedures: this file contains the PERL and R scripts used to perform several of the
analyses in Green et al. 2013. Below is a list of analyses and the scripts to run, in
order, to perform them. The scripts themselves contain more detailed explanations
about their individual functions in the opening comment lines.
Gap reconstruction - to determine the locations of gaps in the reconstructed
ancestral sequences. Start with the structural alignment of extant sequences.
produced in RNA salsa - exactly the alignment used to perform bppML.
- use vi or some other text editor to replace all gaps with‚ 0 and all nucleotides
with 1
gap_reconstruction_conversion.pl
- run bppancestor on the converted extant sequences, using the F84 substitution
model with no rate categories
infer_gaps.pl
Calculate Stem GC Content - to calculate the stem GC content of the extant and
ancestral sequences. Start with the output of the RNAsalsa run on the reconstructed
sequences (these sequences must have gaps!). This output should include a file of
sequences and a file of structures.
concatenate_seqs_and_struc.pl
count_stem_GC.pl
calculate_stem_GC_percent.R
output: stem_GC.pl
Calculate branch lengths independent of GC content - this sequence of actions will
calculate branch lengths independent of changes that effect the stem GC content, to
be used in analyses of the stem GC content.
concatenate_seqs_and_struc.pl
find_mutation_types.pl
count_mutation_types.pl
make_GC_ind_tree.R - for this step you need the tree produced by bppML - the
original tree produced in PhyML with branch lengths refined in bppML and a tree
output by bppML that lists the labels of each node.
Create all possible mutations: This sequence of scripts will make all possible
mutations to the T. maritima ribosome.
simulate_evolution.pl
- run RNAsalsa using the T. maritima structural constraint file and s1,s2,s3 =
1
concatenate_seqs_and_struc.pl
count_stem_GC.pl
calculate_stem_GC_percent.R
gap_reconstruction.pl
#!\usr\bin\perl
use warnings;
use strict;
#This program will read in a series of sequences represented as 0's and 1's, and
convert the zeroes to C's and the 1's to A's.
#The sequences must be previously converted, with gaps encoded as 0's and
nucleotides encoded as 1's. This allows the numbers to be substituted as nucleotides
without any confusion about which C's represent cytosine and which represent gaps
during the conversion process (or which A's represent Adenine and which represent
a nucleotide)
## Input file in clustalw format: numbers_alignment.fna
## Output file in clustalw format: converted_numbers_alignment.fna
my $file=glob("numbers_alignment.fna"); ## input
open(IN,$file);
open(OUT,">converted_numbers_alignment.fna"); ## ouput
while(<IN>){
if ($_=~/(\S{1,10})\s{6}(\S*)\n/){ ## If the line containts sequence, split into
two parts
my $name=$1; ##part one is the name of the sequence
my $string=$2; ## part two is the sequence
$string =~ tr/10/AC/; ##translate 1 to A and 0 to C
print OUT "$name $string\n"; ## Print to outfile
}
else{
print OUT $_; ##If line does not contain sequence, just print that line to the
outfile unchanged
}
}
close(IN);
close(OUT);
infer_gaps.pl
#!\usr\bin\perl
#This program will read in the ancestral sequences in the file ancestral.seqs.fna
#It will read in the reconstructed sequences in converted.output.seqs and use this
information to determine the location of gaps in the ancestral sequences
## Original sequence file: ancestral.seqs.fna
## Converted sequence file: converted.output.seqs
use warnings;
use strict;
while(defined(my $file=glob("ancestral.seqs.fna"))){
print "The file is $file\n";
open(IN,"<$file");
my @seqs=<IN>;
close(IN);
open(IN,"<converted.output.seqs");
my @gaps=<IN>;
close(IN);
open(OUT,">$file.withgaps");
for(my $i=0;$i<scalar(@gaps);$i++){
chomp($gaps[$i]);
chomp($seqs[$i]);
if ($gaps[$i]=~/^>/){ # if the file is an annotation line, do not tamper
print OUT "$gaps[$i]\n";
}
else{
my @gaps_temp=split("",$gaps[$i]);
my @seqs_temp=split("",$seqs[$i]);
for(my $j=0;$j<scalar(@gaps_temp);$j++){ # loop through every
nucleotide in that line
if ($gaps_temp[$j] eq "A"){ #if a nucleotide is present
(A=nucleotide)
print OUT "$seqs_temp[$j]";
}
else {
if ($gaps_temp[$j] eq "C"){ # If a gap is present (C=gap)
print OUT "-"
}
else{
$gaps_temp[$j]";
}
}
}
print "Error! Unknown character
}
}
}
print OUT "\n";
concatenate_seq_and_struc.pl
#!\usr\bin\perl
use warnings;
use strict;
#This program will match the output secondary structure from RNA salsa to the
sequences in the structural alignment in CLUSTAL format
#It will output those two things together, with the sequence followed by the
structure, in one file in fasta format
#IT will do this for each set of structures and sequences in this directory, and make
a different output file for each set
#Read in files to the arrays @struct and @seq
while(defined(my $file=glob("*struct.aln"))){
$file =~ m/(.*?)structaln/;
open(STRUC,"<$1structaln_struct.aln");
my @struct = <STRUC>;
close(STRUC);
open(SEQ,"<$1structaln_sequ.aln");
my @seq = <SEQ>;
close(SEQ);
#Make sure files were read in correctly, and count the length and number of species
in each file
my %names = ();
foreach (@seq){
#
print "$_";
if ($_ !~ /^CLUSTAL/){
if ($_ =~ /^(\w{1,10})\s/){
my $name = $1;
if (defined($names{$name})){}
else {
$names{$name} = 1;
}
}
}
}
my $species = scalar(keys(%names));
print "Found $species species\n";
foreach (@struct){
#
print "$_\n";
if ($_ !~ /^CLUSTAL/){
if ($_ =~ /^(\w{1,10})\s/){
my $name=$1;
if (defined($names{$name})){}
else {
print "ERROR! The genes in the struct and seq file don't
match\n";
die;
}
}
}
}
my $seq_len = scalar(@seq);
my $struct_len = scalar(@struct);
print "The number of lines in the sequence file is $seq_len.\nThe number of lines in
the structure file is $struct_len.\n\n";
### Make two hashes, each keyed by species name, one containing the structes and
one containing the sequences
### Then print to outfile
my %seq_hash = ();
my %struc_hash = ();
foreach (keys(%names)){
$seq_hash{$_}="";
$struc_hash{$_}="";
}
## This subroutine chomps a string and then splits it at the \s
## It returns the first and last item in the array generated by splitting the string
## For a line from a clustalw alignment this will be the sequence name and the
sequence
sub splomp {
my $string = $_[0];
chomp($string);
my @work = split(" ",$string);
my $length = scalar(@work);
my @final = ($work[0],$work[$length-1]);
return @final;
}
open(OUT,">seqs_and_struc.out");
my $jcount = 0;
for(my $i=1;$i<$seq_len;$i++){ #$i starts at 1 in order to skip first line of each file,
which contains no info
if ($seq[$i] !~ /^\w/){}
else { ## Skips blank lines
my @seq_strs = &splomp($seq[$i]);
my @struc_strs = &splomp($struct[$i]);
if (defined($seq_hash{$seq_strs[0]})){
$seq_hash{$seq_strs[0]} =
join("",$seq_hash{$seq_strs[0]},$seq_strs[1]);
$struc_hash{$struc_strs[0]}=join("",$struc_hash{$struc_strs[0]},$struc_strs[1
]);
$jcount ++;
}
else {
print "ERROR THE OUTFILE IS INCORRECT\n"; ## This error
indicates that the names of the sequences as extracted by the splomp function differ
from the names of the sequences in the original hash of sequences - could be a
malfucntion in splomp or a problem with infile formatting
}
}
}
print "Counted $jcount lines!\n";
foreach my $key (keys(%seq_hash)){
print OUT ">$key\n$seq_hash{$key}\n$struc_hash{$key}\n";
}
close(OUT);
}
count_stem_GC.pl
#!\usr\bin\perl
use warnings;
use strict;
## This script will read in the output of the match_seq_to_struc script,
seqs_and_struc.out, and output the nucleotide counts of the stem regions and loop
regions into the file nucleotide_count.txt
open(IN,"<seqs_and_struc.out");
my @lines = <IN>;
close(IN);
## Define two hashes, one with stem and one with loop nucleotide content
my @nucleotides = ("A","C","G","U","-","N");
my %stem = ();
my %loop = ();
foreach (@nucleotides){
$stem{$_}=0;
$loop{$_}=0;
}
open(OUT,">nucleotide_count.txt");
print OUT "File\tStem_-\tStem_A\tStem_N\tStem_C\tStem_G\tStem_U\tLoop_\tLoop_A\tLoop_N\tLoop_C\tLoop_G\tLoop_U";
## Make three arrays, one each for names, sequences, and structures
my @names = ();
my @seqs=();
my @strucs=();
foreach(@lines){
if ($_ =~ /^>(\w{1,10})/){
push(@names,$1);
}
else {
s/T/U/g; #Substitutes T with U
tr/a-z/A-Z/; #Translates lowercase to uppercase
if ($_ =~ /^([\w\-]*)\n/){
#
print "SEQUENCE IS $1\n";
push(@seqs,$1);
}
else {
#
}
}
}
$_ =~ /^([\.\-\(\)X]*)\n/;
print "STRUCTURE IS $1\n";
push(@strucs, $1);
## Now loop through each array and count nucleotides based on secondary
structures, skipping gaps
for(my $i=0;$i<scalar(@names);$i++){ ## Loop through the array of organism
names
my @singles = split("",$seqs[$i]); # array of nucleotides
my @notes = split("",$strucs[$i]); # Array of structure characters
my $line_count =0;
foreach my $nuc (@notes){
if ($nuc =~ /\-/){} ##SKIPS GAPS
else{
if ($nuc =~ /\./){
$loop{$singles[$line_count]}+=1;
}
if ($nuc =~ /[\(\)]/){
$stem{$singles[$line_count]}+=1;
}
## Skips any character that's not ( ) or .
}
$line_count++;
}
print OUT "\n$names[$i]";
foreach (keys(%stem)){
print "$_\t$stem{$_}\n";
print OUT "\t$stem{$_}";
$stem{$_}= 0; ## CLEARS THE HASH
}
foreach (keys(%loop)){
print "$_\t$loop{$_}\n";
print OUT "\t$loop{$_}";
$loop{$_} = 0; #CLEARS THE HASH
}
}
close(OUT);
calculate_stem_GC_percent.pl
## This R script will read in the nucleotide count file and calculate the stem and
loop GC/GCAT content
### MUST BE 12 COLUMNS IN THE OUTPUT FILE
#INPUT nucleotide_count.txt
#OUTPUT: stem_GC.txt
files <- list.files(pattern="*nucleotide_count.txt");
count <- 0
for(j in 1:length(files)){
file <- files[[j]]
count <- count +1
data <- read.table(files[[j]],header=TRUE,row.names="File")
tot.length <- c()
percent.stem <- c()
percent.loop <- c()
stem.gc <- c()
for (i in 1:nrow(data)){
tot.length[i] <- sum(data[i,])
percent.stem[i] <- sum(data[i,1:6])/tot.length[i]
percent.loop[i] <- sum(data[i,7:12])/tot.length[i]
stem.gc[i] <- sum(data[i,4:5])/sum(data[i,1:6]) # Divide by stem length, NOT
total length
}
outmat <- cbind(tot.length,percent.stem,percent.loop,stem.gc)
row.names(outmat) <- row.names(data)
splits <- strsplit(files[[j]],"_")
splits <- splits[[1]]
part <- splits[1:(length(splits)-3)]
temp <- ""
for(k in 1:length(part)){
temp <- paste(temp,part[k],sep="_")
}
write.table(outmat, file=paste(temp,"stem_GC.txt",sep=""))
}
find_mutation_types.pl
#!\usr\bin\perl
use warnings;
use strict;
## This script will read in the seqs_and_struc file, and create a hash of sequences
and of structures, keyed by the name of the organism.
## Then it will read in the R table that contains ancestor and descendant nodes, and
make a hash of descendants keyed by ancestors.
## Then it will loop through each ancestor-descendant pair, find the corresponding
sequences and structures for each, and detect positions where there has been a
change (in seq, struc, or both)
## It will write an output file for each branch, with each mutation as a new line, that
details the ancestral character, ancestral structure, and descendant character and
structure.
## This will be in the folder 110603_initial_outputs.
## Then it will read in these output files and make a condensed table summarizing
the changes for each branch
################## Make hashes of sequences and structure ##########3
open(IN,"<seqs_and_struc.out");
my @lines=<IN>;
close(IN);
my %seqs=();
my %strucs=();
my $name="";
my $count=0;
foreach(@lines){
if (m/^>(.*)\n/){
$name=$1;
$name =~ s/_//g;
$count=0;
print "The name is $name\n"
}
else{
if ($count==0){
chomp;
$seqs{$name}=$_;
$count++;
}
if ($count!=0){
chomp;
$strucs{$name}=$_;
$count++;
}
}
}
#foreach(keys(%seqs)){
#
print "$_\n$seqs{$_}\n$strucs{$_}\n";
#}
#exit;
################ Make hashes of ancestors and descendants ########
open(IN,"edge_table.txt");
@lines=<IN>;
close(IN);
my %branches=();
foreach(@lines){
chomp;
my @entries = split("\t",$_);
if ($entries[1]=~m/\d{1,2}\_/){
my @new = split("_",$entries[1]);
my $string=$new[1];
for(my $j=2;$j<scalar(@new);$j++){
$string = join("",$string,$new[$j]);
}
$entries[1] = $string;
}
$branches{$entries[1]} = $entries[0]; ## Hash of ancestors keyed by
descendants
}
foreach(keys(%branches)){
print "$_\t$branches{$_}\n";
}
################ Loop through each ancestor-descendant and find
mismatches
system("mkdir initial_outputs");
foreach my $d (keys(%branches)){
my $a = $branches{$d};
print "$d\t ancestor $a\n";
my @a_seq = split("",$seqs{$a});
my @d_seq = split("",$seqs{$d});
my @a_struc = split("",$strucs{$a});
my @d_struc = split("",$strucs{$d});
my %positions =(); # This will store the positions of the errors
my $i=0;
foreach my $a_nuc(@a_seq){
my $d_nuc=$d_seq[$i];
#
print "$a_nuc\t$d_nuc\n";
if ($a_nuc ne $d_nuc){
$positions{$i}="M"; #M for mutation
}
else{ $positions{$i} = "S";} #S for same
$i++;
}
$i=0;
foreach my $a_st(@a_struc){
my $d_st = $d_struc[$i];
if($a_st ne $d_st){
$positions{$i} = "M";
}
$i++;
}
open(OUT, ">initial_outputs/output.$a.$d.txt");
print OUT
"Position\tAnces_character\tAnces_structure\tDes_character\tDes_structure\n";
my $mut_count=0;
my @key = sort(keys(%positions));
foreach(@key){
if ($positions{$_} eq "M"){
print OUT
"$_\t$a_seq[$_]\t$a_struc[$_]\t$d_seq[$_]\t$d_struc[$_]\n";
$mut_count++;
}
}
print "Found $mut_count mutations between $a and $d\n";
close(OUT);
}
count_mutation_types.pl
#!\usr\bin\perl
use warnings;
use strict;
### This script inports the table 110603_changes_count.txt, and counts the number
of changes along each branch that increase, decrease, or do not effect the stem GC
content
my %scores=(
"A(" => -1,
"A)" => -1,
"U(" => -1,
"U)" => -1,
"A." => 0,
"U." => 0,
"C." => 0,
"G." => 0,
"C(" => 1,
"G(" => 1,
"C)" => 1,
"G)" => 1,
"--" => 0,
);
open(OUT,">changes_count.txt");
print OUT "Branch\tIncrease\tDecrease\tNeutral\n";
while(defined(my $file=glob("110603_initial_outputs/output*"))){
$file =~ m/outputs\/output\.(.*)\./;
my $name = $1;
open(IN,$file);
my $first = <IN>; # Ignore the first line
my @lines=<IN>;
close(IN);
my $increase=0;
my $decrease=0;
my $neutral=0;
foreach(@lines){
chomp;
my @chars = split("\t",$_);
my $ances = join("",$chars[1],$chars[2]);
my $des = join("",$chars[3],$chars[4]);
print "ancestor $ances descendant $des\n";
my $a_score = $scores{$ances};
my $d_score = $scores{$des};
if ($a_score == $d_score){$neutral++;}
if ($a_score <= $d_score){$increase++;}
if ($a_score >= $d_score){$decrease++;}
}
print OUT "$name\t$increase\t$decrease\t$neutral\n";
}
make_GC_ind_tree.R
## This R script will perform analysis on table of the number of changes along each
branch, and make a new tree with branch lengths based on the # of changes that did
not effect the stem GC content that ocurred on that branch
## input data = changes_count.txt
## input tree = GTR.general.tree
## input tree (for node labels) = names.tree.mod.txt
## Read in the files and the branch lengths
library(ape)
data <- read.table("changes_count.txt",row.names="Branch",header=T)
tree <- read.tree("GTR.general.tree")
tree <drop.tip(tree,c("Sulfurihyd","Hydrogenob","Aaeolicus_","Chydrogeno","Tpseudetha"
, "Ttengconge","Tkodakaren","Pfuriosus_"),trim.internal=TRUE)
names.tree <- read.tree("names.tree.mod.txt")
names.tree <drop.tip(names.tree,c("Sulfurihyd","Hydrogenob","Aaeolicus","Chydrogeno","Tpseu
detha", "Ttengconge","Tkodakaren","Pfuriosus"),trim.internal=TRUE)
tree.names <- c(names.tree$tip.label,names.tree$node.label)
###
p <- data[,1]
n <- data[,2]
ne <- data[,3]
names(p) <-names(sums)
names(n) <- names(sums)
names(ne) <- names(sums)
diffs <- p-n
names(diffs) <- names(sums)
ratios <- n/p
names(ratios) <- names(diffs)
plot(tree$edge.length,ratios[edge.table[,2]])
plot(ne[edge.table[,2]],diffs[edge.table[,2]],pch=18,cex=.9,col="darkgreen",xlim=c(0
,120))
text(ne[edge.table[,2]],diffs[edge.table[,2]],edge.table[,2],cex=.7,pos=4)
### Make a tree with different branch lengths
ne.norm <- ne[edge.table[,2]]/max(ne[edge.table[,2]])
branch.tree <- tree
branch.tree$edge.length <- ne.norm
write.tree(branch.tree,"110809_modified_branch_tree.txt")
###
library(ape)
branch.tree$tip.label <- c("T. maritima str. 2812B","T. maritima str. MSB8", "T.
cell2", "T. sp RQ2","T. neapolitana","T. petrophila","T. napthophila","T. lettingae","T.
elfii","T. subterranea","T. hypogea","T. thermarum","Ts. atlanticus","Ts. geolei","Ts.
japonicus","Ts. africanus","Ts. melanesiensis","F. islandicum","F. changbaicum","F.
nodosum","F. gondowanense","K. olearia","Ms. prima","M. hydrogenitolerans","M.
piezophila","M. okinawensis","M. camini","P. mexicana","P. halophila","P.
mobilis","P. olearia","P. sibirica")
pdf(file="Modified branch lengths tree")
plot.phylo(branch.tree)
dev.off()
simulate_evolution.pl
#!\usr\bin\perl
use warnings;
use strict;
#This script will make every possible point mutation in the T. maritima ribosomal
sequence, and then print them to the file 110530_sim_output.fna
## The input file is the fna file of the T. maritima 16S rRNA
## All of the simulated files will be output in sim_outputs.fna
## The modified sequences will also be written to separate files in the outputs
directory
## RNA salsa should be run on sim_output.fna, with s1,s2,s3=1 and a structural
constraint for the characterized T. maritima 16S rRna
open(IN,"tmar_16S_seq.fna");
my @lines = <IN>;
close(IN);
my $string="";
foreach(@lines){ #joins the sequence into a large string
chomp($_);
if ($_=~/^>/){}
else{
$string=join("",$string,$_);
}
}
$string=~s/-//g; #removes gaps
my @chars =("A","C","G","T");
my @seq_array = split("",$string);
open(OUT,">sim_output.fna");
system("mkdir outputs");
my $count=0;
foreach(@seq_array){
my @index=();
my $char_count=0;
my @temp_seq_array=@seq_array;
foreach my $char (@chars){
if ($_ eq $char){
$index[$char_count]=1;
}
else{
$index[$char_count]=0;
}
$char_count++;
}
my $ind_count=0;
foreach my $ind (@index){
if ($ind==0){
$temp_seq_array[$count]=$chars[$ind_count];
my $tr_char=$count+1; #The position of the change made
my $orig = $seq_array[$count]; #The original base
my $new=$chars[$ind_count]; #The new base
print OUT ">T_mar_sim char:$tr_char orig:$orig new:$new\n";
my $temp_seq=join("",@temp_seq_array);
print OUT "$temp_seq\n";
##This makes an output file for each mutant
open(SING,">outputs/Tmar_sim.$tr_char.$orig.$new.fna");
print SING ">T_mariMSB8_constraint\n$string\n";
print SING ">T_mari.$tr_char.$orig.$new\n$temp_seq\n";
close(SING);
}
else{}
$ind_count++;
}
}
$count++;
Download