Improving the Algorithm for Detection of Chimeras used in the Check

advertisement
Improving the Algorithm for Detection of Chimeras used in the Check Chimera Program
Problem and Rationale:
Statistical analysis of experimental data in bacterial genome research show varying
results in erroneous results in the reporting of error due to the formation of Chimeras. The Check Chimera
program available through the ribosomal database project provides the main method of statistical analysis
for many projects involving the comparison of bacterial genomes. Analysis of the experimental protocols
shows that the statistical analysis relies on an algorithm based upon comparing the sequences using
seven-oligonucleotide maps present in the database. In this project I hope to correct this error by possibly
improving upon the program algorithm and using a more stringent method of analysis.
PCR (Polymerase Chain Reaction) is used to generate rRNA and rDNA of a selected sequence taken from
a bacterial or other genome. Chimeras are sequences that contain segments from two different portions of
the genome that are usually distal from each other and can either be formed naturally or through PCR.
Chimeras are usually formed due to the errors that occur when PCR is used on longer stretches of DNA.
Formation of chimeras leads to erroneous conclusions when doing phylogenetic comparison between
bacterial genomes. It may lead to the belief that a genetic link exists between bacterial genomes that were
heretofore thought to be only distally related when they are actually not.
The Check Chimera program is used to determine if a sequence is composed of two halves that are most
similar to clearly different sequences in the database (i.e. if the sequence is of chimeric origin). By
analyzing these sequences, it will be possible statistically analyze any errors in the data that may be
formed due to the appearance of chimeras. The method uses unaligned sequences and is based on
comparisons of maps of oligomers (7 -mers for SSU rRNA, 8-mers for LSU rRNA). The program returns
only a statistical analysis that the user must interpret to define whether or not the sequence is chimeric in
origin or not. This may be simplified. however. by analyzing known chimeric sequences origins.
Pending the agreement of the ribosomal database project, I wish to attempt to alter the source
code used in the Check Chimera program in order to come up with a more efficient way to determine the
presence of chimeras created by PCR cloning. The present algorithm allows for a seven base oligomer
map as its basis for comparison to show whether the formation of a chimera has occurred. By altering
these parameters for analysis used by the program. I hope to decrease or eliminate this error.
Overview of the method
Basic Function of Check Chimera Program
For analysis under most databases, a truly chimeric sequence should consist of fragments, that
each has closer database relatives, than analyze the full-length sequence. CHECK_CHIMERA returns
three listings of most similar database sequences: One for each of the two fragments, and one for the
full-length sequence. To determine the fragments, a hypothetical break point is moved through the
submitted sequence, while three numbers are calculated for every 10'Th position. First the number of
oligomers shared between the sequence part preceding the break point and its most similar database
sequence are checked. Then the same number for the sequence following the break point is determined.
And finally the number for the full-length sequence is determined; this value is called the "oligo-gain". If
there is a true break point, the 'oligo-gain' value should reach its maximum at that point, and it would be
expected to rise consistently towards that break point as well. CHECK_CHIMERA returns a histogram
where 'oligo-gain' values (horizontal axis) are plotted against every 10'Th sequence position (vertical
axis):
The break point is assigned to the position with the highest value. A histogram is then generated,
which basically summarizes the output data. Following the histogram, three ranked
lists of best matches are returned one for each of the fragments and one for the full-length sequence. If the
histogram values consistently rise toward a maximum and the fall, then the sequence is probably a
chimeric sequence. The higher the maximum, the more likely it is the sequence is chimeric in origin. If
the highest S_ab value (value assigned to both full-length sequence and fragments) for the full length
sequence is lower than those of both fragments, then that is further positive indication.
Possible Problems:
The program’s analysis however, does have some inherent failings that make
interpretation of the data difficult under any set of parameter. The histogram profile can vary
greatly. A truly chimeric short sequence from a highly conserved rRNA region can lead to small
but still consistently rising values since the sequences do not differ much. High values would be
expected if it spans a variable region; if a sequence includes both conserved and variable
regions, the histogram may appear quite asymmetric and deviate significantly from the ideal
example. The prediction becomes less certain if one of the fragments is very short as well. The
signal can weaken to the point of little meaning, in cases where the submitted sequence has no
close relatives in the database.
Download