Improving the Algorithm for Detection of Chimeras used in the Check Chimera Program Problem and Rationale: Statistical analysis of experimental data in bacterial genome research show varying results in erroneous results in the reporting of error due to the formation of Chimeras. The Check Chimera program available through the ribosomal database project provides the main method of statistical analysis for many projects involving the comparison of bacterial genomes. Analysis of the experimental protocols shows that the statistical analysis relies on an algorithm based upon comparing the sequences using seven-oligonucleotide maps present in the database. In this project I hope to correct this error by possibly improving upon the program algorithm and using a more stringent method of analysis. PCR (Polymerase Chain Reaction) is used to generate rRNA and rDNA of a selected sequence taken from a bacterial or other genome. Chimeras are sequences that contain segments from two different portions of the genome that are usually distal from each other and can either be formed naturally or through PCR. Chimeras are usually formed due to the errors that occur when PCR is used on longer stretches of DNA. Formation of chimeras leads to erroneous conclusions when doing phylogenetic comparison between bacterial genomes. It may lead to the belief that a genetic link exists between bacterial genomes that were heretofore thought to be only distally related when they are actually not. The Check Chimera program is used to determine if a sequence is composed of two halves that are most similar to clearly different sequences in the database (i.e. if the sequence is of chimeric origin). By analyzing these sequences, it will be possible statistically analyze any errors in the data that may be formed due to the appearance of chimeras. The method uses unaligned sequences and is based on comparisons of maps of oligomers (7 -mers for SSU rRNA, 8-mers for LSU rRNA). The program returns only a statistical analysis that the user must interpret to define whether or not the sequence is chimeric in origin or not. This may be simplified. however. by analyzing known chimeric sequences origins. Pending the agreement of the ribosomal database project, I wish to attempt to alter the source code used in the Check Chimera program in order to come up with a more efficient way to determine the presence of chimeras created by PCR cloning. The present algorithm allows for a seven base oligomer map as its basis for comparison to show whether the formation of a chimera has occurred. By altering these parameters for analysis used by the program. I hope to decrease or eliminate this error. Overview of the method Basic Function of Check Chimera Program For analysis under most databases, a truly chimeric sequence should consist of fragments, that each has closer database relatives, than analyze the full-length sequence. CHECK_CHIMERA returns three listings of most similar database sequences: One for each of the two fragments, and one for the full-length sequence. To determine the fragments, a hypothetical break point is moved through the submitted sequence, while three numbers are calculated for every 10'Th position. First the number of oligomers shared between the sequence part preceding the break point and its most similar database sequence are checked. Then the same number for the sequence following the break point is determined. And finally the number for the full-length sequence is determined; this value is called the "oligo-gain". If there is a true break point, the 'oligo-gain' value should reach its maximum at that point, and it would be expected to rise consistently towards that break point as well. CHECK_CHIMERA returns a histogram where 'oligo-gain' values (horizontal axis) are plotted against every 10'Th sequence position (vertical axis): The break point is assigned to the position with the highest value. A histogram is then generated, which basically summarizes the output data. Following the histogram, three ranked lists of best matches are returned one for each of the fragments and one for the full-length sequence. If the histogram values consistently rise toward a maximum and the fall, then the sequence is probably a chimeric sequence. The higher the maximum, the more likely it is the sequence is chimeric in origin. If the highest S_ab value (value assigned to both full-length sequence and fragments) for the full length sequence is lower than those of both fragments, then that is further positive indication. Possible Problems: The program’s analysis however, does have some inherent failings that make interpretation of the data difficult under any set of parameter. The histogram profile can vary greatly. A truly chimeric short sequence from a highly conserved rRNA region can lead to small but still consistently rising values since the sequences do not differ much. High values would be expected if it spans a variable region; if a sequence includes both conserved and variable regions, the histogram may appear quite asymmetric and deviate significantly from the ideal example. The prediction becomes less certain if one of the fragments is very short as well. The signal can weaken to the point of little meaning, in cases where the submitted sequence has no close relatives in the database.