Analysis of codon pair frequencies in genomic sequence Buchan, JR, Aucott, L. and Stansfield, I Introduction Perl version 5.005_3 was used to run all programs described. The E.coli geneome sequence has been supplied for test purposes in a trimmed format (a sample of 100 ORFs only in TIGRecoli-trim.txt). The genome has been reformatted to remove all hard returns (paragraph marks) except those that separate one sequence from another. Sequences are in the format: >sequence name1 ###ATGNNNNNNNNNNNNNNNNNNNN hr >sequence name2 ###ATGNNNNNNNNNNNNNNNNNNNN hr Where: ‘ATG’ represents the first codon of each open reading frame, and ‘###’ represents marker text defining the start of the open reading frame . hr denotes a hard return. ‘N’ denotes any nucleotide. In the instructions below, bold font indicates a command to enter in a DOS window whilst italicised comments represent the name of a file (e.g. programs file, spreadsheet, sequence file). Menu commands are underlined. Generating codoncount data using E.coli as an example The program codoncount.txt counts all the codon pairs in a genome ORF by ORF and records the total observed count of each of the 3904 pair types (61 sense codons x 64 sense and stop codons). It also calculates the expected number of codon pairs as each ORF is processed, and records a cumulative expected codon pair count for the whole genome. Results are stored in the file named ‘results.txt’, and can be opened in Excel as a tabdelimited .txt file. In Excel the spreadsheet will comprise an upper block of data 64 rows X 130 columns (starts in cell A3). This is the observed codon pair data while the lower 64R X 130C block (starts in cell A68) is the expected codon pair count data. The data can be converted into a more manageable form using the spreadsheet ‘Codoncount clean.xls’ Select the entire results.txt worksheet, copy it, and paste the data into the ‘Codoncount RAW’ worksheet of Codoncount clean.xls at cell A1. This data is then referenced in the ‘Cleaned data’ worksheet of Codoncount clean.xls such that observed codon pair data (top) and expected codon pair data bottom is listed in 65R x 62C blocks. 5’ P site codons are listed in rows 3 (observed) and 71(expected) and 3’ A site codons are listed in column A. Analysing codoncount data in other spreadsheets Data in Codoncount clean.xls can now be directly pasted into the spreadsheet Residual calcualtor.xls. This normalises expected count data for amino acid pair bias and calculates residual scores for ever codon pair. Copy the observed codon pair data (cells A1:BJ67) from Codoncount clean.xls and Paste special values into the Residual calcualtor.xls ‘observed’ worksheet at cell A1. Copy the observed expected pair data (cells A69:BJ165) from Codoncount clean.xls and Paste special values into Residual calcualtor.xls ‘expected’ worksheet at cell A1. Codonpairindex The program codonpairindex.txt, calculates the mean codon pair residual score (i.e. the Codon Pair index) for each ORF in a genome. The program is first ‘loaded’ with a library of residual scores for each potential codon pair. The supplied version of the program contains the E.coli codon pair residual library. The filename for the genome being analysed must be accurately referenced in codonpairindex.txt (version supplied refers to TIGRecoli-trim.txt). Once the program is run, the mean CPI score for each ORF is written to the file results.txt. This can be opened in Excel. Random Random.txt is a program that generates a randomised genome for a given organism based on the codon usage values of that organism. Random.txt is set to generate random ORFs with an E.coli codon bias. This is achieved by selecting a random number between 0 and 1. Each codon is assigned to a discrete band of values within this range. The width of that band is proportional to that codon’s biased usage among the 61 sense codons of the genetic code. An ‘if’ statement then determines which of the 61 sense codons will be chosen in response to each random number generation. Gene size and the number of genes generated are specified by editing the program at lines 2 and 5. The text “###” is introduced prior to the gene sequence so that the output genome file is ready for analysis by the programs listed above.