DNA/PROTEIN SEQUENCE ANALYSIS: PAIRWISE COMPARISON Sequence comparison techniques fall into two categories: pairwise or multiple. Pairwise comparison provides computer tools to directly compare two sequences, either nucleic acid or peptide. These tools are very useful when verifying sequence data, cloning projects, PCR analysis, and many more. Anytime you would like to compare two sequences, these are the tools to use. The programs are similar with respect to input but vary greatly regarding the output formats. This tutorial will cover the following programs: GAP, BESTFIT, FRAMEALIGN, COMPARE, DOTPLOT, and GAPSHOW. The sequences you will use are as follows: M19993 M15034 P08689 gb_ro - rodent database within GenBank gb_ro - rodent database within GenBank UniProt database GAP When you want an alignment that covers the whole length of both sequences, use GAP. Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete sequences that maximizes the number of matches and minimizes the number of gaps. Gap reads a scoring matrix that contains values for every possible GCG symbol match. Gap finds an alignment with the maximum possible quality where the quality of an alignment is equal to the sum of the values of the matches (each match scored with the scoring matrix) less the gap creation penalty times the number of internal gaps and less the gap extension penalty times the total length of the internal gaps. The alignment found by Gap is, therefore, sensitive to the scoring matrix values and the gap penalties. There is no penalty if either sequence is shifted to the place where the alignment begins unless end gaps are penalized by using the command-line parameter -ENDWeight. Gap considers all possible alignments and gap positions and creates the alignment with the largest number of matched bases and the fewest gaps. You provide a gap creation penalty and a gap extension penalty in units of matched bases. In other words, Gap must make a profit of gap creation penalty number of matches for each gap it inserts. If you choose a gap extension penalty greater than zero, Gap must, in addition, make a profit for each gap inserted of the length of the gap times the gap extension penalty. Gap uses the alignment method of Needleman and Wunsch (J. Mol. Biol. 48; 443-453 (1970)) that has been shown to be equivalent to Sellers. 1 Gap accepts two individual nucleotide sequences or protein sequences as input. The function of Gap depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. Input sequences may not be more than 30,000 symbols long. The program attempts to allocate enough computer memory to align the input sequences. In the worst case, where the two sequences being aligned are unrelated, the allocation is proportional to the product of the lengths of the two input sequences. However, in many cases where the sequences being aligned are more closely related, the computer can determine an optimal alignment using less memory. When memory on your computer is limiting and the program cannot allocate all of the memory it needs to align long sequences, it completes the alignment in whatever memory it can allocate and displays the message *** Alignment is not guaranteed to be optimal ***. Because the criteria used in the calculation for guaranteeing an optimal alignment are very stringent, the alignment often may be optimal even if this message is displayed. For the following exercise your SeqLab screen should look like this. We suggest opening a new list. You can then delete the entire list from directory or save it as a HELP tool when performing additional comparisons. 2 To start the GAP program, highlight the two nucleotide sequences, move to functions and select GAP from the Pairwise Comparison Menu: The following window will appear. The top lines contain the sequences that are being compared. 3 Click the Options… button. For this exercise, select “Set thresholds...” and enter “4” in the window, de-select “Abbreviate large gaps...”, and select both “New sequence file for....” options. Close the Options window and select RUN. 4 For most sequences, the default settings will provide a reasonable alignment. Gap is capable of ignoring a region of excellent similarity or similarity between two sequences if it can produce an alignment with equal or better quality in some other way. BestFit is a better tool to search for weak or unknown similarity or similarity that you suspect is not coextensive along the sequences. 5 It is extremely important that you think formally about what Gap does. Using Gap rather than BestFit implies that you want an alignment where neither sequence is truncated. To see a sample output file, click here. (link to gap_output.txt) GAPSHOW GapShow displays an alignment by making a graph that shows the distribution of similarities and gaps. The two input sequences should be aligned with either Gap or BestFit before they are given to GapShow for display. BestFit and Gap make optimal alignments of sequences by adding gaps to maximize the number of matches. Gap and BestFit normally display the alignments, but they can also write the aligned sequences (with gaps inserted) into new sequence files. GapShow reads these files and plots the distribution of the differences or similarities in the alignment graphically. The sequences are represented by horizontal lines. The horizontal lines have openings at points where there are gaps in either sequence. Regions of interest, such as coding regions, can be shown outside these lines. A large vertical line between the sequences can indicate either a difference or similarity. If differences are being shown, gaps are also depicted with short vertical lines to convey that the gap is a difference between the two sequences. GapShow accepts two aligned individual nucleotide sequences or protein sequences as input. The function of GapShow depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. Enter the two “.gap” sequences from the Output Manager into your Main List. Select both “.gap” sequences in the Main List, move to Functions menu and from the Pairwise Comparison menu select GAPSHOW. 6 The following window will appear. Make sure that your window contains sequences that end with “.gap”. 7 Click on the Options button. When the Options window appears, select “plot for similarities” and “represent the sequences....”. Close the Options window and click on Run in the GAPSHOW window. The output will be loaded into your Output Manager. Open your Output Manager, select the “.figure” file and Display the output. The following graph will be displayed. The thicker bars represent the similarities while the thinner line represents the gaps. 8 The BESTFIT program also will generate “.gap” sequences that can be used to generate a GAPSHOW plot. To select this option, in the BestFit option window, select the “New sequence....” toggles and the sequences will be loaded into your Output Manager. To print the alignment, go to output manager and select the “.pair” output file and click on print. Follow the printing instructions found in the the printing FAQ. To better understand the statistics of the alignment and gain a better understanding of the results obtained for your pairwise comparison, refer to the paper by Needleman and Wunsch (J. Mol. Biol. 48: 443-453 (1970)). BESTFIT BestFit is the most powerful method in the Wisconsin Package(TM) for identifying the best region of similarity between two sequences whose relationship is unknown. BestFit makes an optimal alignment of the best segment of similarity between two sequences. Optimal alignments are found by inserting gaps to maximize the number of matches using the local homology algorithm of Smith and Waterman. BestFit inserts gaps to obtain the optimal alignment of the best region of similarity between two sequences, and then displays the alignment in a format similar to the output from Gap. The 9 sequences can be of very different lengths and have only a small segment of similarity between them. You could take a short RNA sequence, for example, and run it against a whole mitochondrial genome. BestFit accepts two individual nucleotide sequences or protein sequences as input. The function of BestFit depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your input sequences are peptide sequences, this program uses a scoring matrix, blosum62.cmp, with comparison values derived from a study of substitutions between amino acid pairs in ungapped block of aligned protein segments as measured by Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)). Algorithm: BestFit uses the local homology algorithm of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) to find the best segment of similarity between two sequences. BestFit reads a scoring matrix that contains values for every possible GCG symbol match. The program uses these values to construct a path matrix that represents the entire surface of comparison with a score at every position for the best possible alignment to that point. The quality score for the best alignment to any point is equal to the sum of the scoring matrix values of the matches in that alignment, less the gap creation penalty times the number of gaps in that alignment, less the gap extension penalty times the total length of all gaps in that alignment. The gap creation and gap extension penalties are set by you. If the best path to any point has a negative value, a zero is put in that position. After the path matrix is complete, the highest value on the surface of comparison represents the end of the best region of similarity between the sequences. The best path from this highest value backwards to the point where the values revert to zero is the alignment shown by BestFit. This alignment is the best segment of similarity between the two sequences. For nucleic acids, the default scoring matrix has a match value of 10 for each identical symbol comparison and -9 for each non-identical comparison (not considering nucleotide ambiguity symbols for this example). The quality score for a nucleic acid alignment can, therefore, be determined using the following equation: Quality = 10 x TotalMatches + -9 x TotalMismatches - (GapCreationPenalty x GapNumber) (GapExtensionPenalty x TotalLengthOfGaps) The quality score for a protein alignment is calculated in a similar manner. However, while the default nucleic acid scoring matrix has a single value for all non-identical comparisons, the default protein scoring matrix has different values for the various non-identical amino acid comparisons. The quality score for a protein alignment can therefore be determined using the following equation (where Total(AA) is the total number of A-A (Ala-Ala) matches in the 10 alignment, CmpVal(AA) is the value for an A-A comparison in the scoring matrix, Total(AB) is the total number of A-B (Ala-Asx) matches in the alignment, CmpVal(AB) is the value for an AB comparison in the scoring matrix, ...) : Quality = CmpVal(AA) x Total(AA) + CmpVal(AB) x Total(AB) + CmpVal(AC) x Total(AC) + CmpVal(ZZ) x Total(ZZ) - (GapCreationPenalty x GapNumber) - (GapExtensionPenalty x TotalLengthOfGaps) BestFit Always Finds Something BestFit always finds an alignment for any two sequences you compare -- even if there is no significant similarity between them! You must evaluate the results critically to decide if the segment shown is not just a random region of relative similarity. ***** The Segments Shown Obscure Alternative Segments BestFit only shows one segment of similarity, so if there are several, all but one are obscured. You can approach this problem with graphic matrix analysis (see the Compare and DotPlot programs). Alternatively, you can run BestFit on ranges outside the ranges of similarity found in earlier runs to bring other segments out of the shadow of the best segment. The Best Fit is Only One Member of a Family Like all fast gapping algorithms, the alignment displayed is a member of the family of best alignments. This family may have other members of equal quality, but will not have any member with a higher quality. The family is usually significantly different for different choices of gap creation and gap extension penalties. See the CONSIDERATIONS topic in the entry for the Gap program in the Program Manual to learn more about how to assign gap creation and gap extension penalties. Default Gap Penalties are Specific to Each Scoring Matrix BestFit chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with the -MATRix command-line parameter, the program will adjust the default gap penalties accordingly. (See Appendix VII for information about how to set the default gap penalties for any scoring matrix.) You can use -GAPweight and -LENgthweight to specify alternative gap penalties if you don't want to accept the default values. 11 The Public Scoring Matrix for Nucleic Acid Comparisons is Very Stringent The scoring matrix swgapdna.cmp awards matches +10 and penalizes mismatches -9 so the segments found may be very brief. This penalty means that the alignment cannot be extended by even two bases to pick one extra match. The scoring matrix used by Smith and Waterman, when local alignments were first described, used -.33 for the mismatch penalty and +1 for the match score. You can use Fetch to copy randomdna.cmp and rename it swgapdna.cmp to use a value of -3 for each mismatch and +10 for each match, or use nwsgapdna.cmp, which has no mismatch penalty at all. Rapid Alignment When possible, BestFit tries to find the optimal alignment very quickly. If this rapid alignment is not unambiguously optimal, BestFit automatically realigns the sequences to calculate the optimal alignment. When this occurs, the monitor of alignment progress on your terminal screen (Aligning...) is displayed twice for a single alignment. To run BESTFIT, select the gb_ro:m15034 and the gb_ro:m19993 sequences from the Main list window, move your cursor to Functions and select BestFit from the Pairwise Comparison Menu: The BestFit window will appear. Click on the Options button. The Option Window is very similar to the GAP Option window. From this window, select “Set thresholds....” and type 12 “4” into the highlighted window, de-select “Abbreviate output.....”, and select both “New sequence...”. Change the name of the output file names by adding a “-1” behind the accession number (ex. M19993-1.gap). This will distinguish the GAP and BestFit result sequences. 13 The results obtained from the BestFit program provides the user with the BEST alignment found. The best alignment may not contain the entire sequence; in fact it usually does not. One difference is that the ends of the genomic sequence have been left off the alignment. Only GAP will provide full length sequences in the output alignment. Once again, for most alignments the default options will provide a reasonable alignment. To view a sample output file, click here. (link to bestfit_output.txt) The .gap files can be used by the GAPSHOW program to provide a graphical representation of the BestFit results. Follow the instructions found in the GAPSHOW section to produce the following figure: 14 FRAMEALIGN To compare a nucleic acid sequence to a peptide sequence use FrameAlign. This is a very useful tool for verifying coding regions during cloning or sequencing projects. FrameAlign creates an optimal alignment of the best segment of similarity (local alignment) between a protein sequence and the codons in all possible reading frames of a nucleotide sequence. Optimal alignments may include reading frame shifts. FrameAlign inserts gaps to obtain the optimal local alignment of the best region of similarity between a protein sequence and the codons in a nucleotide sequence. Because FrameAlign can align the protein to codons in different reading frames of the nucleotide sequence, it can identify sequence similarity even when the nucleotide sequence contains reading frame shifts. In standard sequence alignment programs, you routinely specify gap creation and extension penalties. In addition to these penalties, FrameAlign also allows you to specify a separate frameshift penalty for the creation of gaps that result in reading frame shifts in the nucleotide sequence. (See the ALGORITHM topic for a more detailed explanation of how gaps are penalized). 15 ALGORITHM: FrameAlign aligns a nucleotide sequence with a protein sequence. The alignment procedure is an extension of the local alignment algorithm of Smith and Waterman (Advances in Applied Mathematics 2; 482-489 (1981)) that is modified to determine the score of the best segment of similarity between a protein sequence and the codons in a nucleotide sequence. Scoring Matrix To create the alignments, FrameAlign requires a scoring matrix that contains values for matches between all possible amino acids and codons. FrameAlign derives this amino acid - codon scoring matrix on the fly from a translation table and an amino acid substitution matrix. The translation table contains a list of all possible codons for each amino acid. The amino acid substitution matrix contains match values for the comparison of all possible amino acids. In the derived amino acid - codon scoring matrix, the value of a match between any amino acid and any codon is the value of the match between the amino acid and the translated codon in the amino acid substitution matrix. If a codon contains IUB nucleotide ambiguity symbols, and all possible unambiguous representations of the codon translate to the same amino acid (e.g. MGR always translates to arginine in the standard genetic code), then the value of a match between that codon and any amino acid can be similarly determined. If all possible unambiguous representations of the codon do not translate to the same amino acid, then the value of a match between that codon and any amino acid is 0. FrameAlign chooses default gap creation and extension penalties that are appropriate for the scoring matrix it reads. If you select a different scoring matrix with the -MATRix command-line parameter, the program will adjust the default gap penalties accordingly. You can respond to the program prompts or use –GAPweight and -LENgthweight to specify alternative gap penalties if you don't want to accept the default values. Protein-Nucleotide Alignment FrameAlign uses the values in the amino acid - codon scoring matrix to determine the score of the best alignment between the protein and nucleotide sequences. If you consider a graph, or path matrix, with the nucleotide sequence placed on the X axis and the protein sequence placed on the Y axis, then every point on the path matrix represents the best alignment between the sequences that ends at that point. For any point on the path matrix, the X coordinate is the first nucleotide of the final codon in the alignment, and the Y coordinate is the final amino acid in the alignment. Each possible alignment end point is associated with a path, which is a series of steps (insertions, deletions, matches) through the path matrix required to create the alignment. Each step has its own score, and the scores for all the steps in an alignment path determine the quality score for the alignment. The quality score for an alignment is equal to the sum of the scoring matrix values of the 16 matches in the alignment, minus the gap creation penalty multiplied by the number of gaps in the alignment, minus the frameshift penalty multiplied by the number of gaps in the alignment that change the reading frame, minus the gap extension penalty multiplied by the total length of all gaps in the alignment. (You can set the value for each of the penalties.) Quality = SUM(scoring matrix values of the matches in the alignment) – (gap creation penalty x number of gaps in the alignment) – (frameshift penalty x number of gaps in the alignment that change the reading frame) – (gap extension penalty x total length of all gaps in the alignment) For example, the following protein-nucleotide alignment consists of six steps: 1 UGUUGUGUUCG....UGGUGG 17 ||||||::: |||||| 1 CysCysValArgIleTrpTrp 7 The first two steps are UGU-Cys matches. The third step is an GUU-Val match. The fourth step is a four nucleotide deletion. The last two steps are UGG-Trp matches. The quality score for this alignment is the sum of the scoring matrix values for two UGU-Cys matches, one GUU-Val match, and two UGG-Trp matches, minus one gap creation penalty, minus four gap extension penalties, minus one frameshift penalty. Matches between an amino acid and a partial codon, like CG. for Arg in the above example, do not add any match value to the alignment score. By convention, all gap characters in partial codons are placed at the end of the codon. For example, the partial codon “CG.” in the above example will never be written as “C.G”. If the best alignment ending at any point has a negative value, a zero is put at that position of the path matrix; otherwise, the quality score for the alignment is put at that position. After the path matrix is completely filled, the highest value in the matrix represents the score of the best region of similarity between the sequences (optimal local alignment). This highest value is reported as the comparison score between the nucleotide and protein sequences. The alignment itself can be reconstructed for display by following the best path from this point of highest value backward to the point where the path matrix has a value of zero. FrameAlign Always Finds Something FrameAlign always finds an alignment for any protein and nucleotide sequences you compare, even if there is no significant similarity between them. You must evaluate the results critically to decide if the segment shown is not just a random region of relative similarity. 17 FrameAlign Shows Only a Single Segment of Similarity -FrameAlign shows only one optimal alignment between a protein sequence and a nucleotide sequence. There are reasons why you might want to evaluate several optimal and suboptimal alignments. - If there are several disjoint segments of similarity, the selection of only a single segment for display does not provide a comprehensive view of the relationship between the nucleotide and protein sequences. - The alignments displayed by FrameAlign are sensitive to your choices for the scoring matrix and gap penalties. If you vary these choices even slightly, FrameAlign may calculate different optimal alignments for the same segment of similarity between the sequences. If FrameAlign were able to display multiple and suboptimal alignments of the same region, you would be able to use the variation among the different alignments to determine which portions of the alignments were reliably determined. FrameAlign, GAP and BestFit are very similar programs and contain many of the same optional parameters found on the Options page. To begain using FrameAlign, select the UniProt:p08689 peptide sequence and the gb_ro:m15034 nucleic acid sequence. Move your cursor to the Functions Menu and select FrameAlign from the Pairwise Comparison Menu. 18 The FrameAlign program window will be loaded. Click on the Options button and make sure that the sequences we are analyzing are in the sequence windows. Select “Set thresholds...” and type “4” in the highlighted window. Also, select “New sequence file...” for each sequence (you should change the m15034.gap to m15034-2.gap). Close the option window and click on Run in the main FrameAlign window. The options page will look familiar. The FrameAlign, GAP, and BestFit pages contain many of the same parameters to set. These two sequences are the cDNA and protein sequence for the rat androgen-binding protein (ABP). These sequences produce a perfect alignment. Sequences that you run may not be this perfect, remember that FrameAlign, like BestFit, displays only the best alignment. The output sequences can be used to run other GCG programs that are expecting 2 nucleic acid or 2 peptide sequences (ex. GAPSHOW). To view a sample output file, click here. (link to framealign_output.txt) COMPARE Compare compares two protein or nucleic acid sequences and creates a file of the points of similarity between them for plotting with DotPlot. Compare finds the points using either a 19 window/stringency or a word match criterion. The word comparison is 1,000 times faster than the window/stringency comparison, but somewhat less sensitive. Compare makes a file with the coordinates of each point where two sequences are similar. The sequences are compared in every possible register and a point is added to the file wherever some match criterion for similarity is met. The match criterion can be met in two different ways: 1. The standard way compares two sequences in every register, searching for all the places where a given number of matches (stringency) occur within a given range (window). See Maizel and Lenk (1981) "Enhanced Graphic Matrix Analysis of Nucleic Acid and Protein Sequences" Proc. Natl. Acad. Sci. USA 78; 7665-7669 for a description of the matrix analysis of biological sequences. 2. The other way to find points of similarity is to search for short perfect matches of some set length. Short perfect matches are referred to as words. The word comparison between two sequences is about 1,000 times faster than the window/stringency match described above, but it requires that the sequences contain short perfect matches for any similarity to be found. Word comparison is discussed in detail by Wilbur and Lipman (1983) "Rapid Similarity Searches of Nucleic Acid and Protein Data Banks" Proc. Natl. Acad. Sci. USA 80; 726-730. The authors refer to a word as a k-tuple. Compare does a word comparison if it is run with the command-line parameter -WORdsize. Compare accepts two individual nucleotide sequences or protein sequences as input. The function of Compare depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. No more than 200,000 points may be produced in the plot file. The point files can be quite large and should be deleted as soon as they have been examined. Window must be between 1 and 100, or if word comparison is done, the word size must be between 1 and 25. ALGORITHM: Compare makes a file of every point where two sequences are similar according to a set match criterion. The points are the Cartesian coordinates of each point of similarity in units of the original sequence coordinates. If the window is greater than 1, the point recorded by Compare is in the middle of the window. Window/Stringency Comparisons For window/stringency comparisons, Compare reads a scoring matrix that defines a match value for every possible GCG symbol comparison. Compare then slides the vertical sequence along the horizontal in order to generate every possible register of comparison. For each register, Compare slides a window along the pair of sequences. The 20 match values for each pair of symbols within the window are summed to determine a score for the window at each window position. When the score is greater than or equal to the stringency, then the match criterion has been met and a point is added to the file at the position of the middle of the window on both axes. When the window has no integral center (windows of even length), then Compare rounds the coordinates up. If you have used the command-line parameter -ALL, then points are added to the file at all of the positions within the window that have match values greater than or equal to the average positive non-identical comparison value in the scoring matrix (see -ALL in the OPTIONAL PARAMETERS topic). Word Comparisons For word comparisons, you set a word length. Compare then slides the vertical sequence along the horizontal in order to generate every possible register of comparison. For each register, Compare slides a window whose size is equal to the word length along the pair of sequences. If all of the symbols in the two sequences within the window are identical, Compare puts a point in the file at the middle of the word's position in the two sequences. If the word has no integral center (words of even length), then Compare rounds the coordinates up. Alphabet The parameter alphabet that appears in the output is the number of symbols in the alphabet that could make up each word. The alphabet contains four symbols for nucleic acids and up to 31 for peptide sequences. Try a Word Comparison First Word comparisons are very fast, so run Compare with the command-line parameter WORdsize first. Usually, this pilot run gives you a rough idea what the dot-plot for the more sensitive window/stringency comparison is going to look like. See the two plots in the Program Manual entry for DotPlot for examples of each type of comparison. Setting Window and Stringency A window 21-symbols wide with a stringency of 14 is a good place to start when comparing nucleic acid sequences that have very few ambiguity codes in them. The number of points you get should be of the same magnitude as the number of symbols in your sequences. We have had good results with a window of 30 and a stringency of 11 for peptide sequence comparison. You can use -LIMit to stop the program before the number of points gets unreasonable. Batch Queue 21 Unless you are using the -WORdsize parameter, Compare is one of the few programs in the Wisconsin Package that can take more than a few minutes to run. Therefore, large comparisons should probably be run in the batch queue. You can specify that this program run at a later time in the batch queue by using the command-line parameter -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Chapter 3, Using Programs in the User's Guide. Very large comparisons may exceed the CPU limit set by some systems. In practice you should probably limit the range of the sequences compared to about 10,000 for each batch job. Setting Word Size You might try a word size of 6 for nucleic acid sequences of 1,000 bases and perhaps 8 for 10,000 bases. You can start with a word size of 2 or 3 for peptide-sequence comparisons. DotPlot Dot-plotting is the best method in the Wisconsin Package(TM) for comparing two sequences when you suspect that there could be more than one segment of similarity between the two. DotPlot makes a dot-plot with the output file from Compare, FoldRNA, or StemLoop. In SeqLab, the program DotPlot can stand alone and will accept Compare files that have been prepared on the command line. SeqLab integrates the Compare results directly into DotPlot. Therefore, when you run Compare there are two results: 1. A matrix of all the points of comparison between the two query sequences and, 2. A plot that displays the results of Compare and shows regions of extended homology with solid vertical line. To examine the utility of Compare and DotPlot, we will use the gb_ro:m15034 and the gb_ro:m19993 sequences. Select both sequences in your Main List, move your cursor to the Functions Menu and select Compare from the Pairwise Comparison Menu (see next page). 22 The main Compare window will appear. As always, make sure that the sequences you selected from the main list are the sequences the program is going to use for its analysis. 23 From the main Compare window, select “window/stringency comparison” and click Run. As mentioned previously, there are 2 output forms that are loaded into your Output Manager. The first output is in the format of a “XXXX.pnt” file. An example of this file is presented below, but without the plot it is difficult to interpret. This is a list of all the points that contain homology between the 2 query sequences. To see a sample output file of this type, click here. (link to compare_output.txt) The second output format is a “figure” file and contains the graphical output that interpreted the point matrix and created this plot. The solid lines show the regions of homology between the two sequences. The m19993 sequence is the genomic sequence for the ABP gene while the m15034 sequence is the cDNA. Therefore, a total of 8 exons (the # of exons in the ABP cDNA) should be represented in the DotPlot. With this example the program identifies all exons as solid lines. 24 25