BIT 150 - Lab 4 Sequence Assembly and Annotation A. Sequence assembly An introduction to Pregap4 and Gap4 ( http://staden.sourceforge.net/manual/master_unix_brief.html ) Gap4 is a Genome Assembly Program. Gap4 stores the data for an assembly project in a Gap4 database, but before the data must be passed through preassembly steps, via Pregap4. Pregap4 operates on batches of files. These files can be binary trace files (in ABI, ALF or SCF format), Experiment Files, or plain text, and do not need to all be in the same format. Objective: Use Pregap4 and Gap4 to assemble a sample of subclones from Triticum monococcum L. BAC clone 322N9, and edit the assembly. Activities: Copy the folder ‘08_Lab4’ from Z: to C:. Both Pregap4 and Gap4 are part of the ‘Staden Package’. To open either of them, go to Start/Programs/BioInformatics/Staden Package. I. Pregap4 1. Open Pregap4: 1.1. The ‘Files to Process’ tab is used to define the files to be processed, to be then used by Gap4 for the assembly. - Go to Add files and Open the folder /08_Lab4/322N9/. In ‘Files of type’ select Any (*.*). Select ALL the files contained in the folder /322N9/, and click on Open. 1.2. In the ‘Configure Modules’ tab you will find: General Configuration: Sets general parameters that affect several other modules. If you do NOT select the option Get entry names from trace files, the sequences’ names will not be changed. - Select No. Estimate Base Accuracies: Analyses the traces at each base call to estimate a confidence value for the called base, by simply looking at the area underneath the trace for the called base and dividing this by the highest area under the trace for the three uncalled bases. This is a very simplistic statistic. If another program (such as Phred) is available, then this should be used in preference. If the Logarithmic (Phred) scale is selected, the confidence value for the called base will be a log-transformed error probability [10*log10(Pe)], i.e. a value of 20 will mean a 1/100 error probability. - Select Logarithmic (Phred) scale. 1 Trace Format Conversion: Converts files between the various trace formats. It can read ABI, ALF, SCF, CTF and ZTR formats, and can write SCF, CTF, and ZTR formats. Of these formats, ZTR typically represents the smallest size and is fast due to its own internal compression routines. - Leave default parameters. Initialize Experiment Files: Creates an ‘experiment file’ from a trace file. This module is mandatory for many subsequent modules. There are no adjustable parameters for this module. Augment Experiment Files: Adds further data to the ‘experiment file’, with additional information obtained from external sources. We will not add any extra information. Quality Clip: Determines where the sequence quality is too poor to be used for reliable assembly. Its default quality evaluation is based on the range of values produced by the Estimate Base Accuracies module. - Clip mode: The ‘by base call’ mode is equivalent to the uncalled Clip mode. The ‘by confidence’ mode uses Phred-scaled confidence values to determine the quality for clipping. - Minimum extent: The lowest allowable 5' clip position. - Maximum extent: The largest allowable 3' clip position. - Minimum length: If after quality clipping the good portion of a sequence is shorter than the specified length, then this file will be rejected with the message ‘qclip: Sequence too short’. The ‘by confidence’ mode: - Window length: The window length over which the confidence will be averaged. - Average confidence: The minimum average confidence (over the ‘window length’ bases) for a sequence to be accepted as good-quality sequence. The ‘by base call’ mode: - Start offset: The base number to start the 5' and 3' good quality searches from. - 3' window length: The window length in which to count uncalled bases. - 3' number of uncalled bases: The maximum allowed count of uncalled bases in a single window length. - 5' window length: The window length in which to count uncalled bases. - 5' number of uncalled bases: The maximum allowed count of uncalled bases in a single window length. - Select Clip mode by confidence, and leave default parameters for the rest of the options. Sequencing Vector Clip: Uses the vector_clip program to identify and mark the sequencing vector. 2 - Select Vector-primer subset: Used in conjunction with the Vector-primer filename, to indicate which of the vector-primer pairs listed in this file should be used. - In Select Vector-primer subset, select pBS/HindIII. Screen for Unclipped Vector Clip: Identifies undetected segments of sequencing vector. After searching and marking sequencing vector, any further strong matches to the sequencing vector indicate a possible problem. - Select this option. Cloning Vector Clip: Searches for non-sequencing vectors used in the shotgunning process, such us BACs, Cosmids or YACs. Any fragment in any orientation of this vector could be present, so there is no need for the cut sites to be known. - Unselect this option. Gap4 Shotgun Assembly: Assembles the processed sequences into Gap4 using Gap4's own assembly engine. - In Gap4 database name, put a name for your output (such as ‘Lab4’), and make sure you change the Gap4 database version of the output every time you perform a new assembly with different parameters. Create new database, and click on RUN. 1.3. The ‘Textual Output’ tab will show you your results in the Output window, whereas any error when performing the assembly will be notified in the Error window. II. Gap4 1. Open Gap4: Your output file will be stored in /08_Lab4/322N9/ (the same folder where the files processed by Pregap4 are located). Take a look at your output file by opening the file with the name you created and the extension ‘.aux’. If you are asked to select a program from a list to open this file, select Gap4. 2. Analyzing the Gap4 assembly: 2.1. Assembly information: Look at the Gap4 main window: - How many reads were taken to perform the assembly? - How many contigs were created? - What is the total length of the assembly? 2.2. Contig information: In the Gap4 main window, go to Contig selector, which shows your assembly graphically displayed. If you place the mouse on one contig, it will tell you the name of the contig (which is the name of the most-left reading), the length (in bp), and the number of reads included in that contig. The same information can be viewed by View|Contig list. - What is the length of each contig? 3 - How many reads were included in each contig? 2.3. Quality of the assembly: a. In the Gap4 main window, go to View|List consensus confidence, select all contigs, and click on OK. Look at the Confidence Values in your Gap 4 Output window. For example, if 50 bases (Frequency) in the consensus have a Confidence Value of 10, we would expect those 50 bases (with an error rate of 1/10) to contain 5 errors; and if 200 bases have a Confidence Value of 20, we would expect them to contain 2 errors. - How many errors are expected to be in the consensus sequence? How many bases in the consensus have a Confidence Value of 10? How many errors we would expect to be contained in those bases? How many bases were assembled with a Confidence Value of 20 or less? b. In the Gap4 main window, go to View|Confidence values graph, select all contigs, and click on OK. You can see a graph of the confidence values for the consensus sequence. By double-clicking on one region of the graph, the Contig Editor showing the aligned sequences in that given region of the contig will be displayed. By double-clicking on the consensus sequence, the chromatograms corresponding to each aligned sequence will be displayed. You can select the number of Columns and Rows showing the chromatograms you want to have displayed. By selecting Show confidence, the confidence value with which each base was called will be shown. c. In the Gap4 main window, go to View|Quality plot. For each base in the consensus sequence, a quality is computed based on the accuracy of the data on each strand. This information is then plotted using color and height to distinguish between the different quality assignments. If you right-click on the plot and go to Information, the percentages of the contig with a given quality will shown. Color Grey Blue Green Red Black - Height 0 to 0 0 to 1 -1 to 0 -1 to 1 -2 to 2 Meaning OK on both strands, both agree OK on plus strand only OK on minus strand only Bad on both strands OK on both strands but they disagree Open the Quality plot for the larger contig. For how many bases the quality of the consensus sequence was OK on both strands? For how many bases the quality of the consensus sequence was bad on both strands? 2.4. Saving the consensus sequence: 4 In the Gap4 main window, go to File|Save consensus|Normal, select all contigs, name consensus by ‘left-most reading’, select FASTA format, give a name to the Output file, and clock on OK. You can open the output file with Word. 2.5. Joining contigs: If this is known that all the reads used to perform the assembly belong to a unique contig although they were grouped into separate contigs, the separate contigs can be joined. In the Gap4 main window, go to View|Find internal joins. Select Input contigs from all contigs, Use standard consensus, Maximum alignment length to list (bp): 4000, Use hidden data: YES, Word length: 4, Minimum overlap: 20, Maximum percentage mismatch: 30.00, Alignment algorithm: sensitive, Diagonal threshold: 1.0e-8. The Contig Comparator showing the results will be displayed. - What does this window look like? Go to View|Display diagonal. - Can you also see another region of overlapping besides the main diagonal? What does it mean? Double-click on this region, and Join Editor will be displayed. You can edit the overlapping region, and if you decide you join the contigs, click on Join/Quit. Take a look at the Contig Selector. - How many contigs do you have now? Save this final consensus sequence. 2.6. Inspecting chromatograms and editing: In the Gap4 main window, go to Edit|Edit contig, write the name of the contig you want to open, and click on OK. The Contig Editor window will be displayed. Go to Edit Modes, and select Mode set 2. You can insert/delete bases, perform searches, create tags, select primers, among other editing options. B. Sequence annotation Objective: Annotate the T. monococcum BAC clone 322N9 fragment assembled in A. Annotate genes and repetitive elements using a combination of graphic and word-based alignment tools, gene-finding programs, and BLAST searches. Activities: Open the final consensus sequence you obtained after the assembly performed in A. in Word. 1. Dotter 5 Dotter helps to identify repetitive elements present in a sequence. 1.1. Use Dotter to align the T. monococcum sequence with itself. - Can you identify any repeat? Are they direct or inverted repeats? To facilitate annotation, let us divide the sequence into two separate regions: the one where repeats were identified, and the one with no repeats and where potentially genes will be found. 2. Annotating REPETITIVE ELEMENTS 2.1. Perform a blastn alignment with the region of the sequence where the repeats where identified, using the nucleotide collection database. - - - Click on the best blastn alignment. What is the accession/version number? Open the flat file in a new tab. Go to the ‘features’ section to see the annotation of the element present in this region of the subject sequence. What is it? Using the coordinates of the alignment between your query sequence and the subject sequence from the database, find and highlight this element in your sequence in Word. Divide now the region of the sequence where the repeats where identified into two approximate halves, and align the two halves using blast2sequences to help yourself identify the LTRs of the repetitive element found. Mark the LTRs with bold letters. Can you identify the host duplication and the inverted repeats flanking the LTRs? Highlight them in your sequence in Word. Highlight the inverted repeats flanking the LTRs with the same color used to highlight the complete repetitive element, but a use a different color to highlight the host duplication, since it is not part of the repetitive element. Annotate the repetitive elements in the T. monococcum sequence in Word. 2.1.1. TREP: BLAST repeats Go to http://wheat.pw.usda.gov/ggpages/Repeats/blastrepeats3.html . This specialized repeat database allows you to discover and annotate the repetitive elements present in a sequence. Copy the region of the sequence where the repeats where identified and paste it in the TREP window. Select blastn program and Cereal repeat sequences, complete set database. Click on Search. - Do you get any significant hit? What is it? Does it agree with your previous finding? 3. Annotating GENES 3.1. Perform a blastn alignment with the region of the sequence with no repeats, using the est_others database. - How many exons would you predict the gene present in this region has? Highlight them in your sequence in Word. 6 3.3.1. Gene prediction programs 3.3.1.1. FGENESH Go to http://linux1.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfin d . Copy the region of the sequence with no repeats and paste it in the FGENESH window. Select Monocot plants as organism, and click on Search. Take a look at the predicted genes by clicking on Show picture of predicted genes in PDF file. - How many genes are predicted? How many exons? Using a combination of your blastn search and the gene finding programs, identify start codon, splicing sites, exons, stop codon, and PolyA, and highlight them in your sequence in Word. 3.3.1.2. You can go to Gene Sequer at http://www.plantgdb.org/PlantGDBcgi/GeneSeqer/PlantGDBgs.cgi, and GENSCAN at http://genes.mit.edu/GENSCAN.html, and compare the results between the different gene prediction programs. - How many genes/exons are identified by each of the programs? 3.3.2. Gene annotation 3.3.2.1. BLASTP (protein blast) and BLASTX Go to http://www.ncbi.nlm.nih.gov/BLAST/ and then to BLASTP (protein blast) to perform a search with the translated protein (after translation using GeneTool), or to BLASTX to perform a search using the protein database with the cDNA. - Do you get any significant hit? What is it? Do you find any conserved domain? What is it? Annotate the gene in the T. monococcum sequence in Word. 7