BIT150 - Lab 2 Sequence Assembly and Database Searches Copy the directory ‘Z:\10_Lab2’ into your own directory in ‘C:\YourLastname\10_Lab2’ A. Sequence assembly Objective: Use Pregap4 and Gap4 to assemble a sample of subclones from Triticum monococcum L. BAC clone 322N9, and edit the assembly. Additional Resource: “An introduction to Pregap4 and Gap4” ( http://staden.sourceforge.net/manual/master_unix_brief.html ) Gap4 is a Genome Assembly Program. Gap4 stores the data for an assembly project in a Gap4 database, but before the data must be passed through preassembly steps, via Pregap4. Pregap4 operates on batches of files. These files can be binary trace files (in ABI, ALF or SCF format), Experiment Files, or plain text, and do not need to all be in the same format. Both Pregap4 and Gap4 are part of the ‘Staden Package’. To open either of them, go to Start/Programs/BioInformatics/Staden Package. I. Pregap4 1. Open Pregap4: 1.1. The ‘Files to Process’ tab is used to define the files to be processed, to be then used by Gap4 for the assembly. - Go to Add files and Open the folder /10_Lab2/322N9/. In ‘Files of type’ select Any (*.*). Select ALL the files contained in the folder /322N9/, and click on Open. 1.2. In the ‘Configure Modules’ tab you will find: General Configuration: Sets general parameters that affect several other modules. If you do NOT select the option Get entry names from trace files, the sequences’ names will not be changed. - Select No. Estimate Base Accuracies: Analyses the traces at each base call to estimate a confidence value for the called base, by simply looking at the area underneath the trace for the called base and dividing this by the highest area under the trace for the three uncalled bases. This is a very simplistic statistic. If another program (such as Phred) is available, then this should be used in preference. If the Logarithmic (Phred) scale is selected, the confidence value for the called base will be a log-transformed error probability [-10*log10(Pe)], i.e. a value of 20 will mean a 1/100 error probability. - Select Logarithmic (Phred) scale. Trace Format Conversion: Converts files between the various trace formats. It can read ABI, ALF, SCF, CTF and ZTR formats, and can write SCF, CTF, and ZTR formats. Of these formats, ZTR typically represents the smallest size and is fast due to its own internal compression routines. - Leave default parameters. Initialize Experiment Files: Creates an ‘experiment file’ from a trace file. This module is mandatory for many subsequent modules. There are no adjustable parameters for this module. 1 Augment Experiment Files: Adds further data to the ‘experiment file’, with additional information obtained from external sources. We will not add any extra information. Quality Clip: Determines where the sequence quality is too poor to be used for reliable assembly. Its default quality evaluation is based on the range of values produced by the Estimate Base Accuracies module. - Clip mode: The ‘by base call’ mode is equivalent to the uncalled Clip mode. The ‘by confidence’ mode uses Phred-scaled confidence values to determine the quality for clipping. - Minimum extent: The lowest allowable 5' clip position. - Maximum extent: The largest allowable 3' clip position. - Minimum length: If after quality clipping the good portion of a sequence is shorter than the specified length, then this file will be rejected with the message ‘qclip: Sequence too short’. The ‘by confidence’ mode: - Window length: The window length over which the confidence will be averaged. - Average confidence: The minimum average confidence (over the ‘window length’ bases) for a sequence to be accepted as good-quality sequence. The ‘by base call’ mode: - Start offset: The base number to start the 5' and 3' good quality searches from. - 3' window length: The window length in which to count uncalled bases. - 3' number of uncalled bases: The maximum allowed count of uncalled bases in a single window length. - 5' window length: The window length in which to count uncalled bases. - 5' number of uncalled bases: The maximum allowed count of uncalled bases in a single window length. - Select Clip mode by confidence, and leave default parameters for the rest of the options. Sequencing Vector Clip: Uses the vector_clip program to identify and mark the sequencing vector. - Select Vector-primer subset: Used in conjunction with the Vector-primer filename, to indicate which of the vector-primer pairs listed in this file should be used. - In Select Vector-primer subset, select pBS/HindIII. Screen for Unclipped Vector Clip: Identifies undetected segments of sequencing vector. After searching and marking sequencing vector, any further strong matches to the sequencing vector indicate a possible problem. - Select this option. Cloning Vector Clip: Searches for non-sequencing vectors used in the shotgunning process, such as BACs, Cosmids or YACs. Any fragment in any orientation of this vector could be present, so there is no need for the cut sites to be known. - Unselect this option. Gap4 Shotgun Assembly: Assembles the processed sequences into Gap4 using Gap4's own assembly engine. - In Gap4 database name, put a name for your output (such as ‘Lab4’), and make sure you change the Gap4 database version of the output every time you perform a new assembly with different parameters. Create new database, and click on RUN. 1.3. The ‘Textual Output’ tab will show you your results in the Output window, whereas any error when performing the assembly will be notified in the Error window. 2 II. Gap4 1. Open Gap4: Your output file will be stored in /10_Lab4/322N9/ (the same folder where the files processed by Pregap4 are located). Take a look at your output file by opening the file with the name you created and the extension ‘.aux’. If you are asked to select a program from a list to open this file, select Gap4. 2. Analyzing the Gap4 assembly: 2.1. Assembly information: Look at the Gap4 main window: - How many reads were taken to perform the assembly? - How many contigs were created? - What is the total length of the assembly? 2.2. Contig information: In the Gap4 main window, go to Contig selector, which shows your assembly graphically displayed. If you place the mouse on one contig, it will tell you the name of the contig (which is the name of the most-left reading), the length (in bp), and the number of reads included in that contig. The same information can be viewed by View|Contig list in the main Gap window. - What is the length of each contig? How many reads were included in each contig? 2.3. Quality of the assembly: a. In the Gap4 main window, go to View|List consensus confidence, select all contigs, and click on OK. It will provide you the total length and the expected total errors. Look at the Confidence Values in your Gap 4 Output window. A confidence value of 10 corresponds to an error rate of 1/10; 20 to 1/100; 30 to 1/1000; etc. For example, if 50 bases (Frequency) in the consensus have a Confidence Value of 10, we would expect those 50 bases (with an error rate of 1/10) to contain 5 errors; and if 200 bases have a Confidence Value of 20, we would expect them to contain 2 errors. b. In the Gap4 main window, go to View|Confidence values graph, select all contigs, and click on OK. You can see a graph of the confidence values for the consensus sequence. - By double-clicking on one region of the graph, the Contig Editor showing the aligned sequences in that given region of the contig will be displayed. - By double-clicking on the consensus sequence, the chromatograms corresponding to each aligned sequence will be displayed. You can select the number of Columns and Rows showing the chromatograms you want to have displayed. - By selecting Show confidence, the confidence value with which each base was called will be shown. 3 In the Gap4 main window, go to View|Quality plot. For each base in the consensus sequence, a quality is computed based on the accuracy of the data on each strand. This information is then plotted using color and height to distinguish between the different quality assignments. If you right-click on the plot and go to Information, the percentages of the contig with a given quality will shown. Color Grey Blue Green Red Black Height 0 to 0 0 to 1 -1 to 0 -1 to 1 -2 to 2 Meaning OK on both strands, both agree OK on plus strand only OK on minus strand only Bad on both strands OK on both strands but they disagree - Open the Quality plot for the larger contig. - For how many bases the quality of the consensus sequence was OK on both strands? - For how many bases the quality of the consensus sequence was bad on both strands? 2.4. Saving the consensus sequence: In the Gap4 main window, go to File|Save consensus|Normal, select all contigs, name consensus by ‘left-most reading’, select FASTA format, give a name to the Output file, and click on OK. You can open the output file with Word. 2.5. Joining contigs: If this is known that all the reads used to perform the assembly belong to a unique contig although they were grouped into separate contigs, the separate contigs can be joined. In the Gap4 main window, go to View|Find internal joins. Select Input contigs from all contigs, Use standard consensus, Maximum alignment length to list (bp): 4000, Use hidden data: YES, Word length: 4, Minimum overlap: 20, Maximum percentage mismatch: 30.00, Alignment algorithm: sensitive, Diagonal threshold: 1.0e-8. The Contig Comparator showing the results will be displayed. - What does this window look like? Go to View|Display diagonal. - Can you also see another region of overlapping besides the main diagonal? What does it mean? Double-click on this region, and Join Editor will be displayed. You can edit the overlapping region, and if you decide you join the contigs, click on Join/Quit. Take a look at the Contig Selector. - How many contigs do you have now? Save this final consensus sequence. 4 2.6. Inspecting chromatograms and editing: In the Gap4 main window, go to Edit|Edit contig, write the name of the contig you want to open, and click on OK. The Contig Editor window will be displayed. Go to Edit Modes, and select Mode set 2. You can insert/delete bases, perform searches, create tags, select primers, among other editing options. B. BLAST SEARCH Open the link: http://www.ncbi.nlm.nih.gov/blast/ BLASTN: Go to nucleotide blast (blastn). 3.1. Randomly type in a 50-bp DNA sequence, choose database Others\Nucleotide collection (nr/nt), optimize for blastn, then click on BLAST. Look at the E values. 3.2. Compare it with a real search using the Acyl Co-A synthetase DNA sequence. BLASTP: Go to protein blast (blastp). 3.3. Do the same as in 4.1. and 4.2., but now using a randomly generated 50-aa sequence and the Acyl Co-A synthetase protein sequence, selecting algorithm blastp. BLASTN: Go to nucleotide blast (blastn). 3.4. Choose database Others/Expressed sequence tags (est), in Organism select ‘Hordeum’, and optimize for blastn. Search with the Acyl Co-A synthetase DNA sequence. Good ESTs show alignment to exons in the Acyl Co-A synthetase DNA sequence. C. ENTREZ Help file: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.chapter.EntrezHelp Open the link: http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi 4.1 Get information about ‘Acyl CoA synthetase’. 4.2 Click on the number near Nucleotide. History function: Go to ‘History’; you will see the history of all your searches, and you can type the number given to them in the query box for future searches. 4.3. Boolean operators: Perform the following searches and report the number of Nucleotides found: ‘Acyl CoA Synthetase’ Acyl CoA Synthetase AND Hordeum Acyl CoA Synthetase AND Hordeum [ORGANISM] Acyl CoA Synthetase AND (Hordeum OR Triticum) Acyl CoA Synthetase AND Hordeum OR Triticum 5 Since the name of this enzyme can appear also as ‘synthase’, perform your search using truncation: “Acyl CoA” synth* Boolean operators in Entrez must be entered in UPPERCASE. Restricted searches may also be entered manually by following the term with the name of the field in square brackets “[ ]” (ie horse[ORGANISM] ) D. Local BLAST The ability to harness the computational power of a BLAST search on a local scale provides a huge advantage for searching unpublished databases or collections of sequences, and can even be used on portions of larger databases that have been downloaded to a local server. Here we will use the program BioEdit to create and search by BLAST against a local database file of sequences from a next generation Illumina sequencing run, which produces ~ 80 million reads approximately 100bp long. This gives us 8,000 megabases (Mb) of sequence. After screening for quality, 95% of these reads were assembled into 29,189 contigs, which collectively span 64.8Mb. The reads in this example are from a plant pathogen that causes stripe rust in cereal crops, known as Puccinia striifomis. P. striiformis has a genome size of 73Mb, which means that this database covers approximately 89% of the genome (64.8Mb / 73Mb). Researchers have identified a gene in a related pathogen, Puccinia graminis, which is known to affect the pathogen’s ability to successfully infect the host with stem rust. You would like to identify a homologous gene in the Puccinia striiformis database by BLAST search. 5.1 Create local database for BLAST. Open Bioedit from the Bioinformatics folder. Click Accessory Application BLAST Create a local nucleotide database file. Select the file Bioedit_database_PST130_contigs.fsa . Files can be formatted as a .txt file or as a FASTA file. *Requires you have permissions to write to files in the Windows folder.* 5.2 Perform a local BLAST on your database. Click Accessory Application BLAST Local BLAST. - Select the appropriate type of BLAST to be performed. - Select the newly created Nucleotide Database. - Paste in the Query sequence you would like to use to search the database. - Click Do Search. 6 For this example we will perform a blastn search. Use the following Query sequence: >Stem_rust_Coding_region_P.graminis AGAGCATTCATCCATCTCACTGTCCTAGCTGATCCTGCTTCTTCTCACCTCTTCCCTCCAGAACTCGCAACAGCATG ACTAAGAAATTCGACGATCGGGTGATCCCACTTGGCGAGCCTGCGACCACCAGAGCCAAACTGGTCGGCGTCGCCAT GGCCTTGTTTGCGGCCTTTGGAGGATTCTTATACGGCTATGACACTGGTTACATATCCGGAACCAAGGAGATGGCTT ACTGGAAGTCGCTCTTTGGAGATCAGATTGCCGACGGAAGTTACATCCTCACCACTGCCAACGACTCCCTCGTCACT TCCATCCTGTCAGCAGGAACCTTCACGGGTGCCCTCCTAGCATATCCTTTCGGTGATCGCCTCGGACGAAGATGGGG TGTCATCGTCGCCTGTCTCATCTTCTGCATCGGCGTGGCTCTTCAGACCGCATCAACTGACATCCCAGTCTTTGCCG TCGGCCGTGTGTTTGCAGGCTTGGGTGTCGGGATGACCTCATGTCTTGTGCCCATGTACCAATCAGAATGCGCCCCG AAATGGATTCGAGGTGCTGTGGTGGCCTGCTACCAATGGGCTATCACCATCGGACTCTTGGTTGCAGCAATCGTCGT CAACGCTACTCAAGACATCAACAATGCCAGCTCCTACCGAATCCCCATCGGGATCCAATTTGTTTGGGCCGTGATCC TATCCTTAGGACTGTACATTCTACCCGAGTCGCCTAAGTATCTGATCCTCAAGGGGCGGGAAGAAGAAGCCAAGAAG TCCCTCTCCAGACTGCTCTCCATCCCTGCCACCTCCCCGCAAGTCCTGAGCGAGTATGACGAAGTCTGCGAAAGCCT GCGCGCCGAGCGCGCCATGGGCACCTCCACTTATGCAGACTGCTTCAAATCTGGACCGGGCAAATACCGGCTCAGGA CTTTGACCGGGATGGGAATCCAAGCTCTCCAACAACTCACTGGTATCAACTTCATCTTCTATTACGGAACAACTTTC TTCAAGAACAGTGGAATCAAGGAGGCATTCACAATCACCATTATCACCAATGTTGTCAACGTAGTCATGACGATCCC CGGGATTTGGCTGGTCGACAAGGCCGGCCGTCGATCGCTGCTATTGACCGGCGCTGCCATCATGTGTGTCTGTGAAT TCATCGTCGCCATCATCGGACTCAAGCTCGAAAGCTCCAACCTTGCTGGCCAACGCGCCTTGATCTCCCTCGTCTGT ATCTACATCGGAGCGTTTGCGGCAACTTGGGGTCCCATCGCATGGGTTGTGACGAGTGAGATTTACCCGTTGGCCAT CCGAGCCAAGGCGATAAACTTCGCCATCGGATACTCGACGCCATATTTAGTGGATGTCGGTCCAGGTAAGGCGGGGC TCCAATCGAACGTGTTCTTCATCTGGGGCGCATGCTGTGGCTTGTGCTTTTTGTTCACCTTTTTCTGTATTCCGGAG ACCAAGGGGCTGTCGCTCGAGCAGGTCGATCAACTCTACATGAACAGCTCCATCCTCGGCTCCAATGCCTATCGTCG GCGACTCGTCAATGGCGAGTTCGACACCCATCAGCCTACGACCCCGCTCGGCTCGATCGTCGAAGACGACCGGGCCG ACAAGCTCGCTAAAGTCGCTTGATCGATCCCTCCCAATAAACAAAAATTGACACTCTCATATCCTTACCTCCTTATA ACCAAACGATCCGCTATTATACTAGTAGTAGAACTTGTTCCCCTTTCACCTAATATACTCTCTATTGTGACTTGTTA GGTTTTTTTTTTGTTTTTTGTTTCTTCTTATGTACCTTAGGTAGGAGTGGGAGGGATCGATAAGCATGGTTGAGCTC GAGGCGTGGTTTACATACCTGAATCATCACAGACTTCAGTTTCTCTTCTTCTTTCTCCACTTGGCAGCCGGGCTAGC 7 GGATCCATTCTTCCTTTCTCATCATCCCCATCTACCTTCCTCTCCCTCTCCCTACACACACACAACACAAAAAAAAA CTTGCTCCAGCCACAACAAAAAAGATATGCCAGTTCTCGCGCTCGTTGTATTATTATCTTATATTGCCCGTCTCTTT TTTTTGTCTTCGGTTCCTCCTCTTTGTTCTCATCTCTTTCTCTGTTTCCAGCTTTCTTTTATGATCTCTCTATATGA AAACTCTTGAAATAAAATCAAAATCAAAAATCAAAAATTGATGATGATGATTTCGT 5.3 Look at the high E value hits… Repeat the BLASTN but clicking the low complexity filter Retrieve matching sequences from your database. a) Note the matching contig name and its orientation (Plus or Minus) in your search results. Open the database file using Notepad. Right click on the file, Click Open With, and select Notepad. (Important! The database file is too large to open with Microsoft Word, and will likely crash that program, causing you to lose any work you haven’t saved. Be safe and use Notepad to open the database file!) b) Use the Find feature to locate your read in the file (Click Edit Find or press CtrlF) using the name of the read. Copy it to your MCBS Word Add-In Document. If the read was in the Minus orientation, you will need to use the reverse complement the read using the Antisense Sequence Manipulation tool from the Add-In. Here we find PSTcontig_7550 is the best match to our Query, and likely contains the Puccinia striiformis homolog to our known Puccinia graminis gene. We can use this complete sequence for continued molecular studies of the gene without the need to clone and sequence it. c) The orientation is Plus/ Minus :: Query/Contig, so we will need to reverse complement this contig to find the matching sequence to our Query. Use the Antisense tool to reverse complement the sequence, then highlight the portions of the sequence that aligned to our Query sequence in the BLAST search. 8