BIT150 - Lab 1 Sequence Analysis Remember: Door code: 51330 Log in: bit150 Password: jorge Log on to: HUNTLAB Introduction Using Triticum monococcum L. genomic DNA we subcloned a 4,116-bp Hind III fragment into a pBluescript II SK vector. This vector has M13 forward (F) and reverse (R) sites at both sides of the Hind III cloning site. Commercial primers M13F and M13R were used to start sequencing the cloned fragment (chromatograms M13_F and M13_R). However, this sequence is too long to be completed in a single sequencing reaction, so primer walking was used to complete the sequence. Using the sequence obtained with the M13 primers, new primers were designed (F1 and R1) and used to extend the sequence. Then, the new sequences were used to design primers F2 and R2. Finally, primer F3 was used to close the last gap (Figure 1). F3 F2 Vector R2 F1 R1 Vector M13_F Figure 1 M13_R Sequences are available in the WEB: Lab1 chromatograms.zip, and also in the Z: drive/08_Lab1 directory. You can read but not write in the ‘Z:’ drive, but you can read and write in the ‘C:’ drive. Create in the C: drive a directory with your last name within the class directory (BIT150), and copy the directory 08_Lab1 from Z: into C:. Take a copy with you for the homework (Hwk1). Objective: Manually prepare a full-length integrated sequence of the T. monococcum fragment, without vector, without sequencing errors, and without overlapping sequences. Activities: 1. Use GeneTool to open and inspect the chromatograms GeneTool is a sequence-editing program that can display chromatograms for viewing and editing. As all the software that will be used in the class, GeneTool was pre-installed on your lab computers, and a copy of the free-ware version is included in the course CD 1 that was distributed in the first class. You can start this and the other programs by clicking on START-> Programs->Bioinformatics (you can also create shortcuts in the Desktop). In the GeneTool Launcher, click on Chromatogram Editor. At the bottom, in Files of type, select Other Chromatograms. Finally, click on the file you want to open, and then click on Open. Once the chromatogram is open, browse through the length of the sequence and check the quality of the sequence. Compare the quality at both the beginning and the end of the sequence, with the quality at the middle of the sequence. o o The chromatogram files are the files with extension <.ab1> When the sequence peaks in the chromatogram are sharp, that indicates high quality of the read. o Determining sequence quality through visual inspection is a highly subjective scoring technique. However, quality scoring software sometimes perform poorly on otherwise good quality sequence due to sequencing artifacts, irregular spacing of the peaks, etc. Thus, it is still considered the most reliable method for decision making on quality scoring. You can use the examples below as guidelines for your own decision making: 2 High quality Low quality Low quality 2. Use GeneTool to convert the chromatograms into FASTA format and copy them in a Word document. Then, find restriction sites within the sequence. Finally, explore other tools in GeneTool In GeneTool, Chromatogram Editor, click on Transfer --> Text Editor. Copy and paste the sequence into a Word document. The M13_R and the R1 and R2 sequences are in reverse complement orientation to the forward sequences (they are sequences from the other DNA strand). To put all the sequences in the same orientation, you need to reverse complement them. (e.g. from AGCTT to AAGCT). It might be easier to align first M_13R, R1 and R2 and then, at the end, reverse complement the correct contig. To reverse complement a sequence, copy the sequence (Ctrl C) and paste it (Ctrl V) into the Sequence Editor (that is in the GeneTool Launcher). Select Edit --> Reverse Complement. Copy the reverse complemented sequence and paste it into your Word document (indicate that the sequence has been reverse complemented, by adding ‘RC’). Use CURRIER NEW font for a good alignment of sequences. In addition to GeneTool, the program Chromas provided in the course CD allows you to view, edit, and also reverse complement the chromatogram. Other uses of GeneTool: o Genetic codes: In the Sequence Editor, click on Help --> Genetic Code or IUPAC code. 3 o Restriction sites: With your sequence in the Sequence Editor, click on Analyze --> Find Restriction Sites, select None (to unselect them all), then select Hind III (or any other restriction enzyme), finally select Unlimited Enzyme Cut Frequency, and click on OK. o Translate DNA into protein: With your sequence in the Sequence Editor selected, click on Analyze --> Translate. You can change the reading frame and the strand. To transfer figures from screen to your Word documents: Press simultaneously the Shift and Prnt Scrn keys of your keyboard to copy to clipboard. Open Start/Programs/Accessories/Paint, select Edit/Paste (or Ctrl V); select the cut tool on the left , mark the region you want to cut, copy it (Ctrl C), and paste it into your Word document. 3. Trim the vector sequence from the M13_F and M13_R sequences using the NCBI's BLAST-VecScreen (http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html). Copy the M13_F sequence and click on RunVecScreen. Find the region corresponding to vector. Identify the Hind III cloning site: AAGCTT, and eliminate the vector sequence, but not the cloning site. Highlight the cloning site. Repeat the same process with the M13_R sequence. The other sequences (F1, F2, F3, R1 and R2) do not have vector because the primers were designed within the cloned segment. 4. Use BLAST 2 sequences to Align M13_F with F1 http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi (create a bookmark for this site) Copy the M13_F sequence in the Sequence 1 window and the F1 sequence in the Sequence 2 window. Unselect Filter, and click on Align. Identify the overlapping sequence between M13_F and F1, and highlight it in both sequences in your Word document (remember to use Currier New font). The two sequences are from the same molecule, and therefore they should be identical. The differences between them are sequencing errors. Examine the chromatograms to decide which base is the correct one for each difference observed in the overlapping sequence. Eliminate the duplicated region and create a combined clean sequence (without vector and without sequencing errors). 5. Repeat the process with the other chromatograms until you assemble a complete clean sequence The final assembled sequence (without vector, without sequencing errors, and without overlapping sequence) should be sent to the TA electronically as part of Homework 1 (find Hwk1 in the class Schedule WEB page). Add F2 in class (15 min), and complete the rest as part of Homework 1. 4 Sequence Submission to NCBI Introduction An important part of working with genomic or protein sequences is the ability to submit them to the central databases, such as GenBank. The software tool developed by NCBI for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence databases is called Sequin, and is available freely at http://www.ncbi.nlm.nih.gov/Sequin/. For Sequin tutorial, go to SEQUIN at NCBI. Objective: Prepare a sequence submission to GenBank using Sequin. Activities: 1. Use Sequin to prepare a sequence submission to GenBank For this assignment you have the genomic DNA sequence (.txt) from barley (Hordeum vulgare L.), the protein translation (.txt), and the annotated genomic sequence (.doc) for the Acyl Co-A Synthetase in the subdirectory with the name ‘Sequin Acyl Co-A Synthetase’, into the Lab1 directory. Download the files into your created directory in the C: drive. 1.1. Start Sequin. 1.2. Enter your personal information as submitter. Ask the sequence to be released in 1 year from today. 1.3. Load the ‘proper’ Co-A data file(s) into Sequin. The FASTA genomic DNA sequence is in the file Co-A_DNA.txt and the protein one in Co-A_Protein.txt (note that the protein sequence should NOT have the asterisk representing the stop codon at the end). For both .txt files, note that the SeqID after the ‘>’ symbol in the definition line should not contain any space. The final annotation of the sequence is in the Word document called ‘Final annotation.doc’. Sequin will format and annotate the sequence using automated programs (called macros) to determine exon locations, etc. Check if the coordinates of the exons are correct using the Tools/Word Count option in your annotated Word document. 1.4. To get the taxonomic information, you can go to NCBI, select the Taxonomy database and search for Hordeum vulgare. Paste the lineage into your Sequin file in the ORGANISM -> Lineage sections Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Liliopsida; commelinids; Poales; Poaceae; BEP clade; Pooideae; Triticeae; Hordeum 1.5. Click on Search->Validate to check correctness of the automated annotation. If a submission is invalid, you can correct it manually by clicking on the shown error and completing the requested information. 1.6. Once you fixed the error, click on Revalidate. 5 1.7. To save your document, click on File --> Export GenBank (then you will be able to open this file as a Word document). The complete and validated GenBank file should be submitted as part of Homework 1. 6