Basic sequence analyses and submission

advertisement
BIT150 - Lab 1
Sequence Analysis
Remember:
Door code: 51330
Log in: bit150
Password: jorge
Log on to: HUNTLAB
Introduction
Using Triticum monococcum L. genomic DNA we subcloned a 4,116-bp Hind III
fragment into a pBluescript II SK vector. This vector has M13 forward (F) and reverse
(R) sites at both sides of the Hind III cloning site. Commercial primers M13F and M13R
were used to start sequencing the cloned fragment (chromatograms M13_F and M13_R).
However, this sequence is too long to be completed in a single sequencing reaction, so
primer walking was used to complete the sequence. Using the sequence obtained with the
M13 primers, new primers were designed (F1 and R1) and used to extend the sequence.
Then, the new sequences were used to design primers F2 and R2. Finally, primer F3 was
used to close the last gap (Figure 1).
F3 
F2 
Vector
 R2
F1 
 R1
Vector
M13_F 
Figure 1
 M13_R
Sequences are available in the WEB: Lab1 chromatograms.zip, and also in the Z:
drive/08_Lab1 directory. You can read but not write in the ‘Z:’ drive, but you can read
and write in the ‘C:’ drive.
Create in the C: drive a directory with your last name within the class directory
(BIT150), and copy the directory 08_Lab1 from Z: into C:. Take a copy with you for the
homework (Hwk1).
Objective: Manually prepare a full-length integrated sequence of the T. monococcum
fragment, without vector, without sequencing errors, and without overlapping sequences.
Activities:
1. Use GeneTool to open and inspect the chromatograms
GeneTool is a sequence-editing program that can display chromatograms for viewing
and editing. As all the software that will be used in the class, GeneTool was pre-installed
on your lab computers, and a copy of the free-ware version is included in the course CD
1
that was distributed in the first class. You can start this and the other programs by
clicking on START-> Programs->Bioinformatics (you can also create shortcuts in the
Desktop).
In the GeneTool Launcher, click on Chromatogram Editor.
At the bottom, in Files of type, select Other Chromatograms.
Finally, click on the file you want to open, and then click on Open.
Once the chromatogram is open, browse through the length of the sequence and check the
quality of the sequence. Compare the quality at both the beginning and the end of the
sequence, with the quality at the middle of the sequence.
o
o
The chromatogram files are the files with extension <.ab1>
When the sequence peaks in the chromatogram are sharp, that indicates high quality
of the read.
o Determining sequence quality through visual inspection is a highly subjective scoring
technique. However, quality scoring software sometimes perform poorly on otherwise
good quality sequence due to sequencing artifacts, irregular spacing of the peaks, etc.
Thus, it is still considered the most reliable method for decision making on quality
scoring. You can use the examples below as guidelines for your own decision making:
2
High quality
Low quality
Low quality
2. Use GeneTool to convert the chromatograms into FASTA format and copy them
in a Word document. Then, find restriction sites within the sequence. Finally,
explore other tools in GeneTool
In GeneTool, Chromatogram Editor, click on Transfer --> Text Editor. Copy and
paste the sequence into a Word document.
The M13_R and the R1 and R2 sequences are in reverse complement orientation to the
forward sequences (they are sequences from the other DNA strand). To put all the
sequences in the same orientation, you need to reverse complement them. (e.g. from
AGCTT to AAGCT). It might be easier to align first M_13R, R1 and R2 and then, at the
end, reverse complement the correct contig.
To reverse complement a sequence, copy the sequence (Ctrl C) and paste it (Ctrl V) into
the Sequence Editor (that is in the GeneTool Launcher). Select Edit --> Reverse
Complement.
Copy the reverse complemented sequence and paste it into your Word document (indicate
that the sequence has been reverse complemented, by adding ‘RC’). Use CURRIER NEW
font for a good alignment of sequences.
In addition to GeneTool, the program Chromas provided in the course CD allows you to
view, edit, and also reverse complement the chromatogram.
Other uses of GeneTool:
o Genetic codes: In the Sequence Editor, click on Help --> Genetic Code or IUPAC
code.
3
o Restriction sites: With your sequence in the Sequence Editor, click on Analyze -->
Find Restriction Sites, select None (to unselect them all), then select Hind III (or any
other restriction enzyme), finally select Unlimited Enzyme Cut Frequency, and click on
OK.
o Translate DNA into protein: With your sequence in the Sequence Editor selected,
click on Analyze --> Translate. You can change the reading frame and the strand.
To transfer figures from screen to your Word documents: Press simultaneously the
Shift and Prnt Scrn keys of your keyboard to copy to clipboard. Open
Start/Programs/Accessories/Paint, select Edit/Paste (or Ctrl V); select the cut tool on the
left
, mark the region you want to cut, copy it (Ctrl C), and paste it into your Word
document.
3. Trim the vector sequence from the M13_F and M13_R sequences using the
NCBI's BLAST-VecScreen (http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html).
Copy the M13_F sequence and click on RunVecScreen. Find the region corresponding
to vector.
Identify the Hind III cloning site: AAGCTT, and eliminate the vector sequence, but not
the cloning site. Highlight the cloning site.
Repeat the same process with the M13_R sequence. The other sequences (F1, F2, F3, R1
and R2) do not have vector because the primers were designed within the cloned
segment.
4. Use BLAST 2 sequences to Align M13_F with F1
http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi (create a bookmark for this site)
Copy the M13_F sequence in the Sequence 1 window and the F1 sequence in the
Sequence 2 window. Unselect Filter, and click on Align.
Identify the overlapping sequence between M13_F and F1, and highlight it in both
sequences in your Word document (remember to use Currier New font).
The two sequences are from the same molecule, and therefore they should be identical.
The differences between them are sequencing errors. Examine the chromatograms to
decide which base is the correct one for each difference observed in the overlapping
sequence. Eliminate the duplicated region and create a combined clean sequence (without
vector and without sequencing errors).
5. Repeat the process with the other chromatograms until you assemble a complete
clean sequence
The final assembled sequence (without vector, without sequencing errors, and without
overlapping sequence) should be sent to the TA electronically as part of Homework 1
(find Hwk1 in the class Schedule WEB page). Add F2 in class (15 min), and complete the
rest as part of Homework 1.
4
Sequence Submission to NCBI
Introduction
An important part of working with genomic or protein sequences is the ability to submit
them to the central databases, such as GenBank. The software tool developed by NCBI
for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence databases
is called Sequin, and is available freely at http://www.ncbi.nlm.nih.gov/Sequin/. For
Sequin tutorial, go to SEQUIN at NCBI.
Objective: Prepare a sequence submission to GenBank using Sequin.
Activities:
1. Use Sequin to prepare a sequence submission to GenBank
For this assignment you have the genomic DNA sequence (.txt) from barley (Hordeum
vulgare L.), the protein translation (.txt), and the annotated genomic sequence (.doc) for
the Acyl Co-A Synthetase in the subdirectory with the name ‘Sequin Acyl Co-A
Synthetase’, into the Lab1 directory.
Download the files into your created directory in the C: drive.
1.1. Start Sequin.
1.2. Enter your personal information as submitter. Ask the sequence to be released in 1
year from today.
1.3. Load the ‘proper’ Co-A data file(s) into Sequin. The FASTA genomic DNA
sequence is in the file Co-A_DNA.txt and the protein one in Co-A_Protein.txt (note that
the protein sequence should NOT have the asterisk representing the stop codon at the
end). For both .txt files, note that the SeqID after the ‘>’ symbol in the definition line
should not contain any space.
The final annotation of the sequence is in the Word document called ‘Final
annotation.doc’. Sequin will format and annotate the sequence using automated programs
(called macros) to determine exon locations, etc. Check if the coordinates of the exons
are correct using the Tools/Word Count option in your annotated Word document.
1.4. To get the taxonomic information, you can go to NCBI, select the Taxonomy
database and search for Hordeum vulgare. Paste the lineage into your Sequin file in the
ORGANISM -> Lineage sections
Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta;
Euphyllophyta; Spermatophyta; Magnoliophyta; Liliopsida; commelinids; Poales;
Poaceae; BEP clade; Pooideae; Triticeae; Hordeum
1.5. Click on Search->Validate to check correctness of the automated annotation. If a
submission is invalid, you can correct it manually by clicking on the shown error and
completing the requested information.
1.6. Once you fixed the error, click on Revalidate.
5
1.7. To save your document, click on File --> Export GenBank (then you will be able to
open this file as a Word document). The complete and validated GenBank file should be
submitted as part of Homework 1.
6
Download