Sequencing analysis

advertisement
BLAST searches for genes
You have sequenced DNA. The next step is to determine what genes are in your sample of
DNA. To do this, you will compare the sequence of each of the samples to all known DNA
sequences. The sequences we will search are located on a computer accessible through the
Internet, and the program that is used to do this is called BLAST.
BLAST is a program that has been designed to compare DNA sequences from two different
sources, and to determine the similarities between them. In our case the two sources are the
cloned DNA sequence and all known DNA sequences. The BLAST program then reports
sequences that have a “good enough” match. Although the “good enough” value can be set by
you, in general, we want a better than random chance of a match between the two DNA
sequences. The way to determine if the match good enough is to check the “E” value (expected).
This value is the number of hits (similar sequences) that are expected due to randomness. An E
value less than one is necessary to be a valid match, with smaller E values being better. Values
less than 0.1 are fair matches, with values < 0.01 being good ones. An E value of 0 is an exact or
nearly exact match.
The results you will get will look similar to those below:
emb|X59788.1|TADNAGD T.acidophilum gene for glucose dehydrogenase
Length = 1676
Score = 3322 bits (1676),
Expect = 0.0 Identities = 1676/1676 (100%) Strand = Plus / Plus
Query:
1 gaattctttcaatgaaacctacttttagacatttttgtgaagaatatgcaccgtaatcag 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 gaattctttcaatgaaacctacttttagacatttttgtgaagaatatgcaccgtaatcag 60
Sbjct:
gb|AF045643.1|CELF58H7 Caenorhabditis elegans cosmid F58H7
Length = 43834
Score = 40.1 bits (20), Expect = 2.2
Identities = 23/24 (95%)
Strand = Plus / Minus
Query:
Sbjct:
948
ctactcgtttgcaggcgtagatcg 971
||||||||||||||||| ||||||
30557 ctactcgtttgcaggcggagatcg 30534
In the above two examples, the results for the T. acidophilum gene gave an expected of 0, while
the Caenorhabditis gave an expected of 2.2. Since the gene used for the search was from T.
acidophilum, these results were unsurprising. However, the point is that you will get results
reported that may not be statistically valid, such as the Caenorhabditis results, and you should
ignore these. In some cases the best match will have E values of 5 or 6. That essentially says
that no similar genes have been found in the data bank.
For this part of the lab, we are trying to find out which genes are in the DNA you sequenced.
Last printed on: May 7, 2004
1
Procedure
You will use the the DNAStar Lasergene group of programs to analyze the sequences. These
programs have capabilities far beyond what we need for this class. To do this lab, this set of
programs must be present on the computer.
SeqMan II method
1) To start the SeqMan program, select “Start”, then “Programs”, then “DNAStar”, and, last,
“SeqMan”.
2) We will check the stringency of the sequence data to make sure that it is set to “low”. This
value affects how many bases are called “good”, with the lower the stringency, the more
“good” bases. When in SeqMan, click on “Project”, then “Parameters”.
3) Click on “End trimming” in the left column, then “Low” on the right side under Quality. To
close the Parameters dialog, click on the “Ok” button.
4) We will now load the sequences that you obtained. Click on “Sequence”, then “Add”.
5) In “File of type”, select “ABI files w/o extensions”. Note that due to a bug in the program,
the opton “Match all files” actually matches none! Then select the sequence files on drive A
by clicking on “Look in” and selecting the A drive. Next, select the “Add all” button on the
right side of the dialog. The sequence files should move to the right box. Last, to add the
sequences, click on the “Done” button.
6) The files should now be in the “Unassembled sequences” dialog. We will now tell the
program which vector was used in the sequencing. Select all the files by selecting the Ctrl-A,
or by selecting “Edit”, then “Select all”.
7) Click on one of the “Set vector” lists, then select the plasmid that was used in your sample.
For clone bank sequencing, select “pSPORTI” or “pSPORTI.seq” from the list that drops
down. If the sample was a cloned PCR product, you would select “pcr4topo_seq” or
“pcr4blunttopo_seq”. This selection tells the program to remove the vector sequences from
your data.
8) Next, select “Trim ends”, then the “Scan all” button. The program will check each of your
files, remove the vector and all bad sequences, and report the good values in the limits section
of the Unassembled sequences dialog.
9) If the “Limits” has a “∅”, no good data was found. Otherwise, it will list the bases it feels are
good enough to use. Note that if the difference of two numbers in limits is under 100 bases,
the sequence is very poor. Under vector, a ¿ pSPORTI means no vector was found, while a ?
pSPORTI means that the vector was found.
10) Next, click on “Assemble”.
11) Assuming everything is okay, you should see a window containing one or more “Contigs”.
You now have one or more “contigs”. Contigs are sequences that overlap. In your case, each
sequence is probably in a different contig. You will now check the contigs (sequences) against
all DNA and all protein sequences in order to determine what gene(s) were in the DNA that was
inserted into your plasmid.
2
12) Select a contig by double-clicking on it. Next, select all of the sequence in the contig by
selecting “Edit” and then “Select all” (or with Ctrl-A). You can look at how the computer
called the sequences by going to “Sequence”, then “Show trace data”. A good sequence has
few or no overlapping peaks and the sequence has few or no Ns in it. N stands for aNy
nucleotide, as it can’t distinguish which of the four was the base that was read by the
sequencer.
13) You will now search for matching DNA sequences by selecting “Net search”, then “Blast
search”. The program listed is “blastn”, and the database is “nr”. This checks the sequence
against all known DNA sequences. Click on the “Ok” button. Note any decent matches to
your sequence. Decent matches are ones that have an expected value less than 1. You can
save them to disk by selecting the items that you want to save and then “Batch save”.
However, some file matches will be very large. It is better to write these down before saving
any others. Write down the organism, gene or clone, E value, and where it matches your
DNA sequence (both the first and last ones that match). Close the window that contains the
matches before checking the next sequence.
14) Repeat step 13, but after clicking on “Blast search”, change the program to “blastx”. This
option will check the translated sequence against all known amino acid sequences. In other
words, it is checking your DNA for any genes related to known proteins.
15) Report the best blastx matches by saving the match to the floppy disk, as above, or by
writing down the matches and expected values in your lab report. Important: Write down the
expected values for each file that you save.
16) Repeat steps 12 through 15 for each sequence that you have.
Analysis
For your lab report, you should do the following (besides the normal procedures, intro, etc):
1) Turn in your disk with your sequences.
2) In your analysis, report the best four matches for each of your contigs for each of the blastn
and blastx searches. However, only report them if they are statistically valid. If you don’t
get matches, say so and suggest why.
3) Explain your results!
Last printed on: May 7, 2004
3
Download