BLAST searches for genes You have sequenced DNA. The next step is to determine what genes are in your sample of DNA. To do this, you will compare the sequence of each of the samples to all known DNA sequences. The sequences we will search are located on a computer accessible through the Internet, and the program that is used to do this is called BLAST. BLAST is a program that has been designed to compare DNA sequences from two different sources, and to determine the similarities between them. In our case the two sources are the cloned DNA sequence and all known DNA sequences. The BLAST program then reports sequences that have a “good enough” match. Although the “good enough” value can be set by you, in general, we want a better than random chance of a match between the two DNA sequences. The way to determine if the match good enough is to check the “E” value (expected). This value is the number of hits (similar sequences) that are expected due to randomness. An E value less than one is necessary to be a valid match, with smaller E values being better. Values less than 0.1 are fair matches, with values < 0.01 being good ones. An E value of 0 is an exact or nearly exact match. The results you will get will look similar to those below: emb|X59788.1|TADNAGD T.acidophilum gene for glucose dehydrogenase Length = 1676 Score = 3322 bits (1676), Expect = 0.0 Identities = 1676/1676 (100%) Strand = Plus / Plus Query: 1 gaattctttcaatgaaacctacttttagacatttttgtgaagaatatgcaccgtaatcag 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 1 gaattctttcaatgaaacctacttttagacatttttgtgaagaatatgcaccgtaatcag 60 Sbjct: gb|AF045643.1|CELF58H7 Caenorhabditis elegans cosmid F58H7 Length = 43834 Score = 40.1 bits (20), Expect = 2.2 Identities = 23/24 (95%) Strand = Plus / Minus Query: Sbjct: 948 ctactcgtttgcaggcgtagatcg 971 ||||||||||||||||| |||||| 30557 ctactcgtttgcaggcggagatcg 30534 In the above two examples, the results for the T. acidophilum gene gave an expected of 0, while the Caenorhabditis gave an expected of 2.2. Since the gene used for the search was from T. acidophilum, these results were unsurprising. However, the point is that you will get results reported that may not be statistically valid, such as the Caenorhabditis results, and you should ignore these. In some cases the best match will have E values of 5 or 6. That essentially says that no similar genes have been found in the data bank. For this part of the lab, we are trying to find out which genes are in the DNA you sequenced. Last printed on: May 7, 2004 1 Procedure You will use the the DNAStar Lasergene group of programs to analyze the sequences. These programs have capabilities far beyond what we need for this class. To do this lab, this set of programs must be present on the computer. SeqMan II method 1) To start the SeqMan program, select “Start”, then “Programs”, then “DNAStar”, and, last, “SeqMan”. 2) We will check the stringency of the sequence data to make sure that it is set to “low”. This value affects how many bases are called “good”, with the lower the stringency, the more “good” bases. When in SeqMan, click on “Project”, then “Parameters”. 3) Click on “End trimming” in the left column, then “Low” on the right side under Quality. To close the Parameters dialog, click on the “Ok” button. 4) We will now load the sequences that you obtained. Click on “Sequence”, then “Add”. 5) In “File of type”, select “ABI files w/o extensions”. Note that due to a bug in the program, the opton “Match all files” actually matches none! Then select the sequence files on drive A by clicking on “Look in” and selecting the A drive. Next, select the “Add all” button on the right side of the dialog. The sequence files should move to the right box. Last, to add the sequences, click on the “Done” button. 6) The files should now be in the “Unassembled sequences” dialog. We will now tell the program which vector was used in the sequencing. Select all the files by selecting the Ctrl-A, or by selecting “Edit”, then “Select all”. 7) Click on one of the “Set vector” lists, then select the plasmid that was used in your sample. For clone bank sequencing, select “pSPORTI” or “pSPORTI.seq” from the list that drops down. If the sample was a cloned PCR product, you would select “pcr4topo_seq” or “pcr4blunttopo_seq”. This selection tells the program to remove the vector sequences from your data. 8) Next, select “Trim ends”, then the “Scan all” button. The program will check each of your files, remove the vector and all bad sequences, and report the good values in the limits section of the Unassembled sequences dialog. 9) If the “Limits” has a “∅”, no good data was found. Otherwise, it will list the bases it feels are good enough to use. Note that if the difference of two numbers in limits is under 100 bases, the sequence is very poor. Under vector, a ¿ pSPORTI means no vector was found, while a ? pSPORTI means that the vector was found. 10) Next, click on “Assemble”. 11) Assuming everything is okay, you should see a window containing one or more “Contigs”. You now have one or more “contigs”. Contigs are sequences that overlap. In your case, each sequence is probably in a different contig. You will now check the contigs (sequences) against all DNA and all protein sequences in order to determine what gene(s) were in the DNA that was inserted into your plasmid. 2 12) Select a contig by double-clicking on it. Next, select all of the sequence in the contig by selecting “Edit” and then “Select all” (or with Ctrl-A). You can look at how the computer called the sequences by going to “Sequence”, then “Show trace data”. A good sequence has few or no overlapping peaks and the sequence has few or no Ns in it. N stands for aNy nucleotide, as it can’t distinguish which of the four was the base that was read by the sequencer. 13) You will now search for matching DNA sequences by selecting “Net search”, then “Blast search”. The program listed is “blastn”, and the database is “nr”. This checks the sequence against all known DNA sequences. Click on the “Ok” button. Note any decent matches to your sequence. Decent matches are ones that have an expected value less than 1. You can save them to disk by selecting the items that you want to save and then “Batch save”. However, some file matches will be very large. It is better to write these down before saving any others. Write down the organism, gene or clone, E value, and where it matches your DNA sequence (both the first and last ones that match). Close the window that contains the matches before checking the next sequence. 14) Repeat step 13, but after clicking on “Blast search”, change the program to “blastx”. This option will check the translated sequence against all known amino acid sequences. In other words, it is checking your DNA for any genes related to known proteins. 15) Report the best blastx matches by saving the match to the floppy disk, as above, or by writing down the matches and expected values in your lab report. Important: Write down the expected values for each file that you save. 16) Repeat steps 12 through 15 for each sequence that you have. Analysis For your lab report, you should do the following (besides the normal procedures, intro, etc): 1) Turn in your disk with your sequences. 2) In your analysis, report the best four matches for each of your contigs for each of the blastn and blastx searches. However, only report them if they are statistically valid. If you don’t get matches, say so and suggest why. 3) Explain your results! Last printed on: May 7, 2004 3