CS144 Instructor: Liliana Florea October 22, 2007 Lab session on gene finding methods Extract the nucleotide sequence containing the human Brutos Tyrosine Kinase gene from GenBank (accession number: HSU78027) and save it as file BTK.seq. This is the genomic sequence to be annotated. Also save a copy of its GenBank annotation for later use. Step 1. Filter "junk" DNA. Run RepeatMasker on the above sequence. (If the job queue is too large, use the repeat-masked sequence from Blackboard/Projects/Gene Finding.) Step 2. Ab initio gene prediction. Run GenScan on the masked sequence. Familiarize yourself with the output (exons, introns, other elements of the gene structure). Extract the predicted gene and protein sequences, and write down the gene annotations. Step 3. Gene / protein validation. Identify supporting cDNA and/or protein evidence for any of the predicted genes. For this, perform two Blast searches: i) submit the predicted gene sequence to Blast – nucleotide searches (blastn) against dbEST (or RefSeq), and ii) submit the predicted gene sequence to a blastx search against the nr database of proteins. Can you make any inference about its function? Step 4. a) Comparative gene prediction. Identify matches to known cDNA and protein sequences. i) Search the masked genomic sequence against dbEST and/or RefSeq (blastn). ii) Search the masked genomic sequence against the protein nr database (blastx). (if time permits) b) Select one of the cDNA sequences resulted from the Blast search and align it to the original (unmasked) sequence using Sim4. If the blast server is too slow (>3 min wait time), use the cDNA sequence from the disk. Step 5. Compare the GenScan annotation with the GenBank record and the Sim4 result. Are they the same? How would you explain the differences? Short list of useful web sites RepeatMasker – http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker GenScan – http://genes.mit.edu/GENSCAN.html Blast – http://www.ncbi.nlm.nih.gov/BLAST Sim4 – http://pbil.univ-lyon1.fr/sim4.php