Understanding Basic Local Alignment Search Tool Muhammad Awais PhD Biochemistry 08-ARID-1103 DNA Sequencing Bioinformatics is based on the fact that DNA sequencing is cheap, and becoming easier and cheaper very quickly. The Human Genome Project cost roughly $3 billion and took 12 years (1991-2003). Sequencing James Watson’s genome in 2007 cost $2 million and took 2 months Today, you could get your genome sequenced for about $100,000 and it would take a month. • • • • To extract information from the genome is difficult. How to convert a string of ACGT’s into knowledge of how the organism works is hard. Most of the work is on the computer, with key confirming experiments done in the “wet lab”. The sequence below contains a gene critical for life: the gene that initiates replication of the DNA. Can you spot it? We are now going to spend some time on what genes look like and how we can find them. TTGGAAAACATTCATGATTTATGGGATAGAGCTTTAGATCAAATTGAAAAAAAATTAAGCAAACCTAGTTTTGAAACCTG GCTCAAATCGACAAAAGCTCATGCTTTACAAGGAGACACGCTCATTATTACTGCACCTAATGATTTTGCACGGGACTGGT TAGAATCTAGGTATTCTAATTTAATTGCTGAAACACTTTATGATCTTACGGGGGAAGAGTTAGATGTAAAATTTATTATT CCTCCTAACCAGGCCGAGGAAGAATTCGATATTCAAACTCCTAAAAAGAAAGTCAATAAAGACGAAGGAGCAGAATTTCC TCAAAGCATGCTAAATTCGAAGTATACCTTTGATACATTTGTTATCGGATCTGGAAATCGGTTTGCGCATGCAGCTTCTT TAGCAGTAGCAGAAGCGCCGGCTAAAGCGTATAATCCGCTTTTTATTTACGGGGGAGTAGGATTAGGCAAAACACACTTA ATGCACGCCATAGGCCACTATGTGTTAGATCATAATCCTGCCGCGAAAGTCGTGTACTTATCATCTGAAAAATTCACAAA CGAGTTTATTAACTCTATTCGTGACAATAAAGCAGTAGAATTCCGCAACAAATACCGTAATGTAGATGTTTTACTGATTG ATGATATTCAATTCTTAGCAGGTAAAGAGCAGACACAAGAAGAATTTTTCCATACGTTTAATACGCTTCACGAAGAAAGC AAGCAGATTGTCATCTCAAGTGATCGACCGCCGAAAGAAATTCCTACACTTGAAGATCGACTTCGCTCTCGCTTTGAATG GGGCCTTATTACAGACATCACACCACCAGATTTGGAAACACGAATTGCTATTTTGCGTAAAAAAGCCAAAGCGGACGGCT TAGTTATTCCAAATGAAGTTATGCTTTATATCGCCAATCAGATTGATTCAAATATTAGAGAATTAGAAGGCGCACTTATT Genes and Proteins • Most genes code for proteins: each gene contains the information necessary to make one protein. • Proteins are the most important type of macromolecule. – Structure: collagen in skin, keratin in hair, crystallin in eye. – Enzymes: all metabolic transformations, building up, rearranging, and breaking down of organic compounds, are done by enzymes, which are proteins. – Transport: oxygen in the blood is carried by hemoglobin, everything that goes in or out of a cell (except water and a few gasses) is carried by proteins. The Genetic Code • • • • • • Proteins are long chains of amino acids. There are 20 different amino acids coded in DNA There are only 4 DNA bases, so you need 3 DNA bases to code for the 20 amino acids – 4 x 4 x 4 = 64 possible 3 base combinations (codons) – Each codon codes for one amino acid – Most amino acids have more than one possible codon Genes start at a start codon and end at a stop codon. 3 codons are stop codons: all genes end at a stop codon. Start codons are a bit trickier, since they are used in the middle of genes as well as at the beginning – in eukaryotes, ATG is always the start codon, – In prokaryotes, ATG, GTG, or TTG can be used as a start codon. In bioinformatics, we generally ignore the fact that RNA uses the base uracil (U) in place of T. Reading Frames • Since codons consist of 3 bases, there are 3 “reading frames” possible on an RNA (or DNA), depending on whether you start reading from the first base, the second base, or the third base. – The different reading frames give entirely different proteins. • Each gene uses a single reading frame, so once the ribosome gets started, it just has to count off groups of 3 bases to produce the proper protein. Open Reading Frames • Ribosomes are very obedient to stop codons: when a stop codon is reached, the protein is finished. Thus, all genes end at the first stop codon in their reading frame. • Since 3 out of the 64 codons are stop codons, random DNA has stop codons very frequently. – Open reading frames (ORFs) are regions with no stop codons. All genes reside in long open reading frames – Note that stop codons in other reading frames have no effect on the gene. • The start codon must occur “upstream” in the same reading frame as the stop codon. It is usually near the beginning of the ORF, but not necessarily the first possible start codon. – Determining the exact start codon is not easy or obvious. – But, the first stop codon in an open reading frame is always a reasonable guess BLAST (Basic Local Alignment Search Tool) • The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 that are used to search sequence databases for optimal local alignments to a query. • BLAST itself is a bit of software that can be run on almost any computer, but the database needed for a good cross-species comparison is quite large – the database is called “nr” for “non-redundant”, and it contains at least 20 Gb of sequence data • Terminology: your sequence, which you paste into the box on the web site, is the query sequence. Sequences in the database that match yours are called subject sequences. Global Alignment Compares total length of two sequences Local Alignment Compares segments of sequences Finds cases when one sequence is a part of another sequence, or they only match in parts. BLAST terminology blast sequence list: Hits/subject output query sequence target database (GenBank/ SwissProt) information about input query sequence, e.g., 10 function The aim of a database (blast) search is to discover sequence homology on basis of sequence similarity BLAST returns similar sequences, not necessarily biological similar sequences Search protein database using a translated nucleotide query Use to find homologous proteins to a nucleotide coding region Translates the query sequence in all six reading frames Often the first analysis performed with a newly determined nucleotide sequence Search translated nucleotide database using a protein query Does six-frame translations of the nucleotide database Find homologous protein coding regions Search translated nucleotide database using a translated nucleotide query Both translations use all six frames Useful in identifying potential proteins Good tool for identifying novel genes Computationally intensive BLAST Scores • • • • Results are arranged with the best ones on top The most important score is the Expect value, or E-value, which can be defined the number of hits any random sequence (with the same length as yours) would have in the database. – E-values for good hits are usually written something like: 3e-42, which is the same as 3 x 10-42 , a very small number – Bad hits are very common, and they have e-values in a more familiar form: for example, 0.004 or 1.2 In this case we see many hits with good e-values, and the top e-values all are quite similar. Before we can conclude that our protein is a homologue of the proteins BLAST matches it with, we would like them to have roughly the same length and have a high percentage of identical amino acids. – the lengths of the query and subject sequences should be within 20% of each other – There should be at least 30% identical amino acids – In this case we can be quite sure we have a good match • BLAST also returns a fourth value, the bit score, which we are going to ignore. A Sequence to BLAST • • • This is a more-or-less randomly chosen gene from Bascillus megaterium.. – It is 174 amino acids long It is written in “fasta” format: the first line starts with > and is immediately followed by an identifier (ORF00135), and then some miscellaneous comments. After that the sequence is written without spaces or other marks. >ORF00135 |chromosome 538197538721 MKAKLIQYVYDAECRLFKS VNQHFDRKHLNRFLRLLTH AGGATFTIVIACLLLFLYPSS VAYACAFSLAVSHIPVAIAK KLYPRKRPYIQLKHTKVLE NPLKDHSFPSGHTTAIFSLVT PLMIVYPAFAAVLLPLAVMV GISRIYLGLHYPTDVMVGLI LGIFSGAVALNIFLT Protein Blast Tblastn (Protein to Nucleotide) Gene Names • Mostly genes are named with the function of their protein. – at some point, some related genes had their function determined through lab work: by examining the effects of mutations in the gene, by isolating and studying the protein produced by the gene, etc. – Enzymes (end in –ase), transport across the cell membrane, genetic information processing (DNA->RNA->protein), structural proteins, sporulation and germination, and more! • Many genes (maybe 1/4 of them in a typical genome) have no known function, although they are found in several different species: conserved hypothetical genes • Every new genome has some genes that are unique: no matching BLAST hits in the database. – Are they real genes? Sometimes there is evidence in the form of messenger RNA, but usually we don’t know – call them hypothetical genes • “putative” means that we think we know the gene’s function but we aren’t sure. Putative should be followed by the function name. Summary 1. DNA can be read in 3 different reading frames, a consequence of the genetic code (3 bases = 1 amino acid) 2. Genes are found in long open reading frames, areas where there are no stop codons. 3. BLAST is the tool we use to compare sequences between species • BLAST scores (e-values) describe the probability of finding a random sequence in the database 4. Gene sequences are conserved between species by natural selection • DNA sequences outside of genes are much less conserved Internet Links blast.ncbi.nlm.nih.gov/ (Official) www.ncbi.nlm.nih.gov/ (Official) www.ebi.ac.uk/Tools/sss/ncbiblast/ (EU Database) www.youtube.com/watch?v=rlK-5joOlyU (Video Tutorials) http://en.wikipedia.org/wiki/BLAST (Text Knowledge) http://www.dogpile.com/ (Preferred Search Engine)