Introduction to NCBI & Ensembl tools including BLAST and database searching Incorporating Bioinformatics into the High School Biology Curriculum Fran Lewitter, Ph.D. Director, Bioinformatics & Research Computing Whitehead Institute for Biomedical Research Lewitter AT wi.mit.edu http://jura.wi.mit.edu/bio What I hope you’ll learn • What do we learn from database searching and sequence alignments • What tools are available at NCBI • What tools are available through Ensembl Lewitter-Whitehead Institute – August 15, 2012 2 First some background info Lewitter-Whitehead Institute – August 15, 2012 3 Lewitter-Whitehead Institute – August 15, 2012 4 Simian sarcoma virus onc gene, v-sis, is derived from the gene (or genes) encoding a platelet-derived growth factor. Doolittle RF, Hunkapiller MW, Hood LE, Devare SG, Robbins KC, Aaronson SA, Antoniades HN. Science 221:275-277, 1983. Lewitter-Whitehead Institute – August 15, 2012 5 Why do alignments • Use sequence similarity to infer homology and/or structural similarity between 2 or more genes/proteins • Identify more conserved regions of a protein, potentially identifying regions of most functional importance • Compare and contrast homologs (perhaps into groups) based on shared positions or regions • Infer evolutionary distance from sequence dissimilarity Lewitter-Whitehead Institute – August 15, 2012 6 Evolutionary Basis of Sequence Alignment • Similarity - observable quantity, such as percent identity • Homology - conclusion drawn from data that two genes share a common evolutionary history; no metric is associated with this – Paralog – genes related by duplication – Ortholog – genes related by speciation Lewitter-Whitehead Institute – August 15, 2012 7 Local vs Global Alignment LOCAL GLOBAL From Mount, Bioinformatics, 2004, pg 71 Lewitter-Whitehead Institute – August 15, 2012 8 Nucleotide vs Protein • If comparing protein coding genes, use protein sequences because of less noise • If protein sequences are very similar, it might be more instructive to use DNA sequences Lewitter-Whitehead Institute – August 15, 2012 9 Example of simple scoring system for nucleic acids • • • • Match = +1 (ex. A-A, T-T, C-C, G-G) Mismatch = -1 (ex. A-T, A-C, etc) Gap opening = - 5 Gap extension = -2 T C A G A C G A G T G T C G G A - - G C T G +1 +1 -1 +1 +1 -5 -2 -1 -1 +1 +1 = -4 Lewitter-Whitehead Institute – August 15, 2012 10 Possible Alignments A: T C A G A C G A G T G B: T C G G A G C T G I. T C A G A C G A G T G T C G G A - - G C T G II. T C A G A C G A G T G T C G G A - G C - T G score=-4 score=-5 III. T C A G A C G A G T G T C G G A - G - C T G score=-5 WIBR Sequence Analysis Course, © Whitehead Institute, February 2005 11 Scoring for Protein Alignments - Amino Acid Substitution Matrices • PAM - point accepted mutation based on global alignment [evolutionary model] • BLOSUM - block substitutions based on local alignments [similarity among conserved sequences] Lewitter-Whitehead Institute – August 15, 2012 12 AA Scoring Matrices •PAM - point accepted mutation based on global alignment [evolutionary model] Log-odds= pair in homologous proteins pair in unrelated proteins by chance •BLOSUM - block substitutions based on local alignments [similarity among conserved sequences] Log-odds= obs freq of aa substitutions freq expected by chance Lewitter-Whitehead Institute – August 15, 2012 13 Substitution Matrices BLOSUM 30 Increasing PAM 250 (80) BLOSUM 62 BLOSUM 80 % identity similarity PAM 120 (66) PAM 90 (50) % change Lewitter-Whitehead Institute – August 15, 2012 14 Scoring for BLAST Alignments Score = 94.0 bits (230), Expect = 6e-19 Identities = 45/101 (44%), Positives = 54/101 (52%), Gaps = 7/101 (6%) Query: 204 YTGPFCDV----DTKASCYDGRGLSYRGLARTTLSGAPCQPWASEATYRNVTAEQ---AR 256 Y+ FC + + CY G G +YRG T SGA C PW S V Q A+ Sbjct: 198 YSSEFCSTPACSEGNSDCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKVYTAQNPSAQ 257 Query: 257 NWGLGGHAFCRNPDNDIRPWCFVLNRDRLSWEYCDLAQCQT 297 GLG H +CRNPD D +PWC VL RL+WEYCD+ C T Sbjct: 258 ALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCST 298 Based on BLOSUM62 Position Position Position Position . Position Position . 1: 2: 3: 4: . . 9: 10: . . Y T G P - Y S S E = = = = 7 1 0 -1 - - P = -11 - - A = -1 Sum Lewitter-Whitehead Institute – August 15, 2012 230 15 Statistical Significance • Raw Scores - score of an alignment equal to the sum of substitution and gap scores. • Bit scores - scaled version of an alignment’s raw score that accounts for the statistical properties of the scoring system used. • E-value - expected number of distinct alignments that would achieve a given score by chance. Lower E-value => more significant. Lewitter-Whitehead Institute – August 15, 2012 16 A formula E = Kmn e-lS This is the Expected number of high-scoring segment pairs (HSPs) with score at least S for sequences of length m and n. This is the E value for the score S. The parameters K and l can be thought of simply as natural scales for the search space size and the scoring system respectively. Lewitter-Whitehead Institute – August 15, 2012 17 What’s significant? • High confidence - >40% identity for long alignments (Rost, 1999 found that sequence alignments unambiguously distinguish between protein pairs of similar and non-similar structure when the pairwise sequence identity >40%) • “Twilight zone” – blurry - 20-35% identity • “Midnight zone” - <20% identity Lewitter-Whitehead Institute – August 15, 2012 18 Tools of Interest • NCBI (US) - http://www.ncbi.nlm.nih.gov/ – BLAST – Entrez • EBI/EMBL/WTSI (Europe) – Ensembl (http://www.ensembl.org/) Lewitter-Whitehead Institute – August 15, 2012 19 NCBI Home Page Lewitter-Whitehead Institute – August 15, 2012 20 Ensembl Home Page Lewitter-Whitehead Institute – August 15, 2012 21 Hands-On • Entrez – finding sequences • BLAST – look for similar sequences • Ensembl – finding sequences Lewitter-Whitehead Institute – August 15, 2012 22 How is BLAST used @ WIBR? • • • • Looking for homologs Identifying domains in proteins Peptide identification Genome annotation Lewitter-Whitehead Institute – August 15, 2012 23 Remember 1. Don’t be afraid to look at help pages for each website • Click, click, click 2. Any analysis tool will give you results 3. Interpret results you find Lewitter-Whitehead Institute – August 15, 2012 24 Sequence of 1992 Note Biotechniques, 1 GCGTTGCTGG CGTTTTTCCA TAGGCTCCGC 31 CCCCCTGACG AGCATCACAA AAATCGACGC 61 GGTGGCGAAA CCCGACAGGA CTATAAAGAT ………….. 1371 GTAAAGTCTG GAAACGCGGA AGTCAGCGCC (Published in 1990) “Here you see the actual structure of a small fragment of dinosaur DNA,” Wu said. “Notice the sequence is made up of four compounds - adenine, guanine, thymine and cytosine. This amount of DNA probably contains instructions to make a single protein - say, a hormone or an enzyme. The full DNA molecule contains three billion of these bases. If we looked at a screen like this once a second, for eight hours a day, it’d still take more than two years to look at the entire DNA strand. It’s that big.” (page 103) Lewitter-Whitehead Institute – August 15, 2012 25 DinoDNA "Dinosaur DNA" from Crichton's THE LOST WORLD p. 135 GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGC ATCAGATACAGTTGGAGATAAGGACGACGTGTGG CAGCTCCCGCAGAGGATTCACTGGAAGTGCATTA CCTATCCCATGGGAGCCATGGAGTTCGTGGCGCT GGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTC CCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGGG GGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGC CTCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTG CCGTGGCAGACACGGGTACTTTGGGGACCCCCCA GTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCC CACTACCTGGAGCTGCTGCAACCCCCCCGGGGCA GCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCC ACTCAGCAGCGCCTGCGGCCTCTACTACAAAC…… Published in 1995 Lewitter-Whitehead Institute – August 15, 2012 26 >Erythroid transcription factor (NF-E1 DNA-binding protein) Query: 121 Sbjct: 1 Query: 301 Sbjct: 61 Query: 481 Sbjct: 117 Query: 661 Sbjct: 170 Query: 841 Sbjct: 227 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGAT TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660 ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQT ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840 STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286 Query: 1021 FGGGAGGYTAPPGLSPQI 1074 FGGGAGGYTAPPGLSPQI Sbjct: 287 FGGGAGGYTAPPGLSPQI 304 Lewitter-Whitehead Institute – August 15, 2012 27 >Erythroid transcription factor (NF-E1 DNA-binding protein) Query: 121 Sbjct: 1 Query: 301 Sbjct: 61 Query: 481 Sbjct: 117 Query: 661 Sbjct: 170 Query: 841 Sbjct: 227 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGAT TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660 ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQT ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840 STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286 Query: 1021 FGGGAGGYTAPPGLSPQI 1074 FGGGAGGYTAPPGLSPQI Sbjct: 287 FGGGAGGYTAPPGLSPQI 304 Lewitter-Whitehead Institute – August 15, 2012 28