Ollie Bridle BSc. Hons., MA., MPhil. oliver.bridle@ouls.ox.ac.uk May 2008 Outline 1. 2. 3. 4. 5. 6. Introduction. Information sources in biology and associated problems. What is bioinformatics? DNA databases. Entrez. (+ exercise) Summary. Aims Convince you that these bioinformatics resources are valuable for research. Give you some important searching strategies. Show you how to find what you want. Suggest other resources and further help. What I won’t Cover All the resources available. Commercial software. Huge amounts of scientific detail. Bibliographic and abstract databases Check out some of the other WISER sessions. About Me… Trainee librarian. Formerly a biologist - degrees in Microbiology (BSc) and Microbial Genetics (MPhil). Much less familiar with animal and population genetics…but… As far as searching databases goes, similar principles apply. Information Sources for Research - Key Questions What is available? Where do I find it? How do I search it? Information Sources for Research Journals, books, theses, abstracts. Technical literature (e.g. protocols, equipment handbooks). Conferences, seminars, meetings and exhibitions. Molecular biology databases. Problems with Biological Data Data collection. The base of information is large, expanding and diverse. Organisation and accessibility. Requirement for special search techniques. You can’t Google a DNA sequence…yet! A student/researcher wants the right information quickly!!! The Good News Large projects working to organise this information. Much is freely available over the internet. University subscribes to many e-journals and bibliographic databases available through Oxlip. A Definition of Bioinformatics ‘…information technology applied to the management and analysis of biological data’ (Attwood, T. K) A multidisciplinary subject. Bioinformatics aims to… Collect, Organise, Store, Retrieve, Analyse, ….biological data with the use of computers. Scope of Bioinformatics E-journals and bibliographic databases. Protein structural modelling. Gene expression studies. Bioinformatics Taxonomy and phylogenetics. Protein interaction. DNA/Protein sequence databases. What is a DNA Sequence? The DNA double helix is made up of a series of chemical bases stung along a sugar backbone. There are 4 bases usually represented by the letters A, T, C and G. The linear sequence in which these bases occur determines all the instructions for building an organism. What is a Protein Sequence? Proteins are complex molecules which control most aspects of cell biology. Constructed of small subunits called amino acids. There are 20 types of amino acid. Assembeled by ‘reading’ (or translating) the DNA sequence. Every set of 3 bases (e.g. ATG) corresponds to an amino acid. So a protein is built up one amino acid at a time according to the DNA blueprint. In Summary… DNA Sequence DNA Molecule Proteins Complete Organism Looking at DNA sequences I Analysis of DNA or protein sequences is a frequent requirement of research. Locating genes within a sequence. Comparing two sequences for similarity. Searching for similar genes (orthologues) in other organisms. Looking at DNA sequences II DNA sequences are easily stored, retrieved, compared and manipulated on computers. Just represent each base as a letter! Computers can compare two or more sequences and find similar regions. Much analysis of genetic information now takes place in silico. Looking at DNA Sequences III DNA sequences can be determined experimentally. Software allows biologists to construct and view maps of DNA sequence. The DNA code of ATCG gets transformed into something much more human friendly. Artemis is one available map viewer. Artemis Map Viewer Using a DNA Sequence Forensics Identifying genes of similar function Medical diagnostics Determining protein composition Classification Identification DNA Databases Free access to vast numbers of sequences deposited by researchers all over the world. Used alongside scientific papers. Can be searched or ‘mined’ in a variety of ways. Global Bioinformatics Agencies DNA Data Bank of Japan International Nucleotide Sequence Database Collaboration European Molecular Biology Laboratory National Centre for Biotechnology Information NCBI and Genbank Genbank is NCBI’s DNA database. Extensive search and deposit capabilities. 606 sequences A Practical Example A researcher might start with a piece of DNA rather than a literature citation. Here we will – Search a DNA database using a piece of DNA sequence. 2. Use the results of the search to identify relevant literature. 1. The Experiment 1) Grow some bugs. 4) Generate sequence. 2) Extract the DNA. 3) Amplify up the desired section of DNA. A DNA Sequence The following sequence is in FASTA format. >G08_CHEV11Fed.seq GTCGACGCGCAAATGGTTCTATATCCATACCAATAGCAGTATCGTTGCCA TTATCACGAATGGAATTAAGTAAAGTTTTCATTCTATCAATAGACTCTAA AACCACATCCATGATATCTGGAGTTATTTTTAACTCGCCATGTCTTGCTT TGTTTAAAACATCCTCCATGTGGTGAGTTAACTTTGTTAAAACATCAAAA TTTAAGAAGCTTGATGATCCTTTAACCGTATGTGCAACACGGAAAATTCT ATTTAATAATTCTAAATCTTCTGGATTTGATTCAAGCTCTACTAAATCAT GGTCGATTTGCTCAACAAGCTCAAAAGCTTCAACCAAAAAGTCTTCAAGT ATTTCTTGCATATCTTCCATATTTTACCCCTGTTCTTGAGATTGATGTTT TTTAATAACCTTTGCAATTTCATTGAAGAAATCGCTAGCGTTAAATTTGA CAAGATAGCCTTCTCCACCAGCTTCTTGAACACCTTTCTCATTCATAAAT TCATTTGATAAAGATGAGTTAAAGACTATAGGAATATCTTTAAATCCGGG ATCTTCTTTAATGCGTGCAGCGGATCCCGGGTACCTGCAGAATTCAGCTG CGCCCTTTAGTTCCTAAAGGGTTTTTATCAGTGCGACAAACTGGGATTTT ATTTATTCAGCAAGTCTTGTAATTCATCCAAAAAACGGCAAACATGAAAG CCGTCACAAACGGCATGATGCACTTGAATCGATAAGGGAATATAGTATTT TCCGCCCTCCTCATAATACTTCCCAAACGTAAATATCGGCAGTAGATAGT A BLAST Search Basic Local Alignment Search Tool Aimed at finding highly similar sequences in the database. Lets see how to submit a sequence query to the Genbank database. BLAST Search Screen Enter sequence. Select database. Select BLAST type. BLAST Results I The Statistics Guidelines for evaluating stats (data from ‘Introduction to Bioinformatics’, Lesk, A, OUP (2005)) E ≤0.02 – Sequences probably homologous (i.e. derived from a common ancestor) E between 0.02 and 1 – homology unproven but can’t be ruled out. E>1 – Expect this good a match by chance. Putting the amino acid sequence NELLYTHEELEPHANT into a BLAST protein search produces results! Best match E value = 9 BLAST Results II Two possible matches. BLAST Results III Literature references allow us to go straight to citations in PubMed relevant to the sequence we have found. Here is the name of the gene! Evaluating the Data There are errors in these databases! Is a BLAST search appropriate? What is the source of this sequence? Should I cross reference? What are the statistics telling me? Using Accession Numbers Papers often contain accession numbers. No database submission = No publication. Using HTML versions of papers you can link directly to the gene or protein sequence. Here’s one I made earlier…. Exploring Further Start with a completely unknown sequence. Searching for ‘CheV’ in WOS will not bring up all the relevant papers. Starting from a DNA sequence you have a new way to search. ‘Having a BLAST with bioinformatics (and avoiding BLASTphemy)’, A. Pertsemlidis and J. W. Fondon III. Genome Biology (2001), 2(10), pp. 1-10 Structure of Entrez Powerful resource for research. Entrez is a cross-database search engine. Records are cross referenced and linked. Simple ‘one box’ search. DNA databases Literature database Protein databases Genome projects Taxonomy databases Entrez Main Screen Single Keyword Search Type keyword into the search box and click ‘GO’ The number of hits for the search term is shown by each database. Single keyword searches are limited. Advanced search techniques refine results and produce fewer irrelevant hits. Using Boolean Operators Boolean operators and phrases build complex searches. Use AND, OR and NOT to join terms. Chemotaxis AND “Campylobacter jejuni” Use UPPERCASE for the operators. A phrase is enclosed in quotation marks. “Protein glycosylation” Your Turn! A little practice using Entrez. Follow the instructions on the handout. Shout if you have problems. 10 Minutes Notes on the Exercise Using brackets with Boolean operators refines search results. Care with placing brackets is essential! The clipboard is helpful for recording results of searches. Refining Searches and Setting Limits. Within an individual database results may be further refined by setting limits. The number and type of limits will depend on the database. Click the ‘limits’ tab from within one of the databases. Steps in Setting a Limit 1. 2. 3. Select a field to limit the search by. Type in the limiting term in the search box. Select other limiting options e.g. – Publication date. Database. 4. Hit ‘GO’ to retrieve the results. Using the History The history keeps track of previous searches. You can combine searches and limits quickly and easily. You can isolate records matching very specific criteria. A demonstration.... Jumping Between Databases Records in Entrez are extensively cross linked. The ‘links’ hyperlink next to each record lets you jump between databases. Entrez in Summary We’ve looked at – Simple and advanced searching. Accessing and moving between records. Using the clipboard. Setting limits. Using the history. Sorting results. Evaluating Entrez I Advantages Quickly cross reference many databases. Elaborate searches can be constructed within each database. Tools to save and modify searches. Pools many resources. Evaluating Entrez II Disadvantages Can return many irrelevant results. Syntax for advanced searching is complicated (many databases = many fields). Doesn't cover everything! Summary Bioinformatics resources help collect, organise and analyse biological data. Essential resources for biology research. Bioinformatics databases can be searched in unique ways. Entrez provides a powerful cross-database searching tool. Many more resources out there! And Finally… Thanks for listening! Any Questions? Resources Search Engines and Software NCBI BLAST – www.ncbi.nlm.nih.gov/blast/Blast.cgi Entrez – www.ncbi.nlm.nih.gov/sites/gquery SRS – Another cross database search engine for bioinformatics data similar in principle to Entrez. http://srs.ebi.ac.uk/ EMBOSS Bioinformatics software – A whole suite of free applications for processing many kinds of biological data. http://emboss.sourceforge.net/ ARTEMIS – A free sequence viewer and editor. www.sanger.ac.uk/Software/Artemis/ Sources of Help I EMBL, DDJ and NCBI all provide reliable introductory information on bioinformatics. They also have extensive documentation for the databases and bioinformatics tools they support. Tutorials Try out the 2can tutorials provided by EMBL www.ebi.ac.uk/2can/home.html Entrez Help The Entrez manual can be viewed on-line or downloaded as a PDF document. www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpentrez.chapter.EntrezH elp Sources of Help II Subject Guides Subject librarians have prepared a number of guides to research resources available in a range of scientific fields. www.ouls.ox.ac.uk/rsl/e-resources Books A number of books are available through OULS. I’d particularly recommend the following. Search the OLIS catalogue at www.lib.ox.ac.uk/olis/ ‘Essential Bioinformatics’ by Jin Xiong (2006), Cambridge University Press. ‘Bioinformatics. Sequence and Genome Analysis, 2nd Edition’ by D. W. Mount. (2004), Cold Spring Harbour Laboratory Press. Courses Oxford University School of Continuing Education has a bioinformatics programme offering short courses, diplomas and Masters qualifications. http://bioinfomsc.stats.ox.ac.uk/