Ch. 24 Bioinformatics Bioinformatics is the application of computer science to the field of molecular biology. Outline 1. Basics of bioinformatics a. Definition b. Data overload c. Top goals for Bioinformatics 2. NCBI website a. Entrez: Database Portal b. Databases i. Entrez Protein ii. Entrez Gene iii. OMIM iv. Unigene v. MeSH vi. Pubmed 3. Other useful sites a. European Bioinformatics Institute and Ensembl b. ExPASy Sequence Retrieval System 3. Assession numbers & RefSeq 4. FASTA format 5. Pair-wise Sequence Alignment a. BLAST b. Similarity, identity, and conservation. c. Homolog, ortholog, and paralog. 1.a Definition Bioinformatics entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data. 1.b Data Overload Over the past few decades rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce a tremendous amount of information related to molecular biology. Since the mid-nineties the amount of data that has been published has been exponentially increasing and has created a real problem of accessibility. Online Databases started making this information accessible became paramount for many reasons, including avoiding repetition of experiments. Bioinformatics was applied in the creation and maintenance of a database to store biological information at the beginning of the "genomic revolution", such as nucleotide and amino acid sequences. Development of this type of database involved not only design issues but the development of complex interfaces whereby researchers could both access existing data as well as submit new or revised data. In figure 1.b.1 we see the individual databases that manage data for that specific area of study. The important thing to not is that molecular biology is very complex and many areas of study are interrelated. It is a key role of bioinformatics to create a usable interface that combines many of the databases on one easy to use platform. NCBI has been kind enough to give us entrez. Figure 1.b.1 1.c Ten Problems facing Bioinformatics Doing a survey of bioinformatics as we are doing only touches on the difficult and multifaceted nature of this interface driven field. Below are a list of obstacles to a number of fields of molecular biology that are limiting the large arching goals of bioinformatics. [1] Precise models of where and when transcription will occur in a genome (initiation and termination) [2] Precise, predictive models of alternative RNA splicing [3] Precise models of signal transduction pathways [4] Determining protein: DNA, protein: RNA, protein: protein recognition codes [5] Accurate ab initio protein structure prediction [6] Rational design of small molecule inhibitors of proteins [7] Mechanistic understanding of protein evolution [8] Mechanistic understanding of speciation [9] Development of effective gene ontologies: systematic ways to describe gene and protein function [10] Education: development of bioinformatics curricula 2. NCBI – National Center for Biotechnology Information URL: http://www.ncbi.nlm.nih.gov The NCBI houses genome sequencing data in GenBank and an index of biomedical research articles in PubMed Central and PubMed, as well as other information relevant to biotechnology. All these databases are available online through the Entrez search engine. The NCBI is part of the National Library of Medicine. 2.a Entrez: Database Portal URL: http://www.ncbi.nlm.nih.gov/gquery/ Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. In Figure 2.a.1 a schematic diagram of all the databases and how many articles, sequences, etc. that are contained in each. Figure 2.a.1 The Entrez front page provides, by default, access to the global query. All databases indexed by Entrez can be searched via a single query string, supporting boolean operators and search term tags to limit parts of the search statement to particular fields. This returns a unified results page, that shows the number of hits for the search in each of the databases, which are also links to actual search results for that particular database. Figure 2.a.2 Entrez also provides a similar interface for searching each particular database and for refining search results. The Limits feature allows the user to narrow a search a web forms interface. The History feature gives a numbered list of recently performed queries. Results of previous queries can be referred to by number and combined via boolean operators. Search results can be saved temporarily in a Clipboard. Users with a MyNCBI account can save queries indefinitely and also choose to have updates with new search results e-mailed for saved queries of most databases. It is widely used in the field of biotechnology to enhance the knowledge of students worldwide. The following list of databases is by no means exhausting but is rather intending to be a guide to some of the more useful areas of Entrez. 2.a.i Entrez Protein In this database you can such for certain proteins and retrieve amino acid sequence information.The protein entries in this Entrez database have been compiled from a variety of sources, including SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq. 2.a.ii Entrez Gene In this database you can information about relating to genes. It does not include all known or predicted genes; instead Entrez Gene focuses on the genomes that have been completely sequenced, that have an active research community to contribute gene-specific information, or that are scheduled for intense sequence analysis. The content of Entrez Gene represents the result of curation and automated integration of data from NCBI's Reference Sequence project (RefSeq), from collaborating model organism databases, and from many other databases available from NCBI. Records are assigned unique, stable and tracked integers as identifiers. The content (nomenclature, map location, gene products and their attributes, markers, phenotypes, and links to citations, sequences, variation details, maps, expression, homologs, protein domains and external databases) is updated as new information becomes available. 2.a.iii OMIM (Online Mendelian Inheritance in Man) OMIM is a comprehensive, authoritative, and timely compendium of human genes and genetic phenotypes. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetics resources. (MIM number) 2.a.iv UniGene UniGene is an NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus (i.e. gene or expressed pseudogene). The type of cluster of genes that UniGene tries to catalog can be seen in figure below. Information on protein similarities, gene expression, cDNA clones, and genomic location is included with each entry. Figure 2.a.3 2.a.v Mesh The MeSH Browser is an online vocabulary look-up aid available for use with MeSH (Medical Subject Headings). It is designed to help quickly locate descriptors of possible interest and to show the hierarchy in which descriptors of interest appear. Virtually complete MeSH records are available, including the scope notes, annotations, entry vocabulary, history notes, allowable qualifiers, etc. The browser does not link directly to any MEDLINE or other database retrieval system and thus is not a substitute for the PUBMED system. 2.a.vi PubMed PubMed is a database accessing the MEDLINE database of citations, abstracts and some full text articles on life sciences and biomedical topics. For comprehensive, optimal searching in PubMed, it is necessary to have a thorough understanding of its core component, MEDLINE, and especially of the MeSH controlled vocabulary used to index MEDLINE articles. 3. Other websites 3.a European Bioinformatics Institute and Ensembl URL: http://uswest.ensembl.org/index.html The Ensembl project was started in 1999, some years before the draft human genome was completed. Even at that early stage it was clear that manual annotation of 3 billion base pairs of sequence would not be able to offer researchers timely access to the latest data. The goal of Ensembl was therefore to automatically annotate the genome, integrate this annotation with other available biological data and make all this publicly available via the web. Since the website's launch in July 2000, many more genomes have been added to Ensembl and the range of available data has also expanded to include comparative genomics, variation and regulatory data. The picture below (3.a.1) shows all the complete genomes that can be found on Ensembl. Figure 3.a.1 3.b ExPASy Sequence Retrieval System URL: http://www.expasy.org/ The ExPASy (Expert Protein Analysis System) is a proteomics server of the Swiss Institute of Bioinformatics (SIB) which analyzes protein sequences and structures and two-dimensional gel electrophoresis (2-D Page electrophoresis). The server functions in collaboration with the European Bioinformatics Institute. ExPASy also produces the protein sequence knowledgebase, UniProtKB/Swiss-Prot, and its computer annotated supplement, UniProtKB/Trembl. 4 - Assession numbers and RefSeq An accession number in bioinformatics is a unique identifier given to a DNA or protein sequence record to allow for tracking of different versions of that sequence record and the associated sequence over time in a single data repository. Because of its relative stability, accession numbers can be utilized as foreign keys for referring to a sequence object, but not necessarily to a unique sequence. All sequence information repositories implement the concept of "accession number" but might do so with subtle variations. RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. 4 FASTA format In bioinformatics, FASTA format (a.k.a. Pearson format) is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. A sequence in FASTA format is represented as a series of lines, which should be no longer than 120 characters and usually do not exceed 80 characters. This probably was because to allow for preallocation of fixed line sizes in software: at the time, most users relied on DEC VT (or compatible) terminals which could display 80 or 132 characters per line. Most people would prefer normally the bigger font in 80-character modes and so it became the recommended fashion to use 80 characters or less (often 70) in FASTA lines. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. In figure 4.1 a typical FASTA format sequence can be seen. Figure 4.1 5 Pair-wise sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Pair-wise sequence alignment methods are used to find the best-matching piecewise (local) or global alignments of two query sequences. Pair-wise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). The three primary methods of producing pair-wise alignments are dot-matrix methods, dynamic programming, and word methods; however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pair-wise methods have difficulty with highly repetitive sequences of low information content - especially where the number of repetitions differ in the two sequences to be aligned. One way of quantifying the utility of a given pair-wise alignment is the 'maximum unique match', or the longest subsequence that occurs in both query sequence. Longer MUM sequences typically reflect closer relatedness. 5.1 BLAST A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. Input – FASTA format sequence Output-When performing a BLAST on NCBI, the results are given in a graphical format showing the hits found, a table showing sequence identifiers for the hits with scoring related data, as well as alignments for the sequence of interest and the hits received with corresponding BLAST scores for these. The easiest to read and most informative of these is probably the table. Figure 5.1.1 shows the different types of basic BLAST searches that NCBI offers. There are also specialized BLAST searches that use more finely tuned algorithm. Figure 5.1.1 To run a BLAST sequence alignment query you have to enter two FASTA sequences, set gap parameters, set the penalty for a mismatch, and most importantly, the correct algorithm that the search will be based on. Depending on the size of the sequences or the relative accuracy of the comparison, there are a series of algorithms to choose from. The NCBI website will offers quick tips on which BLAST search to run. Figure 5.1.2 shows the typical BLAST page on the NCBI website. Figure 5.1.2 The image below (figure 5.1.3) is a typical image that comes from a BLAST sequencing analysis. The first line belongs to the first sequence entered and the second, the second entered. Whenever there is an exact match between two elements in the string there is a line drawn between the two of them. When there is not an exact match there is a system of one and two dots that signify how close the replaced amino acid or nucleotide matches the physiochemical profile of the first. One dot between the pair signifies a somewhat similar profile while two dots represent a very similar profile. Figure 5.1.3 The dot based system of similarity, as mentioned before, is based on how closely the sequence matches the original physiochemical profile. This is what is so called conservation. So similarity is based ultimately on the extent that the two sequences are invariable and the conservation of physiochemical properties. Figure 5.1.4 5.2 Homologs, Paralogs, and Orthologs The relative similarity between two sequences of a certain protein or gene in one animal and that of another is very important information. Similarity between two sequences can come about through various means. The origins of the similarity could come from the fact that the species are homologs, in that they share a common ancestor. The similarity could also arise from the two sequences being orthologs, in that the homologous sequences in different species arose from a common ancestral gene during speciation. In this case, similarity in sequence does not necessitate similarity in function. Finally, the similarity could come about because the two sequences are paralogs. In this case, homologous sequences in one species arise from random gene duplication. Figure 5.2.1 below shows schematically how these three origins of similarity are related. Figure 5.2.1