Sequence Databases, Their Use and BLAST Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Why Sequence Databases? • Electronic databases are fast becoming the lifeblood of the field • Because of the power of biomolecular sequence comparison • Discoveries based solely on sequence homology have become routine The first success story • Cellular growth factor – Particular proteins or hormones needed to stimulate or continue growth of a cell colony. • By early 1970s it was understood that certain viruses could cause particular cells in culture (in vitro) to grow without bound • This cancer like transformation of cultured cells by viruses suggested that viral infection could be a cause of cancer in animals, but the mechanisms were unknown The first success story • Oncogene is a gene that is mutated or expressed at high levels, and thus helps turn a normal cell into a tumor cell • It was hypothesized that certain genes in the infecting viruses (oncogenes) encode cellular growth factors • The virus infected cells would thus produce uncontrolled quantities of the growth factor, allowing the cell colony to grow beyond its normal limits. The first success story • The hypothesis is now generally accepted. • However, the link between oncogenes and growth factors did not come from a direct test of this hypothesis. • Instead it was an unanticipated result of merging two independent sets of data via a computer search. The first success story • Simian sarcoma virus is a retrovirus that was known by the early 1970s to cause cancer in a specific species of monkeys. • Retrovirus is an RNA virus, that must be converted in the infected cell to DNA before the virus can replicate The first success story • By 1970 we know a retrovirus to cause cancer • Oncogene, named v-sis, was isolated and sequenced in 1983 • A partial amino acid / protein sequence of important growth factor, Platelet-derived growth factor (PDGF), was published about the same time • When compared the two sequences – At one region of 31 amino acid, 26 exact matches – In another region of 39 residues, 35 exact matches A More Recent Story • First complete DNA sequence of a free-living ogranism was reported in 1995 • A total of 1,743 putative regions were identified • Each of these 1,743 strings was then translated to one or more proteins (depending on reading frames) • These protein sequences were searched for “sufficiently similar” sequences in the protein sequence database Swiss-Prot. • In this way, 1,007 of the putative genes not only matched entries in the database, but matched is such an unambiguous manner that the specific biochemical function could be deduced for each one. Indirect Applications of Database Search • Clustering similar sequences into sequence families – Such families may reveal important conserved biological phenomena that had not been observed by laboratory work and that would be hard to recognize by looking at two sequences alone. • Also there are many clever ways to tackle both biological and biotechnical problems, such as – Sequence assembly in bacteria – Multiple sclerosis and database search Multiple Sclerosis and Database Search Not well understood disease Multiple Sclerosis and Database Search • Multiple Sclerosis (MS) is an autoimmune disease – Meaning that the immune system incorrectly identifies native cells as foreign invaders • The first line of attack in the immune system are the T-cells, which identify foreign matter. – Once identified, other elements of the immune system attack and destroy the identified matter Multiple Sclerosis and Database Search Multiple Sclerosis and Database Search • Recently, specific T-cells were found that identify proteins or protein segments that appear on the surface of myelin cells • Its natural to conjecture/hypothesize that those T-cells (that mistakenly identify proteins on the myelin surface as foreign) had previously been generated by the immune system to (correctly) identify similar proteins on the surface of bacteria or viruses. • Or, in other way, some bacteria or virus has protein on their outer surface that are very similar to myelin • How you test which bacteria / viruses are involved? Multiple Sclerosis and Database Search • Myelin surface proteins are sequenced • A search was conducted in the protein databases for highly similar proteins in bacteria and viruses • About 100 proteins were found • Laboratory work then verified that the specific T-cells that attach myelin sheath also attack particular proteins found by the database search Multiple Sclerosis and Database Search • So the database search not only confirmed the hypothesis, but also identified the particular bacterial and viral proteins that are confused with proteins on the myelin surface • The hope is that by examining the similarities among those bacterial and viral protein sequences one might better understand what features on the myelin surface proteins are used by the T-cells to mistakenly identify myelin cells as foreign. Biological Databases •Over 1000 biological databases •Vary in size, quality, coverage, level of interest •Many of the major ones covered in the annual Database Issue of Nucleic Acids Research •What makes a good database? •comprehensiveness •accuracy •is up-to-date •good interface •batch search/download •API (web services, DAS, etc.) GenBank, the Granddaddy • Store and facilitate retrieval of all DNA sequences ever made public • It is now maintained by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), which is part of the National Institutes of Health (NIH), USA • The European version, the EMBL data library • DNA DataBase of Japan (DDBJ) exists in Japan • Also another one, Genome Sequence DataBase (GSDB) – These four database share information between them – Submission to one is effectively a submission to all Entrez NIH NCBI GenBank •Submissions •Updates •Submissions •Updates EMBL DDBJ EBI CIB NIG •Submissions •Updates SRS EMBL getentry 19 “Ten Important Bioinformatics Databases” GenBank www.ncbi.nlm.nih.gov nucleotide sequences Ensembl www.ensembl.org human/mouse genome (and others) PubMed www.ncbi.nlm.nih.gov literature references NR www.ncbi.nlm.nih.gov protein sequences SWISS-PROT www.expasy.ch protein sequences InterPro www.ebi.ac.uk protein domains OMIM www.ncbi.nlm.nih.gov genetic diseases Enzymeswww.chem.qmul.ac.uk enzymes PDB www.rcsb.org/pdb/ protein structures KEGG www.genome.ad.jp metabolic pathways Source: Bioinformatics for Dummies Types of Database • Sequence and Bibliographic • Genomic • Clinical and Mutation • Homologies • Integrated Most databases • are accessible from a web page • are interlinked Sequence Databases Main nucleic acid sequence databases • EMBL • GenBank • DDBJ Main protein sequence databases • Swiss Prot • also TREMBL, GenPept Often integrated with other databases Integrating Sequence and Bibliographic Databases Entrez • Links nucleic acid sequences, protein sequences and MEDLINE • Powerful and easy to use SRS = Sequence Retrieval System • Universal system for searching sequence and other databases • Available worldwide including at HGMP (Human Genome Mapping Project) Genomic Databases GDB = Human Genome Database • Repository for mapping and genomic data for the Human Genome Project • Powerful; links to other databases ACeDB • Developed to provide access to C.elegans data Clinical and Mutation Databases OMIM • Online Mendelian Inheritance in Man • Database of disease-linked genes and associated phenotypes • Links to Entrez, GDB and other databases HGMD • Database of sequences and phenotypes of disease-causing mutations Disease-specific mutation databases Homology Databases MGI • Mapping and gene expression data for the mouse • Human homologies • Links to GDB and Entrez Many other organism-specific databases • can be used to search for homologs of human genes An Integrated Database GeneCards • Integrated resource of information on human genes and their products • Major emphasis on human disease • Links to many kinds of biomedical information • Sequence databases • OMIM, HGMD, MDB • Doctors’ Guide to the Internet Databases • Secondary (curated) • Primary (archival) – – – – – – – – – – GenBank/EMBL/DDBJ UniProt PDB Medline (PubMed) BIND 28 RefSeq Taxon UniProt OMIM SGD NCBI (National Center for Biotechnology Information) • Over 30 databases including GenBank, PubMed, OMIM, and GEO • Access all NCBI resources via Entrez (www.ncbi.nlm.nih.gov/Entrez/) www.ncbi.nlm.nih.gov/GenBank GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 65,369,091,950 bases in 61,132,599 sequence records in the traditional GenBank divisions and 80,369,977,826 bases in 17,960,667 sequence records in the WGS division as of August 2006. www.ncbi.nlm.nih.gov/GenBank The Reference Sequence (RefSeq) database is a non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa. Each RefSeq represents a single, naturally occurring molecule from one organism. The goal is to provide a comprehensive, standard dataset that represents sequence information for a species. It should be noted, though, that RefSeq has been built using data from public archival databases only. RefSeq biological sequences (also known as RefSeqs) are derived from GenBank records but differ in that each RefSeq is a synthesis of information, not an archived unit of primary research data. Similar to a review article in the literature, a RefSeq represents the consolidation of information by a particular group at a particular time. Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI) Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI) Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI) The MOD squad •Most model organism communities have established organism-specific Model Organism Databases (MODs) •Many of these databases have different schemas and implementations, although there is movement toward harmonizing many features via the Generic Model Organism Database project. The MOD squad SGD: yeast (www.yeastgenome.org) Wormbase: C. elegans (www.wormbase.org) FlyBase: Drosophila (flybase.bio.indiana.edu) Zfin: zebrafish (zfin.org) and many others (Xenopus, Dictyostelium, Arabisdopsis…) The MOD squad: what about Homo sapiens? There is not a true “model organism” database for Human. The two main sources of genome information that have evolved are the UCSC Genome Browser and Ensembl. EnsEMBL www.ensembl.org UCSC genome.ucsc.edu UCSC Browser UCSC Browser Ensembl Ensembl Ensembl Protein Data Bank (PDB) Protein Data Bank (PDB) total yearly Protein Data Bank (PDB) Real Sequence Database Search • FASTA – Fast-all and pronounced “fast-AY’ • BLAST – Basic Local Alignment Search Tool • Two perspective of studies of these tools – Algorithmic • What algorithm weight matrix they use? – Technical • How they are used? • How to interpret the search result? • How to tune different parameters to get meaningful result and so on? BLAST • We will know the technical perspective • Assignment 2 – Download and run BLAST – Prepare a report • What is BLAST • Different types of BLAST, their insights • Search result analysis and so on. – Submission deadline, August 07 Reference • Chapter 15, Algorithms on Strings, Trees and Sequences – by Dan Gusfield