Slide 11

advertisement
Sequence Databases, Their Use
and BLAST
Presented By
Dr. Shazzad Hosain
Asst. Prof. EECS, NSU
Why Sequence Databases?
• Electronic databases are fast becoming the
lifeblood of the field
• Because of the power of biomolecular
sequence comparison
• Discoveries based solely on sequence
homology have become routine
The first success story
• Cellular growth factor
– Particular proteins or hormones needed to stimulate
or continue growth of a cell colony.
• By early 1970s it was understood that certain
viruses could cause particular cells in culture (in
vitro) to grow without bound
• This cancer like transformation of cultured cells
by viruses suggested that viral infection could be
a cause of cancer in animals, but the mechanisms
were unknown
The first success story
• Oncogene is a gene that is mutated or expressed
at high levels, and thus helps turn a normal cell
into a tumor cell
• It was hypothesized that certain genes in the
infecting viruses (oncogenes) encode cellular
growth factors
• The virus infected cells would thus produce
uncontrolled quantities of the growth factor,
allowing the cell colony to grow beyond its
normal limits.
The first success story
• The hypothesis is now generally accepted.
• However, the link between oncogenes and
growth factors did not come from a direct test
of this hypothesis.
• Instead it was an unanticipated result of
merging two independent sets of data via a
computer search.
The first success story
• Simian sarcoma virus is a retrovirus that was
known by the early 1970s to cause cancer in a
specific species of monkeys.
• Retrovirus is an RNA virus, that must be
converted in the infected cell to DNA before
the virus can replicate
The first success story
• By 1970 we know a retrovirus to cause cancer
• Oncogene, named v-sis, was isolated and
sequenced in 1983
• A partial amino acid / protein sequence of
important growth factor, Platelet-derived growth
factor (PDGF), was published about the same
time
• When compared the two sequences
– At one region of 31 amino acid, 26 exact matches
– In another region of 39 residues, 35 exact matches
A More Recent Story
• First complete DNA sequence of a free-living ogranism
was reported in 1995
• A total of 1,743 putative regions were identified
• Each of these 1,743 strings was then translated to one
or more proteins (depending on reading frames)
• These protein sequences were searched for
“sufficiently similar” sequences in the protein
sequence database Swiss-Prot.
• In this way, 1,007 of the putative genes not only
matched entries in the database, but matched is such
an unambiguous manner that the specific biochemical
function could be deduced for each one.
Indirect Applications of Database Search
• Clustering similar sequences into sequence
families
– Such families may reveal important conserved
biological phenomena that had not been observed
by laboratory work and that would be hard to
recognize by looking at two sequences alone.
• Also there are many clever ways to tackle both
biological and biotechnical problems, such as
– Sequence assembly in bacteria
– Multiple sclerosis and database search
Multiple Sclerosis and Database Search
Not well understood disease
Multiple Sclerosis and Database Search
• Multiple Sclerosis (MS) is an autoimmune disease
– Meaning that the immune system incorrectly identifies
native cells as foreign invaders
• The first line of attack in the immune system are the
T-cells, which identify foreign matter.
– Once identified, other elements of the immune system
attack and destroy the identified matter
Multiple Sclerosis and Database Search
Multiple Sclerosis and Database Search
• Recently, specific T-cells were found that identify
proteins or protein segments that appear on the surface
of myelin cells
• Its natural to conjecture/hypothesize that those T-cells
(that mistakenly identify proteins on the myelin surface
as foreign) had previously been generated by the
immune system to (correctly) identify similar proteins on
the surface of bacteria or viruses.
• Or, in other way, some bacteria or virus has protein on
their outer surface that are very similar to myelin
• How you test which bacteria / viruses are involved?
Multiple Sclerosis and Database Search
• Myelin surface proteins are sequenced
• A search was conducted in the protein
databases for highly similar proteins in
bacteria and viruses
• About 100 proteins were found
• Laboratory work then verified that the specific
T-cells that attach myelin sheath also attack
particular proteins found by the database
search
Multiple Sclerosis and Database Search
• So the database search not only confirmed the
hypothesis, but also identified the particular
bacterial and viral proteins that are confused with
proteins on the myelin surface
• The hope is that by examining the similarities
among those bacterial and viral protein
sequences one might better understand what
features on the myelin surface proteins are used
by the T-cells to mistakenly identify myelin cells
as foreign.
Biological Databases
•Over 1000 biological databases
•Vary in size, quality, coverage, level of interest
•Many of the major ones covered in the annual Database Issue of Nucleic Acids
Research
•What makes a good database?
•comprehensiveness
•accuracy
•is up-to-date
•good interface
•batch search/download
•API (web services, DAS, etc.)
GenBank, the Granddaddy
• Store and facilitate retrieval of all DNA sequences ever
made public
• It is now maintained by the National Center for
Biotechnology Information (NCBI) at the National
Library of Medicine (NLM), which is part of the
National Institutes of Health (NIH), USA
• The European version, the EMBL data library
• DNA DataBase of Japan (DDBJ) exists in Japan
• Also another one, Genome Sequence DataBase (GSDB)
– These four database share information between them
– Submission to one is effectively a submission to all
Entrez
NIH
NCBI
GenBank
•Submissions
•Updates
•Submissions
•Updates
EMBL
DDBJ
EBI
CIB
NIG
•Submissions
•Updates
SRS
EMBL
getentry
19
“Ten Important Bioinformatics Databases”
GenBank
www.ncbi.nlm.nih.gov
nucleotide sequences
Ensembl
www.ensembl.org
human/mouse genome (and others)
PubMed
www.ncbi.nlm.nih.gov
literature references
NR
www.ncbi.nlm.nih.gov
protein sequences
SWISS-PROT
www.expasy.ch
protein sequences
InterPro
www.ebi.ac.uk
protein domains
OMIM
www.ncbi.nlm.nih.gov
genetic diseases
Enzymeswww.chem.qmul.ac.uk
enzymes
PDB
www.rcsb.org/pdb/
protein structures
KEGG
www.genome.ad.jp
metabolic pathways
Source: Bioinformatics for Dummies
Types of Database
• Sequence and Bibliographic
• Genomic
• Clinical and Mutation
• Homologies
• Integrated
Most databases
• are accessible from a web page
• are interlinked
Sequence Databases
Main nucleic acid sequence databases
• EMBL
• GenBank
• DDBJ
Main protein sequence databases
• Swiss Prot
• also TREMBL, GenPept
Often integrated with other databases
Integrating Sequence and Bibliographic
Databases
Entrez
• Links nucleic acid sequences, protein
sequences and MEDLINE
• Powerful and easy to use
SRS = Sequence Retrieval System
• Universal system for searching sequence
and other databases
• Available worldwide including at HGMP
(Human Genome Mapping Project)
Genomic Databases
GDB = Human Genome Database
• Repository for mapping and genomic data
for the Human Genome Project
• Powerful; links to other databases
ACeDB
• Developed to provide access to C.elegans
data
Clinical and Mutation Databases
OMIM
• Online Mendelian Inheritance in Man
• Database of disease-linked genes and associated
phenotypes
• Links to Entrez, GDB and other databases
HGMD
• Database of sequences and phenotypes of
disease-causing mutations
Disease-specific mutation databases
Homology Databases
MGI
• Mapping and gene expression data for the mouse
• Human homologies
• Links to GDB and Entrez
Many other organism-specific databases
• can be used to search for homologs of human genes
An Integrated Database
GeneCards
• Integrated resource of information on human
genes and their products
• Major emphasis on human disease
• Links to many kinds of biomedical information
• Sequence databases
• OMIM, HGMD, MDB
• Doctors’ Guide to the Internet
Databases
• Secondary (curated)
• Primary (archival)
–
–
–
–
–
–
–
–
–
–
GenBank/EMBL/DDBJ
UniProt
PDB
Medline (PubMed)
BIND
28
RefSeq
Taxon
UniProt
OMIM
SGD
NCBI (National Center for Biotechnology Information)
• Over 30 databases including GenBank,
PubMed, OMIM, and GEO
• Access all NCBI resources via Entrez
(www.ncbi.nlm.nih.gov/Entrez/)
www.ncbi.nlm.nih.gov/GenBank
GenBank® is the NIH genetic
sequence database, an annotated
collection of all publicly available
DNA sequences. There are
approximately 65,369,091,950
bases in 61,132,599 sequence
records in the traditional GenBank
divisions and 80,369,977,826
bases in 17,960,667 sequence
records in the WGS division as of
August 2006.
www.ncbi.nlm.nih.gov/GenBank
The Reference Sequence (RefSeq) database is
a non-redundant collection of richly annotated
DNA, RNA, and protein sequences from diverse
taxa. Each RefSeq represents a single, naturally
occurring molecule from one organism. The goal
is to provide a comprehensive, standard dataset
that represents sequence information for a
species. It should be noted, though, that RefSeq
has been built using data from public archival
databases only.
RefSeq biological sequences (also known as
RefSeqs) are derived from GenBank records
but differ in that each RefSeq is a synthesis of
information, not an archived unit of primary
research data. Similar to a review article in the
literature, a RefSeq represents the consolidation
of information by a particular group at a
particular time.
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
Microarray data are stored in GEO (NCBI) and ArrayExpress (EBI)
The MOD squad
•Most model organism communities have established organism-specific
Model Organism Databases (MODs)
•Many of these databases have different schemas and implementations,
although there is movement toward harmonizing many features via the
Generic Model Organism Database project.
The MOD squad
SGD: yeast (www.yeastgenome.org)
Wormbase: C. elegans (www.wormbase.org)
FlyBase: Drosophila (flybase.bio.indiana.edu)
Zfin: zebrafish (zfin.org)
and many others (Xenopus, Dictyostelium, Arabisdopsis…)
The MOD squad: what about Homo sapiens?
There is not a true “model organism” database for Human. The two main
sources of genome information that have evolved are the UCSC Genome
Browser and Ensembl.
EnsEMBL www.ensembl.org
UCSC
genome.ucsc.edu
UCSC Browser
UCSC Browser
Ensembl
Ensembl
Ensembl
Protein Data Bank (PDB)
Protein Data Bank (PDB)
total
yearly
Protein Data Bank (PDB)
Real Sequence Database Search
• FASTA – Fast-all and pronounced “fast-AY’
• BLAST – Basic Local Alignment Search Tool
• Two perspective of studies of these tools
– Algorithmic
• What algorithm weight matrix they use?
– Technical
• How they are used?
• How to interpret the search result?
• How to tune different parameters to get meaningful
result and so on?
BLAST
• We will know the technical perspective
• Assignment 2
– Download and run BLAST
– Prepare a report
• What is BLAST
• Different types of BLAST, their insights
• Search result analysis and so on.
– Submission deadline, August 07
Reference
• Chapter 15, Algorithms on Strings, Trees and
Sequences – by Dan Gusfield
Download