bio_DB

advertisement
Chapter 2
Biological Databases
Databases (DB)



a computerized archive used to store and organize data in such a way that information can be retrieved
easily via a variety of search criteria
record, also called an entry
query.- retrieve the whole data record
Types of DB
Flat file format – entries are separated by a delimiter
database management systems software programs for organizing, searching, and accessing data
Relational DB
use a set of tables to organize data. Each table, also called a relation
structured query language (SQL) – DB programming language
2
Object-Oriented DB (OO DB)
 Describe complex hierarchical relationships between data items



store data as objects
navigating through the objects with the aid of the pointers linking different objects
Programming languages like C++ are used to create object-oriented databases.
Biological DB



Primary DB – contain original biological data, raw sequences (seqs)
Secondary DB – processed or manually curated data
Specialized DB – cater to a particular research interest
Primary DB
 NCBI, EMBL, DDBJ, PDB
Secondary DB
 Protein DBs - SWISS-PROT, TrEMBL, PIR
 Protein family classification DBs – Pfam, Blocks, DALI
Specialized DB
 Flybase, WormBase, AceDB, TAIR
 GenBank EST, microarray DB, i.e. ArrayExpress at EBI
Interconnection between Biological Databases
 format incompatibility three types of database structures – flat files, relational, and object oriented
 Common Object Request Broker Architecture (COBRA) - a network through an “interface broker” without
having to understand each other’s database structure.
 eXtensibleMarkup Language (XML) - each biological record is broken down into small, basic components
that are labeled with a hierarchical nesting of tags. This database structure significantly improves the
distribution and exchange of complex sequence annotations between databases
PITFALLS OF BIOLOGICAL DATABASES
 Reliability, redundancy of data
 errors can be passed on to other databases, causing propagation of errors
 submission of identical or overlapping sequences by the same or different authors, revision of annotations,
dumping of expressed sequence tags (EST) data
 NCBI has now created a nonredundant database, called RefSeq
 SWISS-PROT database also has minimal redundancy for protein sequences
 sequence-cluster databases such as UniGene
 erroneous annotations – Gene Ontology
3
INFORMATION RETRIEVAL FROM BIOLOGICAL DATABASES
PubMed/Medline
- find out about a protein by its name, go to http://www.ncbi.nlm.nih.gov/entrez
- DUTPase
-
- save : File  Save as
Search PubMed using author names (case insensitive)
4
Searching PubMed using fields
title
AB = abstract, AD = laboratory address, AU = authors, SO = journal abbreviation
Example: common names such as Down, can be used in different contexts, such as titles (Down syndrome),
or an address (Down address)  Down [AU], Down[AB] Down[AD]
Logical variable : AND, OR, NOT (in capital letter)
- for more about the fields, read the Entrez Gene web page
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene
Using fields to find experts near you
Exercise
Try to search for articles related to microRNA and cancer.
Abstract and Lab. address
5
Searching PubMed using limits, such find all the review articles about dUTPase
A few more tips about PubMed
- add quoted query, such as “down syndrome”
- adding initials to last names, for example, “Abergel C”
- write down the PubMed Identifier (the PMID number)
- use other names to search, such as use dUTP pyrophosphates
- try the Related Articles options
Retrieving Protein Sequences
http://www.expasy.org/sprot
Retrieving the protein sequence performing the dUTPase function in E. coli
6
FASTA format
Number of amino acid,
molecular weight
>sp|P06968|DUT_ECOLI Deoxyuridine 5'-triphosphate nucleotidohydrolase (EC 3.6.1.23) (dUTPase) (dUTP
pyrophosphatase) - Escherichia coli.
MKKIDVKILDPRVGKEFPLPTYATSGSAGLDLRACLNDAVELAPGDTTLVPTGLAIHIAD
PSLAAMMLPRSGLGHKHGIVLGNLVGLIDSDYQGQLMISVWNRGQDSFTIQPGERIAQMI
FVPVVQAEFNLVEDFDATDRGEGGFGHSGRQ
Advanced search with field restriction
- gene definition, gene name, and organism
Retrieving a list of protein sequences
Deselect the TrEMBL and TrEMBL-New boxes (computer annotated sequences)
7
Perform this query on SRS against
Swiss-Prot
Gene names are different in different species, even they have the same function
E. coli
dut
Yeast
dut1
Vaccinia virus
F2L
Herpes virus
UL50
8
Retrieving DNA sequences
- retrieving the DNA sequence related to E. coli dUTPase (P06968)
- look under the Cross reference section, EMBL, GenBank, DDBJ and CodingSequence
- the GenBank entry consists of many parts, such as LOCUS, REFERENCE section, FEATURES section,
ORIGIN (Sequence) section
326
FEATURES
promoter
286
promoter
291
310
start
316
-10
RBS
-2 -1
12
CDS
9
Repeat unit
repeat_unit
831..838
repeat_unit
/note="inverted repeat A"
844..851
/note="inverted repeat A'"
inverted repeat : ccgcaaac gtttgcgg
Using BLAST to compare my protein sequence to other protein sequences
10
Making a Multiple Sequence Alignment (MSA) with ClustalW
http://pir.georgetown.edu
(DUT_CANAL, DUT_BPT5) evolution distance = 0.33884 + 0.33684
11
Appendix
>sp|P06968|DUT_ECOLI Deoxyuridine 5'-triphosphate nucleotidohydrolase (EC 3.6.1.23) (dUTPase)
(dUTP pyrophosphatase) - Escherichia coli.
MKKIDVKILDPRVGKEFPLPTYATSGSAGLDLRACLNDAVELAPGDTTLVPTGLAIHIAD
PSLAAMMLPRSGLGHKHGIVLGNLVGLIDSDYQGQLMISVWNRGQDSFTIQPGERIAQMI
FVPVVQAEFNLVEDFDATDRGEGGFGHSGRQ
For MSA
DUT_AQUAE
DUT_BPT5
DUT_BRAJA
DUT_BUCAI
DUT_CANAL
>DUT_AQUAE
MSKVILKIKRLPHAQDLPLPSYATPHSSGLDLRAAIEKPLKIKPFERVLIPTGLILEIPE
GYEGQVRPRSGLAWKKGLTVLNAPGTIDADYRGEVKVILVNLGNEEVVIERGERIAQLVI
APVQRVEVVEVEEVSQTQRGEGGFGSTGTK
12
>DUT_BPT5
MIKIKLTHPDCMPKIGSEDAAGMDLRAFFGTNPAADLRAIAPGKSLMIDTGVAVEIPRGW
FGLVVPRSSLGKRHLMIANTAGVIDSDYRGTIKMNLYNYGSEMQTLENFERLCQLVVLPH
YSTHNFKIVDELEETIRGEGGFGSSGSK
>DUT_BRAJA
MSTKVTVELQRLPHAEGLPLPAYQTAEAAGLDLMAAVPEDAPLTLASGRYALVPTGLAIA
LPPGHEAQVRPRSGLAAKHGVTVLNSPGTIDADYRGEIKVILINHGAAAFVIKRGERIAQ
MVIAPVVQAALVPVATLSATDRGAGGFGSTGR
>DUT_BUCAI
MSNIEIKILDSRMKNNFSLPSYATLGSSGLDLRACLDETVKLKAHKTILIPTGIAIYIAN
PNITALILPRSGLGHKKGIVLGNLVGLIDSDYQGQLMISLWNRSDQDFYVNPHDRVAQII
FVPIIRPCFLLVKNFNETSRSKKGFGHSGVSGVI
>DUT_CANAL
MTSEDQSLKKQKLESTQSLKVYLRSPKGKVPTKGSALAAGYDLYSAEAATIPAHGQGLVS
TDISIIVPIGTYGRVAPRSGLAVKHGISTGAGVIDADYRGEVKVVLFNHSEKDFEIKEGD
RIAQLVLEQIVNADIKEISLEELDNTERGEGGFGSTGKN
13
Asia University
Bioinformatics – assignment
Name: ________________________
Class: _____________________
Point your browser to NCBI home page.
1. What are the Genbank common name of the following species ?
a. Acer rubrum
common name:_____________
b. Orycteropus afer
common name:_____________
Hint : point your browser to NCBI home page and use Taxonomy to search.
2. Use PubMed to find out how many references are related to the protein, “mitochondrial cytochrome b”.
3.
4.
Number of items? __________ items
How many items are written by the author Yang on mitochondrial cytochrome b ?
Hint: search PubMed using fields.
__________ items.
Retrieve the following three protein sequences in FASTA format: a. horse pancreatic ribonuclease, b.
minke whale pancreatic ribonuclease c. kagaroo pancreatic ribonuclease. [Hint: Use pancreatic
ribonuclease as your keyword to search in SWISS-PROT home page. You may find more than one
sequence for each species, use search results from SWISS-PROT not from TrEMBL.]
Fill in the details for the protein, pancreatic ribonuclease, in the table below.
Species
Horse
Minke whale
Kangaroo
Primary Accession ID
Length of Amino acids
Molecular weight
Number of
DISULFID bond
5.
Using the three sequences from question 4 to determine which two of these species are most closely
related by doing multiple sequence alignment (MSA). [Hint: Point your browser to the PIR web site
(http://pir.georgetown.edu) and do the ClustalW alignment. Then, from the TREE VIEW result,
determine which two species are most closely related.
closer.]
A. horse and minke whale are most closely related
B. horse and kangaroo are most closely related
C. kangaroo and minke whale are most closely related
Answer: A or B or C
A smaller evolution distance implies they are
Download