protein

advertisement
Sequence Analysis (I)
Yuh-Shan Jou (周玉山)
jou@ibms.sinica.edu.tw
Institute of Biomedical Sciences, Academia Sinica
Bioinformatics
• Bioinformatics is the application of information
technology to analyze, process, and manage biological
data.
• Bioinformatics provides computational tools to facilitate
the process of
Data
Information
Knowledge
Discovery
Don’t believe everything you see in DB or even in GenBank!
QC is the most important aspect and concern in Bioinformatics!
Roadmap to Genomics
Hum an
Genom e
1 . Markers:
EST: Expressed Sequence Tag .
STS: Sequence Tag Sit e.
STR: Sho rt Tandem Repeat .
2 . g eno mic DNA co nt ig s:
Cosmid cont igs
YAC cont igs
Diseases Markers
f o r diag no sis
*1.
*2.
*3.
Expression pat t erns
Expression profiles
Microarray of genes
cDNA sequencing
Hum an Genom e
Physical Maps
Po s it io n al
Clo n in g
Dat abase ESTs
( d b EST )
1 . BAC o r PAC co nt ig s
2 . Sequencing t echno lo g ies
Full length cDNAs
Radiat io n Hy b rids
Mapping Panels
* Diagnosis
w it h GeneChips
Sequencing of
Human Genome
Po sit io nal Candidat e Appro aches
T ranscript ional M ap
of hum an Genom e
Po sit io nal Candidat e Appro aches
Functional
Genomics
A Vision for the Future of Genome Research
Francis S. Collins (National Human Genome Research Institute, NIH, USA)
Nature 422:835 (2003)
International Sequence
Database Collaboration
Entrez
NIH
NCBI
GenBank
•Submissions
•Updates
•Submissions
•Updates
EMBL
CIB
NIG
DDBJ
•Submissions
•Updates
getentry
EBI
SRS
EMBL
www.ensembl.org
Lecture 7.1
6
http://genome.ucsc.edu
Integration Bioinformatics
Data Bases and Scientific Algorithms
Medline
(Asn.1)
Microarray Data
(RDBMS, Excel)
BLAST
(FASTA)
OMIN
(Text File)
Integration
BioInformatics
KEGG
(HTML Text,
Binary Images)
Entrez/NCBI
(Asn.1)
ClustalW
(FASTA)
PDB
(Oracle, 3D images)
Web Access: www.ncbi.nlm.nih.gov
NCBI Web Traffic
600,000
User’s per day
World
Internet Users
500,000
400,000
US
Internet Users
300,000
200,000
100,000
1998
1999
2000
2001
2002
2003
2004
Christmas and New Year’s Day
2005
The Entrez System: Text Searches
Types of Databases
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases
– Built from primary data
– Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, NCBI
Protein, Structure, Conserved Domain
Entrez Nucleotides
•
•
•
•
Primary
GenBank / EMBL / DDBJ 49,675,750
Derivative
RefSeq
545,503
Third Party Annotation
4,544
PDB
5,561
Total
50,231,358
Entrez Protein: Derivative Databases
GenPept
3,950,968
RefSeq
1,348,072
Third Party Annotation
Swiss Prot
PIR
PRF
PDB
Total
4,133
170,087
282,821
12,079
61,845
5,830,005
BLAST nr total
2,336,522
The Growth of GenBank
GenBank Growth
50
50
Basepairs
Records
45
45
Release 148:
35
30
45.2 million records
49.4 billion nucleotides
35
30
25
25
Average doubling time ≈ 14 months*
20
20
Date
Jun-04
Jun-02
Jun-00
Jun-98
0
Jun-96
0
Jun-94
5
Jun-92
5
Jun-90
10
Jun-88
10
Jun-86
15
Jun-84
15
Records (millions)
40
Jun-82
Base Pairs (billions)
40
Organization of GenBank:
Traditional Divisions
Records are divided into 17 Divisions.
11 Traditional
6 Bulk
Traditional Divisions:
• Direct Submissions
(Sequin and BankIt)
• Accurate
• Well characterized
PRI (28) Primate
PLN (13) Plant and Fungal
BCT (11) Bacterial and Archeal
INV (7) Invertebrate
ROD (15) Rodent
VRL (4) Viral
VRT (7) Other Vertebrate
MAM (1) Mammalian
PHG (1) Phage
SYN (1) Synthetic (cloning vectors)
UNA (1) Unannotated
Entrez query: gbdiv_xxx[Properties]
Organization of GenBank:
Bulk Divisions
Records are divided into 17 Divisions.
11 Traditional
6 Bulk
BULK Divisions:
• Batch Submission
(Email and FTP)
• Inaccurate
• Poorly characterized
EST (355)
GSS (132)
HTG (62)
STS (5)
HTC (6)
PAT (17)
Expressed Sequence Tag
Genome Survey Sequence
High Throughput Genomic
Sequence Tagged Site
High Throughput cDNA
Patent
Entrez query: gbdiv_xxx[Properties]
File Formats of the
Sequence Databases
Each sequence is represented by
a text record called a flat file.
GenBank/GenPept (useful for scientists)
FASTA
ASN.1 & XML
(the simplest format)
(useful for programmers)
LOCUS
DEFINITION
AY182241
1931 bp
mRNA
linear
PLN 04-MAY-2004
Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
ACCESSION
AY182241
VERSION
AY182241.2 GI:32265057
KEYWORDS
.
SOURCE
Malus x domestica (cultivated apple)
ORGANISM Malus x domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE
1 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL
Planta 219, 84-94 (2004)
REFERENCE
2 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE
3 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK
Sequence update by submitter
COMMENT
On Jun 26, 2003 this sequence version replaced gi:27804758.
FEATURES
Location/Qualifiers
source
1..1931
/organism="Malus x domestica"
/mol_type="mRNA"
/cultivar="'Law Rome'"
/db_xref="taxon:3750"
/tissue_type="peel"
gene
1..1931
/gene="AFS1"
CDS
54..1784
/gene="AFS1"
/note="terpene synthase"
/codon_start=1
/product="(E,E)-alpha-farnesene synthase"
/protein_id="AAO22848.2"
/db_xref="GI:32265058"
/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK
NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF
EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE
DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK
GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI
LSLLFQPLVN"
ORIGIN
1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat
61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg
121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt
181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt
//
A Traditional
GenBank Record
Header
The Flatfile Format
Feature Table
Sequence
The Header
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
REFERENCE
AUTHORS
TITLE
JOURNAL
REFERENCE
AUTHORS
TITLE
JOURNAL
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
COMMENT
AY182241
1931 bp
mRNA
linear
PLN 04-MAY-2004
Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
AY182241
AY182241.2 GI:32265057
.
Malus x domestica (cultivated apple)
Malus x domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
1 (bases 1 to 1931)
Pechous,S.W. and Whitaker,B.D.
Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
Planta 219, 84-94 (2004)
2 (bases 1 to 1931)
Pechous,S.W. and Whitaker,B.D.
Direct Submission
Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
3 (bases 1 to 1931)
Pechous,S.W. and Whitaker,B.D.
Direct Submission
Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
Sequence update by submitter
On Jun 26, 2003 this sequence version replaced gi:27804758.
Header: Locus Line
LOCUS
AY182241
1931 bp
mRNA
linear
PLN 04-MAY-2004
DEFINITION
Malus x domestica
synthase (AFS1)
mRNA,
LOCUS
AY182241
1931 (E,E)-alpha-farnesene
bp
mRNA
linear
PLN 04-MAY-2004
complete cds.
ACCESSION
AY182241
VERSION
AY182241.2 GI:32265057
KEYWORDS
.
SOURCE
Malus x domestica (cultivated apple)
ORGANISM Malus x domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE
1 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL
Planta 219, 84-94 (2004)
REFERENCE
2 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE
3 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK
Sequence update by submitter
COMMENT
On Jun 26, 2003 this sequence version replaced gi:27804758.
Length
Locus name
Molecule type
Division
Modification Date
Header: Database Identifiers
LOCUS
DEFINITION
AY182241
1931 bp
mRNA
linear
PLN 04-MAY-2004
Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
Accession
complete cds.
ACCESSION
AY182241
•Stable
VERSION
AY182241.2 GI:32265057
•Reportable
KEYWORDS
.
•Universal
SOURCE
Malus x domestica (cultivated apple)
ORGANISM Malus x domestica
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
Version
REFERENCE
1 (bases 1 to 1931)
GI number
AUTHORS
Pechous,S.W.
and
Whitaker,B.D.
Tracks changes in sequence
NCBI
internal
use
TITLE
Cloning and functional expression
of an
(E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL
Planta 219, 84-94 (2004)
REFERENCE
2 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE
3 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK
Sequence update by submitter
COMMENT
On Jun 26, 2003 this sequence version replaced gi:27804758.
ACCESSION
AY182241
VERSION
AY182241.2
GI:32265057
Header: Organism
LOCUS
DEFINITION
AY182241
1931 bp
mRNA
linear
PLN 04-MAY-2004
Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
complete cds.
ACCESSION
AY182241
VERSION
AY182241.2 GI:32265057
KEYWORDS
.
SOURCE
Malus x domestica (cultivated apple)
SOURCE
(cultivated apple)
ORGANISMMalus
Malusxx domestica
domestica
ORGANISM Malus
x domestica
Eukaryota;
Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta;
Magnoliophyta; Streptophyta;
eudicotyledons; core
eudicots;
Eukaryota;
Viridiplantae;
Embryophyta;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons;
REFERENCE
1 (bases 1 to 1931)
eudicots;
eurosids I; Rosales; Rosaceae;
AUTHORS core
Pechous,S.W.
androsids;
Whitaker,B.D.
TITLE Maloideae;
Cloning andMalus.
functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit
JOURNAL
Planta 219, 84-94 (2004)
REFERENCE
2 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.NCBI-controlled taxonomy
TITLE
Direct Submission
JOURNAL
Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE
3 (bases 1 to 1931)
AUTHORS
Pechous,S.W. and Whitaker,B.D.
TITLE
Direct Submission
JOURNAL
Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK
Sequence update by submitter
COMMENT
On Jun 26, 2003 this sequence version replaced gi:27804758.
The Feature Table
FEATURES
source
gene
CDS
start (atg)
Coding sequence
Implied
protein
Location/Qualifiers
1..1931
/organism="Malus x domestica"
/mol_type="mRNA"
/cultivar="'Law Rome'"
/db_xref="taxon:3750"
/tissue_type="peel"
1..1931
/gene="AFS1"
stop (tag)
54..1784
/gene="AFS1"
/note="terpene synthase"
/codon_start=1
/product="(E,E)-alpha-farnesene synthase"
/protein_id="AAO22848.2"
GenPept Identifiers
/db_xref="GI:32265058"
/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK
NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF
EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE
NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS
LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW
ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS
EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT
KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA
DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK
GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI
The Sequence: 99.99% Accurate
ORIGIN
//
1
61
121
181
ttcttgtatc
tcagagttca
aacctgaagc
ggaagaacga
ccaaacatct
cttgcaagct
ctcttacttg
tttcctagat
cgagcttctt
gataatgagc
attaatcaaa
caatctctta
gtacaccaaa
agaaaatttt
gacggtctgc
tcagcaaata
ttaggtattc
tcaaaaccag
aaattacaag
cgatggagat
actatggaat
atgaaacccg
ccaaatattt
gagtatcgga
1741
1801
1861
1921
ggacccacat
aataaatagc
tgtaacgttg
aaaaaaaaaa
cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga
agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt
ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa
a
FASTA Format
>gi|30256|emb|CAA42556.1| c-src-kinase [Homo sapiens]
MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREG
>gi|30256|emb|CAA42556.1| c-src-kinase [Homo sapiens]
VKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI
DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVM
MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREG
LGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS
VKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI
RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPV
gi
number>
Accession.Version Locus Name Organism
DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVM
KWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMK
LGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS
NCWHLDAAMRPSFLQLREQLEHIKTHELHL
RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPV
Database Identifiers:
KWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMK
gb
GenBank
NCWHLDAAMRPSFLQLREQLEHIKTHELHL
emb
dbj
ref
sp
pdb
pir
prf
tpg
tpe
tpj
EMBL
DDBJ
RefSeq
SWISS-PROT
Protein Databank
PIR
PRF
TPA-GenBank
TPA-EMBL
TPA-DDBJ
Abstract Syntax Notation: ASN.1
Seq-entry ::= set {
class nuc-prot ,
descr {
title "Malus x domestica (E,E)-alpha-farnesene synthase
(AFS1) mRNA,
complete cds." ,
source {
org {
taxname "Malus x domestica" ,
common "cultivated apple" ,
db {
{
db "taxon" ,
tag
id 3750 } } ,
orgname {
name
binomial {
genus "Malus" ,
species "x domestica" } ,
mod {
{
subtype cultivar ,
GenPept
GenBank
ASN.1
FASTA
Protein
FASTA
Nucleotide
Bulk Divisions
•Batch Submission and htg (email and ftp)
•Inaccurate
•Poorly Characterized
• Expressed Sequence Tag
– 1st pass single read cDNA
• Genome Survey Sequence
– 1st pass single read gDNA
• High Throughput Genomic
– incomplete sequences of genomic clones
• Sequence Tagged Site
– PCR-based mapping reagents
EST Division: Expressed Sequence Tags
gbdiv_est[Properties]
>IMAGE:275615 5' mRNA sequence
GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG
TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA
5’
TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA
30,000
GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC
TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC
genes
3’
AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN
TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
nucleus
- isolate unique clones
>IMAGE:275615 3', mRNA
sequence
-sequence once
RNA
NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA
from each end
gene products
TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT
AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT
CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG
GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
make cDNA
library
80-100,000 unique
cDNA clones in library
ESTs in Entrez
Total
Human
Mouse
Rat
Zebrafish
Wheat
Barley
Maize
26 million records
6.0 million
4.3 million
0.7 million
0.6 million
0.6 million
0.3 million
0.4 million
Genome Sequencing - HTG, GSS, (WGS)
Whole BAC insert (or genome)
shredding
sequencing
GSS division
or trace archive assembly
cloning isolating
whole genome shotgun assemblies
(traditional division)
Draft Sequence (HTG division)
HTG Division: Rice Draft Sequences
•Unfinished sequences of BACs
•Gaps and unordered pieces
•Finished sequences move to traditional
GenBank division
Whole Genome Shotgun Projects
• Traditional GenBank Divisions
• 200 + projects
–
–
–
–
–
Virus
Bacteria
Environmental sequences
Archaea
51 Eukaryotes featuring:
•
•
•
•
•
•
Cow, Chicken, Rat, Mouse, Dog, Chimpanzee, Human
Pufferfish (2)
Honeybee, Anopheles, Fruit Flies (3), Silkworm
Nematode (C. briggsae)
Yeasts (8), Aspergillus (2)
Rice
Zebrafish: WGS
wgs_master[Properties]
Derivative Databases
UniGene
RefSeq
TPA
Primary vs. Derivative
Sequence Databases
RefSeq
Labs
Sequencing
Centers
TATAGCCG
AGCTCCGATA
CCGATGACAA
Curators
TATAGCCG
TATAGCCG
TATAGCCG
TATAGCCG
Updated
continually
by NCBI
GenBank
Updated ONLY
by submitters
Genome
Assembly
UniGene
Algorithms
What is UniGene?
A gene-oriented view of sequence entries
•MegaBlast based automated sequence clustering
•Now informed by genome hits New!
•Nonredundant set of gene oriented clusters
•Each cluster a unique gene
•Information on tissue types and map locations
•Includes known genes and uncharacterized ESTs
•Useful for gene discovery and selection of mapping reagents
EST hits: Human mRNA
Albumin mRNA
5’ EST hits
3’ EST hits
UniGene: Expressed Sequences
Expression Data
RELEASE 11 (May 13, 2005)
AVAILABLE ON THE FTP SITE!
• Forming the “best representative” sequence
• Standardizing nomenclature and record structure
• Adding annotation (references, sequence features)
• Stable reference for example, gene identification,
•
•
•
polymorphism discovery, comparative analysis
RefSeq Release 11 includes over 1,425,971 proteins and
2928 organisms.
The release is available by FTP at:
ftp://ftp.ncbi.nih.gov/refseq/release/
RefSeq number is still not fixed.
srcdb_refseq[Properties]
Curated RefSeq Records
LOCUS
DEFINITION
ACCESSION
VERSION
ADSS
1368 bp mRNA
linear
PRI 27-AUG-2002
Homo sapiens adenylosuccinate synthase (ADSS), mRNA.
NM_001126
RefSeq Nucleotide
NM_001126.1 GI:4557270
LOCUS
ADSS
455 aa
linear
PRI 27-AUG-2002
DEFINITION adenylosuccinate synthase; Adenylosuccinate synthetase
(Ade(-)H-complementing) Homo sapiens .
ACCESSION
NP_001117
VERSION
NP_001117.1 GI:4557271
RefSeq Protein
DBSOURCE
REFSEQ: accession NM_001126.1
COMMENT
REVIEWED REFSEQ: This record has been curated by NCBI
staff. The reference sequence was derived from X66503.1.
Summary: Adenylosuccinate synthetase catalyzes the first
committed step in the conversion of IMP to AMP.
X records: Genome Annotation & Inferred or Predicted
vs
N records: Provisional, Reviewed or Validated
RefSeq Accession Numbers
mRNAs and Proteins
NM_123456
NP_123456
NR_123456
XM_123456
XP_123456
XR_123456
Gene Records
NG_123456
Chromosome
NC_123455
Assemblies
NT_123456
NW_123456
Curated mRNA
Curated Protein
Curated non-coding RNA
Predicted mRNA
Predicted Protein
Predicted non-coding RNA
Reference Genomic Sequence
Microbial replicons, organelle
Contig
WGS Supercontig
RefSeq Curation Processes
Curated genomic DNA
(NC, NT, NW)
Scanning....
Curated Model mRNA (XM)
Model protein (XP)
(XR)
Curated mRNA (NM)
Protein (NP)
(NR)
http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions
RefSeq: NCBI’s Derivative Sequence Database
• Curated transcripts and proteins
– reviewed
– human, mouse, rat, fruit fly, zebrafish, arabidopsis
microbial genomes (proteins), and more
• Model transcripts and proteins
• Assembled Genomic Regions (contigs)
– human genome
– mouse genome
– rat genome
• Chromosome records
– Human genome
– microbial
srcdb_refseq[Properties]
– organelle
ftp://ftp.ncbi.nih.gov/refseq/release/
RefSeq Benefits
•
•
•
•
•
•
•
non-redundancy
explicitly linked nucleotide and protein sequences
updates to reflect current sequence data and biology
data validation
format consistency
distinct accession series
stewardship by NCBI staff and collaborators
Third Party Annotation (TPA) Database
• Annotations of existing GenBank sequences
• Allows for community annotation of genomes
• Direct submissions
– BankIt
– Sequin
tpa[Properties]
TPA record: WGS
Assembly
CDS Feature
TPA protein
Human Nucleotide Sequences
ISDC
8,965,327
(GenBank/EMBL/DDBJ)
PRI
916,017
(WGS
601,855)
EST 6,003,916
GSS
905,645
HTG
18,364
HTC
49,373
STS
117,870
PAT
953,269
RefSeq
TPA
Total
35,934
893
9,002,154
Other NCBI Databases
•dbSNP:
•Geo:
nucleotide polymorphism
Gene Expression Omnibus
microarray and other expression data
•Gene:
gene records
Unifies LocusLink and Microbial Genomes
•Structure:
imported structures (PDB)
Cn3D viewer, NCBI curation
•CDD:
conserved domain database
Protein families (COGs)
Single domains (PFAM, SMART, CD)
NCBI’s SNP Database
•
•
•
•
•
•
Primary Database and Derivative (RefSNP)
Single Nucleotide Polymorphism
Repeat polymorphisms
Insertion-Deletion Polymorphisms
24 Species
Over 15 million submissions
Submitted SNP
Hemachromatosis SNP
RefSNP
•Non-redundant
•Computational Analysis
•BLAST hits to
•genome, mRNA, protein and structure
Sequence Similarity Searching
Basic Local Alignment Search Tool
(BLAST)
Text
Pubmed
Sequence
BLAST
Structure
VAST
Pairwise Alignment Summary
Global
• Best score for aligning
the full length sequences
• Dynamic programming
• Algorithm:
Needelman- Wunch
• Table cells are allowed
any score
Local
• Best score for aligning
part of sequences
• Dynamic programming
• Algorithm:
Smith-Waterman
• Table cells never score
below zero
Global vs Local Alignment
Seq 1
Seq 2
Global alignment
Seq 1
Seq 2
Local alignment
Global Alignment
Align program (Lipman and Pearson)
Human: 15
Worm:
63
IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84
+A + +
+ DL F K D+L I+ T+
W+
GR G IP+NYV + + +++
PW+
VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125
Human:
GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151
human 85M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG
GK+ R AE+ L
E G FLVR+S +
D +L V + V+HYRI + H
I
F L
M
S
..
AA
SG.
. .A ... .
Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194
worm
MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA
1 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA
20
40
60
Human: 152
220
Worm:
L+ HY +ADGLC L P
Y
W ++ + ++L++ IG G+FG+V G + N VA
195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264
Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289
VK +K
A
FLAEA +M +LRH L+ L V
++ + IVTE M + +L+ +L+ RGR
Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332
Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353
L++ S V
M 440
YLE NF+HRDLAARN+L++ 450 K++DFGL
KE
TG + P+KWTA
human REQLEHI--------KTHELHL
Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA
401
. .:: .
:
...
Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH
423
worm
QWKLEDLFNLDSSEYKEASINF
PEA
+F+TKSDVWSFGILL EI 500
+FGR+PYP +
+V+ +V+ GY+M P GCP +Y++M+ CW
Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471
Human: 424 LDAAMRPSFLQLREQLEHI 443
D
RP+F L+ +LE +
Worm: 472 SDPDKRPTFETLQWKLEDL 492
Basic Local Alignment Search Tool
•
•
•
•
•
Widely used similarity search tool
Heuristic approach based on Smith Waterman algorithm
Finds best local alignments
Provides statistical significance
All combinations (DNA/Protein) query and database.
–
–
–
–
–
DNA vs DNA
DNA translation vs Protein
Protein vs Protein
Protein vs DNA translation
DNA translation vs DNA translation
• www, standalone, and network clients
What BLAST tells you
• BLAST reports surprising alignments
– Different than chance
• Assumptions
– Random sequences
– Constant composition
• Conclusions
– Surprising similarities imply evolutionary
homology
Evolutionary Homology: descent from a common ancestor
Does not always imply similar function
BLAST/FASTA variants for different searches
Program
Query
Database Comparison Searching purpose
blastn/fasta
DNA
DNA
DNA level
homologous DNA
blastp/fasta
Protein
Protein
Protein level
homologous protein
blastx/fastx
DNA
Protein
Protein level
New genes from DNA
tblastn/tfasta Protein
DNA
Protein level
New genes from peptide
tblastx/tfastx DNA
DNA
Protein level
New genes from DNA
BLAST Web site: http://www.ncbi.nlm.nih.gov/BLAST
FASTA Web sites: http://www2.ebi.ac.uk/fasta3/
or http://www.fasta.genome.ad.jp/
BLASTN Databases
nr
GenBank, EMBL, DDBJ, PDB and NCBI
reference sequences (RefSeq)
htgs
High-throughput genomic sequences (draft)
pat
Patented nucleotide sequences
mito
Mitochondrial sequences
vector Vector subset of GenBank
month GenBank, EMBL, DDBJ, PDB from 30 days
chrom Contigs and chromosomes from RefSeq
BLASTP Databases
nr
GenBank CDS translations, RefSeq, PDB,
SWISS-PROT, PIR, PRF
swissprot SWISS-PROT
pat
Patented protein sequences
pdb
Protein Data Bank
month
GenBank CDS translations, PDB, SWISSPROT, PIR, PRF from 30 days
Nucleotide Words
GTACTGGACATGGACCCTACAGGAACGTATACGTAAG
11-mer
GTACTGGACAT
GTACTGGACATGGACCCTACAGGAACGT
TACTGGACATG
ACTGGACATGG
CTGGACATGGA
TGGACATGGAC
TGGACATGGACCCTACAGGAACGTATAC
GGACATGGACC
WORD SIZE
GACATGGACCC
blastn
ACATGGACCCT
. . .
Query
Make a lookup
table of words
Def.
Min.
11
7
28
12
megablast
CATGGACCCTACAGGAACGTATACGTAA
.
.
.
Protein Words
Query: GTQITVEDLFYNIATRRKALKN
GTQ
Word size = 3 (default)
TQI
Word size can only be 2 or 3
QIT
Neighborhood Words
ITV
LTV, MTV, ISV, LSV, etc.
Make a lookup
table of words
TVE
VED
EDL
DLF
...
Minimum Requirements for a Hit
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT exact word match
one match
•Nucleotide BLAST requires one exact match
•Protein BLAST requires two neighboring matches within 40 aa
GTQITVEDLFYNI
SEI
YYN neighborhood words
two matches
BLAST Algorithm
(1)
Query sequence
Words of length W
W default = 11
(2) Compare the word list to the database
and identify exact matches
(3) For each word match, extend alignment in both
directions
(4) Compute E-value
An alignment that BLAST can’t find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG
|| | || || || | || || ||
|| | ||| |||||| | | || | ||| |
1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT
| || ||
|| ||| || | |||||| || | |||||| ||||| |
|
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC
|||| || ||||| || ||
| | |||| || |||
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
Here there are no words longer than 6…...for nucleotides
there must be an exact match of at least 7.
An Alignment BLAST Can Make
Solution: compare protein sequences; BLASTX
BLAST 2 Sequences (blastx) output:
Score = 290 bits (741), Expect = 7e-77
Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)
Frame = +3
Nucleotide vs. Protein BLAST
Comparing ADSS from H. sapiens and A. thaliana
aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc
H.sapiens: N R V T V V L G A Q W G D E G
+ + V +
V L G
Q W G D E G
A.thaliana: S Q V S G V L G C Q W G D E G
agtcaagtatctggtgtactcggttgccaatggggagatgaaggt
BLASTn finds no match, because there are no 7 bp words
BLASTp finds three matching words
Protein searches are generally
more sensitive than nucleotide searches.
The Flavors of BLAST
• Standard BLAST
– traditional “contiguous” word hit
– position independent scoring
– nucleotide, protein and translations (blastn, blastp,
blastx, tblastn, tblastx)
• Megablast
– optimized for large batch searches
– can use discontiguous words
• PSI-BLAST
– constructs PSSMs automatically; uses as query
– very sensitive protein search
• RPS BLAST
– searches a database of PSSMs
– tool for conserved domain searches
Megablast: NCBI’s Genome Annotator
•
•
•
•
Long alignments for similar DNA sequences
Concatenation of query sequences
Faster than blastn
Contiguous Megablast
– exact word match
– Word size 28
• Discontiguous Megablast
– initial word hit with mismatches
– cross-species comparison
MegaBLAST
>AI217550
1133045 gnl|UG|Hs#S1133045 qd43b11.x1 Homo sapiens cDNA, 3'
CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTG
AI251192
GTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCT
TTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGT
AI254381
GACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCG
BE645079
TCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAAC
CACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC
> 1141828 gnl|UG|Hs#S1141828 qv37f11.x1 Homo sapiens cDNA, 3'
GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT
GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA
CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA
GTCGTATCGATGT
> 1145899 gnl|UG|Hs#S1145899 qv33c06.x1 Homo sapiens cDNA, 3'
GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT
GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA
CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA
GTCGTATCGATGT
> 2291670 gnl|UG|Hs#S2291670 7e65f04.x1 Homo sapiens cDNA, 3'
TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGT
TTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCT
CCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAA
GGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACA
CCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAA
AACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTC
CTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAA
GCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT
end
end
C:\seq\hs.4.fsa
end
end
Templates for Discontiguous Words
W
W
W
W
W
W
W
W
W
W
W
W
=
=
=
=
=
=
=
=
=
=
=
=
11,
11,
12,
12,
11,
11,
12,
12,
11,
11,
12,
12,
t
t
t
t
t
t
t
t
t
t
t
t
=
=
=
=
=
=
=
=
=
=
=
=
16,
16,
16,
16,
18,
18,
18,
18,
21,
21,
21,
21,
coding:
non-coding:
coding:
non-coding:
coding:
non-coding:
coding:
non-coding:
coding:
non-coding:
coding:
non-coding:
1101101101101101
1110010110110111
1111101101101101
1110110110110111
101101100101101101
111010010110010111
101101101101101101
111010110010110111
100101100101100101101
111010010100010010111
100101101101100101101
111010010110010010111
W = word size; # matches in template
t = template length (window size within which the word match is evaluated)
Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology
search. Bioinformatics March, 2002; 18(3):440-5
Scoring Systems - Nucleotides
Identity matrix
A
G
C
T
A
+1
–3
–3
–3
G
–3
+1
–3
–3
CAGGTAGCAAGCTTGCATGTCA
|| |||||||||||| |||||
CACGTAGCAAGCTTG-GTGTCA
C
–3
–3
+1
–3
T
-3
-3
-3
+1
raw score = 19-9 = 10
Scoring Systems - Proteins
Position Independent Matrices
PAM Matrices (Percent Accepted Mutation)
• Derived from observation; small dataset of
alignments
• Implicit model of evolution
• All calculated from PAM1
• PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices)
• Derived from observation; large dataset of highly
conserved blocks
• Each matrix derived separately from blocks with a
defined percent identity cutoff
• BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs)
PSI- and RPS-BLAST
BLOSUM62
A 4
R -1 5
N -2 0
D -2 -2
C 0 -3
Q -1 1
E -1 0
G 0 -2
H -2 0
I -1 -3
L -1 -2
K -1 2
M -1 -1
F -2 -3
P -1 -2
S 1 -1
T 0 -1
Negative
W -3 -3
Y -2 -2
V 0 -3
X 0 -1
A R
6
1 6
Common amino acids have low
-3 -3 9
0 0 -3 5
0 2 -4 2 5
0 -1 -3 -2 -2 6
1 -1 -3 0 0 -2 8
-3 -3 -1 -3 -3 -4 -3 4
-3 -4 -1 -2 -3 -4 -3 2 4
Rare amino acids have high
0 -1 -3 1 1 -2 -1 -3 -2 5
-2 -3 -1 0 -2 -3 -2 1 2 -1 5
-3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
-2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1
for
substitutions
-4 less
-4 -2likely
-2 -3
-2 -2 -3 -2 -3 -1 1 -4 -3
-2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2
-3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2
-1
-1 -2 for
-1 more
-1 -1 likely
-1 -1substitutions
-1 -1 -1 -1 -2 0
Positive
N D C Q E G H I L K M F P S
weights
weights
5
-2 11
-2 2 7
0 -3 -1 4
0 -2 -1 -1 -1
T W Y V X
Gapped Alignments
•
•
•
Gapping provides more biologically realistic alignments
Statistical behavior is not completely understood for
gapped alignments
Gapped BLAST parameters must be found by simulations
for each matrix
Gap costs:
-(a+bk)
a = gap open penalty b = gap extend penalty
k= number of residues
For example: A gap of 1 residue receives the score “-(a+b)”.
Scores
Simply add the scores
for each pair of aligned residues
and (as necessary) factor in the gaps!
V
V
BLOSUM62 +4
PAM30
+7
D S –
C
Y
E T L
C
F
+2 +1 -12 +9 +3
+2 0 -10 +10 +2
= 7
= 11
.
Different matrices produce different scores!
Matrix differences
PAM
Built from global alignments
BLOSUM
Built from local alignments
Built from small amout of data Built from vast amout of data
based on minimum
replacement or maximum
parsimony
based on groups of related
sequences counted as one
better for finding global
alignments and remote
homologs
better for finding local
alignments
Higher PAM series means
more divergence
Lower BLOSUM series means
more divergence
Matrices - Rules of thumb
Need different levels of sensitivity ?
– Close relationships (Low PAM number (PAM 1) or
high Blosum number, eg. 80)
– Distant relationships (High PAM (e.g. PAM 250),
low Blosum (BLOSUM 45)
Local Alignment Statistics
High scores of local alignments between two random sequences
follow the Extreme Value Distribution
Expect Value
E = number of database hits you expect to find by chance
Alignments
size of database
your score
expected number
of random hits
Score
E = Kmne-S
E = mn2-S’
K = scale for search space
 = scale for scoring system
S’ = bitscore = (S - lnK)/ln2
(applies to ungapped alignments)
WWW
BLAST
The BLAST homepage
Standard databases
Specialized Databases
BLAST Databases: Nucleic Acid
• nr (nt)
– Traditional GenBank
– NM_ and XM_ RefSeqs
• refseq_rna
• refseq_genomic
– NC_ RefSeqs
• dbest
– EST Division
• est_human, mouse,
others
• htgs
– HTG division
• gss
– GSS division
• wgs
– whole genome shotgun
• env_nt
– environmental samples
Options for Advanced Blasting: Nucleotide
Example Entrez Queries
nucleotide all[Filter] NOT mammalia[Organism]
green plants[Organism]
biomol mrna[Properties]
biomol genomic[Properties]
OtherAdvanced
-W 7
word size
–e 10000
expect value
-v 2000
descriptions
-b 2000
alignments
BLAST Databases: Non-redundant protein
nr (non-redundant protein sequences)
– GenBank CDS translations
– NP_ RefSeqs
– Outside Protein
• PIR, Swiss-Prot, PRF
• PDB (sequences from structures)
pat protein patents
env_nr environmental samples
Advanced Options: Filter
all[Filter] NOT mammals[Organism]
gene_in_mitochondrion[Properties]
2003:2005 [Modification Date]
tpa[Filter]
Nucleotide
Default setting
biomol_mrna[Properties]
biomol_genomic[Properties]
Hides low complexity
for initial word hits only
Masks regions of query
in lower case (pre-masked)
BLAST Formatting Page
BLAST Output: Graphic
Sort by taxonomy
mouse over
BLAST Output: Descriptions
Sorted by e values
3 X 10-12
link to entrez
Gene Linkout
Default e value cutoff 10
TaxBLAST: Taxonomy Reports
BLAST Output: Alignments
>gi|127552|sp|P23367|MUTL_ECOLI
Length = 615
DNA mismatch repair protein mutL
Score = 42.0 bits (97), Expect = 3e-04
Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%)
Query
9
Sbjct
280
LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL
L + P
L LEI P VDVNVHP KHEV F
+H+
+ +L V QQ +E+ L
LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL
Identical match
positive score
(conservative)
negative
substitution
gap
58
338
BLAST Output: Alignments
>gi|730028|sp|P40692|MLH1_HUMAN
Length = 756
DNA mismatch repair protein Mlh1 1)
Score = 233 bits (593), Expect = 8e-62
Identities = 117/131 (89%), Positives = 117/131 (89%)
Query: 1
IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60
IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL
Sbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335
Query: 61
GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120
GSNSSRMYFTQTLLPGLAGPSGEMVK
DKVYAHQMVRTDSREQKLDA
Sbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395
Query: 121 FLQPLSKPLSS 131
FLQPLSKPLSS
Sbjct: 396 FLQPLSKPLSS 406
low complexity sequence filtered
Neighbors: Precomputed BLAST
Nucleotide
Protein
Entrez Related Sequences produces a list of sequences sorted by
BLAST score, but with no alignment details.
Blink – Protein BLAST Alignments
• Lists only 200 hits
• List is nonredundant
PSI-BLAST
Position-Specific Iterated
BLAST
• Mining for protein domains
• Confirming relationships among
related proteins
Position-Specific Scoring Matrix
(PSSM)
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
D
G
V
I
S
S
C
N
G
D
S
G
G
P
L
N
C
Q
A
A
0
-2
-1
-3
-2
4
-4
-2
-2
-5
-2
-3
-3
-2
-4
-1
0
0
-1
R
-2
-1
1
3
-5
-4
-7
0
-3
-5
-4
-6
-6
-6
-6
-6
-4
1
-1
N D C
0 2 -4
0 -2 -4
-3 -3 -5
-3 -4 -6
0 8 -5
-4 -4 -4
-6 -7 12
2 -1 -6
Serine
is
-3 -4 -4
-2 9 -7
-2 -4 -4
-4 -5 -6
-4 -5 -6
-6 -5 -6
-7 -7 -5
0 -6 -4
-5 -5 10
4 2 -5
1 3 -4
Q E G H I L K M F P S T W
2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6
-3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0
-1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6
0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1
-3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7
-1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6
-7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5
7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3
scored
differently in these two positions.
-4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6
-4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8
-3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6
-5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6
-5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6
-5 -5 Active
-6 -6 site
-6 -7
-4 -6 -7 9 -4 -4 -7
nucleophile
-5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5
-4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6
-2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5
2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3
-1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2
Y
-4
-6
-4
-4
-5
-5
0
-4
-6
-7
-5
-7
-7
-7
-4
-1
0
-3
-2
V
-1
-5
-2
0
-6
-3
-4
-3
-6
-7
-5
-7
-7
-6
0
6
0
-4
-3
Position Specific Iterative BLAST:
PSI-BLAST
Create your own PSSM:
Finding protein families
based on your own sequence.
query
PSSM
BLOSUM62
Alignment
PSI-BLAST
>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE
MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGF
VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVD
EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAY
RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGA
VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK
e value cutoff for PSSM
RESULTS: Initial BLASTP
Same results as protein-protein BLAST
Results of First PSSM Search
Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
Third PSSM Search: Convergence
Just below threshold, another
nucleotide metabolism enzyme
Check to add to PSSM
Reverse Position Specific Iterative-BLAST
(a.k.a. RPS-BLAST or CDD Search)
A sequence search of
the Conserved Domain Database (CDD)
containing curated Position-Specific Scoring Matrices.
10
20
30
40
50
60
....*....|....*....|....*....|....*....|....*....|....*....|
consensus
1FGI A
1BYG A
gi 125135
gi 125702
gi 1174437
1
1
1
1
1
1
KWEIPREDLTLGKKLGEGAFGEVYKGTLKGkgd---nkSIDVAVKTLKEDASEeqIKEFL
aWEIPRESLRLEVKLGQGCFGEVWMGTWNG--------TTRVAIKTLKPGTMS--PEAFL
RWELPRDRLVLgkPLGEGAFGQVYLAEAIglgkdkpnrvTKVAVKMLKSDAtedkLSLDI
GWALNMKELKLlqTIGKGEFGDVMLGDYRg---------NKVAVKCIKNDAt---AQAFL
KYEIPRTDLTLkhKLGGGQYGEVYEGVWKky-------sLTVAVKTLKEDTm--eVEEFL
KWEIPRSELTIlrKLGRGNFGEVFYGKWRn--------sIDVAVKTLREGTm--sTAAFL
PSSM Sources
Pfam
SMART
COG
KOG
CD
Sanger
EMBL
NCBI
NCBI
NCBI
7255
663
4873
4825
645
57
311
74
62
284
325
Reverse Position Specific Iterative-BLAST
(a.k.a. RPS-BLAST or CD Search)
Query: sequence
P03958
Database: PSSMs
Result: TyrKc
Questions:
• Searching for p53 protein homologs with
annotation of CDD.
• Can you put codon 72 SNP into 3D protein
structure?
Other Areas to Cover
•
•
•
•
Genomic Data
Annotation
Common Domains prediction WWW
Other Useful Genome Browsers
Download