MedGen505_2003_030108_2print

advertisement
Bioinformatics:
Understanding the data in databases
UBC Bioinformatics
Centre
MedGen 505,
January 9, 2003
Francis Ouellette
Director, UBC Bioinformatics Centre
Vancouver, BC, Canada
francis@cmmt.ubc.ca
http://vanbug.org
• Monthly bioinformatics seminar
• The second Thursday of every month
• Attended by academics, industry and government
types.
• Talk followed by beer and pizza.
• Tonight @ 6:00 at the Chan Centre at the BCRI
(CMMT).
• Nat Goodman, a senior research scientist
from the ISB, aka “the IT guy”
Copyright 2002 UBC Bioinformatics Centre
http://bioinformatics.ubc.ca
Copyright 2002 UBC Bioinformatics Centre
Bioinformatics is about
understanding how life
works. It is an hypothesis
driven science
Copyright 2002 UBC Bioinformatics Centre
In bioinformatics, we use
software tools and
biological databases to ask
questions.
Copyright 2002 UBC Bioinformatics Centre
At the UBC Bioinformatics
Centre (UBiC) we bring
together scientists that share
the vision of making advances
in computational biology, also
working with bench scientists
to validate the hypotheses we
are generating.
Copyright 2002 UBC Bioinformatics Centre
Structure
•
•
•
•
•
Director
Associate Director
6 adjunct faculty
4 more to be recruited
Another recruitment already in
progress
Copyright 2002 UBC Bioinformatics Centre
•
•
•
•
•
•
Director of Operation and
Strategy
Chief Soft. Dev.
Chief Bioinformatics
Chief Systems
Chief Training and Support
Chief Web Development
UBiC: the vision
BLAST
IDB
PeGASys
Large Scale
Bioinformatics
Gene Identification
Comparative Genomics
Algorithm development
Basic
Research
Copyright 2002 UBC Bioinformatics Centre
CBW
WWW
Workshops
Support
&
Training
The UBC Bioinformatics Centre:
Copyright 2002 UBC Bioinformatics Centre
Copyright 2002 UBC Bioinformatics Centre
Ouellette Lab projects
•
•
•
•
•
Core facility: training and support
GeneComber: an Ab initio gene finding algorithm.
IDB: the Integral DataBase system
PeGASys: Parallel genome annotation system
GeMS: Genomic Mutational Signature Sequences.
Copyright 2002 UBC Bioinformatics Centre
http://bioinformatics.ca
Copyright 2002 UBC Bioinformatics Centre
http://bioinformatics.ca
Copyright 2002 UBC Bioinformatics Centre
Canadian Bioinformatics Workshop Series
Bioinformatics
Genomics
Proteomics
Developing
the Tools
Intro
Programming
Copyright 2002 UBC Bioinformatics Centre
Bioinformatics is about bringing
biological themes together with
the help of computer tools and
biological databases.
Computational biology can lead
us to new insights or directions.
Copyright 2002 UBC Bioinformatics Centre
BLAST Result
Basic
Local
Alignment
Search
Tool
Copyright 2002 UBC Bioinformatics Centre
PubMed Text Neighboring
Genetic Analysis
of Cancer in
Families
The Genetic
Predisposition
to Cancer
Copyright 2002 UBC Bioinformatics Centre
• Common terms could indicate
similar subject matter
• Statistical method
• Weights based on term
frequencies within document and
within the database as a whole
• Some terms are better than
others
Micro-array analysis:
Science Jan 1 1999: 83-87
The Transcriptional Program in the Response
of Human Fibroblasts to Serum
Vishwanath R. Iyer, Michael B. Eisen, Douglas T. Ross, Greg
Schuler, Troy Moore, Jeffrey C. F. Lee,
Jeffrey M. Trent, Louis M. Staudt, James Hudson Jr.,
Mark S. Boguski, Deval Lashkari, Dari Shalon,
David Botstein, Patrick O. Brown
Figure 1
Copyright 2002 UBC Bioinformatics Centre
Figure 4
VAST Result
•
•
•
•
Vector
Alignment
Search
Tool
Ferredoxin
•Halobacterium marismortui
•Chlorella fusca
Copyright 2002 UBC Bioinformatics Centre
Computational Biology Analysis
Q
Gln
NH2-C-CH2-CH2O
Copyright 2002 UBC Bioinformatics Centre
R
Arg
NH2-C-NH-CH2-CH2-CH2+NH2
Structural Interactions
Other interactions occurring
within this structure (blue).
In this case Glutaminyl-tRNA
Synthetase interacting with
AMP.
Copyright 2002 UBC Bioinformatics Centre
Positional Cloning
Family
Studies
Chromosome
Interval
Large-Insert
Clones
Candidate
Genes
Disease
Mutation
Met A
T
G
Val G
T
C
Ser T
C
A
Leu C
T
G
Gln C
A
A
Pro C
C
G
Cys T
G
T
*
Genetic
Mapping
Copyright 2002 UBC Bioinformatics Centre
Physical
Mapping
Transcript
Mapping
Gene
Sequencing
A
T
G
G
T
C
T
C
A
C
T
G
T
A
A
C
C
G
T
G
T
Met
Val
Ser
Leu
STOP
Positional Candidate Cloning
Family
Studies
Chromosome
Interval
Candidate
Genes
Disease
Mutation
Met A
T
G
Val G
T
C
Ser T
C
A
Leu C
T
G
Gln C
A
A
Pro C
C
G
Cys T
G
T
*
Genetic
Mapping
Copyright 2002 UBC Bioinformatics Centre
Computer
Search
Gene
Sequencing
A
T
G
G
T
C
T
C
A
C
T
G
T
A
A
C
C
G
T
G
T
Met
Val
Ser
Leu
STOP
What does it mean to do CB?
• Like to work with sequences, structures,
expression arrays, interaction of molecules
and genetic maps.
• Like the whole systems approach
• Like the IT component, and the power it
provides to crunching through lots of data
• Like clear answers
• Like to do Science
Copyright 2002 UBC Bioinformatics Centre
Doing CB means to be …
•
•
•
•
•
•
Database user
Tool user
Database developer
Tool developer
Training, practicing or developing
Doing bioinformatics experiments
Copyright 2002 UBC Bioinformatics Centre
Bioinformatics experiments:
Sequence
BLAST search
Reagents:
Method:
•Sequence
•Databases
•P-P
•N-P
•P-N
•N-N
•N (P) – N (P)
Know
your reagents
Know
your methods
Copyright 2002 UBC Bioinformatics Centre
Alignment
Interpretation:
BLASTP
BLASTX
TBLASTN
BLASTN
TBLASTX
•Similarity
•Hypothesis testing
Do your controls
Nature 409:452
Copyright 2002 UBC Bioinformatics Centre
Copyright 2002 UBC Bioinformatics Centre
Part 1. The Databases
1.GenBank: The Nucleotide Sequence Database
2. PubMed: The Bibliographic Database
3. Macromolecular Structure Databases
4. The Taxonomy Project
5. The Single Nucleotide Polymorphism Database
6. The Gene Expression Omnibus (GEO)
7. Online Mendelian Inheritance in Man (OMIM
8. The NCBI BookShelf: Searchable Biomedical Books
9. PubMed Central (PMC)
10. The SKY/CGH Database
Part 2. Data Flow and Processing
11. Sequin: A Sequence Submission and Editing Tool
12. The Processing of Biological Sequence Data at NCBI
13. Genome Assembly and Annotation Process
Part 3. Querying and Linking the Data
14.
15.
16.
17.
18.
19.
20.
21.
The Entrez Search and Retrieval System
The BLAST Sequence Analysis Tool
LinkOut: Linking to External Resources from Entrez
The Reference Sequence (RefSeq) Project
LocusLink: A Directory of Genes
Using the Map Viewer to Explore Genomes
UniGene: A Unified View of the Transcriptome
The Clusters of Orthologous Groups (COGs)
Part 4. User Support
22. User Services: Helping You Find Your Way
23. Exercises: Using Map Viewer
Glossary
Copyright 2002 UBC Bioinformatics Centre
The challenge of the information space:
Nucleotide records
Nucleotides
Protein sequences
3D structures
Interactions
Expression data points
Human Unigene Clusters
Maps and Complete Genomes
Different taxonomy Nodes
Human dbSNP
Human RefGenes records
bp in Human Contigs > 500 kb
PubMed records
OMIM records
Copyright 2002 UBC Bioinformatics Centre
Jan 2002
14,976,310
15,849,921,438
1,793,850
16,500
6,181
>20,000,000
96,109
1,600
229,799
4,116,188
17,984
1,154,596,000
11,692,207
13,346
The challenge of the information space:
Nucleotide records
Nucleotides
Protein sequences
3D structures
Interactions & complexes
Expression data points
Human Unigene Cluster
Maps and Complete Genomes
Different taxonomy Nodes
Human dbSNP
Human RefSeq records
bp in Human Contigs > 500 kb
PubMed records
OMIM records
Copyright 2002 UBC Bioinformatics Centre
Jan 2003
22,318,883
28,507,990,166
2,955,588
19,392
7,119
>40,000,000
115,523
2,698
278,402
4,892,258
20,008
1,451,804
12,319,105
14,116
Databases
• Organized array of information
• Place where you put things in, and (if all is well) you
should be able to get them out again.
• Resource for other databases and tools.
• Simplify the information space by specialization.
• Bonus: Allows you to make discoveries.
Copyright 2002 UBC Bioinformatics Centre
Databases
Information system
Query system
Storage System
Data
Copyright 2002 UBC Bioinformatics Centre
A List you look at
A catalogue
indexed files
Boxes
SQL
GenBank flat file
PC binary
grep
PDB
file files
Interaction
Record
The
UnixUBC
text library
files
Title
of a book
Google
Bookshelves
Book
Entrez
SRS
“... the more closely and elegantly a model follows a real
phenomenon, the more useful it is in predicting or understanding
the natural phenomenon it mimics.”
Ostell, Wheelan & Kans on the “NCBI data model”
from “Bioinformatics, a Practical Guide to the Analysis of Genes and Proteins.”, Baxevanis and
Ouellette, Eds. 2001
Copyright 2002 UBC Bioinformatics Centre
Using the NCBI data model
CMMT
MEDLINE
Expression
Data
PubMed
online Journals
Full text
Accession
Numbers
GenBank
SNP Data
ACGATGTGGTCGATG
TTCTCTATTATTATC
GGAAGCTAAGGATAT
CGCTGATGTGAGGTGA
TCGGTTCTATCTGCA
TAGCATGGATATTGA
TGGCTTATAGGCTAG
CGCTGATGTGAGGTG
Accession
Numbers - Map
Genomes
Links
MVILLVILAIVLISD
VTGREGSWQIPCMNV
KRKKGREGDHIVLIL
ILLNNAWASVLPESDS
SDSGPLIILHEREKR
LALAMAREENSPNCT
PLIKRESAEDSEDLR
KRKKTDEDDHIVLIL
Protein
Sequences
BIND
interaction:function
MMDB
structure:function
VAST
Structures
Copyright 2002 UBC Bioinformatics Centre
Primary Data
• DNA sequences
• RNA sequences
• Protein sequences
– In most cases protein sequences are
interpreted sequences.
• 3D structures
• Expression data
• Polymorphism data
• Interaction data
Copyright 2002 UBC Bioinformatics Centre
Databases: some examples
• Primary (archival)
– DDBJ/EMBL/GenBank
– TrEMBL
– UNIProt
– PDB
– Medline
– BIND
Copyright 2002 UBC Bioinformatics Centre
• Secondary (curated)
– LOCUSLink
– RefSeq
– Taxon
– Swiss-Prot
– PROSITE
– OMIM
– SGD
– FlyBase
– GO
What is GenBank?
GenBank is the NIH genetic sequence dataset of
all publicly available DNA and derived protein
sequences, with annotations describing the
biological information these records contain.
http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html
Benson et al., 2002, Nucleic Acids Res. 29:12-17
Copyright 2002 UBC Bioinformatics Centre
Entrez
NIH
NCBI
•Submissions
•Updates
GenBank
•Submissions
•Updates
EMBL
DDBJ
EBI
CIB
NIG
•Submissions
•Updates
getentry
Copyright 2002 UBC Bioinformatics Centre
SRS
EMBL
GenBank Flat File (GBFF)
LOCUS
DEFINITION
MUSNGH
1803 bp
mRNA
ROD
29-AUG-1997
Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15
cell TA20 mRNA, complete cds.
ACCESSION
D25291
NID
g1850791
KEYWORDS
neurite extension activity; growth arrest; TA20.
SOURCE
Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma
cell_line:NG108-15 cDNA to mRNA.
ORGANISM Murinae gen. sp.
Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae;
Murinae.
REFERENCE
1 (sites)
AUTHORS
Tohda,C., Nagai,S., Tohda,M. and Nomura,Y.
TITLE
A novel factor, TA20, involved in neuronal differentiation: cDNA
cloning and expression
JOURNAL
Neurosci. Res. 23 (1), 21-27 (1995)
MEDLINE
96064354
REFERENCE
3 (bases 1 to 1803)
AUTHORS
Tohda,C.
TITLE
Direct Submission
JOURNAL
Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro
Tohda, Toyama Medical and Pharmaceutical University, Research
Institute for Wakan-yaku, Analytical Research Center for
Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan
(E-mail:CHIHIRO@ms.toyama-mpu.ac.jp, Tel:+81-764-34-2281(ex.2841),
Fax:+81-764-34-5057)
COMMENT
On Feb 26, 1997 this sequence version replaced gi:793764.
FEATURES
Location/Qualifiers
source
1..1803
/organism="Murinae gen. sp."
/note="source origin of sequence, either mouse or rat, has
not been identified"
/db_xref="taxon:39108"
/cell_line="NG108-15"
/cell_type="mouse neuroblastma-rat glioma hybridoma"
misc_signal
156..163
/note="AP-2 binding site"
GC_signal
647..655
/note="Sp1 binding site"
TATA_signal
694..701
gene
748..1311
/gene="TA20"
CDS
748..1311
/gene="TA20"
/function="neurite extensiion activity and growth arrest
effect"
/codon_start=1
/db_xref="PID:d1005516"
/db_xref="PID:g793765"
/translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR
KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL
RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY
RGPSNRSPPLPPRNRIKQPNRIKLRCR"
polyA_site
1803
BASE COUNT
507 a
458 c
311 g
527 t
ORIGIN
1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg
61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat
121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg
181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca
241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca
301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc
361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc
421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa
481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag
541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag
601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat
661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga
721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg
781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat
841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg
901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg
961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact
1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt
1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct
1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc
1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct
1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt
1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata
1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat
1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt
1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt
1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc
1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc
1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa
1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc
1801 cat
//
Copyright 2002 UBC Bioinformatics Centre
Header
•Title
•Taxonomy
•Citation
Features (AA seq)
DNA Sequence
Abstract Syntax Notation (ASN.1)
Copyright 2002 UBC Bioinformatics Centre
FASTA
>
>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4
MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE
R
Copyright 2002 UBC Bioinformatics Centre
Graphical Representation
Copyright 2002 UBC Bioinformatics Centre
ASN.1
FASTA
MMDB
EMBL
ASN.1
Graphical
Copyright 2002 UBC Bioinformatics Centre
Swiss-Prot
GenBank
GenPept
Outline
• GenBank dissection
– identifiers
– divisions
– format/structure
– features
– file conversions
Copyright 2002 UBC Bioinformatics Centre
Organismal Divisions
Used in which database?
BCT
FUN
HUM
INV
MAM
ORG
PHG
PLN
PRI
PRO
ROD
SYN
VRL
VRT
Copyright 2002 UBC Bioinformatics Centre
Bacterial
Fungal
Homo sapiens
Invertebrate
Other mammalian
Organelle
Phage
Plant
Primate (also see HUM)
Prokaryotic
Rodent
Synthetic and chimeric
Viral
Other vertebrate
DDBJ - GenBank
EMBL
DDBJ - EMBL
all
all
EMBL
all
all
all (not same data in all)
EMBL
all
all
all
all
Functional Divisions
PAT
EST
STS
GSS
HTG
HTC
Patent
Expressed Sequence Tags
Sequence Tagged Site
Genome Survey Sequence
High Throughput Genome (unfinished)
High throughput cDNA (unfinished)
Organismal divisions:
BCT
PRI
FUN
ROD
Copyright 2002 UBC Bioinformatics Centre
INV
SYN
MAM
VRL
PHG
VRT
PLN
Guiding Principals
In GenBank, records are grouped for
various reasons: understand this is key to
using and fully taking advantage of this
database.
Copyright 2002 UBC Bioinformatics Centre
LOCUS, Accession, NID and protein_id
LOCUS: Unique string of 10 letters and numbers in
the database. Not maintained amongst databases,
and is therefore a poor sequence identifier.
ACCESSION: A unique identifier to that record, citable
entity; does not change when record is updated. A good
record identifier, ideal for citation in publication.
VERSION: : New system where the accession and version play the
same function as the accession and gi number.
Nucleotide gi: Geninfo identifier (gi), a unique integer
which will change every time the sequence changes.
PID: Protein Identifier: g, e or d prefix to gi number.
Can have one or two on one CDS.
Protein gi: Geninfo identifier (gi), a unique integer which
will change every time the sequence changes.
protein_id: Identifier which has the same
structure and function as the nucleotide Accession.version
numbers, but slightlt different format.
Copyright 2002 UBC Bioinformatics Centre
LOCUS, Accession, gi and PID
LOCUS
DEFINITION
ACCESSION
VERSION
HSU40282
1789 bp
mRNA
PRI
21-MAY-1998
Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.
U40282
U40282.1 GI:3150001
LOCUS:
ACCESSION:
VERSION:
GI:
PID:
Protein gi:
protein_id:
CDS
Copyright 2002 UBC Bioinformatics Centre
HSU40282
U40282
U40282.1
3150001
g3150002
3150002
AAC16892.1
157..1515
/gene="ILK"
/note="protein serine/threonine kinase"
/codon_start=1
/product="integrin-linked kinase"
/protein_id="AAC16892.1"
/db_xref="PID:g3150002"
/db_xref="GI:3150002"
Sample GenBank mRNA Record
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
REFERENCE
AUTHORS
TITLE
JOURNAL
HSU40282
1789 bp
mRNA
PRI
21-MAY-1998
Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.
U40282
U40282.1 GI:3150001
.
human.
Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
Eutheria; Primates; Catarrhini; Hominidae; Homo.
1 (bases 1 to 1789)
Hannigan,G.E., Leung-Hagesteijn,C., Fitz-Gibbon,L., Coppolino,M.G.,
Radeva,G., Filmus,J., Bell,J.C. and Dedhar,S.
Regulation of cell adhesion and anchorage-dependent growth by a new
beta 1-integrin-linked protein kinase
Nature 379 (6560), 91-96 (1996)
96135142
2 (bases 1 to 1789)
Dedhar,S. and Hannigan,G.E.
Direct Submission
Submitted (07-NOV-1995) Shoukat Dedhar, Cancer Biology Research,
Sunnybrook Health Science Centre and University of Toronto, 2075
Bayview Avenue, North York, Ont. M4N 3M5, Canada
Copyright 2002 UBC Bioinformatics Centre
Sample GenBank Record
FEATURES
source
gene
CDS
Location/Qualifiers
1..1789
/organism="Homo sapiens"
/db_xref="taxon:9606"
/chromosome="11"
/map="11p15"
/cell_line="HeLa"
1..1789
/gene="ILK"
157..1515
/gene="ILK"
/note="protein serine/threonine kinase"
/codon_start=1
/product="integrin-linked kinase"
/protein_id="AAC16892.1"
/db_xref="PID:g3150002"
/db_xref="GI:3150002"
/translation="MDDIFTQCREGNAVAVRLWLDNTENDLNQGDDHGFSPLHWACRE
. . . DK"
443 a
488 c
480 g
378 t
BASE COUNT
ORIGIN
1 gaattcatct gtcgactgct accacgggag ttccccggag aaggatcctg cagcccgagt
< ...>
1681 ggcgggctca gagctttgtc acttgccaca tggtgtcttc caacatggga gggatcagcc
1741 ccgcctgtca caataaagtt tattatgaaa aaaaaaaaaa aaaaaaaaa
//
Copyright 2002 UBC Bioinformatics Centre
EST: Expressed Sequence Tag
Expressed Sequence Tags are short
(300-500 bp) single reads from mRNA (cDNA)
which are produced in large numbers.
They represent a snapshot of what is expressed
in a given tissue, and developmental stage.
Also see:
http://www.ncbi.nlm.nih.gov/dbEST/
http://www.ncbi.nlm.nih.gov/UniGene/
Copyright 2002 UBC Bioinformatics Centre
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
AA675481
524 bp
mRNA
EST
28-NOV-1997
vr72d07.s1 Knowles Solter mouse 2 cell Mus musculus cDNA clone
IMAGE:1134253 5' similar to TR:G992993 G992993 MYOSIN LIGHT CHAIN
KINASE. ;, mRNA sequence.
AA675481
AA675481.1 GI:2652718
EST.
house mouse
Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
...
COMMENT
Contact: Marra M/Mouse EST Project
WashU-HHMI Mouse EST Project
Washington University School of MedicineP
4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108
Tel: 314 286 1800
Fax: 314 286 1810
Email: mouseest@watson.wustl.edu
This clone is available royalty-free through LLNL ; contact the
IMAGE Consortium (info@image.llnl.gov) for further information.
MGI:615525
Possible reversed clone: similarity on wrong strand
High quality sequence stop: 469.
Copyright 2002 UBC Bioinformatics Centre
FEATURES
source
Location/Qualifiers
1..524
/organism="Mus musculus"
/strain="B6D2 F1/J"
/note="Organ: embryo; Vector: pBluescribe (modified);
Site_1: MluI; Site_2: SalI; Cloned unidirectionally from
mRNA prepared from 13,500 2-cell stage embryos. Primer:
SalI(dT): 5'-CGGTCGACCGTCGACCGTTTTTTTTTTTTTTT-3'.
cDNAs
were cloned into the MluI/SalI sites of a modified
pBluescribe vector using commercial linkers (NEB).
Average insert size: 1.2 kb."
/db_xref="taxon:10090"
/clone="1134253"
/clone_lib="Knowles Solter mouse 2 cell"
/tissue_type="embryo"
/dev_stage="2-cell"
/lab_host="DH10B"
168 a
111 c
115 g
130 t
BASE COUNT
ORIGIN
1 ctcagttgta
61 ggaaattaca
121 cgaaagaggt
181 gtacatgtgt
241 tgaaatggat
301 tgtggagggg
361 atcttcctct
421 agcatagctg
481 ttcgtattta
//
Copyright 2002 UBC Bioinformatics Centre
gacagtgagc
tggtggtttg
gaaacttact
aaggcagtca
gactactagg
ccaaaaagga
taagaacttc
acagaaaagg
tagaactaag
cagtcagatt
aaggagaaat
gcctgtattt
acaataaagg
cttccctctg
gaccagaggt
tcatgcatat
gaaataaatg
acttaacata
tactgttaaa
actgcaggat
accggaaacc
ctcagcagcg
tccttgggac
gccactataa
caggttcatt
tacccattct
tacagtttgc
gtaacaggag
ggagaagact
ttcccagaag
agcacctgca
tctctctctc
ctgacttaat
accatgctgt
gtcagaacta
atga
aacccaagcc
atcagtacat
atggaggaga
ttcttaccat
gctgcatctc
ctttccccaa
gcaaagtcaa
agacagaagc
STS
Sequenced Tagged Sites, are operationally
unique sequence that identifies the
combination of primer pairs used in a PCR
assay that generate a mapping reagent which
maps to a single position within the genome.
Also see: http://www.ncbi.nlm.nih.gov/dbSTS/
http://www.ncbi.nlm.nih.gov/genemap/
Copyright 2002 UBC Bioinformatics Centre
GSS: Genome Survey Sequences
Genome Survey Sequences are similar in nature
to the ESTs, except that its sequences are genomic
in origin, rather than cDNA (mRNA).
The GSS division contains:
• random "single pass read" genome survey sequences.
• single pass reads from cosmid/BAC/YAC ends (these could
be chromosome specific, but need not be)
• exon trapped genomic sequences
• Alu PCR sequences
Also see:
Copyright 2002 UBC Bioinformatics Centre
http://www.ncbi.nlm.nih.gov/dbGSS/
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
REFERENCE
AUTHORS
TITLE
JOURNAL
COMMENT
FR0029137
445 bp
DNA
GSS
30-JUN-1998
Fugu rubripes GSS sequence, clone 037G16aE9, genomic survey
sequence.
AL031006
AL031006.1 GI:3286795
GSS; genome survey sequence.
Fugu rubripes.
Fugu rubripes
Eukaryota; Metazoa; Chordata; Vertebrata; Actinopterygii;
Neopterygii; Teleostei; Euteleostei; Acanthopterygii; Percomorpha;
Tetraodontiformes; Tetraodontoidei; Tetraodontidae; Fugu.
1 (bases 1 to 445)
Elgar,G., Clark,M., Smith,S., Meek,S., Warner,S., Umrania,Y.,
Williams,G. and Brenner,S.
Direct Submission
Submitted (09-JUN-1998) MRC Human Genome Mapping Project Resource
Centre, Hinxton, Cambridge, CB10 1SB, UK. Email:
biohelp@hgmp.mrc.ac.uk
Vector: pBluescript II KS
V_type: phagemid
PRIMER: KS
DESCR:
One pass dye-terminator sequencing of cosmid cloned genomic
sequence.
Copyright 2002 UBC Bioinformatics Centre
Genome Survey Sequences
FEATURES
source
Location/Qualifiers
1..445
/organism="Fugu rubripes"
/db_xref="taxon:31033"
/clone_lib="cosmid 037G16"
/clone="037G16aE9"
124 a
96 c
97 g
126 t
BASE COUNT
ORIGIN
1 atcctgcagt
61 gtcggccgta
121 atgggtaagt
181 ttcaagagag
241 gtctttggna
301 gcactgtgaa
361 accaaaagtt
421 atgagttaaa
//
Copyright 2002 UBC Bioinformatics Centre
gaggcagaac
aaagtcctcc
gcaaacattt
tcttggaagc
tgagggaggg
accctctggt
tatcctgcaa
tacggtttgt
agggnctgtt
gaaaacccac
aactcaagat
gtacacacct
aaccagatac
actgagccct
ctgctattta
tgaaa
tccatttttt
aaagcctttg
aagtgccttt
acagcgtagc
ctggtgaaaa
gaaacttcat
acttctgtta
2 others
gtctgtcagt
cctatcgttc
gagataacaa
tgtttttacc
cccatgcaga
gttgtgaggc
gcctctgttt
ttaaacagtg
caaatcttac
aacctctttt
tcagatgaat
cttgcggaga
aacagtgctt
tggagaccac
HTG: High Throughput Genome
High Throughput Genome Sequences are
unfinished genome sequencing efforts records.
Unfinished records have gaps in the
nucleotides sequence, low accuracy, and no
annotations on the records.
Also see:
http://www.ncbi.nlm.nih.gov/HTGS/
Ouellette and Boguski (1997) Genome Res. 7:952-955
Copyright 2002 UBC Bioinformatics Centre
HTGS in GenBank
phase 1
Acc = AC000003
gi = 1556454
HTG
phase 2
Acc = AC000003
gi = 2182283
PRI
phase 3
Acc = AC000003
Copyright 2002 UBC Bioinformatics Centre
HTG
gi = 2204282
HTGS in GenBank
•
Unfinished Record
–
–
–
–
•
Sequencing will be unfinished
Phase 1 or phase 2
HTG division
KEYWORDS: HTG; HTGS_PHASE1 or 2
Finished record
–
–
–
–
Sequencing will be finished
Phase 3
Organismal division it belongs to PRI,INV or PLN
KEYWORDS: HTG
Copyright 2002 UBC Bioinformatics Centre
HTGS: phase 1
LOCUS
DEFINITION
ACCESSION
KEYWORDS
...
COMMENT
HSAC000003 120000 bp
DNA
HTG
20-SEP-1996
*** SEQUENCING IN PROGRESS *** Chromosome 17 genomic sequence; HTGS
phase 1, 6 unordered pieces.
AC000003
HTG; HTGS_PHASE1.
***
***
*** WARNING: Phase 1 High Throughput Genome Sequence ***
***
***
* This sequence is unfinished. It consists of 6 contigs for
* which the order is not known; their order in this record is
* arbitrary. In some cases, the exact lengths of the gaps
* between the contigs are also unknown; these gaps are presented
* as runs of N as a convenience only. When sequencing is complete,
* the sequence data presented in this record will be replaced
*by a single finished sequence with the same accession number.
*
1
22526: contig of 22526 bp in length
*
22527
23035: gap of unknown length
*
23036
33919: contig of 10884 bp in length
*
33920
34427: gap of unknown length
*
34428
61877: contig of 27450 bp in length
...
//
Copyright 2002 UBC Bioinformatics Centre
HTGS Phase 1
* the sequence data presented in this record will be replaced
* by a single finished sequence with the same accession number.
*
1
33214: contig of 33214 bp in length
*
33215
33250: gap of unknown length
*
33251
35134: contig of 1884 bp in length
...
gap of unknown length
33061
33121
33181
33241
33301
33361
33421
ggagagcttc
taaatgtctg
cgagcaattc
nnnnnnnnnn
ctgtctaccc
gggcagctag
aaagaagcag
Copyright 2002 UBC Bioinformatics Centre
agggagactc
gtttaccttc
atgggcaaaa
tagttcatca
tccctcttcc
ctgaaagaga
gttgggggaa
tgcggaatag
agccgaaacg
gtgccgccgc
ccttctggtg
ccttcctccc
ccatctgcct
agaggaagtg
caggttgtaa
cgggagaaat
cacgnnnnnn
gaagccacat
caaatctatc
taggaatagc
aggatttcaa
tcttccggtt
ccagcctgcg
nnnnnnnnnn
tttctctttc
agtaaagacc
ctacactaga
gtcaagaaag
cgatagtcga
tactccacag
nnnnnnnnnn
ctttctttcc
accttgctgt
ttcaaactac
catcctgcct
HTGS phase 3
LOCUS
DEFINITION
ACCESSION
NID
KEYWORDS
...
COMMENT
AC000003
122228 bp
DNA
PRI
07-OCT-1997
Homo sapiens chromosome 17, clone 104H12, complete sequence.
AC000003
g2204282
HTG.
The Staden databases, finishing information, and all
chromatographic files used in the assembly of this clone are
available from our anonymous ftp site.
All repeats were identified using RepeatMasker: Smit, A.F.A. &
Green, P. (1996-1997)
http://ftp.genome.washington.edu/RM/RepeatMasker.html.
FEATURES
Location/Qualifiers
source
1..122228
/organism="Homo sapiens"
/db_xref="taxon:9606"
/clone="104H12"
/clone_lib="Research Genetics/Cal Tech CITB978SK-B (plates
1-194)"
/chromosome="17"
repeat_region
261..370
/rpt_family="MLT1B"
Copyright 2002 UBC Bioinformatics Centre
Copyright 2002 UBC Bioinformatics Centre
Locus Link
Copyright 2002 UBC Bioinformatics Centre
http://nar.oupjournals.org/content/vol31/issue1/
Copyright 2002 UBC Bioinformatics Centre
Genome Projects: discussion point
•
•
•
•
•
•
•
•
Whole genome assembly
“Bermuda agreement”
HTG  Finished
What is it to be “finished”
1:10,000 error rate?
How useful is an unfinished genome?
Reference genomes
TPA and RefSeq
Copyright 2002 UBC Bioinformatics Centre
In Closing ...
• Able to recognize various data formats, and know what their
primary use is.
• Know, understand and utilize all types of sequence identifiers.
• Know and understand various feature types present in the
GenBank flat files.
• Know and understand the various GenBank divisions.
Copyright 2002 UBC Bioinformatics Centre
Resources
•
W W W:
– http://www.ncbi.nlm.nih.gov
– http://www.ddbj.nig.ac.jp/
– http://www.ebi.ac.uk/
– http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html
– http://www.expasy.ch/sprot/
– http://www.rcsb.org/pdb/index.html
– http://www.ncbi.nlm.nih.gov/Omim/
– http://genome-www.stanford.edu/Saccharomyces/
– http://nar.oupjournals.org/content/vol30/issue1/
– http://nar.oupjournals.org/content/vol31/issue1/
Copyright 2002 UBC Bioinformatics Centre
Download