#2 - Biological Databases 8/22/07 BCB 444/544 Lecture 2

advertisement
#2 - Biological Databases
8/22/07
BCB 444/544
BCB 444/544 - Website
Finish: Lecture 1- What is Bioinformatics?
http://bindr.gdcb.iastate.edu/bcb544
• Updated Syllabus
Hyp
erlin
k
• Lecture & Lab Schedules
(with Homework Assignments)
• Lecture PPTs & PDFs
• Lab Exercises
• Practice Exams
• Grading Policy
• Project Guidelines, etc.
• Links
Lecture 2
Biological Databases
&
ISU Resources
#2_Aug22
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
• Check regularly for updates!
8/22/07
1
BCB 444/544 F07 ISU
Meets in 1304 MBB every week
EXCEPT this week:
8/22/07
3
BCB 444/544 F07 ISU
Required Reading
Wed Aug 22 - for Lecture #2
• Xiong Textbook:
4
http://www.dnai.org/c/index.html
• Chp 1 - Introduction
• Chp 2 - Biological Databases
A tutorial on genomic sequencing, gene structure,
genes prediction
Thurs Aug 23 - for Lab #1:
Howard Hughes Medical Institute (HHMI)
Cold Spring Harbor Laboratory (CSHL)
• Literature Resources for Bioinformatics
Andrea Dinkelman, see Lab Schedule for URL
1.
2.
3.

Fri Aug 24
• Genomics & Its Impact on Science & Society:
Genomics & Human Genome Project Primer
see Lecture Schedule for URL
BCB 444/544 Fall 07 Dobbs
Dobbs #2 - Biological Databases
Assignment #2 (& for Fun):
DNA Interactive "Genomes"
(must read before lecture)
Dobbs #2 - Biological Databases
8/22/07
1- Complete HW1_Aug20 for Drena
Current schedule: Thurs 1-3 PM
Conflicts? See Drena
BCB 444/544 F07 ISU
2
Due: Today - Wed, Aug 22
1st Lab meets in Library Rm 32
Dobbs #2 - Biological Databases
8/22/07
Assignment #1:
Tell us about you
BCB 444/544 - Computer Lab
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
5
Take the Tour
Read about the Project
Do some Genome Mining with:
Nothing to turn in - just do it!
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
6
1
#2 - Biological Databases
8/22/07
#1- What is Bioinformatics?
1st Draft Human Genome:
"Finished" in 2001
(cont.)
Xiong: Chp 1
1 Introduction
What Is Bioinformatics?
Goal
Scope
Applications
Limitations
New Themes
Further Reading
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
7
Modified from Eric Green
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
8
8/22/07
10
Public Sequencing:
International Consortium
Human Genome Sequencing
Two approaches:
• Public (government) - International Consortium
(mainly 6 countries, NIH-funded in US)
• Hierarchical cloning & BAC-to-BAC sequencing
• Map-based assembly
• Private (industry) - Celera, Craig Venter, CEO
• Whole genome random "shotgun" sequencing
• Computational assembly
(took advantage of public maps & sequences, too)
Guess which human genome they sequenced?
How many genes?
~
Craig's
20,000 (Science, May 2007)
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
9
Modified from Eric Green
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
"Complete" Human Genome Sequence:
What next?
Comparison of Sequenced Genome Sizes
Plants? Many have much larger genomes than human!
Modified from Eric Green
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
BCB 444/544 Fall 07 Dobbs
8/22/07
11
from Eric Green
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
12
2
#2 - Biological Databases
8/22/07
How can we begin to understand the
complete Human Genome Sequence?
Next Step after the Complete Sequence?
Understanding Gene Function on a Genomic Scale
• Expression Analysis
• Structural Genomics
• Protein Interactions
• Network Analysis
• Systems Biology
Evolutionary Implications of:
• Intergenic Regions as "Gene Graveyard"
• Introns & Exons
Modified from Mark Gerstein
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
13
from Eric Green
Comparative Genomics:
Compare entire genomes
BCB 444/544 F07 ISU
from Eric Green
Dobbs #2 - Biological Databases
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
14
Comparing Genomes:
Identifying functional elements
8/22/07
15
from Eric Green
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
16
Other "Omes"
Proteome, Metabolome, Glycome, etc.
Gene Expression Data:
the Transcriptome
MicroArray Data
ISU has state-of-the-art
Proteomics Instrumentation
Yeast Expression Data:
• Levels for all 6,000 genes!
• Investigate how all genes
respond to changes in
environment or, in humans,
e.g., how patterns of RNA
expression change in
normal vs cancerous tissue
Modified from Mark Gerstein
BCB 444/544 F07 ISU
ISU's Biotechnology Facilities
include state-of-the-art
Microarray Instrumentation
Dobbs #2 - Biological Databases
BCB 444/544 Fall 07 Dobbs
8/22/07
17
ISU's has state-of-the-art
Metabolomics Instrumentation
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
18
3
#2 - Biological Databases
8/22/07
Molecular Biology Information:
Integrating Data
How are "Omes" related?
Understanding the function of genomes
requires integration of many diverse and
complex types of information:
•
•
•
•
•
•
Systems Biology seeks to integrate all of these to
explain the complex behaviors of whole systems
(cells, organisms, ecosystems)
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
19
Metabolic pathways
Regulatory networks
Whole organism physiology
Evolution, phylogeny
Environment, ecology
Literature (MEDLINE)
BCB 444/544 F07 ISU
Modified from Mark Gerstein
Dobbs #2 - Biological Databases
8/22/07
20
Storing & Analyzing Geonomic Information:
Other Genome-Scale Experiments
Exponential Growth of Data Coupled with
Development of Fast Computer Technology
• Increases in computer speed &
starage capacity have been dramatic
Systematic Knockouts:
2-hybrid Experiments:
Make "knockout" (null)
mutations in every gene
- one at a time - and
analyze the resulting
phenotypes!
For each (and every)
protein, identify every
other protein with which it
interacts!
For yeast:
6,000 KO mutants!
For yeast: 6000 x 6000 / 2
~ 18M interactions!!
Modified from Mark Gerstein
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
• Improved computing resources &
more efficient algorithms have been
driving forces in Bioinformatics &
Computational Biology
ISU's supercomputer "CyBlue" is among
100 most powerful computers in the world!
8/22/07
21
Modified from Mark Gerstein
Bioinformatics is born!
& more Bioinformaticists are needed!
BCB 444/544 F07 ISU
• Robotics
• Graphics (surfaces, volumes)
• Comparison & 3D matching
• String Comparison
• Text search
• Alignment
• Significance statistics
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
BCB 444/544 Fall 07 Dobbs
• Simulation & Modeling
• Patterns Finding
•
•
•
•
8/22/07
23
22
• Computational Geometry
Building & querying objectoriented & relational DBs
Modified from Mark Gerstein
8/22/07
“Informatics” techniques used in
Bioinformatics
• Databases
(Internet picture adapted
from D Brutlag, Stanford)
Dobbs #2 - Biological Databases
Machine Learning
Data Mining
Statistics
Linguistics
BCB 444/544 F07 ISU
•
•
•
•
•
•
Newtonian mechanics
Electrostatics
Numerical algorithms
Simulation
Network modeling
Population modeling
Dobbs #2 - Biological Databases
8/22/07
24
4
#2 - Biological Databases
8/22/07
One Strategy:
Molecular Parts = Conserved Domains
Challenges in Organizing Information:
Redundancy and Multiplicity
• Different protein sequences can assume
the same 3-D structure
• Organisms have many similar genes with
redundant functions
• A single gene may have several different
functions
• Genes & proteins function in complex
genetic & regulatory pathways
• How do we organize all this
information so that we can make
sense of it?
Functional Genomics & Systems Biology:
sequences <> motifs <> genes <> RNAs <> proteins <> structures <> functions <>
expression levels <> pathways <> regulatory networks <> functional systems
Modified from Mark Gerstein
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
25
Modified from Mark Gerstein
"Parts List" approach to bike maintenance:
Where are the parts
located?
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
26
World of macromolecular structures is also
finite, providing a valuable simplification
H. sapiens
~ 20,000 genes
~ 2,000 folds
T. pallidum
Global surveys of a finite set
of parts from different
perspectives
Which are the common parts
(bolt, nut,washer, spring,
bearing)?
Which are unique parts
(cogs, levers)?
Same logic for pathways, functions,
sequence families, blocks, motifs....
How flexible and adaptable
are parts mechanically?
Modified from Mark Gerstein
BCB 444/544 F07 ISU
~ 2,000 genes
Dobbs #2 - Biological Databases
8/22/07
27
BUT, what actually happens inside cells or
within whole organisms is very complex providing a challenging complication !
Modified from Mark Gerstein
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
28
So, having a list of parts is not enough!
BIG QUESTION?
How do parts work together to form a
functional system?
Exploring the Virtual Cell at ISU
Virtual Cell projects elsewhere...
SYSTEMS BIOLOGY
NCBI's Bookshelf - a great resource!
What is a system? Macromolecular complex, pathway,
network, cell, tissue, organism, ecosystem…
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
BCB 444/544 Fall 07 Dobbs
8/22/07
29
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
30
5
#2 - Biological Databases
8/22/07
Designing drugs
So, this is Bioinformatics
• Understanding how proteins bind other molecules
• Structural modeling & ligand docking
• Designing inhibitors or modulators of key proteins
What is it good for?
Just a few examples…
Figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web
page at Scripps, and from Computational Chemistry Page at Cornell Theory Center).
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
31
Modified from Mark Gerstein
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
32
Finding WHAT?
Finding homologs of "new" human genes
Homologs - "same genes" in different organisms
• Human vs Mouse vs Yeast
• Much easier to do experiments on yeast to determine function
• Often, function of an ortholog in at least one organism is known
Best Sequence Similarity Matches to Date Between Positionally Cloned
Human Genes and S. cerevisiae Proteins
Modified from Mark Gerstein
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
33
Human Disease
MIM #
Human
Gene
GenBank
BLASTX
Acc# for
P-value
Human cDNA
Yeast
Gene
GenBank
Yeast Gene
Acc# for
Description
Yeast cDNA
Hereditary Non-polyposis Colon Cancer
Hereditary Non-polyposis Colon Cancer
Cystic Fibrosis
Wilson Disease
Glycerol Kinase Deficiency
Bloom Syndrome
Adrenoleukodystrophy, X-linked
Ataxia Telangiectasia
Amyotrophic Lateral Sclerosis
Myotonic Dystrophy
Lowe Syndrome
Neurofibromatosis, Type 1
120436
120436
219700
277900
307030
210900
300100
208900
105400
160900
309000
162200
MSH2
MLH1
CFTR
WND
GK
BLM
ALD
ATM
SOD1
DM
OCRL
NF1
U03911
U07418
M28668
U11700
L13943
U39817
Z21876
U26455
K00065
L19268
M88162
M89914
9.2e-261
6.3e-196
1.3e-167
5.9e-161
1.8e-129
2.6e-119
3.4e-107
2.8e-90
2.0e-58
5.4e-53
1.2e-47
2.0e-46
MSH2
MLH1
YCF1
CCC2
GUT1
SGS1
PXA1
TEL1
SOD1
YPK1
YIL002C
IRA2
M84170
U07187
L35237
L36317
X69049
U22341
U17065
U31331
J03279
M21307
Z47047
M33779
DNA repair protein
DNA repair protein
Metal resistance protein
Probable copper transporter
Glycerol kinase
Helicase
Peroxisomal ABC transporter
PI3 kinase
Superoxide dismutase
Serine/threonine protein kinase
Putative IPP-5-phosphatase
Inhibitory regulator protein
Choroideremia
Diastrophic Dysplasia
Lissencephaly
Thomsen Disease
Wilms Tumor
Achondroplasia
Menkes Syndrome
303100
222600
247200
160800
194070
100800
309400
CHM
DTD
LIS1
CLC1
WT1
FGFR3
MNK
X78121
U14528
L13385
Z25884
X51630
M58051
X69208
2.1e-42
7.2e-38
1.7e-34
7.9e-31
1.1e-20
2.0e-18
2.1e-17
GDI1
SUL1
MET30
GEF1
FZF1
IPL1
CCC2
S69371
X82013
L26505
Z23117
X67787
U07163
L36317
GDP dissociation inhibitor
Sulfate permease
Methionine metabolism
Voltage-gated chloride channel
Sulphite resistance protein
Serine/threoinine protein kinase
Probable copper transporter
Modified from Mark Gerstein
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
34
Molecular Recognition:
Comparative Genomics:
Genome/Transcriptome/Proteome/Metabolome
Analyzing & Predicting Macromolecular Interfaces
(in DNA, RNA & protein complexes)
Drena Dobbs, GDCB
Jae-Hyung Lee
Michael Terribilini
Jeff Sander
Pete Zaback
Databases, statistics
• Occurrence of a specific genes
or features in a genome
• How many kinases in yeast?
Vasant Honavar, Com S
Feihong Wu
Cornelia Caragea
Fadi Towfic
Jivo Sinapov
• Compare Tissues
• Which proteins are expressed
in cancer vs normal tissues?
• Diagnostic tools
• Drug target discovery
Robert Jernigan, BBMB
Taner Sen
Andrzej Kloczkowski
Kai-Ming Ho, Physics
Modified from Mark Gerstein
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
BCB 444/544 Fall 07 Dobbs
8/22/07
35
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
36
6
#2 - Biological Databases
8/22/07
Structure & Function of Human Telomerase:
Designing Zinc Finger DNA-binding Proteins
to Recognize Specific Sites in Genomic DNA
Predicting structure & functional sites in a clinically
important but "recalcitrant" RNP
Drena Dobbs, GDCB
Jeff Sander
Pete Zaback
Cell Biologist:
Biochemist:
Imagined structure:
Dan Voytas, GDCB
Fengli Fu
Les Miller, ComS
Vasant Honavar, ComS
Keith Joung, Harvard
www.intl-pag.org/
www.chemicon.com
Lingner et al (1997) Science 276: 561-567.
How would a systems biologist study telomerase?
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
37
SUMMARY:
#1- What is Bioinformatics?
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
38
8/22/07
40
#2- Biological Databases
Xiong: Chp 2
2 Introduction to Biological Databases
What Is a Database?
Types of Databases
Biological Databases
Pitfalls of Biological Databases
Information Retrieval from Biological Databases
Summary
Further Reading
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
39
What is a Database?
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
Types of Databases
3 Major types of electronic databases:
Duh!!
1- Flat files - simple text files
• no organization to facilitate retrieval
2- Relational - data organized as tables ("relations")
• shared features among tables allows rapid search
OK: skip we'll skip that!
3- Object-oriented - data organized as "objects"
• objects associated hierarchically
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
BCB 444/544 Fall 07 Dobbs
8/22/07
41
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
42
7
#2 - Biological Databases
8/22/07
Types of Biological Databases
Biological Databases
1- Primary
Currently - all 3 types, but MANY flat files
• "simple" archives of sequences, structures, images, etc.
• raw data, minimal annotations, not always well curated!
What are goals of biological databases?
2- Secondary
1- Information retrieval
• enhanced with more complete annotation of sequences,
structures, images, etc.
2- Knowledge discovery
• usually curated!
3- Specialized
Important issue:
• focused on a particular research interest or organism
Interconnectivity
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
• usually - not always - highly curated
8/22/07
43
BCB 444/544 F07 ISU
Examples of Biological Databases
Dobbs #2 - Biological Databases
8/22/07
44
8/22/07
46
Examples of Biological Databases
1- Primary
2- Secondary
• DNA sequences
• Protein sequences
• GenBank - US
• Swiss-Prot, TreEMBL, PIR
• European Molecular Biology Lab - EMBL
• these recently combined into UniProt
• DNA Data Bank of Japan - DDBI
3- Specialized
• Structures (Protein, DNA, RNA)
• Species-specific (or "taxonomic" specific)
• PDB - Protein Data Bank
• Flybase, WormBase, AceDB, PlantDB
• NDB - Nucleic Acid Databank
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
• Molecule-specific,disease-specific
8/22/07
45
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
Information Retrieval
from Biological Databases
Pitfalls of Biological Databases
• Errors!
&
• Lack of documentation re: quality or reliability of data
• Limited mechanisms for "data checking" or preventing
propagation of errors (esp. annotation errors!!)
• Redundancy
• Inconsistency
• Incompatibility (format, terminology, data types, etc.)
2 most popular retrieval systems:
• ENTREZ - NCBI
• will use a LOT - Introduced in Lab 1
• SRS - Sequence Retrieval Systems - EBI
• will use less, similar to ENTREZ
Both:
• Provide access to multiple databases
• Allow complex queries
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
BCB 444/544 Fall 07 Dobbs
8/22/07
47
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
48
8
#2 - Biological Databases
8/22/07
Web Resources:
Bioinformatics & Computational Biology
ISU Resources & Experts
ISU Research Centers & Graduate Training Programs:
• Wikipedia:
•
•
•
•
•
•
Bioinformatics
•
•
•
•
•
•
NCBI - National Center for Biotechnology Information
ISCB - International Society for Computational Biology
JCB - Jena Center for Bioinformatics
UBC - Bioinformatics Links Directory
UWa - BioMolecules
Pitt - OBRC Online Bioinformatics Resources Collection
ISU Facilities:
• ISU - Bioinformatics Resources - Andrea Dinkelman
• ISU - YABI = "Yet Another Bioinformatics Index"
• Biotechnology - Instrumentation Facilities
• PSI - Plant Sciences Institute
• PSI Centers
(from BCB Lab at ISU)
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
BCB Lab - (Student-Led Consulting & Resources)
BCB - Bioinformatics & Computational Biology
LH Baker Center - Bioinformatics & Biological Statistics
CIAG - Center for Integrated Animal Genomics
CILD - Computational Intelligence, Learning & Discovery
NSF IGERT Training Grant - Computational Molecular Biology
8/22/07
49
8/22/07
51
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
8/22/07
50
SUMMARY:
#2- Biological Databases
BEWARE!
BCB 444/544 F07 ISU
Dobbs #2 - Biological Databases
BCB 444/544 Fall 07 Dobbs
9
Download