Bioinformatics and Computational Biology

advertisement
Bioinformatics
& Computational Biology
Podcast for Frontiers in Biology - ISU 7/13/06
Drena Dobbs
Genetics, Development and Cell Biology
Bioinformatics & Computational Biology
Iowa State University
Thanks to Mark Gerstein (Yale)
& Eric Green (NIH)
for many borrowed & modified PPTs
What is Bioinformatics?
(& What is Computational Biology?)
Wikipedia:
• Bioinformatics & computational biology
involve the use of techniques from
mathematics, informatics, statistics, and
computer science (& engineering) to solve
biological problems
What is Bioinformatics?
(& What is Computational Biology?)
Gerstein:
• (Molecular) Bioinformatics is conceptualizing biology in
terms of molecules & applying “informatics”
techniques - derived from disciplines such as mathematics,
computer science, and statistics - to organize and
understand information associated with these
molecules, on a large scale
Modified from Mark Gerstein
What is the Information?
Biological Sequences, Structures, Processes
Central Dogma
of Molecular Biology
Central Paradigm
for Bioinformatics
• DNA sequence
-> RNA
-> Protein
-> Phenotype
• Genomic (DNA) Sequence
• Molecules
 Sequence, Structure, Function
• Processes
-> mRNA& other RNA sequence
-> Protein sequence
-> RNA & Protein Structure
-> RNA & Protein Function
-> Phenotype
• Large Amounts of Information
 Mechanism, Specificity, Regulation
Modified from Mark Gerstein idea from D Brutlag, Stanford, graphics from S Strobel)
 Standardized
 Statistical
Explosion of "Omes" & "Omics!"
Genome, Transcriptome, Proteome
* Note: the set
of specific RNAs
or proteins
expressed varies
greatly in
different cells and
tissues -- and
critically depends
on the age,
developmental
stage, disease
state, etc. of the
organism
• Genome - the complete collection
of DNA (genes and "non-genes") of
an organism
• Transcriptome - the complete
collection of RNAs (mRNAs &
others) expressed in an organism *
• Proteome - the complete
collection of proteins expressed in
an organism *
Molecular Biology Information:
DNA & RNA Sequences
Functions:
•
•
•
•
Genetic material
Information transfer (mRNA)
Protein synthesis (tRNA/mRNA)
Catalytic & regulatory activities
(some very new!)
DNA sequence:
atggcaattaaaattggtatcaatggttttggtcgtat
gcacaacaccgtgatgacattgaagttgtaggtattaa
atggcttatatgttgaaatatgattcaactcacggtcg
aaagatggtaacttagtggttaatggtaaaactatccg
Gcaaacttaaactggggtgcaatcggtgttgatatcgctttaactg
atgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagtt
Information:
RNA sequence has "U" instead of "T"
• 4 letter alphabet
 (DNA nucleotides: AGCT)
• ~ 1,000 base pairs in a small gene
• ~ 3 X 109 bp in a genome (human)
Modified from Mark Gerstein
•
•
•
•
Where are the genes?
Which DNA sequences encode mRNA?
Which DNA sequences are "junk"?
Which RNA sequences encode protein?
Molecular Biology Information:
Protein Sequences
Functions: Most cellular functions are performed or
facilitated by proteins
• Biocatalysis
•
•
•
•
Protein sequences:
d1dhfa_
Cofactor transport/storage
LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTT
Mechanical motion/support
d8dfr__
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTS
Immune protection
d4dfra_ ISLIAALAVDRVIGMENAMPWNRegulation of growth and differentiation
Information:
• 20 letter alphabet (amino acids)
 ACDEFGHIKLMNPQRSTVWY
but not BJOUXZ
• ~ 300 aa in an average protein
(in bacteria)
• ~ 3 X 106 known protein sequences
Modified from Mark Gerstein
LPADLAWFKRNTL
d3dfr__ TAFLWAQDRDGLIGKDGHLPWHLPDDLHYFRAQTV
• What is this protein?
• Which amino acids are most
important -- for folding, activity,
interaction with other proteins?
• Which sequence variations are
harmful (or, beneficial)?
Molecular Biology Information:
Macromolecular Structures
DNA/RNA/Protein Structures
• How does a protein (or RNA)
sequence fold into an active
3-dimensional structure?
• Can we predict structure
from sequence?
• Can we predict function from
structure (or perhaps, from
sequence alone?)
Modified from Mark Gerstein
We don't yet understand the protein folding
code - but we try to engineer proteins anyway!
Modified from Mark Gerstein
Molecular Biology Information:
Biological Processes
Functional Genomics
• How do patterns of gene
expression determine
phenotype?
• Which genes and proteins are
required for differentiation
during during development?
• How do proteins interact in
biological networks?
• Which genes and pathways have
been most highly conserved
during evolution?
On a Large Scale?
Whole Genome
Sequencing
Genome sequence now
accumulate so quickly that,
in less than a week, a single
laboratory can produce
more bits of data than
Shakespeare managed in a
lifetime, although the latter
make better reading.
-- G A Pekso, Nature 401: 115-116 (1999)
Modified from Mark Gerstein
Next Step after the
Sequence?
Understanding Gene
Function on a Genomic
Scale
• Expression Analysis
• Structural Genomics
• Protein Interactions
• Pathway Analysis
• Systems Biology
Evolutionary Implications of:
• Introns & Exons
• Intergenic Regions as "Gene Graveyard"
Modified from Mark Gerstein
Gene Expression Data:
the Transcriptome
MicroArray Data
Yeast Expression Data:
• Levels for all 6,000 genes!
• Experiments to investigate
how genes respond to
changes in environment or
how patterns of expression
change in normal vs
cancerous tissue
Modified from Mark Gerstein (courtesy of J Hager)
ISU's Biotechnology Facilities
include state-of-the-art
Microarray & Proteomics
instrumentation
Other
Whole-Genome
Experiments
Systematic Knockouts:
Make "knockout" (null)
mutations in every gene
- one at a time - and
analyze the resulting
phenotypes!
For yeast:
6,000 KO mutants!
Modified from Mark Gerstein
2-hybrid Experiments:
For each (and every)
protein, identify every
other protein with which it
interacts!
For yeast: 6000 x 6000 / 2
~ 18M interactions!!
Molecular Biology Information:
Integrating Data
• Understanding the function of genomes
requires integration of many diverse and
complex types of information:
 Metabolic pathways
 Regulatory networks
 Whole organism physiology
 Evolution, phylogeny
 Environment, ecology
 Literature (MEDLINE)
Modified from Mark Gerstein
Storing & Analyzing Large-scale Information:
Exponential Growth of Data Matched by
Development of Computer Technology
CPU vs Disk & Net
• Both the increase in
computer speed and the
ability to store large
amounts of information on
computers have been
crucial
• Improved computing
resources have been a
driving force in
Bioinformatics
ISU's supercomputer "CyBlue" is among
100 most powerful in the world
Modified from Mark Gerstein
(Internet picture adaptedfrom D Brutlag, Stanford)
Bioinformatics is born!
& more Bioinformaticists are needed!
(Internet picture adapted
from D Brutlag, Stanford)
Modified from Mark Gerstein
(courtesy of Finn Drablos)
from Mark Gerstein
Weber Cartoon
“Informatics” techniques
in Bioinformatics
• Databases
 Building, Querying
 Object-oriented DB
• String Comparison
 Text search
 Alignment
 Significance statistics
• Finding Patterns




Machine Learning
Data Mining
Statistics
Linguistics
• Geometry
 Robotics
 Graphics (Surfaces, Volumes)
 Comparison & 3D Matching
• Simulation & Modeling





Newtonian Mechanics
Electrostatics
Numerical Algorithms
Simulation
Network modeling
Challenges in Organizing Information:
Redundancy and Multiplicity
• Different sequences can have the
same structure
• Organism has many similar genes
• Single gene may have multiple
functions
• Genes and proteins function in genetic
and regulatory pathways
• How do we organize all this
information so that we can make
sense of it?
Integrative Genomics:
genes >< structures <> functions <> pathways <>
expression levels <>regulatory systems <> ….
Modified from Mark Gerstein
Molecular Parts = Conserved Domains
Modified from Mark Gerstein
"Parts List" approach to bike maintenance:
How many roles
can these play?
How flexible and
adaptable are they
mechanically?
What are the
shared parts (bolt,
nut, washer, spring,
bearing), unique
parts (cogs,
levers)? What are
the common parts - types of parts
(nuts & washers)?
Where are
the parts
located?
Modified from Mark Gerstein
World of structures is also finite,
providing a valuable simplification
(human)
1
2
3
4
5
6
7
8
9
10 11
12 13
14 15 16
17 18 19
20
…
~30,000 genes
~2,000 folds
(T. pallidum)
1
2
3
4
5
6
7
8
9
10 11
12 13
14 15 …
Global Surveys of a Finite
Set of Parts from Many
Perspectives
Same logic for pathways, functions,
sequence families, blocks, motifs....
Modified from Mark Gerstein
Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from, ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related
resources: COGS, ProDom, Pfam, Blocks, Domo, WIT, CATH, Scop....
~2,000 genes
So, this is Bioinformatics
What is it good for?
Application I:
Designing Drugs
• Understanding how proteins bind other molecules
• Docking & structure modeling
• Designing inhibitors
Modified from Mark Gerstein
Figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web
page at Scripps, and from Computational Chemistry Page at Cornell Theory Center).
Application II:
Finding homologs
Modified from Mark Gerstein
Finding WHAT?
Homologs - "same genes" in different organisms
• Human vs. Mouse vs. Yeast
 Much easier to do experiments on yeast!
Best Sequence Similarity Matches to Date Between Positionally Cloned
Human Genes and S. cerevisiae Proteins
Human Disease
MIM #
Human
Gene
GenBank
BLASTX
Acc# for
P-value
Human cDNA
Yeast
Gene
GenBank
Yeast Gene
Acc# for
Description
Yeast cDNA
Hereditary Non-polyposis Colon Cancer
Hereditary Non-polyposis Colon Cancer
Cystic Fibrosis
Wilson Disease
Glycerol Kinase Deficiency
Bloom Syndrome
Adrenoleukodystrophy, X-linked
Ataxia Telangiectasia
Amyotrophic Lateral Sclerosis
Myotonic Dystrophy
Lowe Syndrome
Neurofibromatosis, Type 1
120436
120436
219700
277900
307030
210900
300100
208900
105400
160900
309000
162200
MSH2
MLH1
CFTR
WND
GK
BLM
ALD
ATM
SOD1
DM
OCRL
NF1
U03911
U07418
M28668
U11700
L13943
U39817
Z21876
U26455
K00065
L19268
M88162
M89914
9.2e-261
6.3e-196
1.3e-167
5.9e-161
1.8e-129
2.6e-119
3.4e-107
2.8e-90
2.0e-58
5.4e-53
1.2e-47
2.0e-46
MSH2
MLH1
YCF1
CCC2
GUT1
SGS1
PXA1
TEL1
SOD1
YPK1
YIL002C
IRA2
M84170
U07187
L35237
L36317
X69049
U22341
U17065
U31331
J03279
M21307
Z47047
M33779
DNA repair protein
DNA repair protein
Metal resistance protein
Probable copper transporter
Glycerol kinase
Helicase
Peroxisomal ABC transporter
PI3 kinase
Superoxide dismutase
Serine/threonine protein kinase
Putative IPP-5-phosphatase
Inhibitory regulator protein
Choroideremia
Diastrophic Dysplasia
Lissencephaly
Thomsen Disease
Wilms Tumor
Achondroplasia
Menkes Syndrome
303100
222600
247200
160800
194070
100800
309400
CHM
DTD
LIS1
CLC1
WT1
FGFR3
MNK
X78121
U14528
L13385
Z25884
X51630
M58051
X69208
2.1e-42
7.2e-38
1.7e-34
7.9e-31
1.1e-20
2.0e-18
2.1e-17
GDI1
SUL1
MET30
GEF1
FZF1
IPL1
CCC2
S69371
X82013
L26505
Z23117
X67787
U07163
L36317
GDP dissociation inhibitor
Sulfate permease
Methionine metabolism
Voltage-gated chloride channel
Sulphite resistance protein
Serine/threoinine protein kinase
Probable copper transporter
Modified from Mark Gerstein
Application III:
Genome/Transcriptome/Proteome
Characterization & Comparison
Databases, statistics
• Occurrence of specific genes or
features in a genome
 How many kinases in yeast?
• Compare Tissues
 Which proteins are expressed in
cancer vs normal tissues?
• Diagnostic tools
• Drug target discovery
Modified from Mark Gerstein
Building “Designer” Zinc Finger DNA-binding Proteins
J Sander, Fengli Fu, J Townsend, R Winfrey
D Wright, K Joung, D Dobbs, D Voytas
Identifying "Missing" Components of
Signal Transduction Pathways
Phil Becraft, GDCB
Antony Chettoor
Drena Dobbs, GDCB
Jae-Hyung Lee
Kai-Ming Ho, Physics
Zhong Gao
Yungok Ihm
Haibo Cao
Cai-zhuang Wang
Designing New HIV Therapies
Susan Carpenter, VMPM
Sijun Liu
Wendy Wood
Drena Dobbs, GDCB
Jae-Hyung Lee
Kai-Ming Ho, Physics & Astronomy
Yungok Ihm
Haibo Cao
Cai-zhuang Wang
Amy Andreotti,BBMB
Bruce Fulton, NMR Facility
Vasant Honavar, Com S
Changhui Yan
Predicting Protein-Protein Interactions from
Amino Acid Sequence
Vasant Honavar, Com S
Changhui Yan
Drena Dobbs, GDCB
Jae-Hyung Lee
Kai-Ming Ho, Physics
Robert Jernigan, BBMB
Download