Talks to IPAM group UCLA September 26 Why biologists need mathematicians

advertisement
Talks to IPAM group
UCLA
September 26
Why biologists need mathematicians
I. Themes from biological systems
Roger Brent
The Molecular Sciences Institute
I. Themes from biological systems
A. Why do we care about biology?
A. Why do we care about biology?
1. It’s fascinating
2. It’s us, and it’s a good deal of the rest of our
world
A. Why do we care about biology?
1. It’s fascinating
It’s quite different from the hard world
There are biological functions that don’t have great
counterparts outside
Core functions
-> Self replication
-> Self repair
-> Ordered growth
-> Unusual and un-understood approaches to information
processing
-> Development into great complexity from un- differentiated
state (using stored information to get there)
Biological functions that don’t have good counterparts
in the non-biological world
Neurobiological functions
-> Perception
-> Memory
-> Cognition
-> Self-consciousness
Species and population functions
-> Variation-selection (aka evolution)
-> Distributed coordinated control (immune system)
-> Resiliance (ecologies)
A. Why do we care about biology?
2. It’s us, and it’s our world
-> Medicine
-> Agriculture, food, etc. etc.
-> Increasingly, engineering and design
B. Biological systems are built from stored information.
DNA makes RNA makes protein. The DNA of course, contains
genes. Much of the progress in molecular biology in the mid20th century concerned progress in understanding “the
molecular biology of the gene”.
What is a gene?
-> Definition 1. Operational. Rediscovery of Mendel.
T. H. Morgan. Allele.
-> Definition 2. Coding sequence, or is that the coding
sequence + the control region.
The first definition launched genetics and is still with us
in popular culture. The second is what, in the main,
day to day allows us to work.
C. For reasons that I hope will become apparent, most of
what I talk about over the next two days will deal with nonneurobiological information handling functions.
These processes are arguably the most important current
preoccupation of molecular and cellular biologists.
Let’s first consider the development of an organism from a
relatively undifferentiated state.
This development depends on the elaboration of the
information stored in the genome.
At the limit, most of the construction of the cell and metazoan
organism is specified by that stored information.
C. Information processing during development
There is a fairly clean division between stored
information (program) and the rest of organism.
Consider the cell or organism as the consequence
of an expressed genome. In the cell or organism,
information represented at two levels, the DNA and
the dynamic mRNA and protein products.
Information stored in level 2 is a subset of what is in
level 1 and is synonymous, albeit in a different
language. We’ll get back to this.
D. Non-developmental information processing
1. Metabolism
If I were to be giving this talk in 1950, I would be
telling you that the living state could be understood in
terms of metabolism, and that if one wanted to use
mathematics to understand biology more deeply, one
should consider metabolism.
Small molecules schlepped from workstation to
workstation.
There are all kinds of feedback.
This is a tale of rates and fluxes.
Stoffwechsel
Information
processing,
decisionmaking
E. Regulation
OK, there is overall information flow from DNA to mRNA
and protein. Especially during development.
However, information is processed in response to changes
inside and outside of cell.
Regulation is the mother of this information processing
At the level of the cell, the “primitives”, the
“components” that perform information processing
are subsystems of protein molecules and sites.
Examples of regulatory subsystems.
1) The operon
2) The Bicoid gradient
3) Genetic networks of repressors and activators-usually, not.
4) Signal transduction networks.
1) The operon
Draw PL from phage lambda on board
2) Bicoid protein gradient
3) Genetic networks of repressors and activators,
not--Draw “Boolean network” on board
E. Regulation, continued
4) Signal transduction networks-- yes.
Part of pathway governing
response of yeast to
a factor.
F. Overview of non-neurobiological information
processing in biology.
1. Information processing is the current metaphor for
understanding how things work. This is as true in
biology as it is in any other human endeavor.
2. Contemporary molecular and cellular biology has its
roots in an attempt to understand transmission and
elaboration of the genetic information.
3. Certainly, during development, there is an overall
flow of information from DNA to mRNA and protein.
F. Non-neurobiological information processing in biology,
cot’d
4. However, a good deal of the interesting things that
happen after development, the decisionmaking, the phenomena
that cause biological changes to occur, are due to processing of
information coming from the inside and outside of cell.
5. Some of this information processing, particularly
information processing that can be slow and stately,
occurs through changes in gene expression. However,
relatively little of it occurs as multiple cycles of gene
activation and repression.
F. Non-neurobiological information processing in biology,
cot’d
6. By contrast, a good deal of cellular information
processing occurs via protein protein interactions.
Mammalian
G1-> S
circuitry
F. Non-neurobiological information processing in biology,
cot’d
7. Genomic biological techniques are generating new
kinds of information that promise to heighten our
understanding of core biological functions, including
these information processing functions.
8. In the understanding of the processes that bring
about changes in biological systems, there will be
mathematical opportunities.
9. The challenges and opportunities will likely be both
problem specific and specific to each data type.
Talks to IPAM group
UCLA
September 26 and 27
Why biologists need mathematicians
II. How biological information and knowledge is
created
Roger Brent
The Molecular Sciences Institute
A. Ad hoc experimentation.
(sometimes known as (“hypothesis-driven”
research)
Modular transcription factor story.
lex A
lex op
Gene
lex op
Gene
lex AGAL4
A. Ad hoc biological experimentation, cot’d.
-> Not always hypothesis driven.
-> Can sometimes be reduced to hypothesis-driven, or, more
precisely, in the Francis-Baconian (Novum Organum) sense,
can be reduced to test that decides between alternatives.
-> Reasoning toward conclusions, and conclusions, typically
expressed qualitatively, in natural language (eg English).
-> Strength of conclusions depends on finding right words.
-> Most of what we know we know from this.
B. Genomic research, defined
Genetics, study of genes
Genomics, systematic study of genes (as long as
you can do it in a factory)
Functional genomics, study of what genes do
-- but that is arguably all of biology
-- but wait, as long as you can do it in a factory,
it’s functional genomics
-- Function Genomics is a good deal more than the
study of differences in gene expression
B. Genomic research, defined, continued
This expanded definition of “functional genomics”
would have been OK, but no-- along came proteomics,
and even worse neologisms.
John C. Weinstein has proposed calling all of this “omics”
I will call all of it “genomic biology”.
C. DNA sequencing
1) Sequencing of organismic (genomic) DNA
2) Sequencing of cDNAs, ESTs
3) Main use taught to you so far has probably been to
identify genes and assign function to them.
4) In all of this, the main inferential tactic is to see that
one thing that looks like another thing may do the
same thing as the first thingl-- transitivity.
5) And look. What do we mean by function?
C. DNA polymorphisms, such as SNPs
1) Sequencing
2) PCR
3) Melting
4) Microarrays
5) MALDI-TOF Mass spec
C. DNA polymorphisms
1) Sequencing
Draw on board
C. D NA polymorphisms
2) PCR
Draw on board
C. DNA polymorphisms
3) Melting
draw on board
4) Hybridization, for example to microarrays
PCR + does hybridize to this sequence?
Expensive and difficult to access now due to
restricted Affymetrix chip technology.
Cost will come down as people with deep pockets
challenge this semi monopoly
C. DNA polymorphisms
5) PCR approach +
Matrix Assisted Laser Desorption Ionization
Time Of Flight Mass Spectrometry
(aka “MALDI-TOF mass spec”)
Draw on board
MassArray
MALDI-TOF Mass Spectrometry
Desorption/
Ionization
Separation
Courtesy Charles Cantor, Sequenom, Inc.
Detection/
Identification
MassArray
Allele Discovery: TRSP (-12)
SNP analysis using combined DNA samples
(184 chromosomes) (courtesy Charles Cantor, Sequennom, Inc)
C allele 5837
T allele 6165
G allele 6190
D. Gene expression analysis
1) Methods
-> E. M. Southern
-> Catalog cDNA sequence abundance
-> Affymetrix
-> Brown-Botstein-Davis-DeRisi microspot
-> Suspension array (ala Lynx, Luminex)
2) Quality control problems
-> Surface problems
-> Source of mRNA problems (sample grab, sample
treatment)
-> How you turn mRNA into cDNA
-> How you amplify
D. Gene expression analysis, continued
3) Inferential tactics
-> Two things that are expressed at the same time
or under the same conditions might do the same
thing (“guilt by association”)
-> One thing that is expressed, or that ceases to be
expressed, after another thing is expressed, may
be controlled by the prior-expressed thing.
(post hoc, ergo procter hoc).
-> Seems as if we could do with a few more ideas, no?
(I mean, there are more logical fallacies left.)
Guessing gene function from expression
pattern. Chu et al, 1999
Time after sporulation induction (hours)
0
1/2
2
5
7
9
11
E. Protein
-> 2-D gels
-> Two-hybrid
-> Gst-pulldown activity
-> Mass spec ID of proteins in complexes
-> Follow ons. That which can be systematized will
be
2-D Gels
Draw on board
nuc loc seq
acid
epitope tag
“prey”
interacting
cDNA ORF
fused moiety
“bait”
LexA dimerization
LexA DNA
binding
op
“prey”
gene
act
“bait”
gene
Mass production of two-hybrid information: interaction
mating
Baits
Reporter 1:
LEU2
Preys
Reporter 2:
lacZ
Mating
Aptamer affinity vs LexAop-GFP reporter
expression. Colas, et al. unpublished
-8
Log KD (M)
-7
0
25
50
75
100
% Fluorescence above background, 500nm
Finding new functions by coprecipitation and forced
association. Martzen et al, 1999
Biochemical Activity
Protein 1 Protein 2
Protein 1 Protein 2
Gst
Glutathione
bead
Biochemical Activity
Gst
and
Gst
Glutathione
bead
Protein mass spectrometric identification of proteins in
complexes
Won’t show the lurid commercial slide again
Follow ons
Ways to track information by changes
in protein phosporylation by changes in mass
etc., etc.
Things we need
A
Receptors and ligands
B
C
*
Cytoplasmic
proteins
Rate
constants
D
*
k
Sites and regulatory
proteins
Talks to IPAM group
UCLA
September 26 and 27
Why biologists need mathematicians
III. Things to watch: current areas of genomic biology
that seem to be generating mathematical issues
Roger Brent
The Molecular Sciences Institute
A. “Ken’s light”, courtesy Andrea Way
OMI1782
2
B. Guessing protein function
from interaction pattern.
Lok et al., 1998,
www.molsci.org
cdk4
6
1
Cart1
2
cycD
Interactions in our database, Interaction 1.0
Two patterns of protein interactions
Protein Complex
Signal Transduction Pathway
Computer searching for patterns of protein interaction
(Connect The Dots, or CTD v3.0)
Lok’s algorithm finds noncircular patterns first.
New version > 10 exp 3
faster than first
C. Quantitative simulations and models of biological
processes
Part of pathway governing
response of yeast to
a factor.
Continuous and stochastic reaction computations
reactant
k
k
product
reactant
product
startloop
d[reactant] = -k[reactant]
dt
Preactant product/Dt is f(#reactants, k)
d[reactant] = -kt
[reactant]
Use Preactantproduct/Dt to
compute reaction PDF
[reactant]timet =[reactant]time0e-kt
Sample reaction PDF to
determine when reaction
next occurs
(systems of these are solved numerically)
# reactants = # reactants - 1
#products = # products + 1
go to startloop
(Gillespie, 1977)
Output of n-2 version of a factor simulation.
106
105
104
103
102
101
100
Endy and
Lyons,
unpublished
-200
0
200
400
600
time (seconds)
800
1000
Test simulation by assaying
system output in populations of
single cells. Colman-Lerner,
unpublished.
G1 arrest
state
P Fus1 YFP
P H2-like CFP
P Fus1 YFP
P H2-like CFP
S state
Vary amount of proteins,
protein complexes, etc.
Measure variation. Measure
change in system output.
P Fus1 YFP
P H2-like
CFP
Issues with ongoing simulation work
-> Devising new experimental methods to constrain
upstream steps of pathway
-> Coping computationally with increase in number of
species
-- Run time generation of code for each possible
species (Levchenko and Bruck, 2000)
-- Computing and keeping track of only those species
that come into being during each simulation
run (“Moleculizer”, Lok, 2000)
D. Fuzzier stuff
1)
Finding appropriate symbolic representation
of qualitative knowledge and performing operations
on said symbols
ie.
Computation on qualitative knowledge
Mammalian
G1-> S
circuitry
Limited-vocabulary information
Geographical
Biological
Names (Berkeley, Berkeley- Names (p107, Rb, Raf, ATP,
Oakland line, Interstate
Ovalbumin Estrogen Response
80, Bay Bridge, San
Element)
Francisco Bay)
Verbs (homodimerizes,
Relationships (shares border
ubiquitinates, cleaves,
with, is under, north, right,
represses)
5 miles, 11 minutes)
Modifiers (strongly, slowly, most,
Verbs (Go)
some)
Modifiers (fastest route,
under construction)
Locations (37.8 N, 122.3 W,
50m above sea level)
Locations (in cytoplasm, near
plasma membrane, in plasma
membrane, in nucleus)
Cassini
plumbing
D. Fuzzier stuff, cot’d
2) Coming back to deal with selection and fitness
a) Can evolve in silico. You need a source of variation
and a way to evaluate fitness.
b) Make variants and use simulation to evaluate
their fitness. John Koza, Forrest Bennett III, et al.
1) Create language to describe circuit genomes
(components and connections)
2) Start with 100 random 10-component circuits
3) Use SPICE simulator to simulate circuit behavior
4) Select 4 circuits that best approximate desired
behavior (eg, bandpass filters, amplifiers)
5) Mate, Meiose, Duplicate, Delete, Point Mutate
6) Select 100 random progeny circuits
7) Go to step 3)
Op Amp. Best circuit from generation 109.
Koza et al, 1999.
RFEEDBACK 1000K
Q23
POS15
Q25
R24
9.11K
R85
11.6K
Q31
RSOURCE
1K
Q17
R7
15.6K
Q69
R59
Q312k
Q64
Q37
POS15
R48 1.23K
Q30
Q55
R22
11.6k
Q39
R32 1.23K
ZGNO
POS15
Q73
R90
6.45K
RBSR
C
1K
Q27
4
RLOAD
1K
Q67
R65
9.11K
Q18
VOUT
Q82
R41
18.2K
NEG15
Q58
NEG15
Q77
Q43
VSOURCE NEG15
ZOUT
Q79
R20
9.11K
Q36
5
Q5
Q81
RBFOBK
1000K
POS15
D(2), cot’d
c) You have seen we may have simulations of biological
systems, so we should be able to evolve in silico.
d) Meanwhile, we also seem to be moving toward
whole-genome experimental methods to quantitate
contribution individual genes make to fitness of
an organism under selection. Honest numbers, honest
quantitation.
D(2) Cot’d. Quantitating contribution of individual
genes to fitness. Smith et al, 1996, etc.
10 gens
Pot of different
mutants
50 gens
Organisms with
mutations in essential
genes gone from
population
Organisms with
mutations in
significant genes
(5% growth disadvantage) depleted
from population
D(2) Cot’d. Actual detection and quantitation of
numbers relevant to evolultion.
Thus, we can almost imagine closing the loop between
“theory” and experiment on one aspect of evolution,
contribution of individual genes to fitness for a given
condition.
There are obvious math/statistical issues herein.
E.
Conclusions
1) In your thoughts about what sorts of data was worth
your insights, it would not be good to be biased by the
prevailing genomic hype toward the kinds of data that can
be collected using nucleic acid microarrays.
2) There is a very large world of experimental observation
out there that needs mathematics at all levels, from
basic applied math work on behalf of the innumerate,
to potentially profound insights.
3) Because most of these problems are not articulated, it
requires work to articulate them. In this case, it probably
requires people of goodwill on both sides, people who are
not afraid of embarrassing themselves in a good cause.
Download