Document

advertisement
From Genome Sequences to
Regulatory Network Phenotypes
(bioinformatic functional genomics:)
• Study the systematic operation of
genes and their products in whole
genome, whole cell contexts.
• Discover the effect of every gene on
growth, expression, & interaction .
• Test quantitative network models.
Growth, Expression, & Interaction
Harvard Center for
Computational Genetics
John Aach
Tim Chen
George Church
Jason Hughes
Jason Johnson
Abby McGuire
Jong Park
Fritz Roth
HMS Genetics
Andy Link, Doug Selinger
Pete Estep, Michael Ching
Martha Bulyk, Sonali Bose
Martin Steffen
Saeed Tavazoie, Annie Chan
Dereth Phillips, Chris Harbison
NCBI
Affymetrix
Andrew Neuwald
David Lockhart
Eric Gentalen
UCSD
Bernhard Palsson
DOE, DARPA, Lipper, NIST, HMR
Sequenced genomes
Organism
S cerevisiae
E coli
B subtilus
Synechocystis sp.
A fulgidus
H influenzae
M thermoautotrophicum
H pylori
M jannaschii
B burgdorgeri
M pneumoniae
M genitalium
Total
Science 277: 1433 (1997)
# Genes
6034
4288
4000
3168
2471
1740
1855
1590
1692
863
677
470
% Unknown
function
49%
38%
42%
56%
52%
42%
56%
43%
54%
42%
51%
31%
28848
47%
FUNs
Choice of Cells
Small genome size: Mycoplasma, Haemophilus, Methanococcus
Energy relevance: Methanobacterium, Synechocystis
Major Pathogens: Mycobacterium, Escherichia, Helicobacter
Biotech Production: Escherichia, Saccharomyces, Homo
Recombinant protein production, in vivo combinatorial chemistry,
BACs, gene delivery, etc.
15 going on 40 complete genomes.
30,000 going on 150,000 complete genes (& intergenic regions).
Smith, et al. (1997) J. Bacteriol. 179:7135-55. Methanobacterium
Blattner, et al. (1997) Science 277, 1453-74. Escherichia
Goffeau, et al. (1996) Science 274, 563-7. Saccharomyces
Metabolic & regulatory
databases
4288 / 4909 E. coli orfs / genes
587 - 804 enzymes
720 - 988 metabolic reactions
436 / 1303 metabolites / compounds
Varma & Palsson (1994) Appl. Env. Micro. 60:3724.
Karp et al. (1998) NAR 26:50. EcoCyc
Selkov, et al. (1997) NAR 25:37. WIT
Robison and Church http://arep.med.harvard.edu
Conceptual Data Model
Biomolecule
Interaction,
Growth,
Expression, &
Database:
Project : TBEID1
Model : TBEID
(C)
Author : John Aach Version: 1.04 7/7/97
Condition Set
Condition Set Number
Description
Comment
(S)
Strain Mix
Strain Mix Number
Strain Mix Name
Description
Preparation Comments
(P)
(S,N)
used in
Strain
Strain Number
ProgenitorInd
Description
Comment
used in
used in
Competition PhenotypeExpt
Starting Cell Count
Starting Cell Density
Protein Preparation Set
Prot Prep Set Number
Description
Comment
input to
used in
Strain Phenotype Expt
Starting Cell Count
Starting Cell Density
DNA Protein BindingExpt
Experiment Info
Experiment Number
Experiment Type
Experimenter Name
Description
Comment
Start Time
End Time
Outcome Comment
Success Code
Sample Size
OpenInd
described by
described by
Protein Protein BindingExpt
described by
described by
has
BIGED
exhibits
exhibits
Experiment Measures Set
Expt Measures Set No
Time of Measurement
Expt Measures Set Type
Description
Comment
Raw Data Sets Descrip
Data Transform Descrip
Outcome Comment
Success Code
Date Recorded
Sample Size
OpenInd
exhibits
has
Results Selection
Results Selection Code
Expt Measures Set Type
Results Selection Description
exhibits
exhibits
exhibits
John Aach
Harvard Center for
Computational Genetics
Growth
Rel Growth Mutant
Std dev Rel Growth Mutant
Winner Mutant Ind
Rel Growth All
Std dev Rel Growth All
Winner All Ind
Footprint
Fraction Occupancy
exhibits
St Dev Frac Occupancy
exhibits
mRNA Expression
mRNA Expression Level
Std dev Express Level
Protein Expression
Cell Fraction
Protein State Exp Level
Std Dev Prot State Level
DNA Seq Binding
DNA Seq Bind Const Num
DNA Sequence
Binding Constant
Std Dev Binding Constant
Non Specific DNA Binding
Non Specific BindingConst
Std Dev Non Spec BindConst
Protein Protein Binding
Binding Level
Std Dev Binding Level
Submodel cross-references:
* = main model,
C = Condition Set Entities,
D = DNA and Protein Elements,
N = Names,
P = Protein Preparation Entities,
S = Strain and Strain Mix Entities
Functional Genomics:
Growth, Expression, & Interaction
Why?
Sampled sequence vs. Completed genomes
Random vs. Engineered mutations & environments
Evolutionary models vs. High-throughput assays
Pure comparative genomics challenge:
15% amino acid identity:
Globins retain heme & oxygen binding functions
100% amino acid identity:
Enolase functions vary from enzymatic to
major vertebrate lens structural component.
Escherichia coli & Saccharomyces cerevisiae
Regulatory and Metabolic Networks
Expression
DNA
kR
kD
Growth rate
RNA
Protein
kP
kI
Interactions
Environments
kc
Metabolites
kD , kD , kD : Initiate, Elongate, Terminate, Fold, Modify, Localize, Degrade
Translating successful strategies: Metrics
(physics envy & killer applications)
Automate Data
quality
Model
quality
Similarity
search
X-ray
1960
diffraction
resolution
< 0.2nm
|o-c|/o
R < 0.2
DALI
Sequence
discrepancy
bp <0.01%
conserved
proteins
BLAST
1988
Function 1999 completion
DNAgibbs
CorFun
(growth, expression, & interaction; CorEnvironment)
Ratio of strains over environments, e ,
times, te , selection coefficients, se,
R = Ro exp[-sete]
80% of 34 random yeast insertions have s<0.3% or s>0.3%
t=160 generations, e=1 (rich media); ~50% for t=15, e=7.
Should allow comparisons with population allele models.
Other multiplex competitive growth experiments:
Thatcher, et al. (1998) PNAS 95:253.
Link AJ (1994) thesis; (1997) J Bacteriol 179:6228.
Smith V, et al. (1995) PNAS 92:6479.
Shoemaker D, et al. (1996) Nat Genet 14:450.
Multiplex: Tag(Mix) > Process > Decode
Internal standards, identical conditions, microscale
Multiplex DNA sequencing.
Church GM. Kieffer-Higgins S. (1988) Science. 240:185.
Physical mapping of complex genomes by cosmid multiplex
analysis. Evans GA. Lewis KA. (1989) PNAS 86: 5030.
Multiplexed biochemical assays with biological chips.
Fodor SP, et al. (1993) Nature 364:555.
Lashkari DA, et al. (1995) An automated multiplex
oligonucleotide synthesizer. PNAS 92(17):7912.
Multiplex Competitive Growth Experiments
t=0
107 Environments (so far)
minimal media
Combinatorial:
yeast extract
a,H,F,Q,t
synthetic rich
g,L,Y,N,S
Low N
C,I,W,u,E
Low P
M,K,T,D,dap
NaCl
V,P,R,G,thiamine
urine
a,g,C,M,thiamine
pancreatin
H,L,I,K,V
Bile
F,Y,W,T,P
Cholate
Q,N,u,D,R
triton X-100
t,S,E,dap,G
2 acetate
pyridoxin,nicotinate,biotin,pantothenate,A
4 butyrate
pH: 5, 6, 7, 8, 9
6 hexanoate
homoserine lactone Temperature: 25, 30, 37, 45
Genome Engineering
Challenges: Construct any mutant in any background,
multiple mutants, minimizing hitchhiking mutants.
Avoid undesired residual activities and neomorphic
effects on adjacent genes in most deletion, insertion
nonsense, or antisense alleles.
Full in-frame replacements, computationally track
gene overlaps, primer & genomic repeats.
Link, et al. (1997) J. Bacteriol. 179: 6228-6237. (pKO3)
http://arep.med.harvard.edu
Crossover PCR in-frame deletions / tag substitutions
gene of interest nearby gene
Primer with NotI site
tag
ATG
ATG
TAA
TAA
c-tag
Primer with Bam site
ATG
TAA
ATG
tag
TAA
pKO3: in-frame tagged deletions
rep A ts
sacB
tag
cam R
M13 ori
43° Cam
Resolving the cointegrant
wild type = 1
30°
sucrose
2 = mutant
tag
Primer design for size-tagged PCR
3% agarose
Deleted Orf
ygfX
universal
tag primer
length
789
yiaU
518
yhcS
348
ydhB
266
yfiE
194
ygoX
pssR
141
106
size-tagged primers
Competitive Growth Rate
Tag Readout
Effects of pH in rich media
700
r' pH5
% change from inoculate
600
r' pH6
500
r' pH7
400
r' pH8
300
r' pH9
200
100
0
-100
-200
pssR
farR
nhaR
ydhB
yhcS
yidP
yhiF
yidL
uw6519
Genome Engineering
Current status
5
46
24
20
Highly Expressed Genes
Putative regulatory FUNs
Highly conserved FUNs
Flux Balance Predictions
Link
Phillips
Loferer
in prep.
Glucose
Flux balance model
with max growth objective:
S.v = b
S = stoichiometric
matrix (m x n)
v = vector of n fluxes
b = I/O rate vector
n = 720 metabolic fluxes
m= 436 metabolites
Predict major
flux changes:
zwfzwf- pnt-
G6 P
6 PGA
6 PG
6 .1 6
0
0
3 .9 2
1 0.08
1 0.11
1 0.50
1 0.50
1 0.50
3 .9 2
9 .2 7
9 .3 6
FD P
R5 P
1 .8 9
-0 .16
-0 .15
S7 P
GA 3 P
3 .9 2
9 .2 7
9 .3 6
DHA P
1 5.92
1 8.00
1 8.21
Su c c
D PG
3 PG
1 5.92
1 8.00
1 8.21
1 4.52
1 6.62
1 6.93
1 4.52
1 6.62
1 6.93
0
0
3 .6 1
2 .3 8
2 .3 5
5 .7 9
PEP
0 .9 5
3 .0 7
0
Fu m
1 .3 4
3 .3 4
5 .9 4
Mal
2 PG
OA A
Cit
9 .3 7
1 1.51
0
A cCo A
1 .4 0
3 .4 0
5 .9 9
Icit
Ac
0
Fo r 0 .0 4
0 .5 2
2 .5 2
5 .1 8
QH2
0
3 0 .0
& synthetic lethals:
zwf- pgi-
ATP
2 9.12
2 7.12
2 4.52
H+
3 6.27
3 1.56
3 3.43
NADH
N A DPH
SuccCoA
KG
1 .4 0
3 .4 0
5 .9 9
1 .4 0
3 .4 0
5 .9 9
5 .0 8
5 .2 5
3 .5 4
0
0
1 2.19
0 .1 2
2 .1 3
4 .8 2
0 .5 2
2 .5 2
5 .1 8
0 .5 2
2 .5 2
5 .1 8
1 .3 4
3 .3 4
2 .3 3
1 0 .5
Py r
3 .4 4
-0 .67
-0 .62
E4
P
1 .8 7
-0 .18
-0 .62
3 .9 2
9 .2 7
9 .3 6
2 .7 0
0 .5 9
0 .6 4
X5 P
1 .5 4
-0 .51
-0 .47
F6 P
Ru 5 P
0
1 0 .2
0
FA DH
Non-coding regions:
E. coli:
11%
Yeast:
25%
Human: 95%
Similarity searching for environments,
growth, expression, & interaction data
and then the
Challenges of DNA sequence motifs:
short motifs & limited alphabet (4)
kdg
T
YidX
n = #environ+genotypes
g = gene sites
E
rsp
A
mtlA3
’mtlA5
’o18
4ppi
(switching n & g
gives CorEnv)
D
A
f10
5hrs
A
f21
Catabolite repression
glucose & Crp
regulated
C
4carA
B
YiaK
B
Log vs. stationaryphase regulated
A
Positive correlation
Negative correlation
kdgT
YidX
rspA
mtlA3’
mtlA5’
o184
ppiA
f105
hrsA
f214
carAB
YiaK
o85
pspA
Yggn
o8
5psp
A
Ygg
n
CorFun = Zg.ZgT /n
F
growth, expression, &/or interaction
Expression data from four cultures,
allow three comparisons
glucose
30oC
Mating type a
galactose
30oC
Mating type a
glucose
30oC
Mating type 
glucose
30o C -> 39o C shock
Mating type a
Expression Quantitation Options
1) n-dimensional cDNA or protein displays
2) Computer selected oligomer-arrays
photolithographic or piezoelectric deposition
3) Gridded microarrays from clones
4) Counting 13-bp cDNA tags (SAGE)
(20,000 tags means <800 RNAs have S/N>4)
Lockhart, et al. (1997) Nature Biotechnology 15:1359.
DeRisi, et al. (1997) Science 278:680.
Velculescu, et al. (1997) Cell 88:243.
Galactose Regulatory Network
GAL4
GAL80
Gal4p-Gal80p inactive complex
GALACTOSE
Gal1p
Gal3p
GAL3
Gal4p-Gal80p active complex
?
GCY1
PGM2
MEL1
GAL2
GAL7
Structural Genes For Galactose Metabolism
GAL10
GAL1
Fold Change in GAL3 in Galactose vs. Glucose
(Median Fold Change is 3.1)
GAL3: Fold Change in Expression between
Growth in Galactose and Growth in Glucose
25
15
10
5
Probe Number
19
17
15
13
11
9
7
5
3
0
1
Fold Change
20
30
Relative expression of all genes:
Galactose vs. Glucose
25
15
Number of Genes
20
10000
o
r
f
I
D
/
g
e
n
e
:
c
h
i p
#
p
r
o
b
e
s
Y
B
R
0
2
0
w
/
G
A
L
1
:
A
Y
B
R
0
1
8
c
/
G
A
L
7
:
A
Y
B
R
0
1
9
c
/
G
A
L
1
0
:
A
Y
D
R
3
4
5
c
/
H
X
T
3
:
A
Y
O
R
1
2
0
W
/
G
C
Y
1
:
D
Y
L
R
0
8
1
w
/
G
A
L
2
:
C
Y
G
L
1
8
9
C
/
R
P
S
2
6
A
:
B
Y
P
L
0
6
6
W
/
V
P
S
2
8
:
D
Y
H
R
0
9
4
c
/
H
X
T
1
:
B
Y
O
L
1
5
4
W
/
:
D
Y
P
L
0
6
7
C
/
:
D
Y
G
L
0
3
0
W
/
R
P
L
3
2
_
e
x
1
:
Y
FL
0
4
5
C
/
S
E
C
5
3
:
B
Y
B
R
1
0
6
w
/
:
A
Y
E
R
1
9
0
w
/
_
f
:
B
Y
M R
3
1
8
C
/
:
D
Y
N
L
0
1
5
W
/
P
B
I
2
:
D
Y
B
R
0
1
1
c
/
I
P
P
1
:
A
Y
E
R
1
7
8
w
/
P
D
A
1
:
B
Y
O
L
0
5
8
W
/
A
R
G
1
:
D
Y
C
R
0
0
5
c
/
C
I
T
2
:
A
Y
H
R
0
9
2
c
/
H
X
T
4
:
B
2
5
s
r
R
n
a
a
:
A
:
:
2
5
s
r
R
n
a
a
:
Y
G
L
0
5
5
W
/
O
L
E
1
:
B
Y
FR
0
2
4
C
/
_
r
:
B
Y
H
R
0
3
3
W
/
:
B
Y
D
R
0
0
9
W
/
G
A
L
3
:
A
Y
G
R
2
4
4
C
/
:
B
Y
K
L
0
9
6
W
/
C
W
P
1
:
C
Y
N
L
0
5
2
W
/
C
O
X
5
A
:
D
Y
J
R
0
7
3
C
/
O
P
I
3
:
C
Y
M R
2
5
6
c
/
C
O
X
7
:
D
10
1000
100
10
m e
d
FC
c
o
n
s
FC
2
1
6
4
.
8
1
2
4
.
5
7
2
1
4
1
.
9
1
1
0
.
5
8
2
0
3
7
.
8
1
3
.
0
3
2
0
2
5
.
0
5
1
3
.
5
8
2
0
1
2
.
3
1
7
.
8
1
2
1
8
.
1
9
3
.
5
6
1
9
7
.
8
2
0
.
4
5
2
0
6
.
3
5
2
.
7
5
2
0
6
.
2
6
2
.
3
8
2
1
6
.
0
4
3
.
2
7
2
1
5
.
9
5
3
.
1
3
B
2
1
5
.
3
2
3
.
1
1
2
1
5
.
1
7
2
.
7
3
2
1
5
.
0
3
2
.
6
6
2
0
4
.
9
2
.
4
8
2
0
4
.
0
2
2
.
3
6
2
0
3
.
8
9
2
.
3
2
0
3
.
7
3
1
.
7
5
2
0
3
.
4
6
2
.
2
2
2
0
3
.
3
6
2
.
2
4
2
0
3
.
3
2
.
1
5
2
0
3
.
2
7
1
.
5
2
B
8
:
:
4
2
5
s
r
R
n
a
a
:
3
C
.
:
2
:
2
7
5
s
r
R
n
a
a
:
D
1
.
4
9
2
0
3
.
2
1
1
.
9
8
2
0
3
.
2
1
1
.
4
3
2
0
3
.
1
5
1
.
5
2
2
0
3
.
0
8
1
.
3
8
2
0
2
.
9
9
1
.
5
5
2
1
2
.
9
7
1
.
7
8
2
0
2
.
9
4
1
.
9
6
2
0
2
.
9
2
1
.
5
2
2
1
2
.
8
4
1
.
6
4
t
h
r
s
h
l d
m
i s
s
i n
g
M
M
?
e
x
2
2
2
2
2
2
1
2
1
1
2
1
1
2
2
p
r
r
a
t
i o
6
4
.
8
1
4
1
.
9
1
3
7
.
8
0
.
0
3
9
9
2
0
1
6
1
2
.
3
1
8
.
1
9
0
.
1
2
7
8
7
7
2
4
6
.
3
5
0
.
1
5
9
7
4
4
4
1
0
.
1
6
5
5
6
2
9
1
5
.
9
5
0
.
1
8
7
9
6
9
9
2
0
.
1
9
3
4
2
3
6
0
.
1
9
8
8
0
7
1
6
0
.
2
0
4
0
8
1
6
3
4
.
0
2
3
.
8
9
0
.
2
6
8
0
9
6
5
1
0
.
2
8
9
0
1
7
3
4
3
.
3
6
0
.
3
0
3
0
3
0
3
0
.
3
0
5
8
1
0
4
0
.
3
0
5
8
1
0
4
3
.
2
1
0
.
3
1
1
5
2
6
4
8
3
.
1
5
3
.
0
8
2
.
9
9
0
.
3
3
6
7
0
0
3
4
2
.
9
4
0
.
3
4
2
4
6
5
7
5
2
.
8
4
l o
g
e
x
p
r
r
a
t
i o
1
.
8
1
1
6
4
2
0
2
1
.
6
2
2
3
1
7
6
6
1
.
5
7
7
4
9
1
8
1
.
3
9
8
8
0
7
7
3
1
.
0
9
0
2
5
8
0
5
0
.
9
1
3
2
8
3
9
0
.
8
9
3
2
0
6
7
5
0
.
8
0
2
7
7
3
7
3
0
.
7
9
6
5
7
4
3
3
0
.
7
8
1
0
3
6
9
4
0
.
7
7
4
5
1
6
9
7
0
.
7
2
5
9
1
1
6
3
0
.
7
1
3
4
9
0
5
4
0
.
7
0
1
5
6
7
9
9
0
.
6
9
0
1
9
6
0
8
0
.
6
0
4
2
2
6
0
5
0
.
5
8
9
9
4
9
6
0
.
5
7
1
7
0
8
8
3
0
.
5
3
9
0
7
6
1
0
.
5
2
6
3
3
9
2
8
0
.
5
1
8
5
1
3
9
4
0
.
5
1
4
5
4
7
7
5
0
.
5
1
4
5
4
7
7
5
0
.
5
0
6
5
0
5
0
3
0
.
5
0
6
5
0
5
0
3
0
.
4
9
8
3
1
0
5
5
0
.
4
8
8
5
5
0
7
2
0
.
4
7
5
6
7
1
1
9
0
.
4
7
2
7
5
6
4
5
0
.
4
6
8
3
4
7
3
3
0
.
4
6
5
3
8
2
8
5
0
.
4
5
3
3
1
8
3
4
B
I
N
S
l o
g
Jan
Feb
Mar
Apr
May
Jun
e
x
p
r
FR
r
a
t
E
i o
Q
2
1
.
9
5
1
.
9
1
.
8
5
1
.
8
1
.
7
5
1
.
7
1
.
6
5
1
.
6
1
.
5
5
1
.
5
1
.
4
5
1
.
4
1
.
3
5
1
.
3
1
.
2
5
1
.
2
1
.
1
5
1
.
1
1
.
0
5
1
0
.
9
5
0
.
9
0
.
8
5
0
.
8
0
.
7
5
0
.
7
0
.
6
5
0
.
6
0
.
5
5
0
.
5
0
.
4
5
1
Log of Fold Change
0
Food
Gas
Motel
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
5
-1.5
-2.0
0.1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
2
3
1
0
1
5
3
To analyze the most induced genes, we...
• Extracted the intergenic DNA sequence upstream
of each translation start using the Saccharomyces
Genome Database.
• Used an algorithm for multiple sequence
alignment to look for sequence motifs conserved
among the most induced (or repressed).
• Looked at the intersection of genes which both
matched a conserved motif and were induced (or
repressed)
Gibbs Motif Sampling Strategy
1 Initialize the alignment by choosing a random subset of all
possible sites as the ‘site’ alignment, and use all remaining
sequences to give a ‘non-site’ alignment.
2 Select a potential site from among all possible sites.
3 If the site is in the alignment, take it out.
4 Calculate the relative likelihood that the potential site
belongs with the site alignment rather than the ‘non-site’
alignment, based on a Bayesian multinomial distribution
model.
5 Randomly choose whether or not to add the site, weighted
by this relative likelihood.
6 Repeat Step 2
‘DNAGibbs’: A Modified Gibbs Motif
Sampler Optimized for DNA searches.
• Either forward or reverse strand of a potential site -- but
not both -- may be added to the alignment.
• Near-optimum sampling method was improved so that it is
faster and tends to result in higher scoring alignments.
• Simultaneous multiple motif searching was replaced with
a more efficient iterative masking approach.
• The model for base frequencies of non-site sequence was
fixed using the average nucleotide frequencies of
S. cerevisiae.
• Now runs on DEC Unix and Windows platforms, in
addition to the formerly supported SGI and Sun Unix
platforms.
Finally, exclude motifs with:
• DNAGibbs (maximum log a posteriori likelihood ratio)
scores less than 5.
.
• Good matches (Z < 3 sd below the mean of the aligned
positive motifs) with greater than 10% of all yeast genes
(ORFs)
*O.G.
Berg & P.H. von Hippel, J. Mol. Biol., 193: 723-750 (1987)
Information (Bits)
Using the top 10 genes induced in galactose,
DNAGibbs found UASG, the site recognized by Gal4p
CGYTCGGA-GA-AGT---CCGA
Previous UASG consensus
sequence logos were developed by T.D. Schneider & R.M. Stephens,
Nucleic Acids Res., 18: 6097-6100 (1990).
Genes that changed between galactose and glucose by more
than 2-fold and have strong matches to the UASG motif
Gene
GAL1
GAL7
GAL10
GCY1
GAL2
YPL066W
YPL067C
YMR318C
GAL3
Fold Change
>65
>42
>38
>12
>8
>6
>6
4
>3
Best Z-Score
-1.4
-0.7
-1.4
0.5
0.4
-1.1
-1.1
1.1
2
# of Sites
5
2
5
1
4
1
1
1
2
Galactose Regulatory Network
GAL4
GAL80
Gal4p-Gal80p inactive complex
YPL067C
YPL066W
GALACTOSE
?
YMR318C
Gal1p
Gal3p
GAL3
Gal4p-Gal80p active complex
?
GCY1
PGM2
MEL1
GAL2
GAL7
Structural Genes For Galactose Metabolism
GAL10
GAL1
DNAGibbs and mating type
Motif
Score
%ORF Consensus
mt-1 (A)
mta-1 (B)
mta-2 (C)
mta-3 (D)
mt-mta-1 (E)
mt-mta-2 (F)
mt-mta-3 (G)
mt-mta-4 (H)
8.9
8.5
5.0
28.1
20.7
5.3
8.6
5.3
0.11
0.05
0.10
0.31
0.34
0.13
0.27
0.31
Similarity
ttcctarttng
P Box
anwncwnkmaananantcwtbwtnw
aaaycawmawnanwa
grnawktacayg
2-bind, mt-mta-1
crtgtanntwyc
2-bind mta-3
kwtnywnnnknnntgtttsa
PRE, mt-mta-2
tgamaywwtnaama
PRE, mt-mta-1
rmtgmcngcma
Q Box
Expect DNABP
Consensus
Ref: Herskowitz, et al.,
P Box
Q Box
2-bind
PRE
tttcctaattaggnan
tcaatgacag
crtgtaawt
tgaaaca
in Gene Expression, E. W. Jones,
et al., Eds. (CSHL Press, NY, 1992) .
vol. 2: pp. 583-656
Mcm1p
Mat1p
Mat2p
Ste12p
rpoN
cys B
melR
rpoE
flhCD
hipB
tus
araC
rpoH13
ilvY
rpoH14
marR
lacI
carP
deoR
ada
cynR
fhlA
iclR
rhaS
ntrC
galR
fnr
gcvA
lexA
pdhR
arcA
purR
fadR
nagC
torR
cspA
fruR
phoB
metJ
fur
cytR
argR
tyrR
metR
oxyR
ihf
s oxS
trpR
glpR
farR
narL
fis
dnaA
crp
rpoS
malT
rpoD19
lrp
hns
ompR
rpoD18
rpoD16
rpoD17
rpoD15
rpoN
cys B
melR
rpoE
flhCD
hipB
tus
araC
rpoH13
ilvY
rpoH14
marR
lacI
carP
deoR
ada
cynR
fhlA
iclR
rhaS
ntrC
galR
fnr
gcvA
lexA
pdhR
arcA
purR
fadR
nagC
torR
cspA
fruR
phoB
metJ
fur
cytR
argR
tyrR
metR
oxyR
ihf
s oxS
trpR
glpR
farR
narL
fis
dnaA
crp
rpoS
malT
rpoD19
lrp
hns
ompR
rpoD18
rpoD16
rpoD17
rpoD15
Calibration of 60 E. coli
binding site matrices
0
1
2
3
4
5
Z-score
6
7
8
9
10
Interaction Quantitation Options
Over-expression:
Yeast two-hybrid screens (in vivo complexity)
In vitro chip assays
Martha Bulyk, David Lockhart, Erik Gentalen
Natural levels, environmental regulation:
Subcellular fractionation (unstable)
In vivo footprinting (partners unknown)
In vivo crosslinking
Combinatorial ds-DNA Chips
(chemical, photo & enzymatic synthesis)
3'
5' A C A C A C
h
spacer
n-mer
mask 2 x x
xx
3'
SiO2
Polymerase
ACACA C
AACCGG
AAo o o o
specific
16-mer
C g
G c
AACCGG
C g
G c
primer
3'
5'
Interaction Quantitation Options
Over-expression:
Yeast two-hybrid screens (in vivo complexity)
In vitro chip assays
Natural levels, environmental regulation:
Subcellular fractionation (unstable)
In vivo footprinting (partners unknown)
In vivo crosslinking
Martin Steffen, Andy Link
Isolate in vivo crosslinked complexes
by nucleic acid
CsCl (or hybridization)
by protein epitope tag
analyze protein by
DNase 2D gel,
trypsin-LC-ESI-MS/MS
analyze DNA/RNA by chip
pH
kdal
Link et al. (1997) Electrophoresis 18:1259 & 1314
Rich media log-phase, in vivo crosslink, DNaseI digest
pH
4
5
6
7
100
50
40
30
kdal
grpE
lacI
s sp A
20
ef p
ssb
dps
dps
10
f ur
hns
ihf B
purE
In vivo crosslinking & footprinting summary
11% of the E.coli genome is non-coding.
About 340 / 4328 proteins are likely DNA-binding
proteins (2 or the top 380 proteins).
24/25 footprinted GATC sites are non-coding.
Odds = 10-27.
2/3 crosslinked DNA molecules are likely regulatory
binding sites. Odds = 0.04
8/11 top DNA-crosslinked proteins are
known DNA-binding proteins. Odds = 10-16.
Thoughts on chips for crosslinked
epitope selections (& generally).
An easy 10-fold enrichment but
with 40,000 fragments means
an expensive 1:4000 Signal:Noise,
if sequencing (or SAGE) were used.
However, spread over a chip, 1:10.
E. coli oligonucleotide chip challenges:
#1) Closely spaced transcripts, e.g. carAB:
(Intergenic 25-mers overlap, start 6 bp apart on average)
P1(pyrimidine)
...
48 bp
...
P2(arginine)
gggtaagcaaatttgcattgcttcatactgactgaatgaattaatatgcaaataaagtg
#2) Repeats, e.g. tufA & tufB DNA. Mismatches: *
.....*.........*..*.......................................................................
..........................................................................................
..........................................................................................
..........................................................................................
..........................................................................................
..........................................................................................
................................................*.........................................
..........................................................................................
..........................................................................................
....................................................................................*.....
..........................................................................................
............................................................*.............................
......*.................*..*........*.......................*.............................
*.............
From Genome Sequences to
Regulatory Network Phenotypes
Summary
Expression: Cell-type & condition clustering plus
DNAGibbs algorithm extracts intergenic binding motifs
for yeast Gal-Glc, Mat-Mata, & 30oC-39oC comparisons.
Interaction:
Strong enrichment for low abundance
wild-type & mutant in vivo E.coli DNA-protein contacts
establishes mechanistically anchored intergenic elements.
Growth: Multiplex competitive growth of in-frame
replacements for novel E.coli regulatory genes defines
cellular system integration & environments.
Escherichia coli & Saccharomyces cerevisiae
Regulatory and Metabolic Networks
Population Selection, Flux Balance,
& Gibbs
Expression
DNA
kR
kD
Growth rate
RNA
Protein
kP
kI
Interactions
Environments
kc
Metabolites
kD , kD , kD : Initiate, Elongate, Terminate, Fold, Modify, Localize, Degrade
Growth, Expression, & Interaction
Harvard Center for
Computational Genetics
John Aach
Tim Chen
George Church
Jason Hughes
Jason Johnson
Abby McGuire
Jong Park
Fritz Roth
HMS Genetics
Andy Link, Doug Selinger
Pete Estep, Michael Ching
Martha Bulyk, Sonali Bose
Martin Steffen
Saeed Tavazoie, Annie Chan
Dereth Phillips, Chris Harbison
NCBI
Affymetrix
Andrew Neuwald
David Lockhart
Eric Gentalen
UCSD
Bernhard Palsson
DOE, DARPA, Lipper, NIST, HMR
Download