Gene expression must be regulated in

Prader-Willi &
Angelman Syndromes
• Both of these genetic disorders are caused by
deletion of a region of chromosome 15.
• However, the syndromes differ:
– Prader-Willi Syndrome - obesity, mental retardation,
short stature. (abbreviated PWS)
– Angelman Syndrome - uncontrollable laughter, jerky
movements, and other motor and mental symptoms.
(abbreviated AS)
• Syndrome that develops depends upon the
parent that provided the mutant chromosome.
PWS
AS
PWS
Mouse
model
AS
Mouse
model
From Annu Rev Genomics & Hum Genet
Introduction
Goal : Identify loci associated with variation in
expression levels
Nucleus
regulators
Genomic DNA
mRNA
Target
mRNA
Cis and Trans regulation
Trans-regulator
Cis-regulator
Target gene expression
phenotype
Data
Centre d'Etude du Polymorphisme Humain (CEPH) families are Utah
residents with ancestry from northern and western Europe.
• 14 families with genotype and expression data available for all
parents and a mean of eight offspring (range 7-9)
Method: Linkage analysis
A1 A2  A3A4
A1 A3
A1 A3
IBD=2
IBD: identical-by-descent
A1 A2  A3 A4
A1 A3
IBD=1
A1 A4
A1 A2  A3 A4
A1 A3
A2 A4
IBD=0
For a particular target gene expression
15
t-statistics
10
5
SNP1
2 3
4 5
Genetic Locus
Cis and trans- regulation
Under criteria 1,
• 27/142 (19%) expression phenotype have only a single
cis-regulator.
• 110/142 (77.5%) expression phenotype have only a
single trans-regulator.
• 2 /142 have a cis and a trans-acting regulator
• 3 /142 gene expression have two trans-acting regulator
Under criteria 2,
164 / 984 (16%) has multiple regulators
Se requiere modelos de regulación
de expresión génica
GAL Genes:
Eukaryotic Transcriptional Regulation
• Unlike prokaryotes, eukaryotes do not have genes in
operons (most mRNAs are not polycistronic).
• The GAL genes of S. cerevisiae are the paradigm for
eukaryotic gene regulation
• Galactose is metabolized by GAL gene products:
Galactose
Gal1p
UDP-Glu
Gal10p
Gal-1-P
Gal7p
UDP-Gal
Glu-6-P
Glycolysis
Gal5p
Glu-1-P
Eukaryotic
Transcription
• Proteins bind to distal
elements called
ENHANCERS.
• DNA folding allows
these elements to be
far from the start site
for transcription.
• Proteins bound to the
distal sites promote
the binding of RNA
polymerase to the
proximal elements.
Distal
Proximal
GAL Genes: A Transcriptional Program
• The response to galactose is very complex, with a
number of genes being turned on or off.
• The central regulator is a protein called Gal4p.
– Gal4p binds to enhancer elements in DNA and activates
transcription under some circumstances.
Gal4p: A Transcriptional Regulator
• Gal4p binds to enhancer elements near genes that it
regulates (e.g., GAL1).
• Gal4p also binds to Gal80p.
– Gal80p is necessary for activation of gene expression.
• When galactose binds to Gal80p, the Gal4p-Gal80p
complex can activate transcription.
– This activation has now been studied at the level of the whole
genome:
•
This figure shows data from a microarray experiment (Science 290:2306 [2000]).
Examining Transcriptional Regulation
• MICROARRAYS have become very popular as tools to
study gene regulation.
– A microarray is a small glass slide on which cDNAs of many
(or all) genes in an organism have been dotted.
– cDNA is made using mRNAs present under certain conditions
(or in a certain tissue) and labeled with fluorescent dyes.
– Then, the labeled cDNA are hybridized to the microarray and
the fluorescence determined.
• There is a nice animation describing this at:
– http://www.bio.davidson.edu/courses/genomics/chip/chip.html
– Does this examine transcriptional regulation?
Examining Transcriptional Regulation
• This basic method was extended for the Gal4p study
that we have been discussing discussed.
– For this study, the researchers tagged the Gal4p protein so the
could purify from the cell.
– Then, they chemically cross-linked it to DNA and purified it.
– This allowed them to purify the DNA that Gal4p was bound to
in the cell.
– The DNA that Gal4p was bound to in the cell was labeled and
used to probe the microarray.
– Does this examine transcriptional regulation?
Examining Transcriptional Regulation
• This study established several interesting facts:
– The Gal4p binding sites in the DNA are sometimes bound by
Gal4p in the absence of galactose, others are bound only in
the presence of galactose.
– So the trigger is more complex than simply whether or not the
Gal4p protein can bind.
– This more complex regulation involves Gal80p, an inhibitor.
Two possible models
for regulation of the
Gal4p-Gal80p complex
by galactose.
The models differ only
in the exact binding
sites for Gal80p.
How do Eukaryotic Transcriptional
Regulators Work?
• There are a few specific types of proteins that act to
increase transcriptional activity:
– Many proteins have an acidic domain.
• Surprisingly, these “acid-blob” proteins often require a
hydrophobic residue embedded in an acidic region.
• Both Gal4p and the herpes simplex virus VP16 protein (an
transcriptional regulator for this virus) have acid blobs.
– Glutamine-rich and Proline-rich transcriptional activation
domains have been characterized.
• These protein regions activate transcription when
fused to other DNA-binding domains.
– Alternatively, they can be recruited by protein-protein
interactions - e.g., a DNA-binding protein binds the enhancer,
and it contains a region that recruits and acid-blob protein.
Using Eukaryotic Transcriptional Regulators
• The yeast 2-hybrid system exploits these features of
eukaryotic transcription factors to examine proteinprotein interactions.
– The DNA-binding and transcription activating regions of Gal4p
can be separated.
– Interestingly, if you fuse one protein to the Gal4p DNA-binding
domain (BD) and a second protein that it interacts (physically)
with to the Gal4p transcriptional activating domain (AD), one
can see transcriptional activation:
How do Eukaryotic Transcriptional
Regulators Work?
• Another interesting phenomenon that is sometimes
seen with transcription factor is SQUELCHING.
– Overexpression of transcription activators like Gal4p can
result in a general inhibition of transcriptional activity.
– How does this happen?
– Presumably, specific transcription factors like Gal4p act by
recruiting “basal” transcription factors.
• In fact, some basal factors that physically interact with these
transcription activating domains have been found.
• Basal factors are factors involved in recruiting RNA polymerase II
to a large number of promoters.
– So overexpressing proteins with these transcription activating
domains can actually turn gene expression off, by competing
for these factors.
How do Eukaryotic Transcriptional
Regulators Work?
• At least one way is by altering the packing of DNA into
chromatin.
• The role of chromatin structure in the regulation of
transcription is an area of very active investigation.
• However, two important factors that play clear roles in
transcriptional regulation are known:
– DNA METHYLATION - A subset of cytosine (C) residues are
modified by methylation.
– HISTONE ACETYLATION - Histones can be modified by
acetylation.
Chromatin
• Remember, DNA in
eukaryotes packs into
CHROMATIN.
• HISTONES form the
NUCLEOSOME, which
DNA loops around.
• EUCHROMATIN - less
compact; actively
transcribed
• HETEROCHROMATIN more compact;
transcriptionally
inactive.
– Heterochromatin can be
either constitutive or
facultative.
DNA Methylation
• Genes that are transcriptionally inactive are often
METHYLATED.
– In eukaryotes, cytosine residues are modified by methylation.
NH2
NH2
CYTOSINE
CH3
N
O
N
N
H
O
N
H
METHYL-C
• Typically, the sites of methylation are CG dinucleotides
(vertebrates).
– This allows maintenance through replication.
Histone Acetylation
• HISTONES in transcriptionally active genes are often
ACETYLATED.
• Acetylation is the modification of lysine residues in
histones.
– Reduces positive charge, weakens the interaction with DNA.
– Makes DNA more accessible to RNA polymerase II
• Enzymes that ACETYLATE HISTONES are recruited to
actively transcribed genes.
• Enzymes that remove acetyl groups from histones are
recruited to methylated DNA.
– There are additional types of histone modification as well,
such as methylation of the histones.
Genetic Imprinting
• Remember that DNA methylation can be maintained
through replication.
• This allows the packing of chromatin to be passed on just like a gene sequence.
– However, differences in chromatin packing are not as stable as
gene sequences.
• Heritable but potentially reversible changes in gene
expression are called EPIGENETIC phenomena
– Vertebrates use these differences in chromatin packing to
IMPRINT certain patterns of gene regulation.
– Some genes show MATERNAL IMPRINTING while other show
PATERNAL IMPRINTING.
• The alleles of some genes that are inherited from the
relevant parent are methylated, and therefore are not
expressed.
Prader-Willi &
Angelman Syndromes
• Both of these genetic disorders are caused by
deletion of a region of chromosome 15.
• However, the syndromes differ:
– Prader-Willi Syndrome - obesity, mental retardation,
short stature. (abbreviated PWS)
– Angelman Syndrome - uncontrollable laughter, jerky
movements, and other motor and mental symptoms.
(abbreviated AS)
• Syndrome that develops depends upon the
parent that provided the mutant chromosome.
PWS
AS
PWS
Mouse
model
AS
Mouse
model
From Annu Rev Genomics & Hum Genet
Prader-Willi & Angelman Syndromes
• Prader-Willi Syndrome - develops when the
abnormal copy of chromosome 15 is inherited
from the father.
• Angelman Syndrome - develops when the
abnormal copy of chromosome 15 is inherited
from the mother.
• The differences reflect the fact that some loci
are IMPRINTED - so only the allele inherited
from one parent is expressed.
– The region contains both maternally and paternally
imprinted genes.
Methylation and Gene Regulation
• For imprinted genes, the pattern of gene
regulation is dependent upon the parent
that donated the chromosome.
– The methylation pattern is “reprogrammed”
in the germ line.
• There are other examples of methylation
changes the regulate gene expression.
– In mammals, one of the two X chromosomes
in females is inactivated.
– The inactivated X is methylated.
POR LO TANTO EXPRESION
DE GENES ES IMPORTANTE
PARA ENTENDER HERENCIA
GENETICA
Genomics, Bioinformatics, and Gene
Regulation
Marc S. Halfon, Ph.D.
mshalfon@buffalo.edu
Department of Biochemistry
Center of Excellence in Bioinformatics and the Life Sciences
Based on presentation for UB/CCR Summer Program in
Bioinformatics 2004
Genome Sequencing
As of 6/25/04 (As of 7/25/05)
1128 (1496) genome projects:
199 (274) complete (includes 28 (36) eukaryotes)
508 (728) prokaryotic genomes in progress
421 (494) eukaryotic genomes in progress
smallest: archaebacterium Nanoarchaeum equitans 500 kb
Bacillus anthracis (anthrax)
5228 kb
S. cerivisiae (yeast)
12,069 kb
Arabidopsis thaliana
115,428 kb
Drosophila melanogaster (fruit fly)
137,000 kb
Anopheles gambiae (malaria mosquito)
278,000 kb
Oryza sativa (rice)
420,000 kb
Mus musculus (mouse)
2,493,000 kb
Homo sapiens (human)
2,900,000 kb
http://www.genomesonline.org/
Genome sequencing helps in:
• identifying new genes (“gene discovery”)
• looking at chromosome organization and structure
• finding gene regulatory sequences
• comparative genomics
These in turn lead to advances in:
•medicine
•agriculture
•biotechnology
•understanding evolution and other basic science questions
Because of the vast amounts of data that are
generated, we need new approaches
•high throughput assays
•robotics
•high speed computing
•statistics
•bioinformatics
What’s in a genome?
Genes (i.e., protein coding)
But. . . only <2% of the human genome encodes proteins
Other than protein coding genes, what is there?
• genes for noncoding RNAs (rRNA, tRNA, miRNAs, etc.)
• structural sequences (scaffold attachment regions)
• regulatory sequences
• “junk” (including transposons, retroviral insertions, etc.)
It’s still uncertain/controversial how much of the genome is
composed of any of these classes
The answers will come from experimentation and
bioinformatics. We will discuss further only gene regulation.
Gene expression must be regulated in:
TIME
Wolpert, L. (2002) Principles of Development New York: Oxford University Press. p. 31
Gene expression must be regulated in:
SPACE
Paddock S.W. (2001). BioTechniques 30: 756 - 761.
Gene expression must be regulated in:
Stern, D. (1998). Nature 396, 463 - 466
ABUNDANCE
What happens when gene regulation goes awry?
• Developmental abnormalities (birth defects)
1
2
3
4
5
6
• Disease
- chronic myeloid leukemia
- rheumatoid arthritis
photo credits: Wolpert, L. (2002) Principles of Development New York: Oxford University Press. pp. 183, 340
Genes can be regulated at many levels
• transcription
• post transcription (RNA stability)
the “transcriptome”
• post transcription (translational control)
• post translation (not considered gene regulation)
usually, when we speak of gene regulation, we are referring to
transcriptional regulation
DNA
RNA
TRANSCRIPTION
PROTEIN
TRANSLATION
The “Central Dogma”
Looking at the transcriptome:
DNA microarrays
One way of looking at the transcriptome is with DNA
microarrays. With microarrays, the expression of thousands of
genes can be assessed in a single experiment.
cDNAs or oligonucleotides representing all genes in the genome
are deposited on a glass slide using a robotic arrayer:
Benfey, P. and Protopapas, A. Genomics. 2005. New Jersey: Pearson Prentice Hall. pp. 131-2
Exploring the Metabolic and Genetic Control of
Gene Expression on a Genomic Scale
Joseph L. DeRisi, Vishwanath R. Iyer, Patrick O. Brown*
Microarray
MicroArray
• Allows measuring the mRNA level of thousands
of genes in one experiment -- system level
response
• The data generation can be fully automated by
robots
• Common experimental themes:
–Time Course (when)
–Tissue Type (where)
–Response (under what conditions)
–Perturbation: Mutation/Knockout, Knock-in
Over-expression
Looking at the transcriptome:
DNA microarrays
cell type A
extract
mRNA
make
labeled
cDNA
hybridize to
microarray
cell type B
more in “A”
more in “B”
equal in A & B
Looking at the transcriptome: microarrays
statistical processing and analysis
condition 1
condition 2
condition 3
conditions
genes
Which Genes to select?
They have a method
• For each gene (row) compute a score defined by
sample mean of X - sample mean of Y
divided by
standard deviation of X + standard deviation of Y
• X=ALL, Y=AML
• Genes (rows) with highest scores are selected.
That seems to work well.
•34 new leukemia samples
•29 are predicated with 100% accuracy;
5 weak predication cases
Seems to work ! Improvement?
Study of cell-cycle regulated
genes
• Rate of cell growth and division varies
• Yeast(120 min), insect egg(15-30 min); nerve
cell(no);fibroblast(healing wounds)
• Regulation : irregular growth causes cancer
• Goal : find what genes are expressed at each state
of cell cycle
• Yeast cells; Spellman et al (2000)
• Fourier analysis: cyclic pattern
Yeast Cell Cycle
(adapted from Molecular Cell Biology, Darnell et al)
Most visible
event
Example of the time curve:
Histone Genes: (HTT2)
ORF: YNL031C
Time course:
Histone
Why clustering make sense
biologically?
The rationale is
Rationale behind massive gene expression analysis:
Genes with high degree of
expression
similarity
related and
are likely to be functionally
may participate in common pathways.
They may be co-regulated
regulatory factors.
by
common upstream
Simply put,
Profile similarity implies functional association
Protein rarely works as a single unit
Some protein complexes
Gene profiles and correlation
• Pearson's
correlation coefficient, a simple
way of describing
the strength of linear association
between a pair of random variables, has become the most
popular measure of gene expression similarity.
•1.Cluster analysis: average linkage, self-organizing
map, K-mean, ...
2.Classification: nearest neighbor,linear discriminant
analysis, support vector machine,…
3.Dimension reduction methods: PCA ( SVD)
CC has been used by Gauss, Bravais, Edgeworth …
Sweeping impact in data analysis is due to
Galton(1822-1911)
“Typical laws of heridity in man”
Karl Pearson modifies and popularizes the use.
A building block in multivariate analysis, of which
clustering, classification, dim. reduct. are recurrent themes
As a statistician, how can you
ignore the time order ?
(Isn’t it true that the use of sample
correlation relies on the assumption
that data are I.I.D. ???)
….acerca de probabilidades.
Microarrays can show us when and
where genes are expressed. But what
regulates this expression?
Mechanisms of transcriptional regulation
regulation in trans:
transcription factors
regulation in cis :
promoters & enhancers
binding sites
Identifying transcription factor binding sites
Usually, binding sites are first determined
empirically.
Most transcription factors can bind to a range of
similar sequences. We can represent these in either
of two ways, as a consensus sequence, or as a
position weight matrix (PWM).
Once we know the binding site, we can search the
genome to find all of the (predicted) binding sites.
Binding site (motif) representations
TCCGGAAGC
TCCGGATGC
TCCGGATCT
CATGGATGC
CCAGGAAGT
GGTGGATGC
ACCGGATGC
T CC GGAT GC
C
T
A
T
7 characterized
binding sites for a
certain transcription
factor:
consensus sequence:
PWM and
logo:
A
T
G
C
111007200
302000502
110770060
254000015
Finding binding sites in the genome
T
C
T G C
CC TGGA A C T
Consensus sequences make searching easy, e.g. by using regular
expressions in Perl:
while(<SEQUENCE>){
if ($_ =~ /[T|C]C[T|C]GGA[T|A][G|C][C|T]/)
{do something;}
}
All positions in the motif are treated the same.
Finding binding sites in the genome
A
T
G
C
111007200
302000502
110770060
254000015
A PWM allows us to assign more importance to more invariant
positions. We can calculate a score based on the probability of a
given nucleotide being in a given position.
TCCGGAAGC scores higher than
TCCGGATCT as GC is preferred
over CT in the last two positions
Finding binding sites in the genome
Binding site motifs can be predicted computationally from
the regulatory regions of genes with similar expression
patterns.
For instance, the promoter regions of genes that cluster in a
microarray experiment can be used.
(How can the promoter regions be extracted? You should know
enough Perl at this point to be able to do this, given a wellannotated sequence database.)
Finding binding sites in the genome
seq1:TTTTTATTTTTCTGAATCACCACTTGATATTGCTTCACAGAACT
seq2:CGGGCGGTGAGGCAGAGAAAGAGACCACTTGAAATGTAGTAATA
seq3:CACTTGAATTTTTCTGCACGCAGTTTTTATTTTTACTTTTCTTG
seq4:CGCGTTCGTTATTTGTTGTTGACCACTTGAATTGATTGCTTTAT
seq5:ATCCCGGTCGAGGTGCACTTGATGTTTTCAATGGAAATGTTGCC
seq6:TCTGCAGATTTATGGCCCAACGCTCATTTAACAATTAAAGTGGG
seq7:GCATTAACTCTCACTTCAAAAAATCATATAAACACCTCTAATAT
seq8:TATATTTTCTCGCCACTTAAATAGTTTTCAATGCCAATGGCAGG
seq9:ATCCTTATCGAAGCACTTGGATTTTAAAGCAATCTTTTGAACAC
A Gibbs sampling algorithm can then find the common subsequences:
Of course, we must now discover which transcription factor binds
this sequence.
Finding binding sites in the genome
How meaningful are the sites we find?
• Only experiments can tell us for sure
• However, we can get some hints using statistical analysis
Example 1:
We just found the motif CACTTGA upstream of co-expressed
genes. Is it over-represented in this set compared to a
random selection of genes?
Search 100 random sets of genes.
Find the mean and standard deviation.
z = observed - expected/standard deviation
Finding binding sites in the genome
Example 2:
Many regulatory regions contain multiple binding sites for the
same transcription factor. Is the motif found an unusually
large number of times in a short stretch of sequence?
Crudely:
Probability of finding a 7 bp motif: 4-7 = 1/16,384
i.e., expect only about 1 motif every 16 kb.
Thus, finding several close together is very unlikely.
Transcription factors, binding sites, and target genes
identify
transcription factors
genetic screens
one-hybrid assays
sequence motifs/homology
identify binding
motif
find all motifs
in genome
computational searching
ChIP-chip
bioinformatics (e.g., Gibbs
sampling on microarray data)
molecular biology using purified
protein or protein extracts
identify target
genes
computational searching
microarrays
genetic screens
How well does it work?
•Although not always that difficult computationally, these
approaches are complex biologically
•Predicted and in vitro binding data do not always accurately
reflect what takes place in vivo
•Transcription factor binding can be affected by local
concentration, by chromatin structure, and by interactions with
other transcription factors
•Many predicted sites may therefore have no actual role
•Functional testing of predictions is very important
Putting things together:
cis-Regulatory Modules (enhancers)
Gene regulation is combinatorial— several transcription factors bind
simultaneously
We can search for co-occurrence of multiple transcription factors to
try to identify regulatory modules
% identity
(seq1 vs seq2)
Another way to try to find regulatory modules is through
comparative genomics
predicted regulatory element
sequence
Why bother?
Ultimately, we’d like to be able to describe all of development
in terms of gene expression and regulation.
That is, in every cell, at every time, which genes are on or off,
and why?
Gene Regulatory Networks
Even knowing just a little of this gets incredibly complicated:
Regulatory gene network for sea urchin endomesoderm specification
Davidson et al. (2002) Science 295:1669
But imagine understanding how we go from
http://www.alphascientists.com/embryology_imag
es/cleavage_stage_embryos.html
here . . .
http://nobelprize.org/medicine
. . . to here . . .
. . . to here!
Further Reading:
Wasserman, W. W. and A. Sandelin (2004). "Applied Bioinformatics For The
Identification Of Regulatory Elements." Nature Reviews Genetics 5(4): 276287.
Halfon, M. S. and A. M. Michelson (2002). "Exploring Genetic Regulatory
Networks in Metazoan Development: Methods and Models." Physiol
Genomics 10(3): 131-43.
Davidson, E. H. (2001). Genomic Regulatory Systems. San Diego, Academic
Press.
Carroll, S. B., J. K. Grenier, et al. (2001). From DNA to Diversity. Molecular
Genetics and the Evolution of Animal Design. Massachusetts, Blackwell
Science.