RNA structure

advertisement
DNA2: Aligning ancient diversity
(Last week)
Comparing types of alignments & algorithms
Dynamic programming (DP)
Multi-sequence alignment
Space-time-accuracy tradeoffs
Finding genes -- motif profiles
Hidden Markov Model (HMM) for CpG Islands
1
RNA1: Structure & Quantitation
 Integration with previous topics (HMM & DP for RNA structure)
 Goals of molecular quantitation (maximal fold-changes, clustering
& classification of genes & conditions/cell types, causality)
 Genomics-grade measures of RNA and protein and how we choose
and integrate (SAGE, oligo-arrays, gene-arrays)
 Sources of random and systematic errors (reproducibilty of RNA
source(s), biases in labeling, non-polyA RNAs, effects of array
geometry, cross-talk).
 Interpretation issues (splicing, 5' & 3' ends, gene families, small
RNAs, antisense, apparent absence of RNA).
 Time series data: causality, mRNA decay, time-warping
2
Discrete & continuous bell-curves
.10
.09
Normal (m=20, s=4.47)
.08
Poisson (m=20, s^2=m)
Binomial (N=2020, p=.01, m=Np)
.07
t-dist (m=20, s=4.47, dof=2)
.06
ExtrVal(u=20, L=1/4.47)
.05
.04
.03
.02
.01
.00
0
10
20
30
40
50
3
gggatttagctcagtt
gggagagcgccagact
gaa
gat
ttg
gag
gtcctgtgttcgatcc
acagaattcgcacca
Primary
to tertiary
structure
4
Non-watson-crick bps
-CH3
ref
5
Modified bases & bps in RNA
1
"
ref
72
"
6
Covariance
TyC
3’acc
anticodon
D-stem
Mij=
Sfxixjlog2[fxixj/(fxifxj)]
xixj
M=0 to 2 bits; x=base type
see Durbin et al p. 266-8.
7
Mutual Information
ACUUAU
CCUUAG
GCUUGC
UCUUGA
i=1 j=6
Mij=
M1,6=
M1,2=
S=
x1 x6
fAU log2[fAU/(fA*fU)]...
=4*.25log2[.25/(.25*.25)]=2
4*.25log2[.25/(.25*1)]=0
Sfxixjlog2[fxixj/(fxifxj)]
xixj
M=0 to 2 bits; x=base type
see Durbin et al p. 266-8.
See Shannon entropy, multinomial Grendar
8
RNA secondary structure
prediction
Mathews DH, Sabina J, Zuker M, Turner DH J Mol Biol 1999
May 21;288(5):911-40
Expanded sequence dependence of thermodynamic parameters
improves prediction of RNA secondary structure.
Each set of 750 generated structures contains one structure that,
on average, has 86 % of known base-pairs.
9
Stacked bp & ss
10
Initial 1981 O(N2) DP methods: Circular
Representation of RNA Structure
5’
3’
Did not
handle
pseudoknots
11
RNA pseudoknots, important biologically, but
challenging for structure searches
12
Dynamic programming finally
handles RNA pseudoknots too.
Rivas E, Eddy SR J Mol Biol 1999 Feb
5;285(5):2053-68 A dynamic programming algorithm
for RNA structure prediction including
pseudoknots. (ref)
Worst case complexity of O(N6) in time and O(N4) in
memory space.
Bioinformatics 2000 Apr;16(4):334-40 (ref)
13
CpG Island + in a ocean of First order Hidden Markov Model
MM=16, HMM= 64 transition probabilities (adjacent bp)
P(A+|A+)
A+
T+
C+
G+
P(G+|C+) >
A-
T-
C-
G14
Small nucleolar (sno)RNA
structure & function
Lowe et al. Science (ref)
15
SnoRNA Search
16
Performance of RNA-fold matching
algorithms
Algorithm
CPU bp/sec
TRNASCAN’91
400
TRNASCAN-SE ’97 30,000
SnoRNAs’99
True pos.
False pos.
95.1%
99.5%
>93%
0.4x10-6
<7x10-11
<10-7
(See p. 258, 297 of Durbin et al.; Lowe et al 1999)
17
Putative Sno RNA
gene disruption
effects
on rRNA
modification
Primer extension pauses at 2'O-Me
positions forming bands at low dNTP.
Lowe et al. Science 1999 283:1168-71 (ref)
18
RNA1: Structure & Quantitation
 Integration with previous topics (HMM & DP for RNA structure)
 Goals of molecular quantitation (maximal fold-changes, clustering
& classification of genes & conditions/cell types, causality)
 Genomics-grade measures of RNA and protein and how we choose
and integrate (SAGE, oligo-arrays, gene-arrays)
 Sources of random and systematic errors (reproducibilty of RNA
source(s), biases in labeling, non-polyA RNAs, effects of array
geometry, cross-talk).
 Interpretation issues (splicing, 5' & 3' ends, gene families, small
RNAs, antisense, apparent absence of RNA).
 Time series data: causality, mRNA decay
19
RNA (array) & Protein/metabolite
(MS) quantitation
RNA measures are closer to genomic
regulatory motifs & transcriptional control
Protein/metabolite measures are closer to
Flux & growth phenotypes.
20
Integrating 8 regulon measures
In vitro
Protein fusions
array binding
A-B
or selection
A
In vivo crosslinking
& selection (1-hybrid)
B
Microarray data
Coregulated sets
of genes
EC
SC
BS HI
P1
P2
P3
P4
P5
P6
P7
1
1
0
1
1
0
1
0
1
1
0
1
1
1
1
0
1
0
1
1
0
Phylogenetic profiles
TCA
cycle
B. subtilis purM purN purH purD
E. coli purM purN
Metabolic pathways
purH purD
Conserved operons
Known regulons in
other organisms
21
Check regulons from conserved operons
(chromosomal proximity)
B. subtilis
purE purK purB purC purL purF purM purN purH purD
C. acetobutylicum purE
purC
purF purM purN purH purD
In E. coli, each color above is a separate but coregulated operon:
purE purK
purH purD
purM purN
E. coli PurR motif
purB
purC
purL
purF
Predicting
regulons and
their cisregulatory
motifs by
comparative
genomics.
Mcguire &
Church, (2000)
Nucleic Acids
Research
28:4523-30.
22
Predicting the PurR regulon by
piecing together smaller operons
E. coli
M. tuberculosis
P. horokoshii
C. jejuni
M. janaschii
P. furiosus
purE purK
purM purN
purM purF
purH purN
purH purD
purF purC
purQ purC
purL purH
purM purF
purC purY
purQ purL purY
The above predicts regulon Y
connections among
Q
these genes:
L
C F M
N
H
K
E
D
23
(Whole genome) RNA quantitation
objectives
RNAs showing maximum change
minimum change detectable/meaningful
RNA absolute levels (compare protein levels)
minimum amount detectable/meaningful
Network -- direct causality-- motifs
Classify (e.g. stress, drug effects, cancers)
24
(Sub)cellular inhomogeneity
Dissected tissues
have mixed cell
types.
Cell-cycle
differences in
expression.
XIST RNA localized
( see figure)
on inactive
X-chromosome
25
Fluorescent in situ hybridization (FISH)
•Time resolution: 1msec
•Sensitivity: 1 molecule
•Multiplicity: >24
•Space: 10 nm (3-dimensional, in vivo)
10 nm accuracy with far-field optics energy-transfer fluorescent beads
nanocrystal quantum dots,closed-loop piezo-scanner (ref)
26
RNA1: Structure & Quantitation
 Integration with previous topics (HMM & DP for RNA structure)
 Goals of molecular quantitation (maximal fold-changes, clustering
& classification of genes & conditions/cell types, causality)
 Genomics-grade measures of RNA and protein and how we choose
and integrate (SAGE, oligo-arrays, gene-arrays)
 Sources of random and systematic errors (reproducibilty of RNA
source(s), biases in labeling, non-polyA RNAs, effects of array
geometry, cross-talk).
 Interpretation issues (splicing, 5' & 3' ends, gene families, small
RNAs, antisense, apparent absence of RNA).
 Time series data: causality, mRNA decay, time-warping
27
Steady-state population-average
RNA quantitation methodology
experiment
Microarrays1
ORF
control
• R/G ratios
• R, G values
~1000 bp hybridization
• quality indicators
ORF
Affymetrix
2
• Averaged PM-MM
PM
MM
• “presence”
25-bp hybridization
ORF SAGE Tag
• Counts of SAGE
SAGE3
sequence counting
MPSS4
concatamers
14 to 22-mers
sequence tags for
each ORF
1
DeRisi, et.al., Science 278:680-686 (1997)
4 Brenner et al,
Lockhart, et.al., Nat Biotech 14:1675-1680 (1996)
3 Velculescu, et.al, Serial Analysis of Gene Expression, Science 270:484-487 (1995)
2
28
Biotinylated RNA
from experiment
GeneChip expression
analysis probe array
Image of hybridized probe array
Each probe cell contains
millions of copies of a specific
oligonucleotide probe
Streptavidinphycoerythrin
conjugate
29
Most
RNAs
<1
molecule
per cell.
Yeast RNA
25-mer array
Wodicka, Lockhart, et al. (1997)
Nature Biotech 15:1359-67
Reproducibility
confidence
intervals
to find
significant
deviations.
(ref)
30
(Public) RNA array analyses
Image analysis
dCHIP Wong/Affy
ArrayTools NCI-BRB
ArrayViewer TIGR
F/P-SCAN NIH
ScanAlyze Eisen
GAPS HMS
Statistics
AFM
AMiADA
Churchill
R-SMA
SAM
VERA-SAM ISB
R-Bioconductor
(1, 2)
Database & management
GeneX
MAPS
AMAD
Cluster&Visualize
CLUSFAVOR
Cluster-TreeView Eisen
GeneCluster WI
J-Express Bergen
PaGE
Plaid
SVDMAN LANL
TreeArrange Waterloo
XCluster Stanford
Free vs Open:OSI FSF ISCB 31
Statistical models for repeated array data
(RNA vs. experiment repeats)
Tusher, Tibshirani and Chu (2001) Significance analysis of microarrays
applied to the ionizing radiation response. PNAS 98(9):5116-21.
Selinger, et al. (2000) RNA expression analysis using a 30
base pair resolution Escherichia coli genome array. Nature
Biotech. 18, 1262-7.
Li & Wong (2001) Model-based analysis of oligonucleotide
arrays: model validation, design issues and standard error
application. Genome Biol 2(8):0032
Kuo et al. (2002) Analysis of matched mRNA measurements
from two different microarray technologies. Bioinformatics
32
18(3):405-12
“Significant” distributions
.10
.09
.08
.07
.06
.05
.04
.03
.02
.01
.00
Normal (m=0, s=4.47)
t-dist (m=0, s=4.47, dof=2)
ExtrVal(u=0, L=1/4.47)
graph
-30
-20
-10
0
10
20
30
t-test t= ( Mean / SD ) * sqrt( N ). Degrees of freedom = N-1
H0: The mean value of the difference =0. If difference distribution is not normal,
use the Wilcoxon Matched-Pairs Signed-Ranks Test.
33
Independent Experiments
Microarray analysis of
the transcriptional
network controlled by
the photoreceptor
homeobox gene Crx.
Livesay, et al. (2000)
Current Biology
34
Condition
A1
A2
A3
A4
A5
mean
RNA1
0.1
0.2
0.5
0.7
0.7
0.44
RNA2
0.95
1
1
1.04
1.06
1.01
B1
B2
B3
B4
B5
0.05
0.1
0.25
0.35
0.35
0.22
1
1.1
1.15
1.2
1.3
1.15
P=
0.15
0.03
TTEST(B2:B6,B10:B14,2,2)
or Wilcoxon-rank if non-normal
Are two
means the
same?
Two-fold, but not significant
1.14-fold, statistically significant
2-tails= Interesting whichever is larger
Homoscedastic = same variance.
35
RNA quantitation
Is less than a 2-fold RNA-ratio ever important?
Yes; 1.5-fold in trisomies.
Why oligonucleotides rather than cDNAs?
Alternative splicing, 5' & 3' ends; gene families.
What about using a subset of the genome
or ratios to a variety of control RNAs?
It makes trouble for later (meta) analyses.
36
37
(Whole genome) RNA quantitation methods
Method
Advantages
Genes immobilized labeled RNA
RNAs immobilized labeled genesNorthern gel blot
QRT-PCR
Reporter constructs
Fluorescent In Situ Hybridization
Tag counting (SAGE)
Differential display & subtraction
Chip manufacture
RNA sizes
Sensitivity 1e-10
No crosshybridization
Spatial relations
Gene discovery
"Selective" discovery
38
Microarray to Northern
39
Genomic oligonucleotide microarrays
295,936 oligonucleotides (including controls)
Intergenic regions: ~6bp spacing Genes: ~70 bp spacing
Not polyA (or 3' end) biased
Strengths: Gene family paralogs,
RNA fine structure (adjacent promoters),
untranslated & antisense RNAs, DNA-protein interactions.
E. coli
25-mer array
Protein coding
25-mers
Non-coding sequences
(12% of genome)
Affymetrix: Mei, Gentalen,
Johansen, Lockhart(Novartis Inst)
HMS: Church, Bulyk, Cheung,
Tavazoie, Petti, Selinger
tRNAs, rRNAs
40
Random & Systematic
Errors in RNA quantitation
• Secondary structure
• Position on array (mixing, scattering)
• Amount of target per spot
• Cross-hybridization
• Unanticipated transcripts
41
Spatial Variation in Control Intensity
2
2
0
0
0
3
0
0
0
2
0
0
0
0
2
5
0
0
1
8
0
0
0
1
6
0
0
0
2
0
0
0
Intensity
Intensity
1
4
0
0
0
1
2
0
0
0
1
5
0
0
1
0
0
0
0
8
0
0
0
6
0
0
5
0
0
4
0
0
3
0
0
0
5
0
0
4
0
0
3
0
0
2
0
0
1
0
0
Y
X
5
0
0
2
0
0
1
0
0
0
0
Experiment 1
Selinger et al
6
0
0
5
0
0
4
0
0
3
0
0
6
0
0
0
4
0
0
0
2
0
0
0
0
5
0
0
4
0
0
3
0
0
2
0
0
1
0
0
Y
X
1
0
0
0
2
0
0
1
0
0
0
0
experiment 2
42
Detection of Antisense and
Untranslated RNAs
Expression Chip
Reverse Complement Chip
b0671 - ORF of unknown function, tiled in the opposite orientation
Crick Strand
Watson Strand (same chip)
“intergenic region 1725” - is actually a small untranslated RNA (csrB)
43
Mapping deviations from
expected repeat ratios
Li & Wong
44
RNA1: Structure & Quantitation
 Integration with previous topics (HMM & DP for RNA structure)
 Goals of molecular quantitation (maximal fold-changes, clustering
& classification of genes & conditions/cell types, causality)
 Genomics-grade measures of RNA and protein and how we choose
and integrate (SAGE, oligo-arrays, gene-arrays)
 Sources of random and systematic errors (reproducibilty of RNA
source(s), biases in labeling, non-polyA RNAs, effects of array
geometry, cross-talk).
 Interpretation issues (splicing, 5' & 3' ends, gene families, small
RNAs, antisense, apparent absence of RNA).
 Time series data: causality, mRNA decay, time-warping
45
Independent oligos
analysis of RNA structure
Intensity (PM - MM) / Smax
2
1.5
1
Known
Hairpin
0.5
Log
Stationary
Genomic DNA
0
-300
-200
-100
0
100
200
300
400
-0.5
Known Transcription Start
(position -33)
Translation Stop
(237 bases)
-1
Bases from Translation Start
Selinger46et al
Predicting RNA-RNA interactions
Human RNAsplice
junctions
sequence
matrix
http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html
47
of the
human
genome
using
microarray
technology.
Shoemaker
, et al.
(2001)
Nature
409:922-7.
48
RNA1: Structure & Quantitation
 Integration with previous topics (HMM & DP for RNA structure)
 Goals of molecular quantitation (maximal fold-changes, clustering
& classification of genes & conditions/cell types, causality)
 Genomics-grade measures of RNA and protein and how we choose
and integrate (SAGE, oligo-arrays, gene-arrays)
 Sources of random and systematic errors (reproducibilty of RNA
source(s), biases in labeling, non-polyA RNAs, effects of array
geometry, cross-talk).
 Interpretation issues (splicing, 5' & 3' ends, gene families, small
RNAs, antisense, apparent absence of RNA).
 Time series data: causality, mRNA decay, time-warping
49
Time courses
•To discriminate primary vs secondary effects
we need conditional gene knockouts .
•Conditional control via transcription/translation
is slow (>60 sec up & much longer for down
regulation)
•Chemical knockouts can be more specific
than temperature (ts-mutants).
50
Beyond steady state: mRNA turnover rates
(rifampicin time-course)
Fraction of Initial (16S normalized)
1.4
cspE Chip
lpp Chip
cspE Northern
lpp Northern
1.2
lpp Northern
1
0.8
lpp Chip
0.6
0.4
half life
2.4 min
2.9 min
lpp
chip
Northern
half life
>20 min
>300 min
cspE Chip
cspE Northern
0.2
cspE
chip
Northern
Chip metric = Smax
0
0
2
4
6
8
10
Time (min)
12
14
16
18
51
TimeWarp: pairs of expression
series, discrete or interpolative
a
b
c
...
t5
t0
t2
t3
t4
3
u3 u4
t1
series b
series a
4
t6
u2
u1
series b
j+1
2
j
1
u0
0
d
2
3
series a
1
4
5
...
e
t0
t2
t3
4
t4
u3 u4
t1
u2
u1
series b
u0
t6
series b
series a
f
...
t5
i+1
i
0
†1
3
†
j+1
*
†2
2
j
1
i
i+1
0
0
1
2
3
series a
4
5
...
Aach & Church
52
TimeWarp: cell-cycle experiments
53
TimeWarp: alignment example
54
RNA1: Structure & Quantitation
 Integration with previous topics (HMM & DP for RNA structure)
 Goals of molecular quantitation (maximal fold-changes, clustering
& classification of genes & conditions/cell types, causality)
 Genomics-grade measures of RNA and protein and how we choose
and integrate (SAGE, oligo-arrays, gene-arrays)
 Sources of random and systematic errors (reproducibilty of RNA
source(s), biases in labeling, non-polyA RNAs, effects of array
geometry, cross-talk).
 Interpretation issues (splicing, 5' & 3' ends, gene families, small
RNAs, antisense, apparent absence of RNA).
 Time series data: causality, mRNA decay, time-warping
55
Download