Russell-2009-01-RNA-BCMB - RNA

advertisement
9 January 2009
1. Introduction
2. Methods / Algorithms
Liming Cai
3. Results - Searching Genomes
4. Results - Telomerase RNAs, Other Utilities
5. Future Plans
Changing Views of RNA
Information
Catalysis
Regulation
Pre 1944
Protein
Protein
Protein
1944-1982
DNA, RNA
Protein
Protein
Post 1982
DNA, RNA
Protein, RNA
Protein
Post 1998
DNA, RNA
Protein, RNA
Protein, RNA
Post 2008?
80% of Human Genome Transcribed?
BioinformaticChallenges with ncRNA
Central Paradigm of Bioinformatics, Ability to Predict:
Genome » Expressed Molecules » Phenotype
•
Predict structure of an RNA from sequence.
•
Align RNAs on the basis of their 2o structure.
•
Find new instances of known ncRNAs in
genomes and sequence databases.
•
Predict transcripts from genomic sequence
(both known and unknown RNAs)
Conserved RNA
Secondary Structures
C
G
A
U
Covariation
G
C
U
A
RNA structure is often more conserved
than its primary sequence is.
Search for structures not sequence.
Campbell, Biology
RNA Pseudoknot
dna loops and spool
dna loops and spool
dna loops and spool
Searls, D.B., 1992 The
linguistics of DNA. American
Scientist 80: 579-591.
Many non-coding RNAs have pseudoknots
that are essential to their function.
Telomerase RNA
1. RNA structure is conserved, but
sequence and lengths are not.
2. The pseudoknot is essential to
function.
http://tolweb.org/tree/
Lin, J. et al. (2004) PNAS 101: 1471314718.
Tzfati, Y. et al. (2003) Genes & Dev. 17:
1779-1788.
Bacterial tmRNAs
-A-B-D-E-F-G-H-g-h-I-J-j-i-K-L-M-N-m-O-o-l-k-n-P-p-Q-R-S-r-q-s-T-U-V-W-X-v-u-t-Z-!-z-1-@-#-2-3-x-w-f-e-d-b-$-4-a-
PK1
PK2
Felden et al. 2001. NAR 29: 1602-1607.
PK3
PK4
Functions in
Trans
Translation
Large Database
of Structures
Available
Goals
 Search genomic sequences for RNA structures, not
primary sequence.
•
•
Predicting the structure of RNAs from primary
sequence.
Aligning RNAs to each other by structure.
 Study the evolution of RNA structure.
 Eventually, predict new ncRNA from genomic
sequence.
Our methods handle pseudoknots as well as stem-loops.
Free Energy Minimization – Zuker
Probablistic Models – Eddy, Durbin
Tree Decomposition – Cai
Zucker energy
minimization
lowest ΔG from
stacking.
Durbin et al.
Fig. 10.10
Hidden Markov Models – Sequence Alignment
CUCGCACGGUGCUGA-AUGCCCGUA
CUCGCACAG-GCUGAGAUGCUUGUA
Consider aligning second sequence to first
At each position – match, insertion, deletion - states
Match state – 16 possibilities
A/A, A/C, A/G, A/U, G/A …
Insert or Delete states – 4 possibilities
B
D
D
D
M
M
M
I
I
I
A/-, C/-, G/-, U/-
E
Hidden Markov Models – Sequence Alignment
CUCGCACGGUGCUGA-AUGCCCGUA
CUCGCACAG-GCUGAGAUGCUUGUA
Transitions: Switches between Match, Insert, Delete
Emissions:
What each state does in the alignment
A/A, A/C, A/G, A/U, G/A …
D
D
D
M
M
M
I
I
I
A/-, C/-, G/-, U/-
Probabilities for emissions and
state transitions
We use HMMs for non-paired
regions of RNA structures
B
E
Chomsky Transformational Grammars
Colorless green ideas sleep furiously.
A grammar generates a string of symbols (letters).
Terminals and Non-terminals.
SaS
SgS S
or: S  a S | g S | 
S  aS  agS  aggS  agg
We use transformational grammars for
paired regions (stems) of RNA structures
RNA
Stem
Loop
Durbin et al. p243

W1 
W2 
W3 
0S
aW1u
aW2u
aW3u
gaaa
S 
|
|
|
|
cW1g | gW1c | uW1a
cW2g | gW2c | uW2a
cW3g | gW3c | uW3a
gcaa
aW1u
 acW2gu
 acgW3cgu
 acggaaacgu
S
W1
W2
W3
a c g g a a a c g u
Parse tree format
Profiling Structurally
Aligned RNAs
Loops (unpaired):
Hidden Markov Models
Stems (paired):
Stochastic Grammar
S
B
D
D
D
M
M
M
W1
E
W2
W3
I
I
I
a c g g a a a c g u
Searching by Profiling
Durbin & Eddy
Model
Hidden Markov,
Stochastic Grammar
Aligned by structure
profile
model
alignment
genome
scanning window
(target sequence)
Computational Complexity
Memory and/or Time Requirements
Stem Loops
Pseudoknots
Context Free Grammar
Context Sensitive Grammar
O(N4)
O(N6)
Filters —
Weinberg & Ruzzo
Searching sequences for RNA
structures by Stochastic
Grammars is very slow.
 Find a conserved subsequence or simple structure.
 Search genome or database for all matches.
 Screen region around match in detail.
NNNNNNNNNNNNNNNNNNNCGCCNNNNNNNNNNNNNNNNNNNNNNNNNN
Liming’s New Approach:
Conformational Graph / Tree Decomposition
RNA Structure  Graph  Tree
Robertson, Neil; Seymour, Paul D. (1984),
"Graph minors III: Planar tree-width", Journal
of Combinatorial Theory, Series B 36: 49–64
1.
Represent RNA secondary structure as a graph
2.
Use tree decomposition of the structure graph to
reduce the complexity of the problem
3.
The structure-sequence alignment problem
becomes a subgraph isomorphism problem
Conformational Graph of an RNA
Stem CM
loop HMM
structure
(mixed) graph
5’
Campbell,
Biology
5’
3’
3’
Tree Decomposition of an RNA Graph
Robertson, N. Seymour, P.D. (1986) Graph minors II. Algorithmic aspects of
tree-width, Journal of Algorithms, 7, 309-322.
a
a b a’
b
d
x
d’
b a’ c’
b’
c
y
c’
a’
b b’ c’ y
b’ c c’ y
d b’ y
d d’ b’ x
t = Tree width = |bag| - 1
O(ktm2n)
Alignment is Calculated in the Tree Bags
a b a’
b a’ c’
b b’ c’ y
b’ c c’ y
d b’ y
d d’ b’ x
 Each bag (node) has a dynamic programming table in it.
 Alignment of vertices (stems and loops) between profile-
model and target.
 Calculated from bottom to top; traceback.
profile
model
alignment
genome
scanning window
(target sequence)
Stem CM
loop
HMM
structure
(mixed) graph
3’
5’
Yong Wu
Zhibin Huang
b a’ c’
b b’ c’ y
b’ c c’ y
d b’ y
Joseph
Robertson
d d’ b’ x
RNATOPS
1. Takes input alignment of training sequences.
2. Constructs a profile model of the structure.
3. Searches sequences for instances of the structure.
4. Can use filters (sequence or structure).
Cross Validation Tests:
 Genomes where the position is known.
 Leave ncRNA from the genome to be scanned
out of the training set.
 Determine if the program can find the ncRNA.
Bacterial
RNAseP B
genomes of ~3 x 106 bp
Program:
Training Set
Sensitivity
Specificity
Avg. Time
RNATOPS 1.0
9
100%
100%
15 min.
Infernal
9
100%
100%
98 min.
Yeast
Telomerase
RNA
(Saccharomyces)
genomes of ~ 107 bp
Program:
Training Set
Sensitivity
Specificity
Avg. Time
RNATOPS 1.0
5
100%
100%
6 min.
Infernal
5
100%
100%
442 min.
RNATOPS-W
is web server software to search sequences for RNA
secondary structures including pseudoknots.
• automatic selection of an HMM (sequence) filter
• interactive selection of a substructure filter
Yingfeng
Wang
Input Profile for RNATOPS and -W
Pasta Format (Pairing + Fasta)
>pairs
AAAAA...BBBBBBB...aaaaa.....bbbbbbbAAAAA...aaaaa
>index
11111...1111111...11111.....111111122222...22222
>sequence 1
CGGUGCAGGCUA-GAGACCACCGAUUUUAAUCGUAGCCGGCACCACCG
>sequence 2
CGGUGAAGGCUG-GAGACCACCGUAA--UCGCAGCCGGUGAACCACCG
>sequence 3
CGGUGAAGGCUA-GAGACCACCGUUUUUAAUCGUAGCCGGCACCACCG
Typical RNATOPS-W Output
hit 2
---->genome_seq2(380-415)[379-466]
Plus search result
Hit Positions: 380-415
Alignment score = 2.25878
Alignment to the filter
GTCTCTTGGCCCAGTTGGTTAAGGCACCGTGCTAAT
mmmmmmmmmmmmmmmimmmmmmmmmmmmmmmmmmmm
Extension positions: 379-466
Extension of the hit
GGTCTCTTGGCCCAGTTGGTTAAGGCACCGTGCTAATAACGCGGGGATCA
GCGGTTCGATCCCGCTAGAGACCATAGTGCTGGAGTTG
RNATOPS and RNATOPS-W
 Search genomes for new instances of RNAs on the
basis of structure
 Can use filters if appropriate conserved features exist
 Fast and accurate
 Requires a training set
 Version 1.0 does not handle variable structures
Improvements In Process
10x – 100x increase in speed
Allow for presence or absence of stems
Handle variability in stem and loop length
Stem Length Distribution
profile
model
alignment
genome
scanning window
Dong Zhang
Leilei Guo
Mike McEachern
TRFolder – helps predict/identify complex
secondary structures in short sequences.
Used to find telomerase RNA structures in
newly sequenced yeast genomes.
Telomerase
RNAs
3’ ACCCCAGACCCACGAC 5’
linear chromosome
Telomeres:
Repeated sequences at the ends of linear chromosomes
Protect interior sequences from degradation
Telomerase:
Protein, RNA - containing the template for the repeat
Implicated:
Aging and Cancer
TRFolder
Useful if you suspect there is a defined
structure within a given sequence fragment
1. Start with ~5000 base segment
2. Predict pseudoknots
3. Input additional stems to predict
4. Input ranges of stem and loop sizes
5. Search for triple helix regions
TRFolder:
Telomerase RNAs in
Newly SequencedYeast
Genomes
1. Use template region to find candidate telomerase
RNA gene regions.
3’ ACCCCAGACCCACGAC 5’
2. Verify by checking the known flanking genes from
species where it was previously identified.
3. Use TRFolder to predict structural features
TRFolder:
Telomerase RNAs in
Newly SequencedYeast
Genomes
Tested on Known Structures:
Saccharomyces and Kluyveromyces species
Predict New Structure for Known Telomerase RNAs:
Schizosaccharomyces pombe and Candida albicans
Predict New Telomerase RNA Genes and RNA Structure:
Candida glabrata, Candida guilliermondii, Candida tropicalis,
Ashbya gossypii, Debaryomyces hansenii, Pichia stipitis
TRFolder:
Typical Output
The predicted structures of the top 2 candidates for A. gossypii:
Total:70.11 =pk Score:3*13.22
+Triple
Score:1*13.77
The predicted structures
of top
3 candidates +boundaryEle Score:
..-302..........-288..-256....-248....-9.....-2.....0..................
for A. gossypii:
.....|.............|.....|.......|.....|......|.....|..................
.....GGAGCGGCUCCUGGA.....UGCCCACUG.....CAUGGGCA.....ACCGCUGAGAGACCCAUAC
.....(((((((((((((((.....((((((.((.....)))))))).....*******************
.......................................................................
Total:69.53 =pk Score:3*13.22 +Triple Score:1*13.77 +boundaryEle Score:
..-267.-262..-256....-248....-9.....-2.....0...........................
.....|....|.....|.......|.....|......|.....|...........................
.....GUCUCA.....UGCCCACUG.....CAUGGGCA.....ACCGCUGAGAGACCCAUACACCACACCG
.....((.(((.....((((((.((.....)))))))).....****************************
.......................................................................
Accepts: Rfam alignment files (Stockholm)
or Pasta format (Pairing + Fasta)
>pairs
AAAAA...BBBBBBB...aaaaa.....bbbbbbbAAAAA...aaaaa
>index
11111...1111111...11111.....111111122222...22222
>sequence 1
CGGUGCAGGCUA-GAGACCACCGAUUUUAAUCGUAGCCGGCACCACCG
>sequence 2
CGGUGAAGGCUG-GAGACCACCGUAA--UCGCAGCCGGUGAACCACCG
>sequence 3
CGGUGAAGGCUA-GAGACCACCGUUUUUAAUCGUAGCCGGCACCACCG
Alignment Utility - RNApasta
Example
Features
Arc Diagram
Partition into 2 Sets By Stem Length
Stacking
Alignment Utility - RNApasta
Java Utility with Graphical Interface
Statistical Analysis of Alignments
(examples:)
• Basepairing frequencies by position, region, stem
position
• Summaries of stem and loop length distributions
Alignment Editing
• Identify non-cannonical basepairs
• Extract regions
• Partition sequences into subsets
• Remove gap columns
(examples:)
•Evolution of RNA Structure (in process)
•Structural Alignment of RNAs (in process)
•Alternative Splicing ?
•Finding All Transcripts ?
minus
Anuj Srivastava
Introduced Tree Decomposition Method
RNATOPS – Search Genomes for ncRNA
TRFolder – Telomerase (& other) RNA Structures
RNApasta – Statistical Analysis & Alignment Editing
Our software:
• Is very fast.
• Handles pseudoknots as well as stem-loops.
http://www.uga.edu/RNA-Informatics/
Liming Cai
Biomedical Information Science
and Technology Initiative (BISTI)
Download