Discovery of Structural and Functional Features in RNA Pseudoknots Qingfeng Chen and Yi-Ping Phoebe Chen, Senior Member, IEEE IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009 Adviser: Yu-Chiang Li Speaker: Shao-Hsiang Hung Date:2009/12/10 1 Outline Introduction Material and Methods Results Conclusion and Discussion 2 I. Introduction 3 I. Introduction(1/6) Accurately predicting the functions of biological macromolecules is one of the biggest challenges in functional genomics. RNA molecules play a central role in a number of biological functions within cells, from the transfer of genetic information from DNA to protein, to enzymatic catalysis. 4 I. Introduction(2/6) To fulfill this range of functions, a simple linear nucleotide string of RNA including: uracil, guanine, cytosine, adenine, forms a variety of complex three-dimensional structures. pseudoknot an RNA structure base pairing between a loop formed by an orthodox secondary structure 5 I. Introduction(3/6) 6 I. Introduction(4/6) PseudoBase is the only online database containing: Structural, functional, and sequence data of RNA pseudoknots Unfortunately, the analysis of this valuable data set is underdeveloped Difficulty in modeling Complexity in computing structural 7 I. Introduction(5/6) Association rule mining has been successfully used to discover valuable information in a larger data set. Limitations with multivalued variables Categorical multivalued valuables (such as color {red, blue, green}) Quantitative multivalued variables (such as weight {[40, 50], [50, 75]}) The relationships are captured by using Conditional probability matrix MY | X 8 I. Introduction(6/6) We develop a framework to identify potential top-k covering rule groups in RNA pseudoknots Relationships Structure-function Structure-category Significant ratios of stems and loops. Allows users to regulate k and the minsupp threshold and compare between rules in the same group. Handling high dimensional data Enhances the understanding of structure-function relationships 9 II. Material and Methods 10 II. Material and Methods (1/20) Pseudoknot Data. S1, S2, L1, L2, and L3 A, G, C, and U adenine, guanine, cytosine, and uracil, vr, vt, vf, v3, v5, vo, rr, mr, tm, ri, ap, ot, and ar stem 1, stem 2, loop 1, loop 2, and loop 3 viral ribosomal readthrough signals, viral tRNA-like structures , viral ribosomal frameshifting signals, other viral 30-UTR, other viral 50-UTR ,viral others, rRNA, mRNA, tmRNA, Ribozymes, Aptamers, artifical molecules, others ss, tc and fs self-splicing, translation control, and viral frameshifting 11 II. Material and Methods (2/20) Let X and Y be multivalued attribute valuables x and y be items p(X) p(Y|X) minsupp be the minimum support in the context 12 II. Material and Methods (3/20) The data here is collected from PseudoBase Organism RNA type Bracket view of structure Classified by two stems and three loops Nucleotide squence Size 13 II. Material and Methods (4/20) A data set consisting of 225 Hpseudoknots is obtained 14 II. Material and Methods (5/20) 15 II. Material and Methods (6/20) Partition of Attributes {class, function, stem, loop, base, ratio, length} the last one is a quantitative attribute. Propose a novel partition in conjunction with the properties of pseudoknot data and top-k rule groups. 16 II. Material and Methods (7/20) 17 II. Material and Methods (8/20) The domain of quantitative attribute has to be partitioned into intervals 1) 2) The number of intervals The size of each intervals For example (14,15] included in stem 1, stem 2, and loop 1 but not in loop 3 18 II. Material and Methods (9/20) Definition 1. a quantitative attribute y divided into a set of intervals {y1, . . . , yn} using the categorical item xi such that for any base interval yj, yj consists of a single value for 1 ≦ j ≦ n. The partition using xi is defined as {(y1i, max(y2i)]; . . . ; (max(ym1i), max(ymi)]}. Table 2 presents the distribution of sizes of stem 1 and stem 2 of pseudoknots in PseudoBase. 19 II. Material and Methods (10/20) Definition 1. For example Y1 = {0, (0, 1], (1, 2], (2, 3], (3, 4], (4, 5], (5, 6], (6, 7], (7, 8], (8, 9], (9, 10], (10, 11], (11, 12], (12, 13], (13, 14], (14, 15], (15, 16], (16, 17], (17, 18], (18, 19], (19, 20], (20, 21], (21, 22]} 20 II. Material and Methods (11/20) Denfinion 2. Suppose Yi = {y1i, . . . , ymi} and Yi+1 = {y1i+1, . . . , yni+1} are two adjacent partitions. Let Y =ψ. The integration of them is defined as 21 II. Material and Methods (12/20) Denfinion 2. For example stem 1 as Y1 ={0, (0, 1], (1, 2], (2, 3], (3, 4], (4, 5], (5, 6], (6, 7], (7, 8], (8, 9], (9, 10], (10, 11], (11, 12], (12, 13], (13, 14], (14, 15], (15, 16], (16, 17], (17, 18], (18, 19], (19, 20], (20, 21], (21, 22]}. stem 2 as Y2 ={0, (0, 1], (1, 2], (2, 3], (3, 4], (4, 5], (5, 6], (6, 7], (7, 8], (8, 9], (9, 10], (10, 11], (11, 12], (12, 13], (13, 14], (14, 15], . . . , (31, 32], (32, 33]} 22 II. Material and Methods (13/20) the integrated partition of Y1 and Y2 {0, (0, 1], (1, 2], (2, 3], (3, 4], (4, 5], (5, 6], (6, 7], (7, 8], (8, 9], (9, 10], (10, 11], (11, 12], (12, 13], (13, 14], (14, 15], (15, 16], (16, 17], (17, 18], (18, 19], (19, 22], (22, 33]}. 23 II. Material and Methods (14/20) In comparison, the values of ratio attributes are positive real numbers rather than integers. |yi| = 1 in Definition 3.1 needs to be changed to |yi| =1 or |yi| =0.5. |x| =1 and |xc| =1 in Definition 3.2 are changed to |x| =1 and |xc| =1 or |x| =0.5 and |xc| =0.5. Avoid missing interesting knowledge. 24 II. Material and Methods (15/20) Generation of rule groups. Work out the conditional probabilities for X and Y in the probability matrix below. the conditional probability Y = yi, given X = xi ,as p(yi|xi) = p(xi|yi) * p(yi)/p(xi) 25 II. Material and Methods (16/20) For example: x,y as stem1,the size interval => (3,4] of stem1 By Table2, n = 255, p(255/255)=1 Addition Table2, (3,4] of stem1 with four nuleotides = 42 And p ( y (3,4] x stem1) 42 / 255 0.19 So p( y (3,4] | x stem1) p( y (3,4] x stem1) p( x stem1) 0.19 26 II. Material and Methods (17/20) Compute the entire conditional probabilities of stem 1, namely [p(y1| stem1) p(y2 | stem1) . . . p(yn | stem1)] Stem 2, loop1, loop3 can computed 27 II. Material and Methods (18/20) Suppose MY|X corresponding to an association AS consists of a set of rows {r1, . . . , rn}. A ={A1, . . . , Am} be the complete set of antecedent items of AS C = {C1, . . . , Ck} be the complete set of consequent items of AS Namely PS ( x) {( x, yj ) | yj C , p( yj | x) 0} 28 II. Material and Methods (19/20) Definition 3 (Rule group) Let Gx {x Cj | ( x, Cj )} PS ( x)} be a rule group with an antecedent item x and consequent support set C. Definition 4 Let Ri : X Yi and Rj : X Yj 1 k k max Ri is ranked high than Rj if p (Yi | X ) p (Yj | X ) 29 II. Material and Methods (20/20) For example In Table 2 kmax = 21 top-1 covering rule group = {stem1→(2,3], stem2→(5,6]}. top-2 covering rule group = {stem1→(2,3], stem1→(3,4], stem2→(5,6], stem2→(4,5]}. 30 III. Results 31 III. Results (1/4) 32 III. Results (2/4) 33 III. Results (3/4) 34 III. Results (4/4) 35 IV. Conclusion and Discussion 36 IV. Conclusion and Discussion (1/2) If more rules are considered together, a further understanding of pseudoknot’s structure and function can be achieved. This paper aims to analyze increasingly available RNA pseudoknot data and identifies interesting patterns from PseudoBase. 37 IV. Conclusion and Discussion (2/2) The obtained rule groups reveal the structural properties of pseudoknots and imply potential structurefunction and structure-class relationships in RNA molecules. Moreover, the interpretation of rules demonstrates their significance in the sense of biology. 38