Application of AI to molecular biology Notes for 03/09/2006 Representing gene networks: - we need to represent the relation between several thousands of genes - Several different types of relations( protein 1 causes protein 2 to become active, protein 1 and protein 2 cause the activation of protein 3, etc..) Do versus observe: Assume the following example: Then: P(rain | wet) > 0 (observation), while P(rain | doing(wet)) = 0 (action). Same with genes: Thus if I observe g3 to be high(active, …) it might have been the result of g1, g2. However, if I MAKE g3 high(artificially,…), it might have no impact/effect/relation on/with g1, g2. Thus doing something is a different thing then observing it. Another example: In a group of people, 50% were given treatment and 50% were not. In both groups, 50% of the people recovered, while the other 50% did not (e.g. died). Let: X = 1 treatment was given X = 0 treatment was not given Y = 1 person died Y = 0 person recovered Clearly, P(X = 1, Y = 1) = P(X = 0, Y = 1) = P(X = 1, Y = 0) = P(X = 0, Y = 0) = 0.25. We have 2 queries: Q1: What is the probability that Joe’s death occurred due to the treatment? Q2: What is the probability that Joe, who died after treatment , would have lived had he not been treated? To answer these, we will use probabilistic casual models Probabilistic casual models: - direct acyclic graph (DAG) of variables - variables: -exogenous: They have no parents and the probabilities for those variables are given -other: They have parents, and their probabilities are not given, they are given as a function of their parents (e.g. wet = f(rain, sprinkler) in the first example) - variables are thus deterministic, the only non-determinism comes from exogenous variables. Back to example: V1 – variable that decides treatment V2 – variable that decides if one will die or not Model 1: V1 0 0 1 1 V2 0 1 0 1 X 0 0 1 1 Q1: We have V1=V2=1. Thus P(Q1) = 0.25. Q2; Again, V1=V2=1. P(Y = 0| do(X = 0)) = 0. P(Y = 0| X = 0) = 0. Y 0 1 0 1 Probability 0.25 0.25 0.25 0.25 Model 2: V1 - variable that decides treatment V2- a genetic factor which, if present, kills people who take this treatment or, if absent, kills people who doesn’t take the treatment.. X = V1. Y = X*V2+(1-X)*(1-V2) We obtain the same model in terms of probability distribution as model 1: V1 0 0 1 1 V2 0 1 0 1 X 0 0 1 1 Y 0 1 0 1 Probability 0.25 0.25 0.25 0.25 However, answer to the queries may change: Again, V1 = V2 = 1. P(Y = 0 | do(X = 1)) = 1. Thus if I do something, the model may not apply, e.g. I can make X = 0 even if V1 =1. (It’s outside the model) Another example: U – court ordered execution C – captain gives signal V – rifleman A is nervous A – rifleman A shoots B – rifleman B shoots D – prisoner dies Functions: A = C or V B=C D = A or B C=V Query Q: The prisoner died. What is the probability that the prisoners would not have died if A decided not to shoot. (How responsible is A in the death of the prisoner) U V A B C D Prob. 0 0 0 0 0 0 (1-p)(1-q) 0 1 1 0 0 1 (1-p)q 1 0 1 1 1 1 p(1-q) 1 1 1 1 1 1 pq Only the latter 3 rows are relevant (prisoner died). Thus we calculate their new probabilities: Row 2: (1-p)q / 1 – (1-p)(1-q) Row 3: p(1-q) / 1 – (1-p)(1-q) Row 4: pq / 1 – (1-p)(1-q) Now we set the value of A to zero (rifleman A decides not to shoot). Then we obtain the following new values for A, B, C and D. A 0 0 0 B 0 1 1 C 0 1 1 D 0 1 1 The only applicable row of this table is the first one, thus the probability of our query is the probability of the first row, which is (1-p)q / 1 – (1-p)(1-q). Thus P(Q) = (1-p)q / 1 – (1-p)(1-q) . General algorithm: 1. Make the joint probability distribution my enumerating exogenous variables 2. Take the observation into account 3. Re-compute the joint probability distribution 4. Based on the query, change the model (take into account ‘do’) 5. Re-compute the joint probability distribution with respect to the new model 6. Answer the query, e.g. find P(X | do (Y)).