Active Learning as Active Inference Brigham S. Anderson www.cs.cmu.edu/~brigham brigham@cmu.edu School of Computer Science Carnegie Mellon University Copyright © 2006, Brigham S. Anderson OUTLINE • New Active Inference Algorithm • Active Learning • Background • Application of new algorithm • Example application to Hidden Markov Models • Active sequence selection for Hidden Markov Model learning 2 Rain Tomorrow Who will win American Idol? NP = P Left iron on 3 I will answer one question. Choose a node. Wow! uh… ? Oracle Rain tomorrow? NP = P? Is the iron on? Do I have cancer? NIPS acceptance? Today’s Lotto numbers? etc… 4 Active Inference Given: 1. 2. 3. 4. Set of target nodes: Set of query nodes: Probabilistic model: Uncertainty function: X Y P(X,Y) uncertainty(X) Problem: Choose a node in Y to observe in order to minimize Uncertainty(P(X)) Why is this difficult? …for every Y, we must evaluate uncertainty({Xi} |Y) 5 Why is this useful? Diagnosis, Active Learning, Optimization, … How do we quantify “uncertainty” of a node? 6 Example You have the following model of your Cancer state: P(Cancer) P( no) = 0.95 P(yes) = 0.05 Cancer TestA P(TestA|Cancer) P(pos| no) P(pos|yes) P(neg| no) P(neg|yes) 7 = = = = 0.50 0.99 0.50 0.01 TestB P(TestB|Cancer) P(pos| no) P(pos|yes) P(neg| no) P(neg|yes) = = = = 0.01 0.50 0.99 0.50 Example • Your uncertainty about P(Cancer) is “bad” • How can we quantify the badness? P(Cancer) P( no) = 0.95 P(yes) = 0.05 Cancer TestA P(TestA|Cancer) P(pos| no) P(pos|yes) P(neg| no) P(neg|yes) 8 = = = = 0.50 0.99 0.50 0.01 TestB P(TestB|Cancer) P(pos| no) P(pos|yes) P(neg| no) P(neg|yes) = = = = 0.01 0.50 0.99 0.50 The Uncertainty Function Obvious candidates for Uncertainty: • • • P(L) Entropy Variance Misclassification risk High entropy High variance High misclassification risk P(L) L Low entropy Low variance Low misclassification risk L 9 Notation • Given that you have not had any tests yet, what is your P(Cancer)? • P(Cancer ) Cancer 0.95 0 . 05 P(Cancer) P( no) = 0.95 P(yes) = 0.05 Cancer Notation P( X ) X p1 p 2 pk TestA P(TestA|Cancer) P(pos| no) P(pos|yes) P(neg| no) P(neg|yes) 10 = = = = 0.50 0.99 0.50 0.01 TestB P(TestB|Cancer) P(pos| no) P(pos|yes) P(neg| no) P(neg|yes) = = = = 0.01 0.50 0.99 0.50 Uncertainty How surprised will I be? k Entropy pi log pi i How often will I be wrong if I guess the most likely? How often will I be wrong if I guess probabilistically? Expected Misclassification 1 max( p0 , p1 ,..., pk ) k “Gini” pi 2 i Proposed uncertainty measure 11 T Uncertainty Functions for P(Cancer) Uncertainty ENTROPY GINI MISCLASSIFICATION P(Cancer=yes) 12 The ALARM network 13 Active Inference Performances on ALARM network 0/1 Misclass. Error 0.4 Random Info Gain (entropy) E.Misclass Gini 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Neg. log likelihood 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 25 20 15 10 5 0 Number of Queries 14 Active Inference Performances on Randomly Generated Networks Random Info Gain Gini Number of Queries 15 Some Nice Gini Properties • For multinomials, minimizing Σpi2 minimizes the sum of eigenvalues of the covariance matrix. • Can incorporate misclassification costs naturally: W T T 16 GINI Active Inference Problem Given: 1. 2. 3. 4. Set of target nodes: Set of query nodes: Probabilistic model: Uncertainty function: X Y P(X,Y) gini(X) Problem: Find the one node in Y expected to minimize gini(X) giniCan (X) do giniit( Xin1 ) O(N) gini (for X 2 ) polytrees gini ( X m ) T T& Moore, 2005)T (Anderson 1 1 2 2 m m 17 Polytrees 18 Example Problem Given: 1. 2. 3. 4. Cancer TestA TestB Target node: Observable nodes: Probabilistic model: Uncertainty function: Cancer {TestA, TestB} P(Cancer,TestA,TestB) gini(Cancer) Problem: Choose the test expected to minimize gini(Cancer) if we perform it 19 • In order to know how a test will affect our P(Cancer), we need to know the conditional probabilities between the test results and Cancer. 20 CPT Matrices Define. If A and B are discrete random variables, then CA|B is a CPT matrix where the ijth element is P(A=i|B=j) Theorem. If A and B are discrete random variables, and CA|B is a CPT matrix, A C A|B B I.e., inferring one variable’s distribution from another is a linear operation given the CPT matrix 21 Cancer CCancer| A A P( A 0) P(Cancer 0 | A 0) P(Cancer 0 | A 1) P ( Cancer 1 | A 0 ) P ( Cancer 1 | A 1 ) P( A 1) P(Cancer 0, A 0) P(Cancer 0, A 1) P ( Cancer 1 , A 0 ) P ( Cancer 1 , A 1 ) P(Cancer 0) P ( Cancer 1 ) 22 Imagine that, instead of one cancer node, we have X1,X2,…,Xm nodes that we want to determine the gini of gini (X) gini ( X 1 ) gini ( X 2 ) gini ( X m ) X 1 X 1 Xm Xm T T C X 1| A A C X 1| A A C Xm| A A C Xm| A A T A G T T A X 1, X 2 ,..., Xm A 23 • So, we want GA{targets} for each node A in the query nodes. • How to compute all of these GA{targets} matrices efficiently? • Can do it with dynamic programming because… Theorem. For any nodes X, Y, and set of nodes Z, if X and Z are conditionally independent given Y, then G CY | X G CY | X Z X T 24 Z Y A B C GAA I AB B G I C G CA|B T A| B A A GCABC I CBT|C GBABCB|C Polytrees Use Similar Principle 25 Fast Active Inference • Information gain is quadratic in the number of nodes to compute (there is no way to do messagepassing.) • Gini is linear in the number of nodes. 26 Time to Compute Gain: Random Polytrees Entropy Gini Seconds 27 Applications • Active learning • Diagnosis • Optimization of noisy functions 28 OUTLINE • New Active Inference Algorithm • Active Learning • Background • Application of new algorithm • Example application to Hidden Markov Models • Active sequence selection for Hidden Markov Model learning 29 Active LEARNING Site_id F1 F2 F3 F4 F5 OFFENSIVE 0 0 0 0 1 0 1 0 1 0 1 0 2 0 0 0 0 0 3 0 0 1 1 1 ? ? false ? true ? 4 1 0 0 1 0 5 1 1 0 0 1 6 0 0 0 0 0 7 0 0 1 1 0 8 0 1 0 0 1 30 ? ? false ? ? ? Active Learning Flavors Select Queries Pool Myopic Construct Queries Sequential Specifically, we’re not doing decision processes, POMDPs, or any kind of policy learning. Batch We’re asking: what is the one label you most want to see? 31 Active Learning Ө: Model parameter(s) Ө fi : feature(s) of example f1 Li : label of example L1 32 Active Learning Ө Inference f1 f2 f3 f4 f5 TRUE L1 FALSE L2 L3 L4 FALSE L5 At each iteration, we select the one best How we select node to minimize node todo observe thata will minimize our the uncertainty of the target node, Θ? expected uncertainty about the Ө node. 33 Active Learning • Coincidentally, the Cancer network is analogous to our Active learning problem. Ө Cancer TestA TestB f1 f2 f3 f4 f5 L1 L2 L3 L4 L5 Select test to minimize uncertainty of Cancer Select L to minimize uncertainty of Ө 34 Active Learning • Which page do I show the human expert in order to learn my is-offensive model Ө? • Which email do I show the user in order to learn my is-spam model Ө? Active Inference • Which question do I ask the user in order to infer his preference nodes? • What question do I ask the user in order to infer his printerstate node(s)? 35 Active Learning Basics Uncertainty Sampling uncertainty(L) Query by Committee disagreement(L) Information Gain H(Θ) – H(Θ |L) 36 Active Learning Basics Uncertainty Sampling uncertainty(L) Query by Committee disagreement(L) Information Gain H(Θ) – H(Θ |L) Gini Gain Gini(Θ) – Gini(Θ |L) 37 New Active Learning Basics Uncertainty Sampling uncertainty(L) Query by Committee disagreement(L) Information Gain H(Θ) – H(Θ |L) Gini Gain Gini(Θ) – Gini(Θ |L) 38 Uncertainty Sampling (Lewis and Gale, 1994) BASIC IDEA: choose uncertain labels. Talk Assumption: uncertainty is entropy 39 Uncertainty Sampling Example id F1 F2 F3 F4 F5 OFFEN. P(OFFEN) H(OFFEN) 0 0 0 0 1 0 ? 0.02 0.043 1 0 1 0 1 0 ? 0.01 0.024 2 0 0 0 0 0 ? 0.05 0.086 3 0 0 1 1 1 ? FALSE 0.33 0.910 4 1 0 0 1 0 ? 0.01 0.024 5 1 1 0 0 1 ? 0.96 0.073 40 Uncertainty Sampling BASIC IDEA: choose the sample you are most uncertain about GOOD: easy GOOD: sometimes works BAD: H(L) measures information gained about the sample, not the model Attracted to noisy samples 41 Uncertainty Sampling …but at least H(L) upper bounds the information gain of L w.r.t. the model (or anything else.) BAD: H(L) measures information gained about the sample, not the model Attracted to noisy samples 42 We can do better than uncertainty sampling 43 Query By Committee (QBC) (Seung, Opper, and Sompolinsky, 1992) IDEA: choose labels your models disagree on. ASSUMPTION: no noise ASSUMPTION: perfectly learnable model E.g., if half your version space says X is true, and the other half says it is false, you’re guaranteed to reduce your version space by half if you find out X. 44 QBC • • • Randomly Randomlysample draw 2 2 models models from model space Classify example the example If they disagree, select the example t Sex Age Test A Test B Test C Li θ1 1 M 2030 0 1 1 ? 2 F 2030 0 1 0 ? 3 F 3040 1 0 0 ? 4 F 60+ 1 1 0 ? 5 M 1020 0 1 0 ? 6 M 2030 1 1 1 ? FALSE 45 θ2 FALSE QBC • • • Randomly draw 2 models from model space Classify the example If they disagree, select the example t Sex Age Test A Test B Test C Li 1 M 2030 0 1 1 ? 2 F 2030 0 1 0 ? 3 F 3040 1 0 0 ? 4 F 60+ 1 1 0 ? 5 M 1020 0 1 0 ? 6 M 2030 1 1 1 ? θ1 TRUE 46 θ2 TRUE QBC • • • Randomly draw 2 models from model space Classify the example If they disagree, select the example t Sex Age Test A Test B Test C Li 1 M 2030 0 1 1 ? 2 F 2030 0 1 0 ? 3 F 3040 1 0 0 ? FALSE 4 F 60+ 1 1 0 ? 5 M 1020 0 1 0 ? 6 M 2030 1 1 1 ? θ1 47 TRUE θ2 FALSE Query By Committee (QBC) IDEA: choose labels your models disagree on. In the noise-free case, H(L) is entirely due to uncertainty about the model, so it reduces to uncertainty sampling! If we allow noisy samples and use a model posterior instead of a version space, QBC starts to look exactly like… 48 Active Learning Basics Uncertainty Sampling uncertainty(L) Query by Committee disagreement(L) Information Gain H(Θ) – H(Θ |L) Gini Gain Gini(Θ) – Gini(Θ |L) 49 Information Gain • Choose the unlabeled example whose label has the greatest information gain w.r.t. the model. Ө f1 f2 f3 f4 f5 L1 L2 L3 L4 L5 50 Information Gain • Choose the unlabeled example whose label has the greatest information gain w.r.t. the model. IG ( L; ) H () H ( | L) H ( L ) H ( L | ) Interesting: Uncertainty sampling Information Gain when H(L|Θ) is small relative to H(L). 51 Information Gain Example Assume that our model space consists of two models, θ1 and θ2 … F1 F2 F3 F4 F5 OFFEN. P(OFFEN|θ1) P(OFFEN|θ2) IG(OFFEN ; Θ) 0 0 0 0 1 0 ? 0.02 0.02 0.000 1 0 1 0 1 0 ? FALSE 0.12 0.01 0.230 2 0 0 0 0 0 ? 0.07 0.05 0.025 3 0 0 1 1 1 ? 0.33 0.33 0.000 4 1 0 0 1 0 ? 0.02 0.01 0.007 5 1 1 0 0 1 ? 0.99 0.96 0.022 id 52 Active Learning Basics Uncertainty Sampling uncertainty(L) Query by Committee disagreement(L) Information Gain H(Θ) – H(Θ |L) Gini Gain Gini(Θ) – Gini(Θ |L) 53 Gini Gain • Use the active inference algorithm from the first part of this talk… Target node: Ө Query nodes: {Li} Ө f1 f2 f3 f4 f5 L1 L2 L3 L4 L5 54 Gini Gain Definition. The Gini gain between two random variables X and Y, denoted as GG(X;Y), is defined as GG(Y ; X ) gini (Y ) gini (Y | X ) Y Y P( x) gini (Y | X x) T xdom( X ) 55 Active Learning Basics Uncertainty Sampling uncertainty(L) Query by Committee disagreement(L) PRO: Simple CON: No good theory for noise Information Gain H(Θ) – H(Θ |L) PRO: Information theory-based CON: Does not scale well Gini Gain Gini(Θ) – Gini(Θ |L) 56 PRO: Simple CON: Misled by noise PRO: Scales extremely well. Can use confusion costs. Interesting Question • Can we “fix” uncertainty sampling by approximating H(L|Ө)? If we can do this, it will approximate information gain 57 We’re Still Not Happy • All of the active learning methods used this model: Ө f1 f2 f3 f4 f5 L1 L2 L3 L4 L5 …But something seems wrong… 58 We’re Still Not Happy We usually don’t want information about the model… We want information about the test set labels! Ө Z1 Z2 Z3 f’1 f’2 f’3 L1 L2 L3 L4 f1 f2 f3 f4 Test Set Training Set 59 Information Gain Approach Ө Z1 Z2 Z3 f’1 f’2 f’3 Test Set Information Gain: L1 L2 L3 L4 f1 f2 f3 f4 Training Set Y * arg max IG (Y ; Z1 ) IG (Y ; Z 2 ) IG (Y ; Z m ) Y This blows up quadratically, since we’re evaluating each L’s effect on each Z in the test set. 60 Gini Gain Approach Ө • Gini Gain: Target nodes: {Zi} Query nodes: {Yi} Z1 Z2 Z3 f’1 f’2 f’3 Test Set L1 L2 L3 L4 f1 f2 f3 f4 Training Set Note that the structure of this problem is a polytree, so the algorithm is O(N) Work in progress 61 OUTLINE • New Active Inference Algorithm • Active Learning • Background • Application of new algorithm • Example application to Hidden Markov Models • Active sequence selection for Hidden Markov Model learning 62 The SwitchMaster™ (powered by Hidden Markov Models!) INPUT OUTPUT Binary stream of motion / no-motion Probability distribution over • Phone, • Meeting, • Computer, and • Out E.g., “There is an 86% chance that the user is in a meeting right now.” 63 Hidden Markov Model Model parameters Ө = {π0,A,B} π0= P(S0=1) P(S0=2) … P(S0=n) A= P(St+1=1|St=1) … P(St+1=n|St=1) P(St+1=1|St=2) … P(St+1=n|St=2) … P(St+1=1|St=n) … P(St+1=n|St=n) B= P(O=1|S=1) … P(O=m|S=1) P(O=1|S=2) … P(O=m|S=2) … P(O=1|S=n) … P(O=m|S=n) 64 O0 O1 O2 O3 S0 S1 S2 S3 SwitchMaster HMM A= O1 O2 O3 O4 S1 S2 S3 S4 P(St+1=Phone|St=Phone) … P(St+1=Phone|St=Meeting) … … … P(St+1=Out|St=Out) 65 B= P(Ot =1 | St =Phone) P(Ot =1 | St =Meeting) P(Ot =1 | St =Computer) P(Ot =1 | St =Out) HMM Inference Ot P(St= Phone) P(St= Meeting) P(St= Computer) P(St= Out) 1 0 1.00 0.0 0.00 0.0 2 0 1.00 0.0 0.00 0.0 3 1 0.0 0.10 0.80 0.10 4 1 0.0 0.11 0.80 0.09 5 1 0.0 0.12 0.80 0.08 6 0 0.0 0.10 0.78 0.12 … … … … … … t 66 Active Learning! Good Morning Sir! Here’s the video footage of yesterday. Good Morning Sir! Could you just go through it and label each frame? Can you tell me what you are doing in this frame of video? 67 HMM User Model …Now suppose that our human labels observations: this time step states: Phone Meeting Computer Out O1 O2 O3 O4 S1 S2 S3 S4 Motion sensors Microphones Keyboard activity etc. HIDDEN PMC O PMC O PMC O PMC O State Probabilities for Phone/Meeting/Computer/Out 68 HMMs and Active Learning …Now suppose that our human labels this time step O1 O2 O3 O4 S1 S2 S3 S4 PMC O PMC O PMC O PMC O State Probabilities for Phone/Meeting/Computer/Out 69 HMMs and Active Learning O1 O2 O3 O4 S1 S2 S3 S4 …No problem, if we know the true state… 70 HMMs and Active Learning using Evidence O1 O2 O3 O4 S1 S2 S3 S4 L1 L2 L3 L4 “Queryable” Observations (costly observations, labels, uncertain labels, tests, etc.) 71 HMMs and Active Learning using Evidence O1 O2 O3 O4 S1 S2 S3 S4 L1 L2 L3 L4 PMC O PMC O PMC O …Now we choose a measurement… PMC O State Probabilities for Phone/Meeting/Computer/Out 72 HMMs and Active Learning O1 O2 O3 O4 S1 S2 S3 S4 L1 L2 L3 L4 Active Learning: What is the optimal observation, L1, L2, L3, or L4? Choose L* to minimize uncertainty of the model or the hidden states? 73 HMMs and Active Learning O1 O2 O3 O4 O5 O6 O7 S1 S2 S3 S4 S5 S6 S7 L1 L2 L3 L4 L5 L6 L7 ? hmm… ? ? 74 hmmmmmmmmmmmmmmm mmmmmmmmmmmmmmmm mmmmmmmmmmmmmm… 75 HMMs and Active Learning The SwitchMaster™ is trying to minimize the uncertainty of some target node(s) …What are its target nodes? 76 HMM Inference Tasks States Parameters Baum-Welch algorithm Path Viterbi algorithm Individual States Forward-Backward algorithm 77 Different entropybased and ginibased active learners Path States Model T Entropy Gini H (S1, S2 ,..., ST ) H (S ) t H () t T joint joint T T S S t t t T Efficient myopic algorithms for each of these objective functions in Anderson and Moore, 2005 78 Active State Learning with Information Gain T Y arg max IG (Y ; St ) * Y t 2 2 O(T MN ) 79 L1 L2 L3 L4 S1 S2 S3 S4 Path States Model T Entropy Gini H (S1, S2 ,..., ST ) H (S ) t H () t T joint joint T T S S t t 80 t T Active State Learning with Gini T Y arg max GG(Y ; St ) * Y t 2 2 2 O(TMN O(T MN) ) 81 L1 L2 L3 L4 S1 S2 S3 S4 Experiment: User Model States: Emacs-Latex Emacs-Code Shell Email Other Observations: Key duration (msec) Key transition time (msec) Key category (alpha,space,enter,punc,edit) 1 keystroke = 1 timestep 20,000 timesteps 82 Results Random Uncertainty sampling Gini 83 OUTLINE • New Active Inference Algorithm • Active Learning • Background • Application of new algorithm • Example application to Hidden Markov Models • Active sequence selection for Hidden Markov Model learning (Anderson, Siddiqqi, and Moore, 2006) 84 Actively Selecting Excerpts Good Morning Sir! Could you please label the following scene from yesterday… 85 0 0 0 1 0 1 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 OK, which subsequence would be most informative about my model? There are O(T2) of them! 86 1 1 1 1 0 0 0 1 0 1 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 P P M M C P M 1 Note: the expert annotates each of the states ? hmmmmmmmmmmmmmmm mmmmmmmmmmmmmmmm mmmmmmmmmmmmmm… 87 1 1 1 0 0 0 1 0 1 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 P P M M C P M 1 Possible applications of “excerpt selection” • Selecting utterances from audio • Selecting excerpts from text • Selecting sequences from DNA 88 1 1 1 Excerpt Selection PROBLEM: Find the sequence S = {St,St+1,…,St+k} to maximize IG(S; Θ) NOTE: We’re not using Gini, we’re using information gain! Trick question: Which subsequence maximizes IG(S;Θ)? 89 Sequence Selection We have to include the cost incurred when we force an expert to sit down and label 1000 examples… score(S; ) IG(S; ) S So there is a constant cost, α, associated with providing each label This is computed from the entropy of the sequence, H(S). How do we compute H(S)? 90 What is the Entropy of a Sequence? • H(S1:4) = H(S1,S2,S3,S4) = ? The Chain Rule of Entropy H(S1,S2,S3,S4) = H(S1) + H(S2 |S1) + H(S3 |S1,S2) + H(S4 |S1,S2,S3) …but we have some structural information: S1 S2 S3 S4 H(S1,S2,S3,S4) = H(S1) + H(S2 |S1) + H(S3 |S2) + H(S4 |S3) 91 Entropy of a Sequence t k H ( St , St 1 ,..., St k ) H ( St ) H ( Si 1 | Si ) i t We still get the H(St) and H(St+1|St) values from P(St | O1:T), and P(St+1 | St, O1:T) 92 Score of a Sequence score(S; ) H (S) H (S | ) S tL tL H ( St ) H ( Si 1 | Si ) H ( St | ) H ( Si 1 | Si , ) S i t i t 93 How can I find the best excerpt of length k? 0 0 0 0 1 0 1 1 1 0 0 1 0 94 1 1 0 0 0 0 1 1 1 1 1 Find Best Sequence of Length k 1. Score each length-k subsequence according to score(S;Ө) = H(S) – H(S|Ө) 2. Select the best one 0 0 0 0 k=5 1 0 1 1 1 0 0 1 0 *** 1 1 0 0 0 0 1 1 1 1 Some simple caching gives O(T) 95 1 Yeah, but what if I don’t know k? I want to find the best excerpt of any length 0 0 0 0 1 0 1 1 1 0 0 1 0 96 1 1 0 0 0 0 1 1 1 1 1 Find Best Sequence of Any Length Hmm… 1. Score all possible intervals 2. Pick the best one That’s O(T2). We could cleverly cache some of the computation as we go… 0 0 0 0 1 0 1 1 1 0 0 But we’re still going to be O(T2) 1 0 1 1 0 0 0 0 1 1 1 97 1 1 Similar Problem 4 f(t) Find the sequence with largest integral 3 2 Note: a Google interview question 1 t 0 ?? ? -1 -2 Can be done using Dynamic Programming in O(T) 98 DP Intuition state(t) = the best interval so far, and the best interval ending at t state(t+1) = if f(t) + score of best-ending-at-t < 0 then start a new best-ending-at-t else “keep going” 99 Find Best Sequence of Any Length Use DP to find the subsequence that maximizes score(S;Ө) = H(S) – H(S|Ө) – α|S| 0 0 0 0 1 0 1 1 1 0 0 1 0 1 1 0 *** 100 0 0 0 1 1 1 1 1 Not Just HMMs This active learning algorithm can be applied to any sequential process with the Markov property E.g., Kalman filters 101 SUMMARY • Linear time active inference using Gini • Applications to Hidden Markov Models • Applications to general Active Learning Ө Z1 Z2 Z3 f’1 f’2 f’3 • Active sequence selection 102 L1 L2 L3 L4 f1 f2 f 3 f4 Future Work • On-line active learning • Batch active learning • Optimization of noisy functions 103 104 Selective Sampling Bias? 105 Related Work • Label selection for tracking in text HMMs (Scheffer, et al. 2001) • Nonmyopic label selection for tracking with chain models (Krause & Guestrin, 2005) • Label selection for model learning in general graphical models (Tong & Koller, 2001) 106 Imagine that, instead of one cancer node we’re interested in, we have X1,X2,…,Xm that we want to determine the gini of gini (X) gini ( X 1 ) gini ( X 2 ) gini ( X m ) 1 1 2 2 m m T T T C X 1| A A C X 1| A A C Xm| A A C Xm| A A T C T A T X 1| A A C T A G T T C X 1| A A C T X 1| A T A T Xm| A C Xm| A A C X 1| A C Xm| A C Xm| A A T A X 1, X 2 ,..., Xm A 107 state(t) = [a*,b*] : best interval so far atemp : start of best interval ending at t sum(a*,b*) sum(atemp,t ) Rules: if ( sum(atemp,t-1) + y(t) < 0 ) then atemp= t if ( sum(atemp,t) > sum(a*,b*) ) then [a*,b*] = [atemp,t] 108