Clustering 2 (Chap 7)

Clustering and Probability (Chap 7) Review from Last Lecture  Defined the K-means problem for formalizing the notion of clustering.  Discussed the K-means algorithm.  Noted that the K-means algorithm was “quite good” in discovering “concepts” from data (based on features).  Noted the important distinction between “attributes” and “features”. Example of K-means -1 Measure 1 Measure 2 Patient 1 1 1 Patient 2 2 1 Patient 3 3 4 Patient 4 4 5 Let initial centroids be C1 = (1,1) and C2 = (2,1) Example of K-means-2 Measure Measure 1 2 Dist to C1 (1,1) Dist to C2 (2,1) nearest Patient 1 1 1 0 1 C1 Patient 2 2 1 1 0 C2 Patient 3 3 4 3.6 3.16 C2 Patient 4 4 5 5 4.47 C2 C1: = (1,1); C2 = ((2+3+4)/3, (1+4+5)/3)) = (3,3.33) Example of K-means-3 Measure1 Measure2 Dist to C1 (1,1) Dist to C2 (3,3.33) nearest Patient 1 1 1 0 3.07 C1 Patient 2 2 1 1 2.54 C1 Patient 3 3 4 3.6 0.67 C2 Patient 4 4 5 5 2.10 C2 C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5) Example of K-means-4 Measure1 Measure2 Dist to Dist to C2 C1 (1.5,1) (3.5.4.5) nearest Patient 1 1 1 0.5 4.3 C1 Patient 2 2 1 0.5 3.8 C1 Patient 3 3 4 3.35 0.70 C2 Patient 4 4 5 4.61 1.59 C2 C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5) Example: 2 Clusters c A(-1,2) c B(1,2) 4 (0,0) c C(-1,-2) c D(1,-2) 2 K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and {C,D} K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then {A,C} and {B,D} end up as the two clusters. Several other issues regarding clustering  How do you select the initial centroids?  How do you select the right number of clusters ?  How do you deal with non-Euclidean distance/similarity measures ?  Other approaches (like hierarchical, spectral etc.)  Curse of high-dimensionality. Question S-Length S-Width P-Length P-Width Flower Small Medium Small Medium A (SetosA) Medium Medium Large Large O(Versicolor) Medium Small Small Large I (Virginica) Large Large Medium Small A Large Small Medium Small ? What should the “prediction” be for the flower ? Prediction and Probability  When we make predictions we should assign “probabilities” with the prediction.  Examples:      20% chance it will rain tomorrow. 50% chance that the tumor is malignant. 60% chance that the stock market will fall by the end of the week. 30% that the next president of the United States will be a Democrat. 0.1% chance that the user will click on a banner-ad.  How do we assign probabilities to complex events.. using smart data algorithms…and counting. Probability Basics  Probability is a deep topic…..but for most cases the rules are straightforward to apply..  Terminology  Experiment  Sample Space  Events  Probability  Rules of probability  Conditional Probability  Bayes Rule Probability: Sample Space  Consider an experiment and let S be the space of possible outcomes.  Example:  Experiment is tossing a coin; S={h,t}  Experiment is rolling a pair of dice: S={(1,1),(1,2),…(6,6)}  Experiment is a race consisting of three cars: 1,2 and 3. The sample space is {(1,2,3),(1,3,2),(2,1,3),(2,3,1),(3,1,2),(3,2,1)} Probabilities  Let Sample Space S = {1,2,…m}  Consider numbers pi ³ 0,i =1, 2...m; å pi =1 i  pi is the probability that the outcome of the experiment is i.  Suppose we toss a fair coin. Sample space is S={h,t}. Then ph = 0.5 and pt = 0.5. Probability  Experiment: Will it rain or not in Sydney : S = {rain, no-rain}  Prain = 138/365 =0.38; Pno-rain = 227/365  Assigning (or rather how to) probabilities is a deep philosophical problem.  What is the probability that the “green object standing outside my house is a burglar dressed in green.” Probability  An Event A is a set of possible outcomes of the experiment. Thus A is a subset of S.  Let A be the event of getting a seven when we roll a pair of dice.  A = {(1,6),(6,1),(2,5),(5,2),(4,3),(3,4) }  P(A) = 6/36 = 1/6  In general P(A) = å pi iÎA Probability  The sample space S and events are “sets”.  P(S) = 1;  P(Φ) = 0  Addition:  Often P(AÈ B) = P(A)+ P(B)- P(AÇ B) P(AÇ B) º P(AB) º P(A, B)  Complement: P(Ac ) =1- P(A) Example  Suppose the probability of raining today is 0.4 and tomorrow is also 0.4 and on both days is 0.1. What is the probability it does not rain on either day.  S={(R,N), (R,R),(N,N),(N,R)}  Let A be the event that it will rain today and B it will rain tomorrow. Then  A ={(R,N), (R,R)} ; B={(N,R),(R,R)}  Rain at least today or tomorrow: P(AÈ B) = 0.4 + 0.4 - 0.1= 0.7  Will not rain on either day: 1 – 0.7 = 0.3 Conditional Probability  One of the most important concepts in all of Data Mining and Machine Learning  P(A|B) = P(AB)/P(B) ..assuming P(B) not equal 0.  Conditional probability of A given B has occurred.  Probability it will rain tomorrow given it has rained today.   P(A|B) = P(AB)/(B) = 0.1/0.4 = ¼ = 0.25 In general P(A|B) is not equal to P(B|A) We need conditional probability to answer…. S-Length S-Width P-Length P-Width Flower Small Medium Small Medium A (SetosA) Medium Medium Large Large O(Versicolor) Medium Small Small Large I (Virginica) Large Large Medium Small A Large Small Medium Small ? What should the “prediction” be for the flower ? Bayes Rule  P(A|B) = P(AB)/P(B); P(B|A) = P(BA)|P(A)  Now P(AB) = P(BA)  Thus P(A|B)P(B) = P(B|A)P(A)  Thus P(A|B) = [P(B|A)P(A)]/[P(B)]    This is called Bayes Rule Basis of almost all prediction Latest theories hypothesize that human memory and action is Bayes rule in action. Bayes Rule Prior Posterior P(B | A)P(A) P(A | B) = P(B) P(data | hypothesis)P(hypothesis) P(hypothesis | Data) = P(data) Bayes Rule: Example The ASX market goes up 60% of the days of a year. 40% of the time it stays the same or goes down. The day the ASX is up, there is a 50% chance that the Shanghai Index is up. On other days there is 30% chance that Shanghai goes up. Suppose The Shanghai market is up. What is the probability that ASX was up. Define A1 as “ASX is up”; A2 is “ASX is not up” Define S1 as “Shanghai is up”; S2 is “Shanghai is not up” We want to calculate P(A1|S1) ? P(A1) = 0.6; P(A2) = 0.4; P(S1|A1) = 0.5; P(S1|A2) = 0.3 P(S2|A1) = 1 – P(S1|A1) = 0.5; P(S2|A2) = 1 –P(S1|A2) = 0.7; Bayes Rule: Example We want to calculate P(A1|S1) ? P(A1) = 0.6; P(A2) = 0.4; P(S1|A1) = 0.5; P(S1|A2) = 0.3 P(S2|A1) = 1 – P(S1|A1) = 0.5; P(S2|A2) = 1 –P(S1|A2) = 0.7; P(A1|S1) = P(S1|A1)P(A1)/(P(S1)) How do we calculate P(S1) ? Bayes Rule: Example  P(S1) = P(S1,A1) + P(S1,A2) [Key Step] = P(S1|A1)P(A1) + P(S1|A2)P(A2) = 0.5 x 0.6 + 0.3 x 0.4 = 0.42 Finally, P(A1|S1) = P(S1|A1)P(A1)/P(S1) = (0.5 x 0.6)/0.42 = 0.71 Example: Iris Flower  F=Flower; SL=Sepal Length; SW = Sepal Width;  PL=Petal Length; PW =Petal Width Data  Large Small Medium Small P(F = A) = P(Data | F = A)P(F = A) P(Data) P(F = O) = P(Data | F = O)P(F = O) P(Data) P(Data | F = I )P(F = I) P(F = I ) = P(Data) ? choose the maximum Example: Iris Flower  So how do we compute: P(Data|F=A) ?  This is a a non-trivial question…[subject to much research]  How many times does “Data” appear in the “database” when F=A. P(Data | F = A) = #(Data, F = A) #(F = A)  In this case “Data” is a 4-dimensional “data vector.” Each component takes 3 values (small, medium, large). Thus number of combinations 3^4 = 81. Example: Iris Flower  Conditional Independence P(Data|F=A) = P(SL=Large,SW=Small,PL=Medium,PW=Small|F=A)  ~= P(SL=Large|F=A)P(SW=Small|F=A)P(PL=Medium|A)P(PW=Small|A)   The above is an assumption to make the “computation easier.”  Surprisingly evidence suggest that it works reasonably well in practice.  This prediction method (which exploits conditional independence) is called “Naïve Bayes Classifier.”

Clustering 2 (Chap 7)

Related documents

Products

Support

Clustering 2 (Chap 7)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib