COMP 4125/7820 Visual Analytics & Decision Support Lecture 3: Supplement – Data Clustering Dr. LIU Yang Some contents and images are from “Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (http://www.mmds.org/)” and “Alex Rodriguez and Alessandro Laio, Clustering by fast search and find of density peaks, 1 Science 27 Jun 2014: 344(6191), pp. 1492-1496”. ▪ Given a cloud of data points we want to understand its structure 2 The Problem of Clustering ▪ Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that – Members of a cluster are close/similar to each other – Members of different clusters are dissimilar 3 What is Good Clustering? ▪ A good clustering method will produce high quality clusters – high intra-class similarity: cohesive within clusters – low inter-class similarity: distinctive between clusters Intra-cluster distances are minimized Inter-cluster distances are maximized Cluster Outlier 4 What is similarity? Similarity is hard to define, but… Detecting similarity is a typical task in decision making “We know it when we see it” 5 5 Euclidean Distance ▪ The Euclidean Distance takes into account both the direction and the magnitude of the vectors ▪ The Euclidean Distance between two n-dimensional vectors x=(x1,x2,…,xn) and y=(y1,y2,…,yn) is: d E ( x, y ) = ( x1 − y1 ) 2 + ( x2 − y2 ) 2 + + ( xn − yn ) 2 = n (x − y ) i =1 i 2 i ▪ Each axis represents an experimental sample ▪ The co-ordinate on each axis is the measure of expression level of a gene in this sample. several genes in two experiments (n=2 in the above formula) 6 Clustering - Concept ▪ Start with a collection of n objects each represented by a D–dimensional feature vector xi , i=1, …n. ▪ The goal is to assign these n objects into k clusters so that objects within a clusters are more “similar” than objects between clusters. 7 Hierarchical clustering 8 Hierarchical Clustering ▪ Key operation: Repeatedly combine two nearest clusters ▪ Three important questions: – 1) How do you represent a cluster of more than one point? – 2) How do you determine the “nearness” of clusters? – 3) When to stop combining clusters? 99 Hierarchical Clustering ▪ Key operation: Repeatedly combine two nearest clusters ▪ (1) How to represent a cluster of many points? – Key problem: As you merge clusters, how do you represent the “location” of each cluster, to tell which pair of clusters is closest? – Euclidean case: each cluster has a centroid = average of its (data)points ▪ (2) How to determine “nearness” of clusters? – Measure cluster distances by distances of centroids 10 Example: Hierarchical clustering (5,3) o (1,2) o x (1.5,1.5) x (1,1) o (2,1) o (0,0) Data: o … data point x … centroid x (4.7,1.3) o (4,1) x (4.5,0.5) o (5,0) Dendrogram 11 11 K-means clustering 12 K-means ▪ Given a K, find a partition of K clusters to optimize the chosen partitioning criterion (cost function) ▪ The K-means algorithm: a heuristic method o K-means algorithm (MacQueen’67): each cluster is represented by the centre of the cluster and the algorithm converges to stable centers of clusters. o K-means algorithm is the simplest partitioning method for clustering analysis and widely used in data mining applications. 13 K-means 1) Pick a number (K) of cluster centers (at random) 2) Assign every item to its nearest cluster center (e.g. using Euclidean distance) 3) Move each cluster center to the mean of its assigned items 4) Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold) 14 Example: Assigning Clusters x x x x x x x … data point … centroid x x x x x Clusters after round 1 15 15 Example: Assigning Clusters x x x x x x x … data point … centroid x x x x x Clusters after round 2 16 16 Example: Assigning Clusters x x x x x x x … data point … centroid x x x x x Clusters at the end 17 17 Example 2, step 1 k1 Y Pick 3 initial cluster centers (randomly) k2 k3 X 18 Example 2, step 2 k1 Y Assign each point to the closest cluster center k2 k3 X 19 Example 2, step 3 k1 k1 Y Move each cluster center to the mean of each cluster k2 k3 k2 k3 X 20 Example 2, step 4 Reassign points Y closest to a different new cluster center k1 Q: Which points are reassigned? k3 k2 X 21 Example 2, step 4 Reassign points Y closest to a different new cluster center k1 Q: Which points are reassigned? k3 k2 X 22 Example 2, step 4 … k1 Y A: three points k3 k2 X 23 Example 2, step 4b k1 Y re-compute cluster means k3 k2 X 24 Example 2, step 5 k1 Y move cluster centers to cluster means k2 k3 X 25 Problems with k-Means ▪ Very sensitive to the initial points. – http://shabal.in/visuals/kmeans/1.html – Do many runs of k-Means, each with different initial centroids. – Seed the centroids using a better method than random. (e.g. Farthest-first sampling) ▪ Must manually choose k. 26 Example: Picking k Too few; many long distances to centroid. x x x x x x x x x xx x x x x x x x xx x x x x x x x xx x x x x x x x x x x x x 27 27 Example: Picking k Just right; distances rather short. x x x x x x x x x xx x x x x x x x xx x x x x x x x xx x x x x x x x x x x x x 28 28 Example: Picking k Too many; little improvement in average distance. x x x x x x x x x xx x x x x x x x xx x x x x x x x xx x x x x x x x x x x x x 29 29 Getting the k right How to select k? ▪ Try different k, looking at the change in the average distance to centroid as k increases ▪ Average falls rapidly until right k, then changes little 30 Visualizing K-Means https://www.youtube.com/watch?v=zHbxbb2 ye3E https://www.naftaliharris.com/blog/visualizin g-k-means-clustering/ 31 Density peaks 32 Latest work on clustering ▪ Authors: – Alex Rodriguez, Alessandro Laio ▪ Affiliation: – SISSA (Scuola Internazionale Superiore di Studi Avanzati), via Bonomea 265, I-34136 Trieste, Italy ▪ Title: – Clustering by fast search and find of density peaks ▪ Publication Details: – Science 27 Jun 2014: Vol. 344, Issue 6191, pp. 1492-1496 33 Clustering - example ▪ Visually it is a region with a high density of points. ▪ Separated from other dense regions 34 Clustering - example 35 Inefficiency of standard algorithms ▪ K-means algorithm – It assigns each point to the closest cluster center. Variables: the number of centers and their location – By construction it is unable to recognize non-spherical clusters 36 What is a cluster? ▪ Clusters = Peaks In the Density Of points 37 Proposed Method 38 Proposed Method 39 Proposed Method 40 Proposed Method 41 Not even an algorithm… 42 The clustering approach at work 43 The clustering approach at work 44 The clustering approach at work 45 The clustering approach at work 46 The clustering approach at work 47 The clustering approach at work 48 Possible Applications ▪ Classification of living organisms ▪ Marketing strategies ▪ Libraries (book sorting) ▪ Google search ▪ Face recognition 49 Possible Applications ▪ Classification of living organisms ▪ Marketing strategies ▪ Libraries (book sorting) ▪ Google search ▪ Face recognition 50 Possible Applications ▪ Classification of living organisms ▪ Marketing strategies ▪ Libraries (book sorting) ▪ Google search ▪ Face recognition 51 Possible Applications ▪ Classification of living organisms ▪ Marketing strategies ▪ Libraries (book sorting) ▪ Google search ▪ Face recognition 52 Possible Applications ▪ Classification of living organisms ▪ Marketing strategies ▪ Libraries (book sorting) ▪ Google search ▪ Face recognition 53 Application on face recognition ▪ Define a “distance” between faces, based on some stable features ▪ Sampat, M.P., Wang, Z., Gupta,S., Bovik,A.C., & Markey,M.K. (2009). Complex wavelet structural similarity: A new image similarity index. IEEE Trans. Image Processing, 18(11), 2385‐2401. 54 Application on face recognition 55 Application on face recognition 56 Dr. LIU Yang Some contents are from “Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (http://www.mmds.org/)” K Nearest Neighbor Decision Tree Support Vector Machines 2 Would like to do prediction: estimate a function f(x) so that y = f(x) Y X’ Y’ Where y can be: ▪ Real number: Regression ▪ Categorical: Classification ▪ Complex object: ▪ Ranking of items, Parse tree, etc. X Data is labeled: Training and test set Estimate y = f(x) on X,Y. Hope that the same f(x) also works on unseen X’, Y’ ▪ Have many pairs {(x, y)} ▪ x … vector of binary, categorical, real valued features ▪ y … class ({+1, -1}, or a real number) 3 Would like to do prediction: estimate a function f(x) so that y = f(x) Where y can be: ▪ Real number: Regression ▪ Categorical: Classification ▪ Complex object: Human Animal ▪ Ranking of items, Parse tree, etc. Data is labeled: ▪ Have many pairs {(x, y)} ? ▪ x … vector of binary, categorical, real valued features ▪ y … class ({+1, -1}, or a real number) 4 Eager learning ▪ When given a set of training data samples, it will construct a generalization model before receiving new data samples to classify ▪ Decision tree, rule-based classification, classification by back propagation, Support Vector Machines (SVM), associative classification Lazy learning ▪ Simply stores training data samples and waits until it is given a test data sample ▪ Less time in training but more time in predicting 5 A typical lazy learning method ▪ An algorithm that stores all available cases and classifies new cases based on a similarity measure Data samples/instances represented as points in a Euclidean space ▪ Classification done by comparing feature vectors of the different points Also called ▪ ▪ ▪ ▪ Memory-Based Reasoning Example-Based Reasoning Instance-Based Learning Case-Based Reasoning 7 If it walks like a duck, quacks like a duck, then it’s probably a duck Compute Distance Training samples Test data sample Choose the “nearest” training sample 8 • Consider a two class problem where each sample consists of two measurements, so that is represented as a two-dimensional vector in the Euclidean space. Class 2 Class 1 9 • Consider a two class problem where each sample consists of two measurements, so that is represented as a two-dimensional vector in the Euclidean space Which class does the green triangle belong to ? Class 2 Class 1 10 • For a given query point, assign the class of the nearest neighbor to it. It belongs to Class 1 Class 2 Class 1 11 • Consider a two class problem where each sample consists of two measurements, so that is represented as a two-dimensional vector in the Euclidean space Which class does the green triangle belong to now ? Class 2 Class 1 12 To make Nearest Neighbor work we need 2 important things: ▪ Distance metric: ▪ Euclidean ▪ How many neighbors to look at? ▪ One 13 Distance metric: ▪ Euclidean How many neighbors to look at? ▪ k Weighting function (optional): ▪ Unused How to fit with the local points? ▪ Just predict the average output among k nearest neighbors 14 • Compute the k nearest neighbours and assign the class by majority vote It belongs to Class 2 now (k = 5). Class 2 Class 1 15 • • Requires three things • The set of training data samples • Distance metric to compute distance between samples • The value of k, the number of nearest neighbors to retrieve To classify an unknown data sample: • Compute distance to other training samples • Identify k nearest neighbors • Use class labels of nearest neighbors to determine the class label of unknown sample (e.g., by taking majority vote) 16 17 Given the training data in above table, predict the class of the following new example using k-Nearest Neighbor for k=5: {age<=30, income=medium, student=yes, credit_rating=fair} 18 Among the 5 nearest neighbors, 4 are from class Yes and 1 from class No. Hence, the k-NN classifier predicts buys_computer = yes for the new example. 19 What distance measure to use? ▪ Often Euclidean distance is used ▪ Locally adaptive metrics ▪ More complicated with non-numeric data, or when different dimensions have different scales Choice of k? ▪ ▪ ▪ ▪ Cross-validation 1-NN often performs well in practice k-NN needed for overlapping classes Re-label all data according to k-NN, then classify with 1-NN 20 What distance measure to use? ▪ Often Euclidean distance is used ▪ Locally adaptive metrics ▪ More complicated with non-numeric data, or when different dimensions have different scales Choice of k? ▪ ▪ ▪ ▪ Cross-validation 1-NN often performs well in practice k-NN needed for overlapping classes Re-label all data according to k-NN, then classify with 1-NN 21 • • Any two bases ei, ej are assumed to be uncorrelated with each other Ignore the relationships among different coordinates for high-order data. distE(a,b) = 54 > 49 = distE(a,c), unreasonable! 22 What distance measure to use? ▪ Often Euclidean distance is used ▪ Locally adaptive metrics ▪ More complicated with non-numeric data, or when different dimensions have different scales Choice of k? ▪ ▪ ▪ ▪ Cross-validation 1-NN often performs well in practice k-NN needed for overlapping classes Re-label all data according to k-NN, then classify with 1-NN 23 What distance measure to use? ▪ Often Euclidean distance is used ▪ Locally adaptive metrics ▪ More complicated with non-numeric data, or when different dimensions have different scales Choice of k? ▪ ▪ ▪ ▪ Cross-validation 1-NN often performs well in practice k-NN needed for overlapping classes Re-label all data according to k-NN, then classify with 1-NN 24 What distance measure to use? ▪ Often Euclidean distance is used ▪ Locally adaptive metrics ▪ More complicated with non-numeric data, or when different dimensions have different scales Choice of k? ▪ ▪ ▪ ▪ Cross-validation 1-NN often performs well in practice k-NN needed for overlapping classes Re-label all data according to k-NN, then classify with 1-NN 25 What distance measure to use? ▪ Often Euclidean distance is used ▪ Locally adaptive metrics ▪ More complicated with non-numeric data, or when different dimensions have different scales Choice of k? ▪ ▪ ▪ ▪ Cross-validation 1-NN often performs well in practice k-NN needed for overlapping classes Re-label all data according to k-NN, then classify with 1-NN 26 What distance measure to use? ▪ Often Euclidean distance is used ▪ Locally adaptive metrics ▪ More complicated with non-numeric data, or when different dimensions have different scales Choice of k? ▪ ▪ ▪ ▪ Cross-validation 1-NN often performs well in practice k-NN needed for overlapping classes Re-label all data according to k-NN, then classify with 1-NN 27 What distance measure to use? ▪ Often Euclidean distance is used ▪ Locally adaptive metrics ▪ More complicated with non-numeric data, or when different dimensions have different scales Choice of k? ▪ ▪ ▪ ▪ Cross-validation 1-NN often performs well in practice k-NN needed for overlapping classes Re-label all data according to k-NN, then classify with 1-NN 28 What distance measure to use? ▪ Often Euclidean distance is used ▪ Locally adaptive metrics ▪ More complicated with non-numeric data, or when different dimensions have different scales Choice of k? ▪ ▪ ▪ ▪ Cross-validation 1-NN often performs well in practice k-NN needed for overlapping classes Re-label all data according to k-NN, then classify with 1-NN 29 What distance measure to use? ▪ Often Euclidean distance is used ▪ Locally adaptive metrics ▪ More complicated with non-numeric data, or when different dimensions have different scales Choice of k? ▪ ▪ ▪ ▪ Cross-validation 1-NN often performs well in practice k-NN needed for overlapping classes Re-label all data according to k-NN, then classify with 1-NN 30 Tid Attrib1 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Attrib2 Attrib3 10 Training Set Attrib2 Tree Induction algorithm Class Induction 3 Learn Model 4 Model 1 Attrib3 2 Tid Attrib1 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 6 Apply Model Decision Tree Deduction 5 10 Test Set 32 Splitting Attributes Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K 6 No Married 7 Yes Divorced 220K 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Yes Refund Yes No NO MarSt Single, Divorced No No TaxInc < 80K NO Married NO > 80K YES 10 Training Data Model: Decision Tree 33 10 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K MarSt Married NO Single, Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! 34 Start from the root of tree Refund Yes Test Data No NO Refund Marital Status Taxable Income Cheat No 80K MarSt Single, Divorced TaxInc < 80K NO Married Married ? 10 NO > 80K YES 35 Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 36 Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 37 Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 38 Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 39 Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married Assign Cheat to “No” NO > 80K YES 40 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Class Sail? ? ? 1 2 Attribute Outlook Company Sailboat sunny no big rainy big small outlook company sailboat 41 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Class Sail? ? ? 1 2 Attribute Outlook Company Sailboat sunny no big rainy big small outlook 42 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Class Sail? ? ? 1 2 Attribute Outlook Company Sailboat sunny no big rainy big small outlook sunny rainy 43 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Class Sail? ? ? 1 2 Attribute Outlook Company Sailboat sunny no big rainy big small outlook sunny rainy yes 44 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Class Sail? ? ? 1 2 Attribute Outlook Company Sailboat sunny no big rainy big small outlook sunny rainy yes company 45 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Class Sail? ? ? 1 2 Attribute Outlook Company Sailboat sunny no big rainy big small outlook sunny rainy yes company no big med 46 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Class Sail? ? ? 1 2 Attribute Outlook Company Sailboat sunny no big rainy big small outlook sunny rainy yes company no big med no yes 47 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Class Sail? ? ? 1 2 Attribute Outlook Company Sailboat sunny no big rainy big small outlook sunny rainy yes company no big med no sailboat yes 48 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Class Sail? ? ? 1 2 Attribute Outlook Company Sailboat sunny no big rainy big small outlook sunny rainy yes company no big med no sailboat small yes big 49 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Class Sail? ? ? 1 2 Attribute Outlook Company Sailboat sunny no big rainy big small outlook sunny rainy yes company no big med no sailboat small yes yes big no 50 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Attribute Class Outlook Company Sailboat Sail? 1 sunny no big Yes 2 rainy big small ? outlook sunny rainy yes company no big med no sailboat small yes yes big no 51 # Attribute Outlook Company Sailboat 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny no small 5 sunny big big 6 rainy no small 7 rainy med small 8 rainy big big 9 rainy no big 10 rainy med big Class Sail? yes yes yes yes yes no yes yes no no # Class Attribute Outlook Company Sailboat Sail? 1 sunny no big Yes 2 rainy big small Yes outlook sunny rainy yes company no big med no sailboat small yes yes big no 52 # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy Attribute Temperature Humidity hot high hot high hot high moderate high cold normal cold normal cold normal moderate high cold normal moderate normal moderate normal moderate high hot normal moderate high Windy no yes no no no yes yes no no no yes yes no yes Class Play N N P P P N P N P P P P P N 53 Outlook sunny Humidity high N rainy overcast P Windy normal P yes N no P 54 Temperature hot cold moderate Outlook sunny P overcast P Outlook rainy sunny Windy Windy yes yes N no P P Windy rainy overcast P Humidity no N yes N high normal Windy P yes N no P no Humidity high normal Outlook sunny N overcast P P rainy null 55 Main principle ▪ Select attribute which partitions the learning set into subsets as “pure” as possible Various measures of purity ▪ ▪ ▪ ▪ ▪ Information-theoretic Gini index X2 ReliefF ... Various improvements ▪ probability estimates ▪ normalization ▪ binarization, subsetting 56 To classify an object, a certain entropy is needed ▪ E, entropy After we have learned the value of attribute A, we only need some remaining amount of information to classify the object ▪ Ires, residual information Gain(S,A): Expected reduction in entropy due to sorting on Expected reduction in entropy due to sorting on A ▪ Gain(A) = Entropy(S) – Ires(A) The most ‘informative’ attribute is the one that minimizes Ires, i.e., maximizes Gain 57 Calculation of entropy ▪ Entropy(S) = ∑(i=1 to I)-|Si|/|S| * log2(|Si|/|S|) ▪ S = set of examples ▪ Si = subset of S with value vi under the target attribute ▪ I = size of the range of the target attribute For a two-class problem: entropy |Si|/|S| 58 After applying attribute A, S is partitioned into subsets according to values v of A Ires is equal to weighted sum of the amounts of information for the subsets • ∑(i = 1 to k) |Si|/|S| Entropy(Si), where k is the range of the attribute we are testing 59 # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Color green green yellow red red red green green yellow red green yellow yellow red Attribute Outline dashed dashed dashed dashed solid solid solid dashed solid solid solid dashed solid dashed Shape Dot no yes no no no yes no no yes no yes yes no yes triange triange square square square triange square triange square square square square square triange 60 # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Color green green yellow red red red green green yellow red green yellow yellow red Attribute Outline dashed dashed dashed dashed solid solid solid dashed solid solid solid dashed solid dashed Shape Dot no yes no no no yes no no yes no yes yes no yes triange triange square square square triange square triange square square square square square triange Data Set: A set of classified objects . . . . . . 61 . . . . . • 5 triangles • 9 squares • class probabilities . • entropy 62 . . . . . . . . red Color? green . yellow . . . 63 . . . . . . . . red Color? green . . yellow . . 64 . . . . . . . . red Color? green . . yellow . . 65 Attributes ▪ Gain(Color) = 0.246 ▪ Gain(Outline) = 0.151 ▪ Gain(Dot) = 0.048 Heuristics: attribute with the highest gain is chosen This heuristics is local (local minimization of impurity) 66 . . . . . . . . red Color? green . . yellow . . Gain(Outline) = 0.971 – 0 = 0.971 bits Gain(Dot) = 0.971 – 0.951 = 0.020 bits 67 . . . . . . . . red Gain(Outline) = 0.971 – 0.951 = 0.020 bits Gain(Dot) = 0.971 – 0 = 0.971 bits Color? green . . yellow . . solid . Outline? dashed . 68 . . . . . . . . red . yes Dot? Color? . no green . . yellow . . solid . Outline? dashed . 69 . . . . . . Color red Dot yes triangle yellow green square no Outline dashed square triangle solid square 70 -We are given a set of n points (vectors) : x1 , x2 ,.......xnsuch that xi is a vector of length m, and each belong to one of two classes we label them by “+1” and “-1”. -So our training set is: So the decision ( x1 , y1 ), ( x2 , y2 ),....( xn , yn ) i xi R , yi {+1, −1} m function will be f ( x) = sign( w x + b) - We want to find a separating hyperplane w x + b = 0 that separates these points into the two classes. “The positives” (class “+1”) and “The negatives” (class “-1”). (Assuming that they are linearly separable) 72 x2 yi = +1 yi = −1 f ( x) = sign( w x + b) A separating hypreplane w x + b = 0 x1 But there are many possibilities for such hyperplanes !! 73 yi = +1 yi = −1 Which one should we choose! Yes, There are many possible separating hyperplanes It could be this one or this or this or maybe….! 74 -Suppose we choose the hypreplane (seen below) that is close to some sample xi. - Now suppose we have a new point x ' that should be in class “-1” and is close to xi . Using our classification function f(x) this point is misclassified! f ( x) = sign( w x + b) Poor generalization! (Poor performance on unseen data) x' xi 75 -Hyperplane should be as far as possible from any sample point. -This way a new data that is close to the old samples will be classified correctly. Good generalization! xi x' 76 -The SVM idea is to maximize the distance between The hyperplane and the closest sample point. In the optimal hyperplane: The distance to the closest negative point = The distance to the closest positive point. Aha! I see ! 77 SVM’s goal is to maximize the Margin which is twice the distance “d” between the separating hyperplane and the closest sample. Why it is the best? -Robust to outliners as we saw and thus strong generalization ability. -It proved itself to have better performance on test data in both practice and in theory. xi 78 Support vectors are the samples closest to the separating hyperplane. Oh! So this is where the name came from! These are Support Vectors xi We will see later that the Optimal hyperplane is completely defined by the support vectors. 79 -Our optimization problem so far: I do remember the Lagrange Multipliers from Calculus! 1 minimize w 2 s.t. 2 yi (wT xi + b) 1 -We will solve this problem by introducing Lagrange multipliers i associated with the constrains: n 1 2 minimize L p ( w, b, ) = w − i ( yi ( xi w + b) − 1) 2 i =1 s.t i 0 80 M= Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 2 w.w What should our quadratic optimization criterion be? How many constraints will we have? What should they be? 81 e2 e11 e7 What should our quadratic optimization criterion be? Minimize R 1 w.w + C εk 2 k =1 M= Given guess of w , b we can Compute sum of distances of points to their correct zones Compute the margin width Assume R datapoints, each (xk,yk) where yk = +/- 1 2 w.w How many constraints will we have? R What should they be? w . xk + b >= 1-ek if yk = 1 w . xk + b <= -1+ek if yk = -1 82 83 Examples of Kernel Functions ◼ Linear: K(xi,xj)= xi Txj ◼ Polynomial of power p: K(xi,xj)= (1+ xi Txj)p ◼ Gaussian (radial-basis function network): K (x i , x j ) = exp(− ◼ xi − x j 2 2 2 ) Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1) 84 85 Dr. LIU Yang Jure Leskovec,Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (http://www.mmds.org/) When we think of a social network, we think of Facebook, Twitter, Google+, … The essential characteristics of a social network are: ▪ There is a collection of entities that participate in the network. Typically, these entities are people, but they could be something else. ▪ There is at least one relationship between entities of the network. On Facebook, this relationship is called friends. Sometimes the relationship is all-or-nothing; two people are either friends or they are not. However, in other examples of social networks, the relationship could also be discrete or represented by a real number. ▪ There is an assumption of nonrandomness or locality. This condition is the hardest to formalize, but the intuition is that relationships tend to cluster. That is, if entity A is related to both B and C, then there is a higher probability than average that B and C are related. 2 We often think of networks being organized into modules, cluster, communities: 3 4 Discovering social circles, circles of trust: [McAuley, Leskovec: Discovering social circles in ego networks, 2012] 5 How to find communities? We will work with undirected (unweighted) networks 6 The entities are the nodes A through G. The relationship, which we might think of as “friends,” is represented by the edges. ▪ For instance, B is friends with A, C, and D. 7 Edge betweenness: Number of shortest paths passing over the edge Intuition: b=16 b=7.5 ▪ An important aspect of social networks is that they contain communities of entities that are connected by many edges. ▪ Finding the edges that are least likely to be inside a community. ▪ As in golf, a high score is bad. It suggests that the edge (a, b) runs between two different communities; that is, a and b do not belong to the same community. 8 The edge (B,D) has the highest betweenness, as should surprise no one. ▪ In fact, this edge is on every shortest path between any of A, B, and C to any of D, E, F, and G. Its betweenness is therefore 3 × 4 = 12. In contrast, the edge (D, F) is on only four shortest paths: those from A, B, C, and D to F. 9 10 Want to compute betweenness of paths starting at node 𝑨 Breath first search starting from 𝑨: 0 1 2 3 4 11 Count the number of shortest paths from 𝑨 to all other nodes of the network: 12 Compute betweenness by working up the tree: If there are multiple paths count them fractionally The algorithm: •Add edge flows: -- node flow = 1+∑child edges -- split the flow up based on the parent value • Repeat the BFS procedure for each starting node 𝑈 1+1 paths to H Split evenly 1+0.5 paths to J Split 1:2 1 path to K. Split evenly 13 Compute betweenness by working up the tree: If there are multiple paths count them fractionally The algorithm: •Add edge flows: -- node flow = 1+∑child edges -- split the flow up based on the parent value • Repeat the BFS procedure for each starting node 𝑈 1+1 paths to H Split evenly 1+0.5 paths to J Split 1:2 1 path to K. Split evenly 14 Using E as the start node to calculate the betweenness 15 16 Label the root E with 1. At level 1 are the nodes D and F. Each has only E as a parent, so they too are labeled 1. Nodes B and G are at level 2. B has only D as a parent, so B’s label is the same as the label of D, which is 1. However, G has parents D and F, so its label is the sum of their labels, or 2. Finally, at level 3, A and C each have only parent B, so their labels are the label of B, which is 1. 17 A and C, being leaves, get credit 1. Each of these nodes have only one parent, so their credit is given to the edges (B,A) and (B,C), respectively. 18 At level 2, G is a leaf, so it gets credit 1. B is not a leaf, so it gets credit equal to 1 plus the credits on the DAG edges entering it from below. Since both these edges have credit 1, the credit of B is 3. Intuitively 3 represents the fact that all shortest paths from E to A, B, and C go through B. 19 B has only one parent, D, so the edge (D,B) gets the entire credit of B, which is 3. 20 However, G has two parents, D and F. We therefore need to divide the credit of 1 that G has between the edges (D,G) and (F,G). 21 From the figure in step 2, we observe that both D and F have label 1, representing the fact that there is one shortest path from E to each of these nodes. Thus, we give half the credit of G to each of these edges; i.e., their credit is each 1/(1 + 1) = 0.5. 22 Now, we can assign credits to the nodes at level 1. D gets 1 plus the credits of the edges entering it from below, which are 3 and 0.5. That is, the credit of D is 4.5. The credit of F is 1 plus the credit of the edge (F,G), or 1.5. Finally, the edges (E,D) and (E, F) receive the credit of D and F, respectively, since each of these nodes has only one parent. 23 The credit on each of the edges is the contribution to the betweenness of that edge due to shortest paths from E. For example, this contribution for the edge (E,D) is 4.5. To complete the betweenness calculation, we have to repeat this calculation for every node as the root and sum the contributions. Finally, we must divide by 2 to get the true betweenness, since every shortest path will be discovered twice, once for each of its endpoints. 24 25 COMP 4125/7820 Visual Analytics & Decision Support Lecture 7: Network Visualization and Analytics Dr. MA Jing Outline Network Data Network Visualization – Node-Link Diagrams – Adjacency Matrix – Enclosure Network Analytics – Network Centrality Network Data Dataset: Networks The dataset type of networks is well suited for specifying that there is some kind of relationship between two or more items – An item in a network is often called a node – A link is a relation between two items 4 Dataset: Networks Online Social Network (e.g., Facebook, LinkedIn, etc) – Nodes: people – Links: friendship 5 Dataset: Networks Gene Interaction Network – Nodes: genes – Links: genes have been observed to interact with each other 6 Dataset: Networks - Trees Networks with hierarchical structure are more specifically called trees. In contrast to a general network, trees do not have cycles: each child node has only one parent node pointing to it. Network (Graph) Data Definition A Network G(V, E) contains a set of vertices (nodes) V together with a set of edges (lines) E; Network are also called graph For example – V = {A, B, C, D} – E = {{A, B}, {A,C}, {B, C}, {C, D}} How to draw a network? V = {A, B, C, D}, E = {{A, B}, {A,C}, {B, C}, {C, D}} A A B D B C One way to draw the graph D C Another way to draw the same graph Network Visualization Node-Link Diagrams Adjacency Matrix Enclosure Node-Link Diagrams for Trees The most common visual encoding idiom – Nodes: point marks – Links: line marks Triangular vertical node-link layout - vertical spatial position is used to show the depth in the tree r Spline radial layout - Distance to the center is used to encode the depth in the tree Node-Link Diagrams for Networks Network (also called as graph), is also very commonly represented as node-link diagram The number of hops within a path is used to measure distances Node-link diagrams are well suited for tasks of analyzing the topology – – – – find all possible paths between two nodes find shortest paths between two nodes find neighbors from a target node find nodes that act as bridges between two groups of nodes Simple Fixed Graph Layout Circular Layout Grid Layout Random Layout Force-directed Placement for Node-link Diagram One of the most widely used idioms for nodelink network diagram Network elements are positioned according to a simulation of physical forces – Nodes push away from each other – Links act like springs that draw their endpoint nodes closer to each other Nodes are placed randomly in the beginning and their position are iteratively refined – Pushing and pulling of the simulated spring forces gradually improve the layout https://bl.ocks.org/mbostock/4062045 Example for Force-directed Layout This graph shows character co-occurrence in Les Miserables. Data based on character co-apprearance in Victor Hugo’s Les Miserables. Data: – Nodes – Links Nodes Links Example for Force-directed Layout Nodes represent different characters. Edges represents the co-appearance relationship. Line width encode edge attribute (Larger line width indicates that characters cooccurred more frequently). Node color encode the group information. More on Force-directed Placement Spatial position does not directly encode any attribute of either nodes or links; the placement algorithm use it indirectly. Spatial proximity does indicate grouping through a strong perceptual cue; but sometimes arbitrary – Nodes near each other may because they are repelled from elsewhere, not because they are closely connected. Layout are often non-deterministic because of randomly chosen initial position – The layout will look different each time the layout is computed. – Spatial memory can not be exploited across different runs. – Lead to different proximity relationships each time Major weakness of force-directed placement A major weakness of force-directed placement: Scalability – Visual Complexity of the layout – Time required to compute it Force-directed approaches yield readable layout quickly for tiny graphs with dozens of nodes. However, the layout quickly degenerates into a hairball of visual clutter. – Tasks of path following become very difficult even on a few hundred nodes – Essentially impossible with thousands of nodes or more. Summary of Force-Directed Placement Multi-level network: to scale on big network The original network is augmented with a derived cluster hierarchy to form a compound network. The cluster hierarchy is computed by coarsening the original network into successively simply networks that nevertheless attempt to capture the most essential aspect of the original’s structure. – Laying out the simplest version of the networks first – Improving layout with the more and more complex version Multilevel scalable force-directed placement Significant cluster structure is visible for large network (less than 10k nodes) However, with huge graph, it becomes a “hairball” without much visible structure. Summary of Multilevel Force-Directed Placement Network Visualization Node-Link Diagrams Adjacency Matrix Enclosure Adjacency Matrix of A Network V = {A, B, C, D}, E = {{A, B}, {A,C}, {B, C}, {C, D}} A A D B C B D C A B C D 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 Adjacency Matrix View for Network Network data can also be encoded with a matrix view by deriving a table from the original network data Nodes are labeled in rows and columns of the matrix, each cell in it indicates the link between two nodes. 1 1 2 1 2 4 3 3 4 5 5 2 3 4 5 Adjacency Matrix View for Larger Network Matrix views of networks can achieve very high information density, up to a limit of one thousand nodes and one million edges Example of Adjacency Matrix View https://bost.ocks.org/mike/miserables/ • Matrix view of character co-occurrence in Les Miserables. • Characters are labeled in rows and columns of the matrix • Each colored cell represents two characters that appeared in the same chapter; darker cells indicate characters that co-occurred more frequently. Summary of Adjacency Matrix View Node-link vs. Matrix View Strengths of Node-link layouts 1 – Extremely intuitive for small networks. – They particularly shine for tasks that rely on understanding the topological structure of the network 4 Path tracing, search for neighbors – Also very effective for tasks such as general overview or find similar substructures. Weaknesses of Node-link layouts – Scalability: can not handle large networks – Occlusion from edge crossing each other and crossing underneath nodes. 2 5 3 Node-link vs. Matrix View Strengths of Matrix View – Perceptual scalability for both large and dense networks – Their predictability, stability and support for reordering Predictability: be laid out within a predictable amount of space Stability: Adding a new item cause only a small visual change. – Quickly estimating the number of nodes and fast node lookup 1 1 2 3 4 5 Weaknesses of Matrix View – Unfamiliarity: Need training to interpret matrix view. Clique, Biclique, degree – Lack of supporting for investigating topological structure. 2 3 4 5 Node-link vs. Matrix View Clique: Completely interconnected. Biclique: every node in the first set is connected to every node of the second set. Node-link vs. Matrix View 1 1 2 1 2 4 3 3 4 5 5 2 3 4 5 Tasks of analyzing the topology – All possible path from node 1 to node 5 – Shortest path from node 1 to node 3 Node-link vs. Matrix View An empirical study compared node-link and matrix views – Node-link views are best for small networks and matrix views are best for large networks – Several tasks became more difficult for node-link views as size increased, but okay for matrix views Estimate the number of nodes and edges Finding the most connected node Find a node given it label Find a direct link between two nodes Find a common neighbor between two nodes – Finding a path between two nodes is difficult in matrix view. – Topological structure tasks such as path tracing are best supported by node-link views. Outline Network Data (recap) Node-Link Diagrams Nodes: point marks Links: line marks Adjacency Matrix Directly show adjacency relationships Enclosure Show hierarchical relationships through nesting 34 Containment: Hierarchy Marks Very effective at showing complete information about hierarchical structure – Connection marks in node-link diagrams only show pairwise relationships Example: Tree maps (An alternative to node-link tree) – Hierarchical relationships are shown with containment rather than connection mark. – All of the children of a tree node are enclosed within the area allocated that node, creating a nested layout. Treemaps A treemap view of a 5161-node computer file system. Node size encodes file size. Containment marks are not as effective as the pair-wise connection marks for tasks on topological structure. Good for tasks that pertain to understanding attribute values at leaves of the tree. Very effective for spotting the outliers of very large attribute values. Treemap: A Simple Example Treemap: A Simple Example • Set parent node values to sum of child node values from bottom up Treemap: A Simple Example • Set parent node values to sum of child node values from bottom up • Partition the space based on current node’s value as a portion of parent node’s value from top down Treemap: A Simple Example 5/7 • Set parent node values to sum of child node values from bottom up • Partition the space based on current node’s value as a portion of parent node’s value from top down 2/7 Treemap: A Simple Example 1/2 • Set parent node values to sum of child node values from bottom up • Partition the space based on current node’s value as a portion of parent node’s value from top down 1/2 Treemap: A Simple Example 4/5 • Set parent node values to sum of child node values from bottom up • Partition the space based on current node’s value as a portion of parent node’s value from top down 1/5 Treemap: A Simple Example 1 4 • Set parent node values to sum of child node values from bottom up • Partition the space based on current node’s value as a portion of parent node’s value from top down 1 1 Summary of Treemaps Grouse Flocks Compound networks: a combination of network and tree – Nodes in network are the leaves of tree – The interior nodes of the tree encompass multiple network nodes Grouse Flocks – Combined view using containment marks for associated hierarchy and connection marks for the original network links. Summary of Network Visualization Node-link Diagrams – Extremely intuitive for small networks – They particularly shine for tasks that rely on understanding the topological structure of the network – Can not handle large network Matrix View – – – – – Can handle large networks effectively Their predictability, stability and support for reordering Quickly estimating the number of nodes and fast node lookup Need training to interpret matrix view Lack of supporting for investigating topological structure Containment – Focus on showing hierarchical structure Network Centrality Node Importance Based on the structure of the network, which are the most important nodes? Why Network Centrality? Centrality = “Importance” – Which nodes are important based on their network position – Different ways of thinking about “importance” Rank all nodes in a network (graph) – Find celebrities or influential people in a social network (Twitter) – Find “gatekeepers” who connect communities (headhunters love to find them on LinkedIn) – Important pages on the web (Google search) Centrality algorithms are contained in most network (graph) analysis library. Used them help graph analysis, visualization, etc. Centrality Algorithms Degree – Important nodes have many connections Betweeness – Important nodes connect other nodes Closeness – Important nodes are close to other nodes PageRank – Important nodes are those with many in-links from important nodes. Degree Centrality (easiest) Assumption: – Important nodes have many connections Degree = number of neighbors Directed graph – In Degree: number of incoming edges – Out Degree: number of outgoing edges – Total Degree: In Degree + Out Degree Undirected graph, only degree is defined X has higher centrality than Y according to total degree centrality measure Degree Centrality (Undirected) example The nodes with many connections are important. Degree = number of neighbors When Degree Centrality is Good E.g. Friendship network He or She who has many friends is most important – People who will do favors for you – People you can talk to/ have coffee with When Degree is not good? Ability to bridge between group Likelihood that information originating anywhere in the network reaches you Centrality Algorithms Degree – Important nodes have many connections Betweenness – Important nodes connect other nodes Closeness – Important nodes are close to other nodes PageRank – Important nodes are those with many in-links from important nodes. Betweenness Centrality Assumption – Important nodes connect other nodes Betweenness definition Number of shortest paths between s and t that goes through v Number of shortest paths between s and t It quantifies how often a node acts as a bridge that connects two other nodes. Betweeness on a Toy Network A F C B D E G Betweeness on a Toy Network Betweeness Centrality - Complexity Computing betweeness centrality of all nodes can be very computationally expensive Depending on the algorithm, this computation can take up to O(N3) time, where N is the number of nodes in the graph. Approximation algorithms are used for big graph Centrality Algorithms Degree – Important nodes have many connections Betweeness – Important nodes connect other nodes Closeness – Important nodes are close to other nodes PageRank – Important nodes are those with many in-links from important pages. Closeness Centrality What if it’s not so important to have many direct friends? Or be “between” others But one still wants to be in the “middle” of things, not too far from the center Closeness Centrality Assumption: – Important nodes are close to other nodes Closeness is based on the length of the average shortest path between a vertex and all vertices in the graph Closeness definition The shortest distances between vertices x and y Closeness on a toy network 0.1 A 0.143 B 0.167 C 0.143 D 0.1 E Closeness Centrality on a Toy Network 1/16 1/16 1/11 1/16 1/10 1/11 1/16 Centrality Algorithms Degree – Important nodes have many connections Betweeness – Important nodes connect other nodes Closeness – Important nodes are close to other nodes PageRank – Important nodes are those with many in-links from important nodes. Pagerank Developed by Google founders to measure the importance of webpages from hyperlink network structure PageRank assigns a score of importance to each node. Important nodes are those with many inlinks from important pages. PageRank can be used for any type of network, but it is mainly useful for directed networks A node’s PageRank depends on the PageRank of other nodes. Larry Page Sergey Brin Brin, Sergey and Lawrence Page (1998). Anatomy of a Large-Scale Hypertextual Web Search Engine. 7th Intl World Wide Web Conf. PageRank Problem – Give a directed graph, find its most important nodes Assumption – Important pages are those with many in-links from important pages (recursive). (Simplified) PageRank n = number of nodes in the network k = number of steps 1. Assign all nodes a PageRank of 1/n 2. Perform the (Simplified) PageRank update rule k times (Simplified) PageRank Update Rule: – Each node give an equal share of it current PageRank to all nodes it links to – The new PageRank of each node is the sum of all the PageRank it received from other nodes (Simplified) PageRank Example (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step1 (Simplified) PageRank Example – Step2 (Simplified) PageRank Example – Step2 (Simplified) PageRank Example – Step2 (Simplified) PageRank Example – Step2 … (Simplified) PageRank Example – Step2 (Simplified) PageRank Example … What if continue with k = 4, 5, 6, …? (Simplified) PageRank Example … For most networks, PageRank values converge. Summary for (Simplified) PageRank Steps for (Simplified) PageRank 1. Assign all nodes a PageRank of 1/n 2. Perform the (Simplified) PageRank update rule k times – Each node give an equal share of it current PageRank to all nodes it links to – The new PageRank of each node is the sum of all the PageRank it received from other nodes For most networks, PageRank values converge as k get larger Matrix Model for (Simplified) PageRank computation The Adjacency matrix A (Aij = 1 if node j points to i) From To A A B C D E B C D E Transition Matrix Normalize the adjacency matrix so that the matrix is a stochastic matrix P (each column sum up to 1) Pij: Probability that arriving at page i from page j. From A To A B C D E B C D E (Simplified) PageRank Algorithm Assign all nodes a PageRank of x = [1/n, 1/n, …,1/n] For k = 1, 2, … xk+1 = Pxk × = Problem in simplified PageRank For a large enough k, F an G each have PageRank of ½ and all other nodes have PageRank 0. Why? – In each iteration, whenever PageRank values were received by node F or G, they will “stuck” on F and G. How to solve it? Full PageRank Algorithm To fix this, we introduce a “damping parameter” α In each iteration: – With probability α: choose an outgoing edge at random and follow it to the next node – With probability 1- α: Choose a node at random and go to it. Why? – To make the matrix irreducible – From any node, there’s non-zero probability to reach any other node Full PageRank Algorithm Initialize x0 = [1/n, 1/n, …,1/n] For k = 1, 2, … xk+1 = (αP+(1- α)E/n) xk E is a n x n matrix with all 1s xk+1 = α + (1-α) Xk The Final Full PageRank Algorithm Note that E = eeT, where e is a column vector of 1’s xk+1 = (αP+(1- α)eeT/n) xk = αPxk+(1- α)eeTxk/n = αPxk+(1- α)v v = [1/n, 1/n, 1/n] Note eTxk =1 The Final Full PageRank Algorithm Initialize x0 = [1/n, 1/n, …,1/n] For k = 1, 2, … xk+1 =αPxk+(1- α)v v = [1/n, 1/n, 1/n] xk+1 = α Xk + (1-α) α is often set as 0.85 The Full PageRank Algorithm – Step 1 Initialize x0 = [1/5, 1/5, …,1/5], α = 0.85 For k = 1, 2, … xk+1 =αPxk+(1- α)v v = [1/n, 1/n, 1/n] 0.26 0.85*(1/3*1/5+1*1/5)+0.15*1/5 0.37 x1 = 0.85 + 0.15 = 0.17 0.12 0.09 The Full PageRank Algorithm – Step 2 Initialize x0 = [1/5, 1/5, …,1/5], α = 0.85 For k = 1, 2, … xk+1 =αPxk+(1- α)v v = [1/n, 1/n, 1/n] x2 = 0.85 + 0.15 = Summary of Centrality Algorithms Degree – Important nodes have many connections Betweeness – Important nodes connect other nodes Closeness – Important nodes are close to other nodes PageRank – Important nodes are those with many in-links from important nodes. The best centrality measure depends on the context of the network Case Study: Analyzing Golden State Warriors' passing network using GraphFrames in Spark http://opiateforthemass.es/articles/analyzinggolden-state-warriors-passing-network-usinggraphframes-in-spark/ InDegree Stephen Curry received the most passes. OutDegree Draymond Green provides the most passes. PageRank PageRank can be used to compute the importance of the nodes (i.e., players) Curry, Green and Thompson are the top 3 based on the network data. COMP 4125/7820 Visual Analytics & Decision Support Lecture 12-1: Evaluate and Visualize Classification Performance Dr. MA Jing Classification Evaluation Why Evaluation? Methods for Estimating a Classifier’s Performance Evaluation Metrics – Accuracy – Confusion Matrix – Visualizing Classification Performance using ROC curves and PrecisionRecall Curves – Extensions to Multi-Class Why Evaluation? Multiple methods are available to build a classification model – Perceptron – KNN – Support Vector Machine (SVM) – Deep Neural Networks, …. For each method, multiple choices are available for parameter settings. To choose the best model, we need to access each model’s performance. Selecting a meaningful evaluation metric is crucial in classification-related projects. Evaluation: Accuracy Accuracy is widely use. – Fraction of correctly classified samples over the whole set of samples – Number of Correct Classified Samples / Total Number of Samples Predicted Label True Label 1 1 -1 1 1 1 -1 -1 1 1 Accuracy = 4/5 = 0.80 Methods for Estimating a Classifier’s Performance Test sets How can we get an unbiased estimate of the accuracy of a learned model? Labeled dataset Training dataset Training a classification model test dataset Learned Model When learning a model, you should pretend you do not have the test data yet. Accuracy estimates will be biased if you used the test data during the training. Estimated accuracy Validation (Tuning) Sets We want to estimate the accuracy during the training stage for tuning the hyper-parameter of a model (e.g., k in knn) Labeled dataset Training dataset Training dataset Training classification modesl test dataset validation dataset select Model Learned Model Estimated accuracy Limitations of using a single training/test partition We may not have enough data to make sufficiently large training and test sets – a larger test set gives us more reliable estimate of accuracy (i.e. a lower variance estimate) – but… a larger training set will be more representative of how much data we actually have for learning process A single training set doesn’t tell us how sensitive accuracy is to a particular training set Cross-Validation K-fold cross validation – Create K equal size partitions of the dataset – Each partition has N/K samples – Train using K-1 partitions, test on the remaining partition – Repeat the process for K times, each with different test partition Labeled dataset S1 S2 S3 S4 S5 Iteration Training Dataset Test Dataset 1 S2, S3, S4, S5 S1 2 S1, S3, S4, S5 S2 3 S1, S2, S4, S5 S3 4 S1, S2, S3, S5 S4 5 S1, S2, S3, S4 S5 Cross Validation Example 5-fold Cross Validation – Suppose we have 100 labeled samples, we use 5-fold cross validation to estimate the accuracy Iteration Training Dataset Test Dataset Accuracy 1 S2, S3, S4, S5 S1 11/20 2 S1, S3, S4, S5 S2 17/20 3 S1, S2, S4, S5 S3 16/20 4 S1, S2, S3, S5 S4 13/20 5 S1, S2, S3, S4 S5 16/20 Accuracy = 73/100 = 73% Leave-one-out (LOO) Cross Validation A special case of K-fold cross validation when K = N (number of training samples) – Each partition is now a data sample – Training using N-1 data samples, test on one data sample – Repeat for N times – Can be expensive for large N. Typically used when N is small. Evaluation Metrics Evaluation Metrics Accuracy Confusion Matrix Visualizing Classification Performance using ROC curves and PrecisionRecall Curves Extensions to Multi-Class Accuracy Accuracy is widely use. – Fraction of correctly classified samples over the whole set of samples – Number of Correct Classified Samples / Total Number of Samples However, accuracy may be misleading sometimes. – Consider a test data with imbalanced classes When Accuracy is Not Good? Accuracy with imbalanced Classes Supposed you have a test set with 2 classes containing 1000 samples – 10 of them are positive class – 990 of them are negative class You build a classifier, and the accuracy on this test set is 96% Is it good? When Accuracy is Not Good A imbalanced dataset with 2 classes containing 1000 samples – 10 of them are positive class – 990 of them are negative class For comparison, suppose we had a “dummy” classifier that did’t look at the features at all, and always just blindly predict the most frequent class What the accuracy? Answer – Accuracy = 990/1000 = 99% > 96% Evaluation Metrics Accuracy Confusion Matrix Visualizing Classification Performance using ROC curves and PrecisionRecall Curves Extensions to Multi-Class Confusion Matrix • Label 1 = positive class (class of interest) Binary Classification Example • Label 0 = negative class (everything else) Predicted Label Positive Positive TP Negative FN • Positive samples correctly classified as belonging to positive class • FN = False Negative • True Label Negative • TP = True Positive FP TN Positive samples misclassified as belonging to negative class • FP = False Positive • Negative samples misclassified as belonging to positive class • TN = True Negative • Negative samples correctly classified as belonging to negative class Confusion matrix provides more information than Accuracy. Many evaluation metrics can be derived from it. Accuracy and Classification from the Confusion Matrix Accuracy Predicted Label Negative Positive Positive Negative True Label Classification Error: 1 - Accuracy Recall (or True Positive Rate) What fraction of all positive samples does the classifier correctly identify as positive? Predicted Label Positive Positive Negative True Label Negative Recall is also know as: • True Positive Rate (TPR) • Sensitivity • Probability of Detection Precision What fraction of positive predictions is correct? Predicted Label Positive Positive Negative True Label Negative False Positive Rate What fraction of all negative instances does the classifier incorrectly identify as positive? Predicted Label Positive Positive Negative True Label Negative A Graphical Illustration of Precision and Recall The Precision-Recall Tradeoff High Precision, Lower Recall Low Precision, High Recall A tradeoff between precision and recall Recall-oriented machine learning tasks – Search and information extraction in legal discovery – Tumor detection – Often paired with a human expert to filter out false positives Precision-oriented machine learning tasks – Search engine ranking, query suggestion – Documentation classification – Many customer-facing tasks (users remember failures) F1-Score F1-Score: combining precision & recall into a single number F-Score: generalizes F1-score for combining precision & recall into a single number β allows adjustment of the metric to control the emphasis on recall vs precision Evaluation Metrics Accuracy Confusion Matrix Visualizing Classification Performance using ROC curves and PrecisionRecall Curves Extensions to Multi-Class Decision Functions of Classifiers The score value for each test point indicates how confidently the classifier predicts it as the positive class (large-magnitude positive values) or the negative class (large-magnitude negative values). Choosing a fixed decision threshold gives a classification rule. By sweeping the decision threshold through the entire range of possible score values, we get a series of classification outcomes that form a curve. Predicted Probability of Class Membership Typical rule: choose most likely class – e.g class 1 if threshold > 0.50. Adjusting threshold affects predictions of classifier. Higher threshold results in a more conservative classifier – e.g. only predict Class 1 if estimated probability of class 1 is above 70% – This increases precision. Doesn't predict class 1 as often, but when it does, it gets high proportion of class 1 instances correct. Not all models provide realistic probability estimates Cutoff Table Cutoff: 0.25 Cutoff: 0.75 If cutoff is 0.75: 8 records are classified as “1” If cutoff is 0.25: 15 records are classified as “1” Confusion Matrices for Different Cutoffs Cutoff: 0.75 Cutoff: 0.25 Creating an ROC curve Sort test-set predictions according to confidence that each instance is positive Step through sorted list from high to low confidence – locate a threshold between instances with opposite classes (keeping instances with the same confidence value on the same side of threshold) – compute TPR, FPR for predictions that predict samples above the threshold as positive – output (FPR, TPR) coordinate An example of plotting ROC curve ROC curve X-axis: False Positive Rate Y-axis: True Positive Rate A single confusion matrix corresponds to one point in ROC Curve Top left corner: – The “ideal” point – False positive rate is zero – True positive rate is one ROC curve examples: random guessing Suppose we have a test set contains p positive samples and n negative samples Random prediction: Randomly select r samples and predict them as positive, the rest as negative – Number of true positive: r*p/(p + n) – Number of false positive: r*n/(p + n) TPR = TP/(TP + FN) = r/(p + n) FPR = FP/(FP+TN) = r/(p + n) “TP + FN” is the number of positive samples which equals to p. “FP + TN” is the number of negative samples which equals to n. ROC curve examples: perfect classifier Suppose we have a perfect classifier that always assign higher score to a randomly chosen positive sample than to a randomly chosen negative sample. TPR = TP/(TP + FN) FPR = FP/(FP+TN) Summarizing an ROC curve in one number: Area Under the Curve (AUC) AUC = 0 (worst) AUC = 1 (best) AUC can be interpreted as: – The total area under the ROC curve. – The probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example. Precision-Recall Curves X-axis: Recall Y-axis: Precision Top right corner: – The “ideal” point – Precision = 1.0 – Recall = 1.0 Suppose we have a perfect classifier that always assign higher score to a randomly chosen positive sample than to a randomly chosen negative sample. Precision Precision-Recall Curve examples: perfect classifier Recall = TP/(TP + FN) Recall Precision = TP/(TP+FP) Precision-Recall Curve examples: random guessing Precision Suppose we have a test set contains p positive samples and n negative samples Random prediction: Randomly select r samples and predict them as positive, the rest as negative – Number of true positive: r*p/(p + n) – Number of false positive: r*n/(p + n) p/(p + n) Recall = TP/(TP + FN) = r/(p + n) Precision = TP/(TP+FP) = p/(p + n) “TP + FP” is the number of positive predictions which equals to r. “TP + FN” is the number of positive samples which equals to p. Recall Evaluation Metrics Accuracy Confusion Matrix Visualizing Classification Performance using ROC curves and PrecisionRecall Curves Extensions to Multi-Class Multi-Class Evaluation Multi-class evaluation is an extension of the binary case. – A collection of true vs predicted binary outcomes, one per class – Confusion matrices are especially useful Overall evaluation metrics are averages across classes – There are different ways to average multi-class results Multi-Class Confusion Matrix The numbers in the diagonal represent correct predictions. Predicted Digit True Digit Visualize the Multi-Class Confusion Matrix Standard confusion matrix Heat-map confusion matrix In the standard confusion matrix, it is hard to spot underlying patterns. In the heat-map confusion matrix, it is much easier to identify some insights for this multiclass classification. – 3 and 8 are often misclassified as each other – 5 is misclassified as many different numbers COMP 4125/7820 Visual Analytics & Decision Support Lecture 12-2: Text Data Analytics and Visualization Dr. MA Jing Why Text Analytics Text is everywhere We use documents as primary information artifact in our lives Our access to documents has grown tremendously thanks to the Internet – WWW: webpages, Twitter, Facebook, Wikipedia, Blogs, ... – Digital libraries: Google books, ACM, IEEE, ... – Lyrics, closed caption... (youtube) – Police case reports – Legislation (law) – Reviews (products, rotten tomatoes) – Medical reports (EHR - electronic health records) – Job descriptions Text Analytics Tasks Topic Modeling Text Classification – Spam/ Not Spam Text Similarity – Information Retrieval Sentiment Analysis – Positive/Negative Entity Extraction Text Summarization Machine Translation Natural Language Generation Popular Natural Language Processing (NLP) libraries Stanford NLP OpenNLP NLTK (Python) Tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing A Typical Text Analytics Pipeline Raw Text Preprocessing Numerical Representation Analysis Outline Text Analytics – Preprocessing (e.g., Tokenize, stemming, remove stop words) – Document representation (most common: bag-of-words model) – Word importance (e.g., word count, TF-IDF) – Latent Semantic Indexing (find “concepts” among documents and words), which helps with retrieval Text Visualization Preprocessing Raw Text Tokenize – Tokenization is a process that splits an input sequence into so-called tokens (e.g. words) – “Hello, I’m Dr. Jones.” -> [‘Hello’, ‘I’, ‘m’, ‘Dr.’, ‘Jones’] Token Normalization – Stemming A process of removing and replacing suffixes to get to the root from of the word (i.e. stem) cats -> cat – Lemmatization Refers to doing things properly with the use of a vocabulary and morphological analysis E.g. am is are -> be – Lower cases Computer -> computer Remove Noise – Remove stop words (e.g. “a”, “the”, “is”) – Remove punctuations John’s car is red, right? John’s car is red, right? Document representation: bag-of-words model Represent each document as a bag of words, ignoring words’ ordering. Why? For simplicity. Unstructured text becomes a vector of numbers e.g., docs: “I like visualization”, “I like data”. 1 : “I” 2 : “like” 3 : “data” 4 : “visualization” “I like visualization” ➡ [1, 1, 0, 1] “I like data” ➡ [1, 1, 1, 0] Document Representation One possible approach: – Each entry describes a document – Attribute describe whether or not a term appears in the document Example Document Representation Another approach: – Each entry describes a document – Attributes represent the frequency in which a term appear in the document Example: Term frequency document matrix TF-IDF A word’s importance score in a document, among N documents TF-IDF weighting: give higher weight to terms that are rare among N documents When to use it? Everywhere you use “word count”, you can likely use TF-IDF. TF: term frequency = #appearance a document – high, if terms appear many times in this document – High value indicate more relevant IDF: inverse document frequency = log( N / #document containing that term) – penalize “common” words appearing in almost any documents – High value indicate more discriminative Final score = TF * IDF – higher score ➡ more “characteristic” A Simple TF-IDF Example Suppose we have two documents – Document 1: “This is a sample” – Document 2: “This is another example” Term Frequency Document Matrix this is a another sample example Doc1 1 1 1 0 1 0 Doc2 1 1 0 1 0 1 A Simple TF-IDF Example Inverse Document Frequency (IDF) = log( N / #document containing that term) Term df N idf this 2 2 0 is 2 2 0 a 1 2 log2 = 0.301 another 1 2 log2 = 0.301 sample 1 2 log2 = 0.301 example 1 2 log2 = 0.301 A Simple TF-IDF Example Inverse Document Frequency (IDF) = log( N / #document containing that term) Term df N idf this 2 2 0 is 2 2 0 a 1 2 log2 = 0.301 another 1 2 log2 = 0.301 sample 1 2 log2 = 0.301 example 1 2 log2 = 0.301 Low values for words appearing almost in any document. A Simple TF-IDF Example TF Document Matrix this is IDF a another sample example Doc1 1 1 1 0 1 0 Doc2 1 1 0 1 0 1 Term df idf this 2 0 is 2 0 a 1 0.301 another 1 0.301 sample 1 0.301 example 1 0.301 this is a another sample example Doc1 0 0 0.301 0 0.301 0 Doc2 0 0 0 0.301 0 0.301 Vector Space Model Each document ➡ vector Each query ➡ vector Search for documents ➡ find “similar” vectors Cluster documents ➡ cluster “similar” vectors Latent Semantic Indexing (LSI) Main idea – map each document into some ‘concepts’ – map each term into some ‘concepts’ ‘Concept’ : a set of terms, with weights. For example, DBMS_concept: “data” (0.8), “system” (0.5), … Latent Semantic Indexing (LSI) Latent Semantic Indexing (LSI) Q: How to search, e.g., for a query “system”? A: find the corresponding concept(s); and the corresponding documents Latent Semantic Indexing (LSI) Q: How to search, e.g., for a query “system”? A: find the corresponding concept(s); and the corresponding documents We may retrieve documents that DON’T have the term “system”, but they contain almost everything else (“data”, “retrieval”) LSI - Discussion Great idea – to derive ‘concepts’ from documents – to build a ‘thesaurus’ (words with similar meaning) automatically – to reduce dimensionality (down to few “concepts”) How does LSI work? – Uses Singular Value Decomposition (SVD) SVD Definition A: n x m matrix – e.g. n documents, m terms U: n x r matrix – e.g. n documents, r concepts Λ: r x r diagonal matrix – r : rank of the matrix; strength of each ‘concept’ V: m x r matrix – e.g., m terms, r concepts SVD - Example SVD - Example Case Study: How to do queries with LSI? How to do queries with LSI? For example, how to find documents relevant with a query ‘data’? How to do queries with LSI? For example, how to find documents relevant with a query ‘data’? A: Map query vectors into ‘concept space’ using inner product (cosine similarity) with each ‘concept’ vector vi How to do queries with LSI? How would the document (‘information’, ‘retrieval’) be handled? How to do queries with LSI? Document (‘information’, ‘retrieval’) will be retrieved by query (‘data’), even though it does not contain ‘data’!! Outline Text Analytics – Preprocessing (e.g., Tokenize, stemming, remove stop words) – Document representation (most common: bag-of-words model) – Word importance (e.g., word count, TF-IDF) – Latent Semantic Indexing (find “concepts” among documents and words), which helps with retrieval Text Visualization Text Visualization: Word/Tag Cloud One of the most intuitive and commonly used techniques for visualizing words. Font size indicates the word frequency. Fail to uncover the word relationship Word Tree A tree based visualization to capture the word relationship. Word Tree Summarizes text data via a syntax tree – Sentences are aggregated by their sharing words and split into branches at a place where the corresponding words in the sentences are divergent. https://www.jasondavies.com/wordtree/ Phrase Net Phrase Net uses a node-link diagram – Nodes are keywords – Links represent relationship among key words For example, a user select a predefined regular express from a list to extract a pattern (“X and Y”) http://hint.fm/projects/phrasenet/ Topic Model Visualization A tabular view (left) displays term-topic distributions for an LDA topic model. A bar chart (right) shows the marginal probability of each term. http://vis.stanford.edu/papers/termite Text Visualization Example An analysis of uses of different words across the texts of the U.S. president's annual “State of the Union” address, showing which terms are used how often as the national situation changes. On the left, a visualization of the text is linked to the search box, and a table showing frequency of term usage graphically across time is shown on the right. Text Visualization Example A visualization of the relative popularity of U.S. baby names across time, using a stacked line graph. All matching names are shown, in alphabetical order, with the frequency for a given year determining how much space lies between the line for that name and the name below it. The naming frequency changes over time, producing an impression of undulating waves. More Text Visualization Examples http://textvis.lnu.se/