Uploaded by chao yu

7820calculating

advertisement
COMP 4125/7820
Visual Analytics & Decision
Support
Lecture 3: Supplement – Data
Clustering
Dr. LIU Yang
Some contents and images are from “Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets
(http://www.mmds.org/)” and “Alex Rodriguez and Alessandro Laio, Clustering by fast search and find of density peaks,
1
Science 27 Jun 2014: 344(6191), pp. 1492-1496”.
▪ Given a cloud of data points we want to
understand its structure
2
The Problem of Clustering
▪ Given a set of points, with a notion of
distance between points, group the
points into some number of clusters, so
that
– Members of a cluster are close/similar to each
other
– Members of different clusters are dissimilar
3
What is Good Clustering?
▪ A good clustering method will produce high quality
clusters
– high intra-class similarity: cohesive within clusters
– low inter-class similarity: distinctive between clusters
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
Cluster
Outlier
4
What is similarity?
Similarity is hard
to define, but…
Detecting similarity
is a typical task in
decision making
“We know it when
we see it”
5
5
Euclidean Distance
▪ The Euclidean Distance takes into account both the direction and
the magnitude of the vectors
▪ The Euclidean Distance between two n-dimensional vectors
x=(x1,x2,…,xn) and y=(y1,y2,…,yn) is:
d E ( x, y ) = ( x1 − y1 ) 2 + ( x2 − y2 ) 2 +  + ( xn − yn ) 2
=
n
 (x − y )
i =1
i
2
i
▪ Each axis represents an experimental sample
▪ The co-ordinate on each axis is the measure of
expression level of a gene in this sample.
several genes in two experiments
(n=2 in the above formula)
6
Clustering - Concept
▪ Start with a collection of n objects each
represented by a D–dimensional feature
vector xi , i=1, …n.
▪ The goal is to assign these n objects into k
clusters so that objects within a clusters
are more “similar” than objects between
clusters.
7
Hierarchical clustering
8
Hierarchical Clustering
▪ Key operation:
Repeatedly combine
two nearest clusters
▪ Three important questions:
– 1) How do you represent a cluster of more
than one point?
– 2) How do you determine the “nearness” of
clusters?
– 3) When to stop combining clusters?
99
Hierarchical Clustering
▪ Key operation: Repeatedly combine two
nearest clusters
▪ (1) How to represent a cluster of many
points?
– Key problem: As you merge clusters, how do you
represent the “location” of each cluster, to tell which
pair of clusters is closest?
– Euclidean case: each cluster has a
centroid = average of its (data)points
▪ (2) How to determine “nearness” of
clusters?
– Measure cluster distances by distances of centroids
10
Example: Hierarchical clustering
(5,3)
o
(1,2)
o
x (1.5,1.5)
x (1,1) o (2,1)
o (0,0)
Data:
o … data point
x … centroid
x (4.7,1.3)
o (4,1)
x (4.5,0.5)
o (5,0)
Dendrogram
11
11
K-means clustering
12
K-means
▪ Given a K, find a partition of K clusters to
optimize the chosen partitioning criterion
(cost function)
▪
The K-means algorithm: a heuristic method
o K-means algorithm (MacQueen’67): each
cluster is represented by the centre of the
cluster and the algorithm converges to stable
centers of clusters.
o K-means algorithm is the simplest partitioning
method for clustering analysis and widely used
in data mining applications.
13
K-means
1) Pick a number (K) of cluster centers (at
random)
2) Assign every item to its nearest cluster
center (e.g. using Euclidean distance)
3) Move each cluster center to the mean of its
assigned items
4) Repeat steps 2,3 until convergence
(change in cluster assignments less than a
threshold)
14
Example: Assigning Clusters
x
x
x
x
x
x
x … data point
… centroid
x
x
x
x
x
Clusters after round 1
15
15
Example: Assigning Clusters
x
x
x
x
x
x
x … data point
… centroid
x
x
x
x
x
Clusters after round 2
16
16
Example: Assigning Clusters
x
x
x
x
x
x
x … data point
… centroid
x
x
x
x
x
Clusters at the end
17
17
Example 2, step 1
k1
Y
Pick 3
initial
cluster
centers
(randomly)
k2
k3
X
18
Example 2, step 2
k1
Y
Assign
each point
to the closest
cluster
center
k2
k3
X
19
Example 2, step 3
k1
k1
Y
Move
each cluster
center
to the mean
of each cluster
k2
k3
k2
k3
X
20
Example 2, step 4
Reassign
points
Y
closest to a
different new
cluster center
k1
Q: Which points
are reassigned?
k3
k2
X
21
Example 2, step 4
Reassign
points
Y
closest to a
different new
cluster center
k1
Q: Which points
are reassigned?
k3
k2
X
22
Example 2, step 4 …
k1
Y
A: three points
k3
k2
X
23
Example 2, step 4b
k1
Y
re-compute
cluster means
k3
k2
X
24
Example 2, step 5
k1
Y
move cluster
centers to
cluster means
k2
k3
X
25
Problems with k-Means
▪ Very sensitive to the initial points.
– http://shabal.in/visuals/kmeans/1.html
– Do many runs of k-Means, each with
different initial centroids.
– Seed the centroids using a better
method than random. (e.g. Farthest-first
sampling)
▪ Must manually choose k.
26
Example: Picking k
Too few;
many long
distances
to centroid.
x
x
x
x
x x
x x
x xx x
x x x
x x
x
xx x
x x
x x x
x
xx x
x
x x
x x x x
x x x
x
27
27
Example: Picking k
Just right;
distances
rather short.
x
x
x
x
x x
x x
x xx x
x x x
x x
x
xx x
x x
x x x
x
xx x
x
x x
x x x x
x x x
x
28
28
Example: Picking k
Too many;
little improvement
in average
distance.
x
x
x
x
x x
x x
x xx x
x x x
x x
x
xx x
x x
x x x
x
xx x
x
x x
x x x x
x x x
x
29
29
Getting the k right
How to select k?
▪ Try different k, looking at the change in the
average distance to centroid as k
increases
▪ Average falls rapidly until right k, then
changes little
30
Visualizing K-Means
https://www.youtube.com/watch?v=zHbxbb2
ye3E
https://www.naftaliharris.com/blog/visualizin
g-k-means-clustering/
31
Density peaks
32
Latest work on clustering
▪ Authors:
– Alex Rodriguez, Alessandro Laio
▪ Affiliation:
– SISSA (Scuola Internazionale Superiore di Studi
Avanzati), via Bonomea 265, I-34136 Trieste,
Italy
▪ Title:
– Clustering by fast search and find of density
peaks
▪ Publication Details:
– Science 27 Jun 2014: Vol. 344, Issue 6191, pp.
1492-1496
33
Clustering - example
▪ Visually it is a
region with a
high density
of points.
▪ Separated
from other
dense
regions
34
Clustering - example
35
Inefficiency of standard algorithms
▪ K-means algorithm
– It assigns each point to the closest
cluster center. Variables: the
number of centers and their
location
– By construction it is unable to
recognize non-spherical clusters
36
What is a cluster?
▪ Clusters = Peaks In the Density Of points
37
Proposed Method
38
Proposed Method
39
Proposed Method
40
Proposed Method
41
Not even an algorithm…
42
The clustering approach at work
43
The clustering approach at work
44
The clustering approach at work
45
The clustering approach at work
46
The clustering approach at work
47
The clustering approach at work
48
Possible Applications
▪ Classification of
living organisms
▪ Marketing
strategies
▪ Libraries (book
sorting)
▪ Google search
▪ Face recognition
49
Possible Applications
▪ Classification of
living organisms
▪ Marketing
strategies
▪ Libraries (book
sorting)
▪ Google search
▪ Face recognition
50
Possible Applications
▪ Classification of
living organisms
▪ Marketing
strategies
▪ Libraries (book
sorting)
▪ Google search
▪ Face recognition
51
Possible Applications
▪ Classification of
living organisms
▪ Marketing
strategies
▪ Libraries (book
sorting)
▪ Google search
▪ Face recognition
52
Possible Applications
▪ Classification of
living organisms
▪ Marketing
strategies
▪ Libraries (book
sorting)
▪ Google search
▪ Face recognition
53
Application on face recognition
▪ Define a “distance”
between faces,
based on some
stable features
▪ Sampat, M.P., Wang, Z.,
Gupta,S., Bovik,A.C., &
Markey,M.K. (2009).
Complex wavelet
structural similarity: A new
image similarity index.
IEEE Trans. Image
Processing, 18(11),
2385‐2401.
54
Application on face recognition
55
Application on face recognition
56
Dr. LIU Yang
Some contents are from “Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (http://www.mmds.org/)”



K Nearest Neighbor
Decision Tree
Support Vector Machines
2


Would like to do prediction:
estimate a function f(x) so that
y = f(x)
Y
X’
Y’
Where y can be:
▪ Real number: Regression
▪ Categorical: Classification
▪ Complex object:
▪ Ranking of items, Parse tree, etc.

X
Data is labeled:
Training and test set
Estimate y = f(x) on X,Y.
Hope that the same f(x)
also works on unseen X’, Y’
▪ Have many pairs {(x, y)}
▪ x … vector of binary, categorical, real valued features
▪ y … class ({+1, -1}, or a real number)
3

Would like to do prediction:
estimate a function f(x) so that
y = f(x)

Where y can be:
▪ Real number: Regression
▪ Categorical: Classification
▪ Complex object:
Human
Animal
▪ Ranking of items, Parse tree, etc.

Data is labeled:
▪ Have many pairs {(x, y)}
?
▪ x … vector of binary, categorical, real valued features
▪ y … class ({+1, -1}, or a real number)
4

Eager learning
▪ When given a set of training data samples, it will
construct a generalization model before receiving
new data samples to classify
▪ Decision tree, rule-based classification, classification
by back propagation, Support Vector Machines
(SVM), associative classification

Lazy learning
▪ Simply stores training data samples and waits until
it is given a test data sample
▪ Less time in training but more time in predicting
5

A typical lazy learning method
▪ An algorithm that stores all available cases and classifies new
cases based on a similarity measure

Data samples/instances represented as points in a
Euclidean space
▪ Classification done by comparing feature vectors of the
different points

Also called
▪
▪
▪
▪
Memory-Based Reasoning
Example-Based Reasoning
Instance-Based Learning
Case-Based Reasoning
7

If it walks like a duck, quacks like a duck, then it’s
probably a duck
Compute
Distance
Training
samples
Test data
sample
Choose the
“nearest” training
sample
8
• Consider a two class problem where each sample
consists of two measurements, so that is represented as
a two-dimensional vector in the Euclidean space.
Class 2
Class 1
9
• Consider a two class problem where each sample
consists of two measurements, so that is represented as
a two-dimensional vector in the Euclidean space
Which class does the green triangle belong to ?
Class 2
Class 1
10
• For a given query point, assign the class of the
nearest neighbor to it.
It belongs to Class 1
Class 2
Class 1
11
• Consider a two class problem where each sample
consists of two measurements, so that is represented as
a two-dimensional vector in the Euclidean space
Which class does the green triangle belong to now ?
Class 2
Class 1
12

To make Nearest Neighbor work we need 2
important things:
▪ Distance metric:
▪ Euclidean
▪ How many neighbors to look at?
▪ One
13

Distance metric:
▪ Euclidean

How many neighbors to look at?
▪ k

Weighting function (optional):
▪ Unused

How to fit with the local points?
▪ Just predict the average output among k nearest neighbors
14
• Compute the k nearest neighbours and assign the
class by majority vote
It belongs to Class 2 now (k = 5).
Class 2
Class 1
15
•
•
Requires three things
• The set of training data samples
• Distance metric to compute distance between samples
• The value of k, the number of nearest neighbors to
retrieve
To classify an unknown data sample:
• Compute distance to other training samples
• Identify k nearest neighbors
• Use class labels of nearest neighbors to determine the
class label of unknown sample (e.g., by taking majority
vote)
16
17

Given the training data in above table, predict
the class of the following new example using
k-Nearest Neighbor for k=5:
{age<=30, income=medium, student=yes, credit_rating=fair}
18
Among the 5
nearest neighbors,
4 are from class Yes
and 1 from class No.
Hence, the k-NN
classifier predicts
buys_computer =
yes for the new
example.
19

What distance measure to use?
▪ Often Euclidean distance is used
▪ Locally adaptive metrics
▪ More complicated with non-numeric data, or when
different dimensions have different scales

Choice of k?
▪
▪
▪
▪
Cross-validation
1-NN often performs well in practice
k-NN needed for overlapping classes
Re-label all data according to k-NN, then classify with
1-NN
20

What distance measure to use?
▪ Often Euclidean distance is used
▪ Locally adaptive metrics
▪ More complicated with non-numeric data, or when
different dimensions have different scales

Choice of k?
▪
▪
▪
▪
Cross-validation
1-NN often performs well in practice
k-NN needed for overlapping classes
Re-label all data according to k-NN, then classify with
1-NN
21
•
•
Any two bases ei, ej are assumed to be uncorrelated
with each other
Ignore the relationships among different coordinates
for high-order data.
distE(a,b) = 54 > 49 = distE(a,c), unreasonable!
22

What distance measure to use?
▪ Often Euclidean distance is used
▪ Locally adaptive metrics
▪ More complicated with non-numeric data, or when
different dimensions have different scales

Choice of k?
▪
▪
▪
▪
Cross-validation
1-NN often performs well in practice
k-NN needed for overlapping classes
Re-label all data according to k-NN, then classify with
1-NN
23

What distance measure to use?
▪ Often Euclidean distance is used
▪ Locally adaptive metrics
▪ More complicated with non-numeric data, or when
different dimensions have different scales

Choice of k?
▪
▪
▪
▪
Cross-validation
1-NN often performs well in practice
k-NN needed for overlapping classes
Re-label all data according to k-NN, then classify with
1-NN
24

What distance measure to use?
▪ Often Euclidean distance is used
▪ Locally adaptive metrics
▪ More complicated with non-numeric data, or when
different dimensions have different scales

Choice of k?
▪
▪
▪
▪
Cross-validation
1-NN often performs well in practice
k-NN needed for overlapping classes
Re-label all data according to k-NN, then classify with
1-NN
25

What distance measure to use?
▪ Often Euclidean distance is used
▪ Locally adaptive metrics
▪ More complicated with non-numeric data, or when
different dimensions have different scales

Choice of k?
▪
▪
▪
▪
Cross-validation
1-NN often performs well in practice
k-NN needed for overlapping classes
Re-label all data according to k-NN, then classify with
1-NN
26

What distance measure to use?
▪ Often Euclidean distance is used
▪ Locally adaptive metrics
▪ More complicated with non-numeric data, or when
different dimensions have different scales

Choice of k?
▪
▪
▪
▪
Cross-validation
1-NN often performs well in practice
k-NN needed for overlapping classes
Re-label all data according to k-NN, then classify with
1-NN
27

What distance measure to use?
▪ Often Euclidean distance is used
▪ Locally adaptive metrics
▪ More complicated with non-numeric data, or when
different dimensions have different scales

Choice of k?
▪
▪
▪
▪
Cross-validation
1-NN often performs well in practice
k-NN needed for overlapping classes
Re-label all data according to k-NN, then classify with
1-NN
28

What distance measure to use?
▪ Often Euclidean distance is used
▪ Locally adaptive metrics
▪ More complicated with non-numeric data, or when
different dimensions have different scales

Choice of k?
▪
▪
▪
▪
Cross-validation
1-NN often performs well in practice
k-NN needed for overlapping classes
Re-label all data according to k-NN, then classify with
1-NN
29

What distance measure to use?
▪ Often Euclidean distance is used
▪ Locally adaptive metrics
▪ More complicated with non-numeric data, or when
different dimensions have different scales

Choice of k?
▪
▪
▪
▪
Cross-validation
1-NN often performs well in practice
k-NN needed for overlapping classes
Re-label all data according to k-NN, then classify with
1-NN
30
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
10
Training Set
Attrib2
Tree
Induction
algorithm
Class
Induction
3
Learn
Model
4
Model
1
Attrib3
2
Tid
Attrib1
Class
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
6
Apply
Model
Decision
Tree
Deduction
5
10
Test Set
32
Splitting Attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
6
No
Married
7
Yes
Divorced 220K
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Yes
Refund
Yes
No
NO
MarSt
Single, Divorced
No
No
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Model: Decision Tree
33
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
MarSt
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one
tree that fits the same data!
34
Start from the root of tree
Refund
Yes
Test Data
No
NO
Refund Marital
Status
Taxable
Income Cheat
No
80K
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
Married
?
10
NO
> 80K
YES
35
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
36
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
37
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
38
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
39
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
Assign Cheat to “No”
NO
> 80K
YES
40
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Class
Sail?
?
?
1
2
Attribute
Outlook Company Sailboat
sunny
no
big
rainy
big
small
outlook
company
sailboat
41
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Class
Sail?
?
?
1
2
Attribute
Outlook Company Sailboat
sunny
no
big
rainy
big
small
outlook
42
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Class
Sail?
?
?
1
2
Attribute
Outlook Company Sailboat
sunny
no
big
rainy
big
small
outlook
sunny
rainy
43
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Class
Sail?
?
?
1
2
Attribute
Outlook Company Sailboat
sunny
no
big
rainy
big
small
outlook
sunny
rainy
yes
44
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Class
Sail?
?
?
1
2
Attribute
Outlook Company Sailboat
sunny
no
big
rainy
big
small
outlook
sunny
rainy
yes
company
45
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Class
Sail?
?
?
1
2
Attribute
Outlook Company Sailboat
sunny
no
big
rainy
big
small
outlook
sunny
rainy
yes
company
no
big
med
46
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Class
Sail?
?
?
1
2
Attribute
Outlook Company Sailboat
sunny
no
big
rainy
big
small
outlook
sunny
rainy
yes
company
no
big
med
no
yes
47
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Class
Sail?
?
?
1
2
Attribute
Outlook Company Sailboat
sunny
no
big
rainy
big
small
outlook
sunny
rainy
yes
company
no
big
med
no
sailboat
yes
48
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Class
Sail?
?
?
1
2
Attribute
Outlook Company Sailboat
sunny
no
big
rainy
big
small
outlook
sunny
rainy
yes
company
no
big
med
no
sailboat
small
yes
big
49
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Class
Sail?
?
?
1
2
Attribute
Outlook Company Sailboat
sunny
no
big
rainy
big
small
outlook
sunny
rainy
yes
company
no
big
med
no
sailboat
small
yes
yes
big
no
50
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Attribute
Class
Outlook Company Sailboat
Sail?
1
sunny
no
big
Yes
2
rainy
big
small
?
outlook
sunny
rainy
yes
company
no
big
med
no
sailboat
small
yes
yes
big
no
51
#
Attribute
Outlook Company Sailboat
1
sunny
big
small
2
sunny
med
small
3
sunny
med
big
4
sunny
no
small
5
sunny
big
big
6
rainy
no
small
7
rainy
med
small
8
rainy
big
big
9
rainy
no
big
10 rainy
med
big
Class
Sail?
yes
yes
yes
yes
yes
no
yes
yes
no
no
#
Class
Attribute
Outlook
Company
Sailboat
Sail?
1
sunny
no
big
Yes
2
rainy
big
small
Yes
outlook
sunny
rainy
yes
company
no
big
med
no
sailboat
small
yes
yes
big
no
52
#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
Attribute
Temperature
Humidity
hot
high
hot
high
hot
high
moderate
high
cold
normal
cold
normal
cold
normal
moderate
high
cold
normal
moderate
normal
moderate
normal
moderate
high
hot
normal
moderate
high
Windy
no
yes
no
no
no
yes
yes
no
no
no
yes
yes
no
yes
Class
Play
N
N
P
P
P
N
P
N
P
P
P
P
P
N
53
Outlook
sunny
Humidity
high
N
rainy
overcast
P
Windy
normal
P
yes
N
no
P
54
Temperature
hot
cold
moderate
Outlook
sunny
P
overcast
P
Outlook
rainy
sunny
Windy
Windy
yes
yes
N
no
P
P
Windy
rainy
overcast
P
Humidity
no
N
yes
N
high
normal
Windy
P
yes
N
no
P
no
Humidity
high
normal
Outlook
sunny
N
overcast
P
P
rainy
null
55

Main principle
▪ Select attribute which partitions the learning set into subsets as “pure”
as possible

Various measures of purity
▪
▪
▪
▪
▪

Information-theoretic
Gini index
X2
ReliefF
...
Various improvements
▪ probability estimates
▪ normalization
▪ binarization, subsetting
56

To classify an object, a certain entropy is needed
▪ E, entropy

After we have learned the value of attribute A, we only need
some remaining amount of information to classify the object
▪ Ires, residual information

Gain(S,A): Expected reduction in entropy due to sorting on
Expected reduction in entropy due to sorting on A
▪ Gain(A) = Entropy(S) – Ires(A)

The most ‘informative’ attribute is the one that minimizes Ires,
i.e., maximizes Gain
57

Calculation of entropy
▪ Entropy(S) = ∑(i=1 to I)-|Si|/|S| * log2(|Si|/|S|)
▪ S = set of examples
▪ Si = subset of S with value vi under the target attribute
▪ I = size of the range of the target attribute

For a two-class problem:
entropy
|Si|/|S|
58
After applying attribute A, S is partitioned into
subsets according to values v of A
 Ires is equal to weighted sum of the amounts
of information for the subsets

• ∑(i = 1 to k) |Si|/|S| Entropy(Si), where k is the range of the attribute
we are testing
59
#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Color
green
green
yellow
red
red
red
green
green
yellow
red
green
yellow
yellow
red
Attribute
Outline
dashed
dashed
dashed
dashed
solid
solid
solid
dashed
solid
solid
solid
dashed
solid
dashed
Shape
Dot
no
yes
no
no
no
yes
no
no
yes
no
yes
yes
no
yes
triange
triange
square
square
square
triange
square
triange
square
square
square
square
square
triange
60
#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Color
green
green
yellow
red
red
red
green
green
yellow
red
green
yellow
yellow
red
Attribute
Outline
dashed
dashed
dashed
dashed
solid
solid
solid
dashed
solid
solid
solid
dashed
solid
dashed
Shape
Dot
no
yes
no
no
no
yes
no
no
yes
no
yes
yes
no
yes
triange
triange
square
square
square
triange
square
triange
square
square
square
square
square
triange
Data Set:
A set of classified objects
.
.
.
.
.
.
61
.
.
.
.
.
• 5 triangles
• 9 squares
• class probabilities
.
• entropy
62
.
.
.
.
.
.
.
.
red
Color?
green
.
yellow
.
.
.
63
.
.
.
.
.
.
.
.
red
Color?
green
.
.
yellow
.
.
64
.
.
.
.
.
.
.
.
red
Color?
green
.
.
yellow
.
.
65

Attributes
▪ Gain(Color) = 0.246
▪ Gain(Outline) = 0.151
▪ Gain(Dot) = 0.048


Heuristics: attribute with the highest gain is
chosen
This heuristics is local (local minimization of
impurity)
66
.
.
.
.
.
.
.
.
red
Color?
green
.
.
yellow
.
.
Gain(Outline) = 0.971 – 0 = 0.971 bits
Gain(Dot) = 0.971 – 0.951 = 0.020 bits
67
.
.
.
.
.
.
.
.
red
Gain(Outline) = 0.971 – 0.951 = 0.020 bits
Gain(Dot) = 0.971 – 0 = 0.971 bits
Color?
green
.
.
yellow
.
.
solid
.
Outline?
dashed
.
68
.
.
.
.
.
.
.
.
red
.
yes
Dot?
Color?
.
no
green
.
.
yellow
.
.
solid
.
Outline?
dashed
.
69
.
.
.
.
.
.
Color
red
Dot
yes
triangle
yellow
green
square
no
Outline
dashed
square
triangle
solid
square
70
-We are given a set of n points (vectors) :
x1 , x2 ,.......xnsuch that xi is a vector of length m,
and each belong to one of two classes we label them
by “+1” and “-1”.
-So our training set is:
So the decision
( x1 , y1 ), ( x2 , y2 ),....( xn , yn )
i xi  R , yi  {+1, −1}
m
function will be
f ( x) = sign( w  x + b)
- We want to find a separating hyperplane w  x + b = 0
that separates these points into the two classes.
“The positives” (class “+1”) and “The negatives” (class “-1”).
(Assuming that they are linearly separable)
72
x2
yi = +1
yi = −1
f ( x) = sign( w  x + b)
A separating
hypreplane
w x + b = 0
x1
But there are many possibilities
for such hyperplanes !!
73
yi = +1
yi = −1
Which one should we
choose!
Yes, There are many possible separating hyperplanes
It could be this one or this or this or maybe….!
74
-Suppose we choose the hypreplane (seen below) that is close to
some sample xi.
- Now suppose we have a new point x ' that should be in class “-1”
and is close to xi . Using our classification function f(x) this point is
misclassified!
f ( x) = sign( w  x + b)
Poor generalization!
(Poor performance on
unseen data)
x'
xi
75
-Hyperplane should be as far as possible from any
sample point.
-This way a new data that is close to the old samples will
be classified correctly.
Good generalization!
xi
x'
76
-The SVM idea is to maximize the distance between The
hyperplane and the closest sample point.
In the optimal hyperplane:
The distance to the
closest negative point =
The distance to the
closest positive point.
Aha! I see !
77
SVM’s goal is to maximize the Margin which is twice the
distance “d” between the separating hyperplane and the
closest sample.
Why it is the best?
-Robust to outliners as
we saw and thus strong
generalization ability.
-It proved itself to have
better performance on
test data in both
practice and in theory.
xi
78
Support vectors are the samples closest to the
separating hyperplane.
Oh! So this is where the
name came from!
These are
Support
Vectors
xi
We will see later that the
Optimal hyperplane is
completely defined by the
support vectors.
79
-Our optimization problem so far:
I do remember the
Lagrange Multipliers
from Calculus!
1
minimize
w
2
s.t.
2
yi (wT xi + b)  1
-We will solve this problem by introducing Lagrange
multipliers  i associated with the constrains:
n
1
2
minimize L p ( w, b,  ) =
w −   i ( yi ( xi  w + b) − 1)
2
i =1
s.t  i  0
80
M=
Given guess of w , b we can
Compute sum of distances of
points to their correct zones
 Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
2

w.w
What should our quadratic
optimization criterion be?
How many constraints will we
have?
What should they be?
81
e2
e11
e7
What should our quadratic
optimization criterion be?
Minimize
R
1
w.w + C  εk
2
k =1
M=
Given guess of w , b we can
Compute sum of distances of
points to their correct zones
 Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
2

w.w
How many constraints will we
have? R
What should they be?
w . xk + b >= 1-ek if yk = 1
w . xk + b <= -1+ek if yk = -1
82
83
Examples of Kernel Functions
◼
Linear: K(xi,xj)= xi Txj
◼
Polynomial of power p: K(xi,xj)= (1+ xi Txj)p
◼
Gaussian (radial-basis function network):
K (x i , x j ) = exp(−
◼
xi − x j
2
2
2
)
Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)
84
85
Dr. LIU Yang
Jure Leskovec,Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets (http://www.mmds.org/)


When we think of a social network, we think of Facebook,
Twitter, Google+, …
The essential characteristics of a social network are:
▪ There is a collection of entities that participate in the network.
Typically, these entities are people, but they could be something
else.
▪ There is at least one relationship between entities of the
network. On Facebook, this relationship is called friends.
Sometimes the relationship is all-or-nothing; two people are
either friends or they are not. However, in other examples of
social networks, the relationship could also be discrete or
represented by a real number.
▪ There is an assumption of nonrandomness or locality. This
condition is the hardest to formalize, but the intuition is that
relationships tend to cluster. That is, if entity A is related to both
B and C, then there is a higher probability than average that B
and C are related.
2

We often think of networks being organized
into modules, cluster, communities:
3
4

Discovering social circles, circles of trust:
[McAuley, Leskovec: Discovering social circles in ego networks, 2012]
5
How to find communities?
We will work with undirected (unweighted) networks
6


The entities are the nodes A through G.
The relationship, which we might think of as
“friends,” is represented by the edges.
▪ For instance, B is friends with A, C, and D.
7
Edge betweenness: Number of
shortest paths passing over the edge
 Intuition:

b=16
b=7.5
▪ An important aspect of social networks is that
they contain communities of entities that are
connected by many edges.
▪ Finding the edges that are least likely to be inside
a community.
▪ As in golf, a high score is bad. It suggests that the
edge (a, b) runs between two different
communities; that is, a and b do not belong to
the same community.
8

The edge (B,D) has the highest betweenness, as should
surprise no one.
▪ In fact, this edge is on every shortest path between any of A, B,
and C to any of D, E, F, and G. Its betweenness is therefore 3 × 4
= 12.

In contrast, the edge (D, F) is on only four shortest paths:
those from A, B, C, and D to F.
9
10

Want to compute
betweenness of
paths starting at
node 𝑨

Breath first search
starting from 𝑨:
0
1
2
3
4
11

Count the number of shortest paths from
𝑨 to all other nodes of the network:
12

Compute betweenness by working up the
tree: If there are multiple paths count them
fractionally
The algorithm:
•Add edge flows:
-- node flow =
1+∑child edges
-- split the flow up
based on the parent
value
• Repeat the BFS
procedure for each
starting node 𝑈
1+1 paths to H
Split evenly
1+0.5 paths to J
Split 1:2
1 path to K.
Split evenly
13

Compute betweenness by working up the
tree: If there are multiple paths count them
fractionally
The algorithm:
•Add edge flows:
-- node flow =
1+∑child edges
-- split the flow up
based on the parent
value
• Repeat the BFS
procedure for each
starting node 𝑈
1+1 paths to H
Split evenly
1+0.5 paths to J
Split 1:2
1 path to K.
Split evenly
14

Using E as the start node to calculate the
betweenness
15
16
Label the root E with 1.
At level 1 are the
nodes D and F. Each
has only E as a parent,
so they too are labeled
1.
 Nodes B and G are at
level 2. B has only D as
a parent, so B’s label is
the same as the label
of D, which is 1.
However, G has
parents D and F, so its
label is the sum of
their labels, or 2.
 Finally, at level 3, A and
C each have only
parent B, so their
labels are the label of B,
which is 1.


17

A and C, being
leaves, get
credit 1. Each
of these nodes
have only one
parent, so their
credit is given
to the edges
(B,A) and (B,C),
respectively.
18


At level 2, G is a leaf,
so it gets credit 1. B
is not a leaf, so it
gets credit equal to 1
plus the credits on
the DAG edges
entering it from
below. Since both
these edges have
credit 1, the credit of
B is 3.
Intuitively 3
represents the fact
that all shortest
paths from E to A, B,
and C go through B.
19

B has only one
parent, D, so
the edge (D,B)
gets the entire
credit of B,
which is 3.
20

However, G has
two parents, D
and F. We
therefore need
to divide the
credit of 1 that
G has between
the edges (D,G)
and (F,G).
21

From the figure in
step 2, we observe
that both D and F
have label 1,
representing the
fact that there is
one shortest path
from E to each of
these nodes. Thus,
we give half the
credit of G to each
of these edges;
i.e., their credit is
each 1/(1 + 1) =
0.5.
22



Now, we can assign
credits to the nodes
at level 1. D gets 1
plus the credits of
the edges entering it
from below, which
are 3 and 0.5. That is,
the credit of D is 4.5.
The credit of F is 1
plus the credit of the
edge (F,G), or 1.5.
Finally, the edges
(E,D) and (E, F)
receive the credit of
D and F, respectively,
since each of these
nodes has only one
parent.
23
The credit on each of the
edges is the contribution
to the betweenness of
that edge due to shortest
paths from E. For example,
this contribution for the
edge (E,D) is 4.5.
 To complete the
betweenness calculation,
we have to repeat this
calculation for every node
as the root and sum the
contributions. Finally, we
must divide by 2 to get the
true betweenness, since
every shortest path will be
discovered twice, once for
each of its endpoints.

24
25
COMP 4125/7820
Visual Analytics & Decision
Support
Lecture 7: Network Visualization and
Analytics
Dr. MA Jing
Outline
 Network Data
 Network Visualization
– Node-Link Diagrams
– Adjacency Matrix
– Enclosure
 Network Analytics
– Network Centrality
Network Data
Dataset: Networks
 The dataset type of networks is well suited for specifying that there is some
kind of relationship between two or more items
– An item in a network is often called a node
– A link is a relation between two items
4
Dataset: Networks
 Online Social Network (e.g., Facebook, LinkedIn, etc)
– Nodes: people
– Links: friendship
5
Dataset: Networks
 Gene Interaction Network
– Nodes: genes
– Links: genes have been observed to interact with each other
6
Dataset: Networks - Trees
 Networks with hierarchical structure are more specifically called trees.
 In contrast to a general network, trees do not have cycles: each child node
has only one parent node pointing to it.
Network (Graph) Data Definition
 A Network G(V, E) contains a set of vertices (nodes) V together with a set of
edges (lines) E;
 Network are also called graph
 For example
– V = {A, B, C, D}
– E = {{A, B}, {A,C}, {B, C}, {C, D}}
How to draw a network?
V = {A, B, C, D}, E = {{A, B}, {A,C}, {B, C}, {C, D}}
A
A
B
D
B
C
One way to draw the graph
D
C
Another way to draw the same graph
Network Visualization
 Node-Link Diagrams
 Adjacency Matrix
 Enclosure
Node-Link Diagrams for Trees
 The most common visual encoding idiom
– Nodes: point marks
– Links: line marks
Triangular vertical node-link
layout
- vertical spatial position is used to show
the depth in the tree r
Spline radial layout
- Distance to the center is used to encode
the depth in the tree
Node-Link Diagrams for Networks
 Network (also called as graph), is also very
commonly represented as node-link diagram
 The number of hops within a path is used to
measure distances
 Node-link diagrams are well suited for tasks of
analyzing the topology
–
–
–
–
find all possible paths between two nodes
find shortest paths between two nodes
find neighbors from a target node
find nodes that act as bridges between two groups of nodes
Simple Fixed Graph Layout
Circular Layout
Grid Layout
Random Layout
Force-directed Placement for Node-link Diagram
 One of the most widely used idioms for nodelink network diagram
 Network elements are positioned according to a
simulation of physical forces
– Nodes push away from each other
– Links act like springs that draw their endpoint nodes
closer to each other
 Nodes are placed randomly in the beginning
and their position are iteratively refined
– Pushing and pulling of the simulated spring forces
gradually improve the layout
https://bl.ocks.org/mbostock/4062045
Example for Force-directed Layout
 This graph shows character co-occurrence in Les
Miserables.
 Data based on character co-apprearance in Victor
Hugo’s Les Miserables.
 Data:
– Nodes
– Links
Nodes
Links
Example for Force-directed Layout
 Nodes represent different characters.
 Edges represents the co-appearance
relationship.
 Line width encode edge attribute (Larger
line width indicates that characters cooccurred more frequently).
 Node color encode the group
information.
More on Force-directed Placement
 Spatial position does not directly encode any attribute of either nodes or
links; the placement algorithm use it indirectly.
 Spatial proximity does indicate grouping through a strong perceptual cue; but
sometimes arbitrary
– Nodes near each other may because they are repelled from elsewhere,
not because they are closely connected.
 Layout are often non-deterministic because of randomly chosen initial
position
– The layout will look different each time the layout is computed.
– Spatial memory can not be exploited across different runs.
– Lead to different proximity relationships each time
Major weakness of force-directed placement
 A major weakness of force-directed placement: Scalability
– Visual Complexity of the layout
– Time required to compute it
 Force-directed approaches yield readable layout quickly for tiny graphs with
dozens of nodes. However, the layout quickly degenerates into a hairball of
visual clutter.
– Tasks of path following become very difficult even on a few hundred
nodes
– Essentially impossible with thousands of nodes or more.
Summary of Force-Directed Placement
Multi-level network: to scale on big network
 The original network is augmented with a derived cluster hierarchy to form a
compound network.
 The cluster hierarchy is computed by coarsening the original network into
successively simply networks that nevertheless attempt to capture the most
essential aspect of the original’s structure.
– Laying out the simplest version of the networks first
– Improving layout with the more and more complex version
Multilevel scalable force-directed placement
 Significant cluster structure is
visible for large network (less
than 10k nodes)
 However, with huge graph, it
becomes a “hairball” without
much visible structure.
Summary of Multilevel Force-Directed Placement
Network Visualization
 Node-Link Diagrams
 Adjacency Matrix
 Enclosure
Adjacency Matrix of A Network
 V = {A, B, C, D}, E = {{A, B}, {A,C}, {B, C}, {C, D}}
A
A
D
B
C
B
D
C
A
B
C
D
0
1
1
0
1
0
1
0
1
1
0
1
0
0
1
0
Adjacency Matrix View for Network
 Network data can also be encoded with a matrix view by deriving a table from
the original network data
 Nodes are labeled in rows and columns of the matrix, each cell in it indicates
the link between two nodes.
1
1
2
1
2
4
3
3
4
5
5
2
3
4
5
Adjacency Matrix View for Larger Network
 Matrix views of networks can achieve very high information density, up to a
limit of one thousand nodes and one million edges
Example of Adjacency Matrix View
https://bost.ocks.org/mike/miserables/
•
Matrix view of character co-occurrence in Les
Miserables.
•
Characters are labeled in rows and columns of the
matrix
•
Each colored cell represents two characters that
appeared in the same chapter; darker cells
indicate characters that co-occurred more
frequently.
Summary of Adjacency Matrix View
Node-link vs. Matrix View
 Strengths of Node-link layouts
1
– Extremely intuitive for small networks.
– They particularly shine for tasks that rely on understanding the
topological structure of the network
4
 Path tracing, search for neighbors
– Also very effective for tasks such as general overview or find
similar substructures.
 Weaknesses of Node-link layouts
– Scalability: can not handle large networks
– Occlusion from edge crossing each other and crossing
underneath nodes.
2
5
3
Node-link vs. Matrix View
 Strengths of Matrix View
– Perceptual scalability for both large and dense networks
– Their predictability, stability and support for reordering
 Predictability: be laid out within a predictable amount of space
 Stability: Adding a new item cause only a small visual change.
– Quickly estimating the number of nodes and fast node lookup
1
1
2
3
4
5
 Weaknesses of Matrix View
– Unfamiliarity: Need training to interpret matrix view.
 Clique, Biclique, degree
– Lack of supporting for investigating topological structure.
2
3
4
5
Node-link vs. Matrix View
Clique: Completely
interconnected.
Biclique: every
node in the first set
is connected to
every node of the
second set.
Node-link vs. Matrix View
1
1
2
1
2
4
3
3
4
5
5
2
3
4
5
 Tasks of analyzing
the topology
– All possible path
from node 1 to
node 5
– Shortest path
from node 1 to
node 3
Node-link vs. Matrix View
 An empirical study compared node-link and matrix views
– Node-link views are best for small networks and matrix views are best for large networks
– Several tasks became more difficult for node-link views as size increased, but okay for
matrix views
 Estimate the number of nodes and edges
 Finding the most connected node
 Find a node given it label
 Find a direct link between two nodes
 Find a common neighbor between two nodes
– Finding a path between two nodes is difficult in matrix view.
– Topological structure tasks such as path tracing are best supported by node-link views.
Outline
 Network Data (recap)
 Node-Link Diagrams
 Nodes: point marks
 Links: line marks
 Adjacency Matrix
 Directly show adjacency relationships
 Enclosure
 Show hierarchical relationships through
nesting
34
Containment: Hierarchy Marks
 Very effective at showing complete information about
hierarchical structure
– Connection marks in node-link diagrams only show pairwise
relationships
 Example: Tree maps (An alternative to node-link tree)
– Hierarchical relationships are shown with containment rather
than connection mark.
– All of the children of a tree node are enclosed within the area
allocated that node, creating a nested layout.
Treemaps
 A treemap view of a 5161-node
computer file system.
 Node size encodes file size.
 Containment marks are not as effective
as the pair-wise connection marks for
tasks on topological structure.
 Good for tasks that pertain to
understanding attribute values at leaves
of the tree.
 Very effective for spotting the outliers of
very large attribute values.
Treemap: A Simple Example
Treemap: A Simple Example
• Set parent node values to sum of
child node values from bottom up
Treemap: A Simple Example
• Set parent node values to sum of
child node values from bottom up
• Partition the space based on
current node’s value as a portion of
parent node’s value from top down
Treemap: A Simple Example
5/7
• Set parent node values to sum of
child node values from bottom up
• Partition the space based on
current node’s value as a portion of
parent node’s value from top down
2/7
Treemap: A Simple Example
1/2
• Set parent node values to sum of
child node values from bottom up
• Partition the space based on
current node’s value as a portion of
parent node’s value from top down
1/2
Treemap: A Simple Example
4/5
• Set parent node values to sum of
child node values from bottom up
• Partition the space based on
current node’s value as a portion of
parent node’s value from top down
1/5
Treemap: A Simple Example
1
4
• Set parent node values to sum of
child node values from bottom up
• Partition the space based on
current node’s value as a portion of
parent node’s value from top down
1
1
Summary of Treemaps
Grouse Flocks
 Compound networks: a combination of
network and tree
– Nodes in network are the leaves of tree
– The interior nodes of the tree encompass
multiple network nodes
 Grouse Flocks
– Combined view using containment marks
for associated hierarchy and connection
marks for the original network links.
Summary of Network Visualization
 Node-link Diagrams
– Extremely intuitive for small networks
– They particularly shine for tasks that rely on understanding the topological structure of the
network
– Can not handle large network
 Matrix View
–
–
–
–
–
Can handle large networks effectively
Their predictability, stability and support for reordering
Quickly estimating the number of nodes and fast node lookup
Need training to interpret matrix view
Lack of supporting for investigating topological structure
 Containment
– Focus on showing hierarchical structure
Network Centrality
Node Importance
Based on the structure of the network,
which are the most important nodes?
Why Network Centrality?
 Centrality = “Importance”
– Which nodes are important based on their network position
– Different ways of thinking about “importance”
 Rank all nodes in a network (graph)
– Find celebrities or influential people in a social network (Twitter)
– Find “gatekeepers” who connect communities (headhunters love to find
them on LinkedIn)
– Important pages on the web (Google search)
 Centrality algorithms are contained in most network (graph) analysis library.
Used them help graph analysis, visualization, etc.
Centrality Algorithms
 Degree
– Important nodes have many connections
 Betweeness
– Important nodes connect other nodes
 Closeness
– Important nodes are close to other nodes
 PageRank
– Important nodes are those with many in-links from important nodes.
Degree Centrality (easiest)
 Assumption:
– Important nodes have many connections
 Degree = number of neighbors
 Directed graph
– In Degree: number of incoming edges
– Out Degree: number of outgoing edges
– Total Degree: In Degree + Out Degree
 Undirected graph, only degree is defined
X has higher centrality than Y
according to total degree
centrality measure
Degree Centrality (Undirected) example
 The nodes with many connections are important.
Degree = number of neighbors
When Degree Centrality is Good
 E.g. Friendship network
 He or She who has many friends is most
important
– People who will do favors for you
– People you can talk to/ have coffee with
When Degree is not good?
 Ability to bridge between group
 Likelihood that information originating
anywhere in the network reaches you
Centrality Algorithms
 Degree
– Important nodes have many connections
 Betweenness
– Important nodes connect other nodes
 Closeness
– Important nodes are close to other nodes
 PageRank
– Important nodes are those with many in-links from important nodes.
Betweenness Centrality
 Assumption
– Important nodes connect other nodes
 Betweenness definition
Number of shortest paths between s
and t that goes through v
Number of shortest paths
between s and t
 It quantifies how often a node acts as a bridge that connects two other
nodes.
Betweeness on a Toy Network
A
F
C
B
D
E
G
Betweeness on a Toy Network
Betweeness Centrality - Complexity
 Computing betweeness centrality of all nodes can be very computationally
expensive
 Depending on the algorithm, this computation can take up to O(N3) time,
where N is the number of nodes in the graph.
 Approximation algorithms are used for big graph
Centrality Algorithms
 Degree
– Important nodes have many connections
 Betweeness
– Important nodes connect other nodes
 Closeness
– Important nodes are close to other nodes
 PageRank
– Important nodes are those with many in-links from important pages.
Closeness Centrality
 What if it’s not so important to have many direct friends?
 Or be “between” others
 But one still wants to be in the “middle” of things, not too far from the center
Closeness Centrality
 Assumption:
– Important nodes are close to other nodes
 Closeness is based on the length of the average shortest path between a
vertex and all vertices in the graph
 Closeness definition
The shortest distances between
vertices x and y
Closeness on a toy network
0.1
A
0.143
B
0.167
C
0.143
D
0.1
E
Closeness Centrality on a Toy Network
1/16
1/16
1/11
1/16
1/10
1/11
1/16
Centrality Algorithms
 Degree
– Important nodes have many connections
 Betweeness
– Important nodes connect other nodes
 Closeness
– Important nodes are close to other nodes
 PageRank
– Important nodes are those with many in-links from important nodes.
Pagerank
 Developed by Google founders to measure the
importance of webpages from hyperlink network
structure
 PageRank assigns a score of importance to each
node. Important nodes are those with many inlinks from important pages.
 PageRank can be used for any type of network,
but it is mainly useful for directed networks
 A node’s PageRank depends on the PageRank of
other nodes.
Larry Page
Sergey Brin
Brin, Sergey and Lawrence Page
(1998). Anatomy of a Large-Scale
Hypertextual Web Search Engine.
7th Intl World Wide Web Conf.
PageRank
 Problem
– Give a directed graph, find its most important
nodes
 Assumption
– Important pages are those with many in-links
from important pages (recursive).
(Simplified) PageRank
 n = number of nodes in the network
 k = number of steps
 1. Assign all nodes a PageRank of 1/n
 2. Perform the (Simplified) PageRank update rule k times
 (Simplified) PageRank Update Rule:
– Each node give an equal share of it current PageRank to all nodes it links
to
– The new PageRank of each node is the sum of all the PageRank it
received from other nodes
(Simplified) PageRank Example
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step1
(Simplified) PageRank Example – Step2
(Simplified) PageRank Example – Step2
(Simplified) PageRank Example – Step2
(Simplified) PageRank Example – Step2
…
(Simplified) PageRank Example – Step2
(Simplified) PageRank Example …
What if continue with k = 4, 5, 6, …?
(Simplified) PageRank Example …
For most networks, PageRank values converge.
Summary for (Simplified) PageRank
 Steps for (Simplified) PageRank
 1. Assign all nodes a PageRank of 1/n
 2. Perform the (Simplified) PageRank update rule k times
– Each node give an equal share of it current PageRank to all nodes it links
to
– The new PageRank of each node is the sum of all the PageRank it
received from other nodes
For most networks, PageRank values converge as k get larger
Matrix Model for (Simplified) PageRank computation
 The Adjacency matrix A (Aij = 1 if node j points to i)
From
To
A
A
B
C
D
E
B
C
D
E
Transition Matrix
 Normalize the adjacency matrix so that the matrix is a stochastic matrix P
(each column sum up to 1)
 Pij: Probability that arriving at page i from page j.
From
A
To
A
B
C
D
E
B
C
D
E
(Simplified) PageRank Algorithm
 Assign all nodes a PageRank of x = [1/n, 1/n, …,1/n]
 For k = 1, 2, …
xk+1 = Pxk
×
=
Problem in simplified PageRank
 For a large enough k, F an G each have
PageRank of ½ and all other nodes have
PageRank 0.
 Why?
– In each iteration, whenever PageRank values
were received by node F or G, they will
“stuck” on F and G.
 How to solve it?
Full PageRank Algorithm
 To fix this, we introduce a “damping parameter” α
 In each iteration:
– With probability α: choose an outgoing edge at random and follow it to
the next node
– With probability 1- α: Choose a node at random and go to it.
 Why?
– To make the matrix irreducible
– From any node, there’s non-zero probability to reach any other node
Full PageRank Algorithm
 Initialize x0 = [1/n, 1/n, …,1/n]
 For k = 1, 2, …
xk+1 = (αP+(1- α)E/n) xk
E is a n x n matrix with all 1s
xk+1
=
α
+
(1-α)
Xk
The Final Full PageRank Algorithm
 Note that E = eeT, where e is a column vector of 1’s
xk+1 = (αP+(1- α)eeT/n) xk
= αPxk+(1- α)eeTxk/n
= αPxk+(1- α)v
v = [1/n, 1/n, 1/n]
Note eTxk =1
The Final Full PageRank Algorithm
 Initialize x0 = [1/n, 1/n, …,1/n]
 For k = 1, 2, …
xk+1 =αPxk+(1- α)v
v = [1/n, 1/n, 1/n]
xk+1 =
α
Xk
+
(1-α)
α is often set as 0.85
The Full PageRank Algorithm – Step 1
 Initialize x0 = [1/5, 1/5, …,1/5], α = 0.85
 For k = 1, 2, …
xk+1 =αPxk+(1- α)v
v = [1/n, 1/n, 1/n]
0.26
0.85*(1/3*1/5+1*1/5)+0.15*1/5
0.37
x1
= 0.85
+ 0.15
=
0.17
0.12
0.09
The Full PageRank Algorithm – Step 2
 Initialize x0 = [1/5, 1/5, …,1/5], α = 0.85
 For k = 1, 2, …
xk+1 =αPxk+(1- α)v
v = [1/n, 1/n, 1/n]
x2
= 0.85
+ 0.15
=
Summary of Centrality Algorithms
 Degree
– Important nodes have many connections
 Betweeness
– Important nodes connect other nodes
 Closeness
– Important nodes are close to other nodes
 PageRank
– Important nodes are those with many in-links from important nodes.
 The best centrality measure depends on the context of the network
Case Study:
Analyzing Golden State Warriors' passing network using
GraphFrames in Spark
http://opiateforthemass.es/articles/analyzinggolden-state-warriors-passing-network-usinggraphframes-in-spark/
InDegree
Stephen Curry received the most passes.
OutDegree
Draymond Green provides the most passes.
PageRank
 PageRank can be used to compute the importance of the nodes (i.e.,
players)
 Curry, Green and Thompson are the top 3 based on the network data.
COMP 4125/7820
Visual Analytics & Decision
Support
Lecture 12-1: Evaluate and Visualize
Classification Performance
Dr. MA Jing
Classification Evaluation
 Why Evaluation?
 Methods for Estimating a Classifier’s Performance
 Evaluation Metrics
– Accuracy
– Confusion Matrix
– Visualizing Classification Performance using ROC curves and PrecisionRecall Curves
– Extensions to Multi-Class
Why Evaluation?
 Multiple methods are available to build a classification model
– Perceptron
– KNN
– Support Vector Machine (SVM)
– Deep Neural Networks, ….
 For each method, multiple choices are available for parameter settings.
 To choose the best model, we need to access each model’s performance.
 Selecting a meaningful evaluation metric is crucial in classification-related
projects.
Evaluation: Accuracy
 Accuracy is widely use.
– Fraction of correctly classified samples over the whole set of samples
– Number of Correct Classified Samples / Total Number of Samples
Predicted Label
True Label
1
1
-1
1
1
1
-1
-1
1
1
Accuracy = 4/5 = 0.80
Methods for Estimating a Classifier’s Performance
Test sets
 How can we get an unbiased estimate of the accuracy of a learned model?
Labeled dataset
Training dataset
Training a classification
model
test dataset
Learned Model
When learning a model,
you should pretend you do
not have the test data yet.
Accuracy estimates will be
biased if you used the test
data during the training.
Estimated
accuracy
Validation (Tuning) Sets
 We want to estimate the accuracy during the training stage for tuning the
hyper-parameter of a model (e.g., k in knn)
Labeled dataset
Training dataset
Training dataset
Training
classification modesl
test dataset
validation dataset
select Model
Learned Model
Estimated
accuracy
Limitations of using a single training/test partition
 We may not have enough data to make sufficiently large training and test
sets
– a larger test set gives us more reliable estimate of accuracy (i.e. a lower
variance estimate)
– but… a larger training set will be more representative of how much data
we actually have for learning process
 A single training set doesn’t tell us how sensitive accuracy is to a particular
training set
Cross-Validation
 K-fold cross validation
– Create K equal size partitions of the
dataset
– Each partition has N/K samples
– Train using K-1 partitions, test on the
remaining partition
– Repeat the process for K times, each
with different test partition
Labeled dataset
S1
S2
S3
S4
S5
Iteration Training Dataset
Test Dataset
1
S2, S3, S4, S5
S1
2
S1, S3, S4, S5
S2
3
S1, S2, S4, S5
S3
4
S1, S2, S3, S5
S4
5
S1, S2, S3, S4
S5
Cross Validation Example
 5-fold Cross Validation
– Suppose we have 100 labeled samples, we use 5-fold cross validation to
estimate the accuracy
Iteration
Training Dataset
Test Dataset
Accuracy
1
S2, S3, S4, S5
S1
11/20
2
S1, S3, S4, S5
S2
17/20
3
S1, S2, S4, S5
S3
16/20
4
S1, S2, S3, S5
S4
13/20
5
S1, S2, S3, S4
S5
16/20
Accuracy = 73/100 = 73%
Leave-one-out (LOO) Cross Validation
 A special case of K-fold cross validation when K = N (number of training
samples)
– Each partition is now a data sample
– Training using N-1 data samples, test on one data sample
– Repeat for N times
– Can be expensive for large N. Typically used when N is small.
Evaluation Metrics
Evaluation Metrics
 Accuracy
 Confusion Matrix
 Visualizing Classification Performance using ROC curves and PrecisionRecall Curves
 Extensions to Multi-Class
Accuracy
 Accuracy is widely use.
– Fraction of correctly classified samples over the whole set of samples
– Number of Correct Classified Samples / Total Number of Samples
 However, accuracy may be misleading sometimes.
– Consider a test data with imbalanced classes
When Accuracy is Not Good?
 Accuracy with imbalanced Classes
 Supposed you have a test set with 2 classes containing 1000 samples
– 10 of them are positive class
– 990 of them are negative class
 You build a classifier, and the accuracy on this test set is 96%
 Is it good?
When Accuracy is Not Good
 A imbalanced dataset with 2 classes containing 1000 samples
– 10 of them are positive class
– 990 of them are negative class
 For comparison, suppose we had a “dummy” classifier that did’t look at the
features at all, and always just blindly predict the most frequent class
 What the accuracy?
 Answer
– Accuracy = 990/1000 = 99% > 96%
Evaluation Metrics
 Accuracy
 Confusion Matrix
 Visualizing Classification Performance using ROC curves and PrecisionRecall Curves
 Extensions to Multi-Class
Confusion Matrix
• Label 1 = positive class
(class of interest)
 Binary Classification Example
• Label 0 = negative class
(everything else)
Predicted Label
Positive
Positive
TP
Negative
FN
•
Positive samples correctly classified
as belonging to positive class
• FN = False Negative
•
True Label
Negative
• TP = True Positive
FP
TN
Positive samples misclassified as
belonging to negative class
• FP = False Positive
•
Negative samples misclassified as
belonging to positive class
• TN = True Negative
•
Negative samples correctly classified
as belonging to negative class
Confusion matrix provides more information than Accuracy. Many evaluation metrics can be derived from it.
Accuracy and Classification from the Confusion Matrix
 Accuracy
Predicted Label
Negative
Positive
Positive
Negative
True Label
 Classification Error: 1 - Accuracy
Recall (or True Positive Rate)
 What fraction of all positive samples does the classifier correctly identify as
positive?
Predicted Label
Positive
Positive
Negative
True Label
Negative
Recall is also know as:
• True Positive Rate (TPR)
• Sensitivity
• Probability of Detection
Precision
 What fraction of positive predictions is correct?
Predicted Label
Positive
Positive
Negative
True Label
Negative
False Positive Rate
 What fraction of all negative instances does the classifier incorrectly identify
as positive?
Predicted Label
Positive
Positive
Negative
True Label
Negative
A Graphical Illustration of Precision and Recall
The Precision-Recall Tradeoff
High Precision, Lower Recall
Low Precision, High Recall
A tradeoff between precision and recall
 Recall-oriented machine learning tasks
– Search and information extraction in legal discovery
– Tumor detection
– Often paired with a human expert to filter out false positives
 Precision-oriented machine learning tasks
– Search engine ranking, query suggestion
– Documentation classification
– Many customer-facing tasks (users remember failures)
F1-Score
 F1-Score: combining precision & recall into a single number
 F-Score: generalizes F1-score for combining precision & recall into a single
number
β allows adjustment of the metric to control the emphasis on recall vs precision
Evaluation Metrics
 Accuracy
 Confusion Matrix
 Visualizing Classification Performance using ROC curves and PrecisionRecall Curves
 Extensions to Multi-Class
Decision Functions of Classifiers
 The score value for each test point indicates how confidently the classifier
predicts it as the positive class (large-magnitude positive values) or the
negative class (large-magnitude negative values).
 Choosing a fixed decision threshold gives a classification rule.
 By sweeping the decision threshold through the entire range of possible
score values, we get a series of classification outcomes that form a curve.
Predicted Probability of Class Membership
 Typical rule: choose most likely class
– e.g class 1 if threshold > 0.50.
 Adjusting threshold affects predictions of classifier.
 Higher threshold results in a more conservative classifier
– e.g. only predict Class 1 if estimated probability of class 1 is above 70%
– This increases precision. Doesn't predict class 1 as often, but when it
does, it gets high proportion of class 1 instances correct.
 Not all models provide realistic probability estimates
Cutoff Table
Cutoff: 0.25
Cutoff: 0.75
 If cutoff is 0.75:
8 records are
classified as “1”
 If cutoff is 0.25:
15 records are
classified as “1”
Confusion Matrices for Different Cutoffs
 Cutoff: 0.75
 Cutoff: 0.25
Creating an ROC curve
 Sort test-set predictions according to confidence that each instance is
positive
 Step through sorted list from high to low confidence
– locate a threshold between instances with opposite classes (keeping
instances with the same confidence value on the same side of threshold)
– compute TPR, FPR for predictions that predict samples above the
threshold as positive
– output (FPR, TPR) coordinate
An example of plotting ROC curve
ROC curve
 X-axis: False Positive Rate
 Y-axis: True Positive Rate
 A single confusion matrix corresponds to one
point in ROC Curve
 Top left corner:
– The “ideal” point
– False positive rate is zero
– True positive rate is one
ROC curve examples: random guessing
 Suppose we have a test set contains p
positive samples and n negative samples
 Random prediction: Randomly select r
samples and predict them as positive, the rest
as negative
– Number of true positive: r*p/(p + n)
– Number of false positive: r*n/(p + n)
 TPR = TP/(TP + FN) = r/(p + n)
 FPR = FP/(FP+TN) = r/(p + n)
“TP + FN” is the number of positive
samples which equals to p.
“FP + TN” is the number of negative
samples which equals to n.
ROC curve examples: perfect classifier
 Suppose we have a perfect
classifier that always assign
higher score to a randomly
chosen positive sample than to a
randomly chosen negative
sample.
 TPR = TP/(TP + FN)
 FPR = FP/(FP+TN)
Summarizing an ROC curve in one number: Area Under the
Curve (AUC)
 AUC = 0 (worst) AUC = 1 (best)
 AUC can be interpreted as:
– The total area under the ROC curve.
– The probability that the classifier will
assign a higher score to a randomly
chosen positive example than to a
randomly chosen negative example.
Precision-Recall Curves
 X-axis: Recall
 Y-axis: Precision
 Top right corner:
– The “ideal” point
– Precision = 1.0
– Recall = 1.0
 Suppose we have a perfect
classifier that always assign
higher score to a randomly
chosen positive sample than to a
randomly chosen negative
sample.
Precision
Precision-Recall Curve examples: perfect classifier
 Recall = TP/(TP + FN)
Recall
 Precision = TP/(TP+FP)
Precision-Recall Curve examples: random guessing
Precision
 Suppose we have a test set contains p positive
samples and n negative samples
 Random prediction: Randomly select r samples
and predict them as positive, the rest as negative
– Number of true positive: r*p/(p + n)
– Number of false positive: r*n/(p + n)
p/(p + n)
 Recall = TP/(TP + FN) = r/(p + n)
 Precision = TP/(TP+FP) = p/(p + n)
“TP + FP” is the number of positive
predictions which equals to r.
“TP + FN” is the number of positive
samples which equals to p.
Recall
Evaluation Metrics
 Accuracy
 Confusion Matrix
 Visualizing Classification Performance using ROC curves and PrecisionRecall Curves
 Extensions to Multi-Class
Multi-Class Evaluation
 Multi-class evaluation is an extension of the binary case.
– A collection of true vs predicted binary outcomes, one per class
– Confusion matrices are especially useful
 Overall evaluation metrics are averages across classes
– There are different ways to average multi-class results
Multi-Class Confusion Matrix
 The numbers in the diagonal
represent correct predictions.
Predicted Digit
True Digit
Visualize the Multi-Class Confusion Matrix
Standard confusion matrix
Heat-map confusion matrix
 In the standard confusion matrix, it is hard to spot underlying patterns.
 In the heat-map confusion matrix, it is much easier to identify some insights for this multiclass classification.
– 3 and 8 are often misclassified as each other
– 5 is misclassified as many different numbers
COMP 4125/7820
Visual Analytics & Decision
Support
Lecture 12-2: Text Data Analytics and
Visualization
Dr. MA Jing
Why Text Analytics
 Text is everywhere
 We use documents as primary information artifact in our lives
 Our access to documents has grown tremendously thanks to the Internet
– WWW: webpages, Twitter, Facebook, Wikipedia, Blogs, ...
– Digital libraries: Google books, ACM, IEEE, ...
– Lyrics, closed caption... (youtube)
– Police case reports
– Legislation (law)
– Reviews (products, rotten tomatoes)
– Medical reports (EHR - electronic health records)
– Job descriptions
Text Analytics Tasks
 Topic Modeling
 Text Classification
– Spam/ Not Spam
 Text Similarity
– Information Retrieval
 Sentiment Analysis
– Positive/Negative
 Entity Extraction
 Text Summarization
 Machine Translation
 Natural Language Generation
Popular Natural Language Processing (NLP) libraries
 Stanford NLP
 OpenNLP
 NLTK (Python)
Tokenization, sentence segmentation, part-of-speech
tagging, named entity extraction,
chunking, parsing
A Typical Text Analytics Pipeline
Raw Text
Preprocessing
Numerical Representation
Analysis
Outline
 Text Analytics
– Preprocessing (e.g., Tokenize, stemming, remove stop words)
– Document representation (most common: bag-of-words model)
– Word importance (e.g., word count, TF-IDF)
– Latent Semantic Indexing (find “concepts” among documents and words),
which helps with retrieval
 Text Visualization
Preprocessing Raw Text
 Tokenize
– Tokenization is a process that splits an input sequence into so-called tokens (e.g. words)
– “Hello, I’m Dr. Jones.” -> [‘Hello’, ‘I’, ‘m’, ‘Dr.’, ‘Jones’]
 Token Normalization
– Stemming
 A process of removing and replacing suffixes to get to the root from of the word (i.e. stem)
 cats -> cat
– Lemmatization
 Refers to doing things properly with the use of a vocabulary and morphological analysis
 E.g. am is are -> be
– Lower cases
 Computer -> computer
 Remove Noise
– Remove stop words (e.g. “a”, “the”, “is”)
– Remove punctuations
John’s car is red, right?
John’s car is red, right?
Document representation: bag-of-words model
 Represent each document as a bag of words,
ignoring words’ ordering. Why? For simplicity.
 Unstructured text becomes a vector of numbers
e.g., docs: “I like visualization”, “I like data”.
1 : “I”
2 : “like”
3 : “data”
4 : “visualization”
“I like visualization” ➡ [1, 1, 0, 1]
“I like data” ➡ [1, 1, 1, 0]
Document Representation
 One possible approach:
– Each entry describes a document
– Attribute describe whether or not a term appears in the document
Example
Document Representation
 Another approach:
– Each entry describes a document
– Attributes represent the frequency in which a term appear in the document
Example: Term frequency document matrix
TF-IDF
A word’s importance score in a document, among N documents
TF-IDF weighting: give higher weight to terms that are rare among N documents
When to use it? Everywhere you use “word count”, you can likely use TF-IDF.
TF: term frequency = #appearance a document
– high, if terms appear many times in this document
– High value indicate more relevant
 IDF: inverse document frequency = log( N / #document containing that term)
– penalize “common” words appearing in almost any documents
– High value indicate more discriminative
 Final score = TF * IDF
– higher score ➡ more “characteristic”




A Simple TF-IDF Example
 Suppose we have two documents
– Document 1: “This is a sample”
– Document 2: “This is another example”
 Term Frequency Document Matrix
this
is
a
another
sample
example
Doc1
1
1
1
0
1
0
Doc2
1
1
0
1
0
1
A Simple TF-IDF Example
Inverse Document Frequency (IDF) = log( N / #document containing that term)
Term
df
N
idf
this
2
2
0
is
2
2
0
a
1
2
log2 = 0.301
another
1
2
log2 = 0.301
sample
1
2
log2 = 0.301
example
1
2
log2 = 0.301
A Simple TF-IDF Example
Inverse Document Frequency (IDF) = log( N / #document containing that term)
Term
df
N
idf
this
2
2
0
is
2
2
0
a
1
2
log2 = 0.301
another
1
2
log2 = 0.301
sample
1
2
log2 = 0.301
example
1
2
log2 = 0.301
Low values for words appearing almost in any document.
A Simple TF-IDF Example
TF Document Matrix
this
is
IDF
a
another
sample example
Doc1
1
1
1
0
1
0
Doc2
1
1
0
1
0
1
Term
df
idf
this
2
0
is
2
0
a
1
0.301
another
1
0.301
sample
1
0.301
example
1
0.301
this
is
a
another
sample
example
Doc1
0
0
0.301
0
0.301
0
Doc2
0
0
0
0.301
0
0.301
Vector Space Model
 Each document ➡ vector
 Each query ➡ vector
 Search for documents ➡ find “similar” vectors
 Cluster documents ➡ cluster “similar” vectors
Latent Semantic Indexing (LSI)
 Main idea
– map each document into some ‘concepts’
– map each term into some ‘concepts’
 ‘Concept’ : a set of terms, with weights.
 For example, DBMS_concept: “data” (0.8), “system” (0.5), …
Latent Semantic Indexing (LSI)
Latent Semantic Indexing (LSI)
 Q: How to search, e.g., for a query “system”?
 A: find the corresponding concept(s); and the corresponding documents
Latent Semantic Indexing (LSI)
 Q: How to search, e.g., for a query “system”?
 A: find the corresponding concept(s); and the corresponding documents
We may retrieve documents
that DON’T have the term
“system”, but they contain
almost everything else
(“data”, “retrieval”)
LSI - Discussion
 Great idea
– to derive ‘concepts’ from documents
– to build a ‘thesaurus’ (words with similar meaning) automatically
– to reduce dimensionality (down to few “concepts”)
 How does LSI work?
– Uses Singular Value Decomposition (SVD)
SVD Definition
 A: n x m matrix
– e.g. n documents, m
terms
 U: n x r matrix
– e.g. n documents, r
concepts
 Λ: r x r diagonal matrix
– r : rank of the matrix;
strength of each
‘concept’
 V: m x r matrix
– e.g., m terms, r
concepts
SVD - Example
SVD - Example
Case Study: How to do queries with LSI?
How to do queries with LSI?
For example, how to find documents relevant with a query ‘data’?
How to do queries with LSI?
For example, how to find documents relevant with a query ‘data’?
A: Map query vectors into ‘concept space’ using inner product (cosine similarity)
with each ‘concept’ vector vi
How to do queries with LSI?
 How would the document (‘information’, ‘retrieval’) be handled?
How to do queries with LSI?
 Document (‘information’, ‘retrieval’) will be retrieved by query (‘data’), even
though it does not contain ‘data’!!
Outline
 Text Analytics
– Preprocessing (e.g., Tokenize, stemming, remove stop words)
– Document representation (most common: bag-of-words model)
– Word importance (e.g., word count, TF-IDF)
– Latent Semantic Indexing (find “concepts” among documents and words),
which helps with retrieval
 Text Visualization
Text Visualization: Word/Tag Cloud
 One of the most
intuitive and
commonly used
techniques for
visualizing words.
 Font size indicates
the word
frequency.
 Fail to uncover the
word relationship
Word Tree
 A tree based visualization
to capture the word
relationship.
 Word Tree Summarizes
text data via a syntax tree
– Sentences are aggregated
by their sharing words and
split into branches at a place
where the corresponding
words in the sentences are
divergent.
https://www.jasondavies.com/wordtree/
Phrase Net
 Phrase Net uses a node-link
diagram
– Nodes are keywords
– Links represent
relationship among key
words
 For example, a user select a
predefined regular express
from a list to extract a pattern
(“X and Y”)
http://hint.fm/projects/phrasenet/
Topic Model Visualization
 A tabular view (left) displays
term-topic distributions for an
LDA topic model.
 A bar chart (right) shows the
marginal probability of each
term.
http://vis.stanford.edu/papers/termite
Text Visualization Example

An analysis of uses of different
words across the texts of the
U.S. president's annual “State of
the Union” address, showing
which terms are used how often
as the national situation
changes.

On the left, a visualization of the
text is linked to the search box,
and a table showing frequency of
term usage graphically across
time is shown on the right.
Text Visualization Example
 A visualization of the relative
popularity of U.S. baby names
across time, using a stacked line
graph.
 All matching names are shown, in
alphabetical order, with the
frequency for a given year
determining how much space lies
between the line for that name and
the name below it.
 The naming frequency changes over
time, producing an impression of
undulating waves.
More Text Visualization Examples
http://textvis.lnu.se/
Download