Stream Clustering

advertisement
Stream Clustering
CSE 902
Big Data
Stream analysis
Stream: Continuous flow of data
Challenges
◦ Volume: Not possible to store all the data
◦ One-time access: Not possible to process the data using multiple
passes
◦ Real-time analysis: Certain applications need real-time analysis of
the data
◦ Temporal Locality: Data evolves over time, so model should be
adaptive.
Stream Clustering
Topic
cluster
Article
Listings
Stream Clustering
• Online Phase
• Summarize the data into
memory-efficient data
structures
• Offline Phase
• Use a clustering
algorithm to find the
data partition
Stream Clustering Algorithms
Data Structures
Examples
Prototypes
Stream, Stream Lsearch
CF-Trees
Scalable k-means, single pass k-means
Microcluster Trees
ClusTree, DenStream, HP-Stream
Grids
D-Stream, ODAC
Coreset Tree
StreamKM++
Prototypes
Stream, LSearch
CF-Trees
Summarize the data in each CF-vector
• Linear sum of data points
• Squared sum of data points
• Number of points
Scalable k-means, Single pass k-means
Microclusters
CF-Trees with “time” element
CluStream
• Linear sum and square sum of timestamps
• Delete old microclusters/merging microclusters if
their timestamps are close to each other
Sliding Window Clustering
• Timestamp of the most recent data point added to
the vector
• Maintain only the most recent T microclusters
DenStream
• Microclusters are associated with weights based on
recency
• Outliers detected by creating separate microcluster
Grids
D-Stream
• Assign the data to grids
• Grids weighted by recency of points
added to it
• Each grid associated with a label
DGClust
• Distributed clustering of sensor data
• Sensors maintain local copies of the grid
and communicate updates to the grid to
a central site
StreamKM++ (Coresets)
 A weighted set S is a (𝑘, 𝜀) coreset for a data set D if the clustering of S approximates the
clustering of D with an error margin of 𝜀
1 − 𝜀 ∗ 𝑑𝑖𝑠𝑡 𝐷, 𝐶 ≤ 𝑑𝑖𝑠𝑡𝑤 𝑆, 𝐶 ≤ (1 + 𝜀) ∗ dist D, C
 Maintain data in buckets 𝐵1 , 𝐵2 … 𝐵𝐿 .
Buckets 𝐵2 to 𝐵𝐿 contain either exactly
contains 0 or m points. 𝐵1 can have
any number of points between 0 to m
points.
 Merge data in buckets using coreset
tree.
StreamKM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012
Kernel-based Clustering
𝜙
 ( x , y )  ( x , 2 xy , y )
2
2
T
𝐾 𝑎, 𝑏 = 𝜙(𝑎)𝑇 𝜙(𝑏)
Kernel-based Stream Clustering
 Use non-linear distance measures to define similarity between
data points in the stream
 Challenges
 Quadratic running time complexity
 Computationally expensive to compute centers using linear sums and
squared sums (CF-vector approach will not work)
Stream Kernel k-means (sKKM)
𝑋1
(𝑤1 , 𝑤2 , … 𝑤𝑘 )
Kernel k-means
𝑋2
𝑋3
Weighted
Kernel k-means
(𝐶1 , 𝐶2 , … 𝐶𝑘 )
Approximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012
History from only the preceding data chunk retained
Statistical Leverage Scores
Measures the influence of a point in the low-rank approximation
𝑘
𝜆𝑖 𝑢𝑖 𝑣𝑖𝑇
𝑀=
𝑖=1
Leverage score 𝑙𝑖 =
𝑘
𝑖 2
(𝑢
𝑗=1 𝑗 )
Statistical Leverage Scores
Used to characterize the matrices which can be approximated accurately with a sample of the
entries
0 0 1
𝑀= 0 0 0
0 0 0
1 0
= 0 0
0 1
0 1
1 0
0 0
0 0 0 0 −1
0 0 0 −1 0
0 0 1 0
0
Leverage scores are 1, 1, 1 – all rows are equally important
All the entries of the matrix need to be sampled
If singular vectors/eigenvectors are spread out (uncorrelated with the standard basis),
then we can approximate the matrix with a small number of samples
Approximate Stream kernel k-means
o Uses statistical leverage score to determine which data points in the stream are
potentially “important”
o Retain the important points and discard the rest
o Use an approximate version of kernel k-means to obtain the clusters – Linear
time complexity
o Bounded amount of memory
Approximate Stream kernel k-means
Importance Sampling
Sampling probability
𝐾 = 𝑉𝑘 Σ𝑘 𝑉𝑘 𝑇
𝑝𝑡 =
1
𝑘
𝑉𝑘𝑡
2
2
Kernel matrix construction
𝐾𝑡 =
𝐾𝑡−1
∅
𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑝𝑡
∅𝑇 𝜅(𝑥𝑡 , 𝑥𝑡 )
𝐾𝑡−1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑡𝑦 1 − 𝑝𝑡
Clustering
• Using kernel k-means to recluster M each time a point is added will be expensive
• Reduce complexity by employing a low-dimensional representation of the data
𝑘
Kernel k-means
min
𝑠
max
𝑈 ∈ 0,1 𝑘 × 𝑠 𝑐𝑗 ∙ ∈𝐻
𝑗=1 𝑖=1
𝑈𝑗𝑖
𝑐 ∙ − 𝜅(𝑥𝑖 ,∙)
𝑠 𝑗
2
𝐻𝜅
• Constrain the cluster centers to the top k eigenvectors of the kernel matrix
𝑘
“Approximate”
Kernel k-means
min𝑘 × 𝑠 max
𝑈 ∈ 0,1
𝑐𝑗 ∙ ∈𝐻𝑎
𝑠
𝑗=1 𝑖=1
𝑈𝑗𝑖
𝑐 ∙ − 𝜅(𝑥𝑖 ,∙)
𝑠 𝑗
𝐻𝑎 = 𝑠𝑝𝑎𝑛 (𝑣1 , … 𝑣𝑘 )
2
𝐻𝜅
Clustering
“Approximate”
Kernel k-means
𝑘
min𝑘 × 𝑠 max
𝑈 ∈ 0,1
𝑐𝑗 ∙ ∈𝐻𝑎
𝑠
𝑗=1 𝑖=1
𝑈𝑗𝑖
𝑐 ∙ − 𝜅(𝑥𝑖 ,∙)
𝑠 𝑗
2
𝐻𝜅
𝐻𝑎 = 𝑠𝑝𝑎𝑛 (𝑣1 , … 𝑣𝑘 )
𝑐𝑗 ∙ =
1
𝑉𝑘 Σ𝑘 1/2
𝑛𝑗
max𝑘 × 𝑠 𝑡𝑟(𝑈𝑉𝑘 Σ𝑘 𝑉𝑘 𝑇 𝑈𝑇 )
𝑈 ∈ 0,1
Solve by running k-means on 𝑉𝑘 Σ𝑘 1/2 - 𝑂(𝑠𝑘 2 𝑙) running time complexity
Updating eigenvectors
• Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and
clustering
• Update the eigenvectors and eigenvalues incrementally
𝑂(𝑠𝑘 + 𝑘 3 ) running
time complexity
𝐾 = 𝑉Σ𝑉 𝑇
𝐾 + 𝑎𝑎 = 𝑉
𝑇
𝑝 = 𝐼 − 𝑉𝑉 𝑇 𝑎
Component orthogonal to 𝑉
𝑝
𝑝
Σ′ 𝑉
𝑝
𝑝
𝑇
Σ ′ contains the eigenvalues of
Σ 𝑉𝑇𝑎
sparse matrix 𝑇
𝑎
𝑝
Approximate Stream Kernel k-means
Network Traffic Monitoring
 Clustering used to detect intrusions in the network
 Network Intrusion Data set
 TCP dump data from seven weeks of LAN traffic
 10 classes: 9 types of intrusions, 1 class of legitimate traffic.
Running Time in milliseconds
(per data point)
Cluster Accuracy
(NMI)
Approximate stream kernel k-means
6.6
14.2
StreamKM++
0.8
7.0
sKKM
42.1
13.3
Around 200 points clustered per second
Summary
 Efficient kernel-based stream clustering algorithm - linear running time complexity
 Memory required is bounded
 Real-time clustering is possible
 Limitation: does not account for data evolution
Download