Stream Clustering CSE 902 Big Data Stream analysis Stream: Continuous flow of data Challenges ◦ Volume: Not possible to store all the data ◦ One-time access: Not possible to process the data using multiple passes ◦ Real-time analysis: Certain applications need real-time analysis of the data ◦ Temporal Locality: Data evolves over time, so model should be adaptive. Stream Clustering Topic cluster Article Listings Stream Clustering • Online Phase • Summarize the data into memory-efficient data structures • Offline Phase • Use a clustering algorithm to find the data partition Stream Clustering Algorithms Data Structures Examples Prototypes Stream, Stream Lsearch CF-Trees Scalable k-means, single pass k-means Microcluster Trees ClusTree, DenStream, HP-Stream Grids D-Stream, ODAC Coreset Tree StreamKM++ Prototypes Stream, LSearch CF-Trees Summarize the data in each CF-vector • Linear sum of data points • Squared sum of data points • Number of points Scalable k-means, Single pass k-means Microclusters CF-Trees with “time” element CluStream • Linear sum and square sum of timestamps • Delete old microclusters/merging microclusters if their timestamps are close to each other Sliding Window Clustering • Timestamp of the most recent data point added to the vector • Maintain only the most recent T microclusters DenStream • Microclusters are associated with weights based on recency • Outliers detected by creating separate microcluster Grids D-Stream • Assign the data to grids • Grids weighted by recency of points added to it • Each grid associated with a label DGClust • Distributed clustering of sensor data • Sensors maintain local copies of the grid and communicate updates to the grid to a central site StreamKM++ (Coresets) A weighted set S is a (𝑘, 𝜀) coreset for a data set D if the clustering of S approximates the clustering of D with an error margin of 𝜀 1 − 𝜀 ∗ 𝑑𝑖𝑠𝑡 𝐷, 𝐶 ≤ 𝑑𝑖𝑠𝑡𝑤 𝑆, 𝐶 ≤ (1 + 𝜀) ∗ dist D, C Maintain data in buckets 𝐵1 , 𝐵2 … 𝐵𝐿 . Buckets 𝐵2 to 𝐵𝐿 contain either exactly contains 0 or m points. 𝐵1 can have any number of points between 0 to m points. Merge data in buckets using coreset tree. StreamKM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012 Kernel-based Clustering 𝜙 ( x , y ) ( x , 2 xy , y ) 2 2 T 𝐾 𝑎, 𝑏 = 𝜙(𝑎)𝑇 𝜙(𝑏) Kernel-based Stream Clustering Use non-linear distance measures to define similarity between data points in the stream Challenges Quadratic running time complexity Computationally expensive to compute centers using linear sums and squared sums (CF-vector approach will not work) Stream Kernel k-means (sKKM) 𝑋1 (𝑤1 , 𝑤2 , … 𝑤𝑘 ) Kernel k-means 𝑋2 𝑋3 Weighted Kernel k-means (𝐶1 , 𝐶2 , … 𝐶𝑘 ) Approximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012 History from only the preceding data chunk retained Statistical Leverage Scores Measures the influence of a point in the low-rank approximation 𝑘 𝜆𝑖 𝑢𝑖 𝑣𝑖𝑇 𝑀= 𝑖=1 Leverage score 𝑙𝑖 = 𝑘 𝑖 2 (𝑢 𝑗=1 𝑗 ) Statistical Leverage Scores Used to characterize the matrices which can be approximated accurately with a sample of the entries 0 0 1 𝑀= 0 0 0 0 0 0 1 0 = 0 0 0 1 0 1 1 0 0 0 0 0 0 0 −1 0 0 0 −1 0 0 0 1 0 0 Leverage scores are 1, 1, 1 – all rows are equally important All the entries of the matrix need to be sampled If singular vectors/eigenvectors are spread out (uncorrelated with the standard basis), then we can approximate the matrix with a small number of samples Approximate Stream kernel k-means o Uses statistical leverage score to determine which data points in the stream are potentially “important” o Retain the important points and discard the rest o Use an approximate version of kernel k-means to obtain the clusters – Linear time complexity o Bounded amount of memory Approximate Stream kernel k-means Importance Sampling Sampling probability 𝐾 = 𝑉𝑘 Σ𝑘 𝑉𝑘 𝑇 𝑝𝑡 = 1 𝑘 𝑉𝑘𝑡 2 2 Kernel matrix construction 𝐾𝑡 = 𝐾𝑡−1 ∅ 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑝𝑡 ∅𝑇 𝜅(𝑥𝑡 , 𝑥𝑡 ) 𝐾𝑡−1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑡𝑦 1 − 𝑝𝑡 Clustering • Using kernel k-means to recluster M each time a point is added will be expensive • Reduce complexity by employing a low-dimensional representation of the data 𝑘 Kernel k-means min 𝑠 max 𝑈 ∈ 0,1 𝑘 × 𝑠 𝑐𝑗 ∙ ∈𝐻 𝑗=1 𝑖=1 𝑈𝑗𝑖 𝑐 ∙ − 𝜅(𝑥𝑖 ,∙) 𝑠 𝑗 2 𝐻𝜅 • Constrain the cluster centers to the top k eigenvectors of the kernel matrix 𝑘 “Approximate” Kernel k-means min𝑘 × 𝑠 max 𝑈 ∈ 0,1 𝑐𝑗 ∙ ∈𝐻𝑎 𝑠 𝑗=1 𝑖=1 𝑈𝑗𝑖 𝑐 ∙ − 𝜅(𝑥𝑖 ,∙) 𝑠 𝑗 𝐻𝑎 = 𝑠𝑝𝑎𝑛 (𝑣1 , … 𝑣𝑘 ) 2 𝐻𝜅 Clustering “Approximate” Kernel k-means 𝑘 min𝑘 × 𝑠 max 𝑈 ∈ 0,1 𝑐𝑗 ∙ ∈𝐻𝑎 𝑠 𝑗=1 𝑖=1 𝑈𝑗𝑖 𝑐 ∙ − 𝜅(𝑥𝑖 ,∙) 𝑠 𝑗 2 𝐻𝜅 𝐻𝑎 = 𝑠𝑝𝑎𝑛 (𝑣1 , … 𝑣𝑘 ) 𝑐𝑗 ∙ = 1 𝑉𝑘 Σ𝑘 1/2 𝑛𝑗 max𝑘 × 𝑠 𝑡𝑟(𝑈𝑉𝑘 Σ𝑘 𝑉𝑘 𝑇 𝑈𝑇 ) 𝑈 ∈ 0,1 Solve by running k-means on 𝑉𝑘 Σ𝑘 1/2 - 𝑂(𝑠𝑘 2 𝑙) running time complexity Updating eigenvectors • Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and clustering • Update the eigenvectors and eigenvalues incrementally 𝑂(𝑠𝑘 + 𝑘 3 ) running time complexity 𝐾 = 𝑉Σ𝑉 𝑇 𝐾 + 𝑎𝑎 = 𝑉 𝑇 𝑝 = 𝐼 − 𝑉𝑉 𝑇 𝑎 Component orthogonal to 𝑉 𝑝 𝑝 Σ′ 𝑉 𝑝 𝑝 𝑇 Σ ′ contains the eigenvalues of Σ 𝑉𝑇𝑎 sparse matrix 𝑇 𝑎 𝑝 Approximate Stream Kernel k-means Network Traffic Monitoring Clustering used to detect intrusions in the network Network Intrusion Data set TCP dump data from seven weeks of LAN traffic 10 classes: 9 types of intrusions, 1 class of legitimate traffic. Running Time in milliseconds (per data point) Cluster Accuracy (NMI) Approximate stream kernel k-means 6.6 14.2 StreamKM++ 0.8 7.0 sKKM 42.1 13.3 Around 200 points clustered per second Summary Efficient kernel-based stream clustering algorithm - linear running time complexity Memory required is bounded Real-time clustering is possible Limitation: does not account for data evolution