Stream Clustering

Stream Clustering CSE 902 Big Data Stream analysis Stream: Continuous flow of data Challenges ◦ Volume: Not possible to store all the data ◦ One-time access: Not possible to process the data using multiple passes ◦ Real-time analysis: Certain applications need real-time analysis of the data ◦ Temporal Locality: Data evolves over time, so model should be adaptive. Stream Clustering Topic cluster Article Listings Stream Clustering • Online Phase • Summarize the data into memory-efficient data structures • Offline Phase • Use a clustering algorithm to find the data partition Stream Clustering Algorithms Data Structures Examples Prototypes Stream, Stream Lsearch CF-Trees Scalable k-means, single pass k-means Microcluster Trees ClusTree, DenStream, HP-Stream Grids D-Stream, ODAC Coreset Tree StreamKM++ Prototypes Stream, LSearch CF-Trees Summarize the data in each CF-vector • Linear sum of data points • Squared sum of data points • Number of points Scalable k-means, Single pass k-means Microclusters CF-Trees with “time” element CluStream • Linear sum and square sum of timestamps • Delete old microclusters/merging microclusters if their timestamps are close to each other Sliding Window Clustering • Timestamp of the most recent data point added to the vector • Maintain only the most recent T microclusters DenStream • Microclusters are associated with weights based on recency • Outliers detected by creating separate microcluster Grids D-Stream • Assign the data to grids • Grids weighted by recency of points added to it • Each grid associated with a label DGClust • Distributed clustering of sensor data • Sensors maintain local copies of the grid and communicate updates to the grid to a central site StreamKM++ (Coresets)  A weighted set S is a (𝑘, 𝜀) coreset for a data set D if the clustering of S approximates the clustering of D with an error margin of 𝜀 1 − 𝜀 ∗ 𝑑𝑖𝑠𝑡 𝐷, 𝐶 ≤ 𝑑𝑖𝑠𝑡𝑤 𝑆, 𝐶 ≤ (1 + 𝜀) ∗ dist D, C  Maintain data in buckets 𝐵1 , 𝐵2 … 𝐵𝐿 . Buckets 𝐵2 to 𝐵𝐿 contain either exactly contains 0 or m points. 𝐵1 can have any number of points between 0 to m points.  Merge data in buckets using coreset tree. StreamKM++: A Clustering Algorithm for Data Streams, Ackermann, Journal of Experimental Algorithmics 2012 Kernel-based Clustering 𝜙  ( x , y )  ( x , 2 xy , y ) 2 2 T 𝐾 𝑎, 𝑏 = 𝜙(𝑎)𝑇 𝜙(𝑏) Kernel-based Stream Clustering  Use non-linear distance measures to define similarity between data points in the stream  Challenges  Quadratic running time complexity  Computationally expensive to compute centers using linear sums and squared sums (CF-vector approach will not work) Stream Kernel k-means (sKKM) 𝑋1 (𝑤1 , 𝑤2 , … 𝑤𝑘 ) Kernel k-means 𝑋2 𝑋3 Weighted Kernel k-means (𝐶1 , 𝐶2 , … 𝐶𝑘 ) Approximation of Kernel k-Means for Streaming Data, Havens, ICPR 2012 History from only the preceding data chunk retained Statistical Leverage Scores Measures the influence of a point in the low-rank approximation 𝑘 𝜆𝑖 𝑢𝑖 𝑣𝑖𝑇 𝑀= 𝑖=1 Leverage score 𝑙𝑖 = 𝑘 𝑖 2 (𝑢 𝑗=1 𝑗 ) Statistical Leverage Scores Used to characterize the matrices which can be approximated accurately with a sample of the entries 0 0 1 𝑀= 0 0 0 0 0 0 1 0 = 0 0 0 1 0 1 1 0 0 0 0 0 0 0 −1 0 0 0 −1 0 0 0 1 0 0 Leverage scores are 1, 1, 1 – all rows are equally important All the entries of the matrix need to be sampled If singular vectors/eigenvectors are spread out (uncorrelated with the standard basis), then we can approximate the matrix with a small number of samples Approximate Stream kernel k-means o Uses statistical leverage score to determine which data points in the stream are potentially “important” o Retain the important points and discard the rest o Use an approximate version of kernel k-means to obtain the clusters – Linear time complexity o Bounded amount of memory Approximate Stream kernel k-means Importance Sampling Sampling probability 𝐾 = 𝑉𝑘 Σ𝑘 𝑉𝑘 𝑇 𝑝𝑡 = 1 𝑘 𝑉𝑘𝑡 2 2 Kernel matrix construction 𝐾𝑡 = 𝐾𝑡−1 ∅ 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑝𝑡 ∅𝑇 𝜅(𝑥𝑡 , 𝑥𝑡 ) 𝐾𝑡−1 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑡𝑦 1 − 𝑝𝑡 Clustering • Using kernel k-means to recluster M each time a point is added will be expensive • Reduce complexity by employing a low-dimensional representation of the data 𝑘 Kernel k-means min 𝑠 max 𝑈 ∈ 0,1 𝑘 × 𝑠 𝑐𝑗 ∙ ∈𝐻 𝑗=1 𝑖=1 𝑈𝑗𝑖 𝑐 ∙ − 𝜅(𝑥𝑖 ,∙) 𝑠 𝑗 2 𝐻𝜅 • Constrain the cluster centers to the top k eigenvectors of the kernel matrix 𝑘 “Approximate” Kernel k-means min𝑘 × 𝑠 max 𝑈 ∈ 0,1 𝑐𝑗 ∙ ∈𝐻𝑎 𝑠 𝑗=1 𝑖=1 𝑈𝑗𝑖 𝑐 ∙ − 𝜅(𝑥𝑖 ,∙) 𝑠 𝑗 𝐻𝑎 = 𝑠𝑝𝑎𝑛 (𝑣1 , … 𝑣𝑘 ) 2 𝐻𝜅 Clustering “Approximate” Kernel k-means 𝑘 min𝑘 × 𝑠 max 𝑈 ∈ 0,1 𝑐𝑗 ∙ ∈𝐻𝑎 𝑠 𝑗=1 𝑖=1 𝑈𝑗𝑖 𝑐 ∙ − 𝜅(𝑥𝑖 ,∙) 𝑠 𝑗 2 𝐻𝜅 𝐻𝑎 = 𝑠𝑝𝑎𝑛 (𝑣1 , … 𝑣𝑘 ) 𝑐𝑗 ∙ = 1 𝑉𝑘 Σ𝑘 1/2 𝑛𝑗 max𝑘 × 𝑠 𝑡𝑟(𝑈𝑉𝑘 Σ𝑘 𝑉𝑘 𝑇 𝑈𝑇 ) 𝑈 ∈ 0,1 Solve by running k-means on 𝑉𝑘 Σ𝑘 1/2 - 𝑂(𝑠𝑘 2 𝑙) running time complexity Updating eigenvectors • Only eigenvectors and eigenvalues of kernel matrix are required for both sampling and clustering • Update the eigenvectors and eigenvalues incrementally 𝑂(𝑠𝑘 + 𝑘 3 ) running time complexity 𝐾 = 𝑉Σ𝑉 𝑇 𝐾 + 𝑎𝑎 = 𝑉 𝑇 𝑝 = 𝐼 − 𝑉𝑉 𝑇 𝑎 Component orthogonal to 𝑉 𝑝 𝑝 Σ′ 𝑉 𝑝 𝑝 𝑇 Σ ′ contains the eigenvalues of Σ 𝑉𝑇𝑎 sparse matrix 𝑇 𝑎 𝑝 Approximate Stream Kernel k-means Network Traffic Monitoring  Clustering used to detect intrusions in the network  Network Intrusion Data set  TCP dump data from seven weeks of LAN traffic  10 classes: 9 types of intrusions, 1 class of legitimate traffic. Running Time in milliseconds (per data point) Cluster Accuracy (NMI) Approximate stream kernel k-means 6.6 14.2 StreamKM++ 0.8 7.0 sKKM 42.1 13.3 Around 200 points clustered per second Summary  Efficient kernel-based stream clustering algorithm - linear running time complexity  Memory required is bounded  Real-time clustering is possible  Limitation: does not account for data evolution

Stream Clustering

Related documents

Products

Support

Stream Clustering

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib