Memory-Constrained Data Mining Slobodan Vucetic Assistant Professor Department of Computer and Information Sciences Center for Information Science and Technology Temple University, Philadelphia Scientific Data Mining Lab Dr. Slobodan Vucetic, Assistant Professor CIS Department, IST Center, Temple University, Philadelphia, USA Need: (see Nature of March 23, 2006) Amount of data in science every year Shift from computers supporting scientists to playing central role in testing, and even formulation, of scientific hypothesis Lab Mission: Developing an interface between data analysis and applied sciences Working on collaborative projects at the interface between computer science and other disciplines (sciences, engineering, business) Training students to become computational research scientists Research Tasks: Predictive Modeling Pattern Discovery Summarization Scientific Data Mining Lab: Research Challenges Spatial and temporal dependency High dimensional data Data collection bias Data and knowledge fusion from multiple sources Large-scale data Missing/noisy/unstable attributes … Scientific Data Mining Lab: Current Projects Data Mining Resource-Constrained Data Mining (NSF) Earth Science Applications Estimation of geophysical parameters from satellite data (NSF) Biomedical Applications Gene expression data analysis (NIH, PA Dept. of Health) Bioinformatics of protein disorder (PA Dept. of Health) Bioinformatics core facility (PA Dept. of Health) Text mining and Information retrieval (NSF) Spatial modeling of disease and infection spread Spatial and Temporal Knowledge Discovery Spatial-temporal data reduction (NSF) Analysis of deregulated electricity markets Analysis of highway traffic data Scientific Data Mining Lab: Multiple-Source Spatial-Temporal Data Analysis Aim: Accurate and efficient estimation of geophysical parameters from MISR and MODIS instruments on Terra satellite and ground based observations (huge data streams) MISR: Multi-angle Imaging SpectroRadiometer 9 view angles at Earth surface 4 Spectral bands 70.5º D 60.0º 45.6º 26.1º C a 0.0º 26.1º B a 45.6º A 60.0º A a a 70.5º D f A n B f C f 400-km swath width f Vucetic, S., Han, B., Mi, W., Li, Z., Obradovic, Z., A Data Mining Approach for the Validation of Aerosol Retrievals, IEEE Geoscience and Remote Sensing Letters, 2008. Scientific Data Mining Lab: Temporal Data Mining Aim: analyze price vs. load dependences by discovering semi- stationary segments in multivariate time series 40 30 20 0 2000 4000 6000 8000 10000 Size (hours) 1 2 3 4 5707 4630 1425 1191 Price prediction (R2) Local Global 0.79 0.76 0.81 0.75 0.72 -0.49 0.48 -0.03 Price Volatility 40 19 9 56 12000 50 40 0 2000 4000 6000 8000 10000 12000 Price [$/MWh] 150 100 50 0 Regime 4 30 1 20 2 3 10 A P R 8, 98 JULY 1, 98 OCT 1, 98 JA N 1, 98 A P R 1, 99 JULY 1, 98 OCT 1, 99 0 15 20 25 30 Load [GWh] 35 40 Result: several pricing regimes existed in California market Vucetic, S., Obradovic, Z. and Tomsovic, K. (2001) “Price-Load Relationships in California's Electricity Market," IEEE Trans. on Power Systems. Scientific Data Mining Lab: Text Mining: Re-Ranking of Articles Retrieved by a Search Engine When topic is difficult to express as a query, often No relevant articles are found by keyword search Too many irrelevant articles are returned Biomedical Example: “Apurinic/apyrimidinic endonuclease”: 638 citations returned by PubMed “Apurinic/apyrimidinic endonuclease disorder”: 1 citation (irrelevant) returned Result: Large lift of relevant retrievals in top 10 Han, B., Obradovic, Z., Hu, Z.Z., Wu, C. H. and Vucetic, S. (2006) “Substring Selection for Biomedical Document Classification,” Bioinformatics. Scientific Data Mining Lab: Collaborative filtering Aim: Predict preferences of an active customer given his/her preferences on some items and a database of preferences of other customers Result: Regression-based collaborative filtering algorithm is superior to the neighborbased approach. It is two orders of magnitude faster on-line predicting; more accurate; more robust to small number of observed votes. Vucetic, S., Obradovic, Z., Collaborative Filtering Using a Regression-Based Approach, Knowledge and Information Systems, Vol. 7, No. 1, pp. 1-22, 2005. Scientific Data Mining Lab: Bioinformatics: Protein Disorder Analysis Aim: Understanding protein disorder and its functions Results: • Protein disorders are very common (contrary to a 20th century belief) • Fraction of disorder varies a lot by genomes • Different types of disorder exist in proteins • Involved with many important functions Kissinger et al, 1995 Scientific Data Mining Lab: Analysis of Highway Traffic Data Aim: understand traffic patters, predict traffic congestion and delays In progress… Scientific Data Mining Lab: Spatio-Temporal Disease Modelling Aim: predict infection or disease risk, given 5 the information about population movement Activity Type 10 15 20 25 30 5 10 15 20 10 15 20 Infection Risk 0.04 0.03 0.02 0.01 0 5 Location Type Figure 1. Illustration of location clusters and the associated risks. Result: movement information is very useful in prediction of the infection risk Vucetic, S,. Sun, H., Aggregation of Location Attributes for Prediction of Infection Risk, Workshop on Spatial Data Mining: Consolidation and Renewed Bearing, SDM, Bethesda, MD, 2006. Scientific Data Mining Lab: Resource-Constrained Data Mining Aim: Efficient knowledge discovery from large data by limited-capacity computing devices Approach: Integration of data mining and data compression Figure1. left) Noisy checkerboard data – the goal is to discriminate between black and yellow dots and the achievable accuracy is 90%, middle) 100 randomly selected examples and the trained prediction model that has 76% accuracy, right) 100 examples selected by the reservoir algorithm and the trained prediction model that has 88% accuracy Resource-Constrained Data Mining: Motivation Data mining objective: Efficient and accurate algorithms for learning from large data Performance measures: Accuracy Scaling with data size (# examples, #attributes) Mainstream data mining: many accurate learning algorithms that scale linearly or even sublinearly with data size and dimension, in both runtime and space Caveat: linear space scaling is often not sufficient it implies an unbounded growth in memory with data size Challenge: how to learn from large, or practically infinite, data sets/streams using limited memory resources Resource-Constrained Data Mining: Learning Scenario Examples are observed sequentially in a single pass Data stream examples independent and identically distributed (IID) Could store the data summary in reservoir with fixed memory Resource-Constrained Data Mining: Approaches Model-Free: Reservoir Approach Maintains a random sample of size R from data stream Add xt with min(1, R/t), remove randomly Caveat: random sampling often not optimal Data-Free: Online algorithms Updates the model as examples are observed Perceptron: wt+1 = wt + (yt - f(xt))xt , where f(x) = wTx Caveat: sensitive to data ordering Hybrid: Data + Model Implicitly done with Support Vector Machines (SVMs) Resource-Constrained Data Mining: Objective Develop a memory-constrained SVM algorithm What is SVM? Popular data mining algorithm for classification The most accurate on many problems Theoretically and practically appealing Computationally expensive Cubic training time cost O(N3) (e.g. neural nets are O(N)) Quadratic training memory cost O(N2) (e.g. neural nets are O(N)) Linear prediction cost O(N) (e.g. neural nets are O(1)) Resource-Constrained Data Mining: SVM Overview Goal: Use x1 and x2 to predict class y {-1, 1} Assume linear prediction function f(x) = w1x1+w2x2+b sign(f(x)) is final prediction Challenge: What is better, f1(x) or f2(x) What is the best choice for f(x)? Answer: Best f(x) has the most wiggle space it has largest margin x2 f1(x) x1 f2(x) Resource-Constrained Data Mining: SVM Overview Maximizing margin is equivalent to: minimize ||w||2 such that yi f(xi) 1 What if data are noisy? minimize ||w||2 + Cii such that yi f(xi) 1 - i, i 0 What if problem is nonlinear? X (X) Resource-Constrained Data Mining: SVM Overview Standard approach convert to dual problem 1 minimize ||w||2 + Cii min : W i Qij j i b yi i such that yi f(xi) 1 - i, i 0 0 C 2 i, j i i i where Qij = yiyj(xi)(xj) = yiyjK(xi, xj) , K is the Kernel function Gaussian kernel: K(xi,xj) = exp(||xi – xj||2/A) i are Lagrange multipliers Optimization becomes the Quadratic Programming Problem (minimizing convex function with linear constraints) There is the optimal solution in O(N3) time and O(N2) space SVM predictor: N f ( x) yi i K ( xi , x) b i 1 To predict class of example x, we should compare it with all training examples with i > 0 Resource-Constrained Data Mining: SVM Overview N f ( x) yi i K ( xi , x) b i 1 Support vectors 0<i<C Error vectors i=C Reserve vectors i=0 f(x) = -1 f(x) = +1 Resource-Constrained Data Mining: Incremental-Decremental SVM Standard SVM solution is “batch”, meaning that all training data should be available for learning Alternative is “online” SVM that can be update when new training data are available Incremental-Decremental SVM [Cauwenberghs, Poggio, 2000] For each new example, the update takes O(Ns2) time, Ns – number of support vectors (0<i<C) O(NsN) memory. Considering Ns = O(N), memory is O(N2) Total cost for online training on N examples is O(N3) time O(N2) memory The same as for batch mode Resource-Constrained Data Mining: Memory-Constrained IDSVM Idea Modify IDSVM by upper-bounding number of support vectors How Twin Vector Machine (TVM) Define budget B and a set of pivot vectors q1…qB Quantize each example to its nearest pivot, Q(x) = {qk, k = arg minj=1:B ||x-qj||} D = {(xi,yi), i = 1…N} Q(D) = {(Q(xi),yi), i = 1…N} minimize ||w||2 + Cii such that yi f(xi) 1 - i, i 0, i = 1…N minimize ||w||2 + Cj(nj+j+ + nj-j-) such that f(qj) 1 - j, -f(qj) 1 - j-, j+, j- 0, j = 1…B Training SVM on Q(D) is equivalent to SVM on TV, TV = {TVj, j = 1…B} (Twin Vector Set) TVj = {(qj,+1,nj+}, (qj,-1,nj-)} (Twin Vector) O(N3) O(B3) (constant) time; O(N2) O(B2) (constant) memory Resource-Constrained Data Mining: Online TVM Online-TVM Input: Data stream D = {(xi,yi), i = 1…N}, budget B, kernel function K, slack parameter C Output: TVM with parameters 1+,1-,… B+,B-, and b 1. 2. 3. 4. 5. Initialize TVM = 0, TV = for i = 1 to N if Beneficial(xi) Update-TV Update-TVM Resource-Constrained Data Mining: Online TVM Beneficial 1. 2. 3. 4. if size(TV) < B or |f(xi)| m1 return 1 else return 0 -2 +1 buffer m1 Online-TVM Input: Data stream D = {(xi,yi), i = 1…N}, budget B, kernel function K, slack parameter C Output: TVM with parameters 1+,1-,… B+,B-, and b 0 -1 buffer -2 1. 2. 3. 4. 5. Initialize TVM = 0, TV = for i = 1 to N if Beneficial(xi) Update-TV Update-TVM Resource-Constrained Data Mining: Online TVM Update-TV m2 s = size(TV) -2 TVB+1 = {(xi,yi,1), (qi,-yi,0)} +1 if s < B buffer 0 TVs+1 = TVB+1 -1 elseif maxi=1:B|f(qi)| > m2 buffer -2 k = arg maxi=1:B |f(qi)| TVk = TVB+1 else find best pair TVi, TVj to merge use (**) to calculate qnew TVi = {(qnew,+1, si+ + sj+), (qnew,-1,si- + sj-)} TVj = TVB+1 ( si si )qi ( s j s j )q j qnew si si s j s j (**) Resource-Constrained Data Mining: Online TVM Merging Heuristics: Nearest versus Weighted +1 Global versus One-Sided 0 +1 0 Rejection merging -1 -1 OneSideMerge GlobalMerge Resource-Constrained Data Mining: Results 100 -1.5 6 -1 4 -0.5 2 0 0 0.5 -2 1 -4 1.5 -6 -1 0 1 10000 400 -1.5 -1.5 -1 -1 5 -0.5 10 -0.5 0 0 0 0 0.5 0.5 -10 1 1 -5 1.5 -1 0 -20 1.5 -1 1 Budget B = 100 0 1 Resource-Constrained Data Mining: Results Checkerboard (noisy) Checkerboard (noisy) 1 2500 CPU time (in seconds) Accuracy 0.95 0.9 0.85 0.8 0.75 2 10 TVM IDSVM LIBSVM Random Sampling 3 TVM IDSVM 2000 1500 1000 500 0 4 10 10 Length of data stream (in log scale) Budget B = 100 0 2000 4000 6000 8000 10000 Length of data stream Resource-Constrained Data Mining: Results Checkerboard (noisy) Checkerboard (noisy) 80 1 Accuracy 0.9 CPU time (in seconds) TVM budget 50 TVM budget 100 TVM budget 200 0.95 0.85 0.8 0.75 0.7 TVM budget 50 TVM budget 100 TVM budget 200 60 40 20 0.65 0 1 10 2 3 4 10 10 10 Length of data stream (in log scale) 0 2000 4000 6000 8000 Length of data stream 10000 Resource-Constrained Data Mining: Results Adult 1 0.84 0.95 0.82 0.9 Accuracy Accuracy Checkerboard (noisy) 0.85 0.8 with buffer without buffer 0.75 0.7 0 5000 Length of data stream 10000 Budget B = 100 OneSideMerge GlobalMerge 0.8 0.78 0.76 0.74 2 3 4 5 10 10 10 10 Length of data stream (in log scale) Resource-Constrained Data Mining: Results Banana 0.84 0.91 0.82 0.9 0.78 0.76 TVM IDSVM LIBSVM Random Sampling TVM IDSVM LIBSVM Random Sampling 0.8 0.78 0.76 2 3 4 10 10 10 Length of data stream (in log scale) TVM IDSVM LIBSVM Random Sampling 0.9 0.85 0.85 2 3 4 10 10 10 Length of data stream (in log scale) IJCNN 1 TVM IDSVM LIBSVM Random Sampling 0.8 2 3 4 10 10 10 Length of data stream (in log scale) Pendigits 1 0.99 0.98 0.98 Accuracy Accuracy 0.82 0.88 0.86 Gauss 0.84 0.89 0.87 0.74 2 3 4 5 10 10 10 10 Length of data stream (in log scale) 0.95 Accuracy 0.8 Checkerboard 1 Accuracy 0.92 Accuracy Accuracy Adult 0.86 0.96 0.94 0.92 TVM IDSVM LIBSVM Random Sampling 0.9 2 3 4 5 10 10 10 10 Length of data stream (in log scale) 0.97 0.96 0.95 TVM IDSVM LIBSVM Random Sampling 0.94 2 3 4 10 10 10 Length of data stream (in log scale) Resource-Constrained Data Mining: Results Resource-Constrained Data Mining: Results Resource-Constrained Data Mining: Conclusions Memory-Constrained SVM is successful Significantly higher accuracy than baseline Close to the optimal approach Merging heuristics are very important Future work Further improvements Forgetting Probabilistic merging Use data compression Non-IID streams Thank You! More information: http://www.ist.temple.edu/~vucetic/ Collaboration/assistantship contact: Slobodan Vucetic CIS Department, IST Center, Temple University vucetic@ist.temple.edu