ODU-final - ODU Computer Science

advertisement
Memory-Constrained Data Mining
Slobodan Vucetic
Assistant Professor
Department of Computer and Information Sciences
Center for Information Science and Technology
Temple University, Philadelphia
Scientific Data Mining Lab
Dr. Slobodan Vucetic, Assistant Professor
CIS Department, IST Center,
Temple University, Philadelphia, USA
Need: (see Nature of March 23, 2006)

Amount of data in science every year

Shift from computers supporting scientists to playing central role in
testing, and even formulation, of scientific hypothesis
Lab Mission:

Developing an interface between data analysis and applied sciences

Working on collaborative projects at the interface between computer
science and other disciplines (sciences, engineering, business)

Training students to become computational research scientists
Research Tasks:

Predictive Modeling

Pattern Discovery

Summarization
Scientific Data Mining Lab:
Research Challenges
 Spatial and temporal dependency
 High dimensional data
 Data collection bias
 Data and knowledge fusion from multiple sources
 Large-scale data
 Missing/noisy/unstable attributes …
Scientific Data Mining Lab:
Current Projects
Data Mining

Resource-Constrained Data Mining (NSF)
Earth Science Applications

Estimation of geophysical parameters from satellite data (NSF)
Biomedical Applications

Gene expression data analysis (NIH, PA Dept. of Health)

Bioinformatics of protein disorder (PA Dept. of Health)

Bioinformatics core facility (PA Dept. of Health)

Text mining and Information retrieval (NSF)

Spatial modeling of disease and infection spread
Spatial and Temporal Knowledge Discovery

Spatial-temporal data reduction (NSF)

Analysis of deregulated electricity markets

Analysis of highway traffic data
Scientific Data Mining Lab:
Multiple-Source Spatial-Temporal Data
Analysis
Aim: Accurate and efficient estimation of geophysical parameters
from MISR and MODIS instruments on Terra satellite and
ground based observations (huge data streams)
MISR: Multi-angle
Imaging SpectroRadiometer
9 view angles at
Earth surface
4 Spectral bands
70.5º D
60.0º
45.6º
26.1º
C a
0.0º
26.1º
B a
45.6º
A
60.0º
A a a
70.5º
D
f
A n
B f
C f
400-km swath width
f
Vucetic, S., Han, B., Mi, W., Li, Z., Obradovic, Z., A
Data Mining Approach for the Validation of Aerosol
Retrievals, IEEE Geoscience and Remote Sensing
Letters, 2008.
Scientific Data Mining Lab:
Temporal Data Mining
Aim: analyze price vs. load dependences by discovering semi-
stationary segments in multivariate time series
40
30
20
0
2000
4000
6000
8000
10000
Size
(hours)
1
2
3
4
5707
4630
1425
1191
Price prediction (R2)
Local
Global
0.79
0.76
0.81
0.75
0.72
-0.49
0.48
-0.03
Price
Volatility
40
19
9
56
12000
50
40
0
2000
4000
6000
8000
10000
12000
Price [$/MWh]
150
100
50
0
Regime
4
30
1
20
2
3
10
A P R 8, 98
JULY 1, 98
OCT 1, 98
JA N 1, 98
A P R 1, 99
JULY 1, 98
OCT 1, 99
0
15
20
25
30
Load [GWh]
35
40
 Result: several pricing regimes existed in California market
Vucetic, S., Obradovic, Z. and Tomsovic, K. (2001) “Price-Load Relationships in California's
Electricity Market," IEEE Trans. on Power Systems.
Scientific Data Mining Lab:
Text Mining: Re-Ranking of Articles Retrieved by a
Search Engine
When topic is difficult to express as a query, often
 No relevant articles are found by keyword search
 Too many irrelevant articles are returned
Biomedical Example:
“Apurinic/apyrimidinic
endonuclease”:
638 citations returned by PubMed
“Apurinic/apyrimidinic
endonuclease disorder”:
1 citation (irrelevant) returned
Result: Large lift of relevant retrievals in top 10
Han, B., Obradovic, Z., Hu, Z.Z., Wu, C. H. and Vucetic, S. (2006)
“Substring Selection for Biomedical Document Classification,” Bioinformatics.
Scientific Data Mining Lab:
Collaborative filtering
Aim: Predict preferences of an active customer given his/her
preferences on some items and a database of preferences of
other customers
Result: Regression-based collaborative filtering algorithm is superior to the neighborbased approach. It is two orders of magnitude faster on-line predicting; more accurate;
more robust to small number of observed votes.
Vucetic, S., Obradovic, Z., Collaborative Filtering Using a Regression-Based Approach, Knowledge and
Information Systems, Vol. 7, No. 1, pp. 1-22, 2005.
Scientific Data Mining Lab:
Bioinformatics: Protein Disorder Analysis
Aim: Understanding protein
disorder and its functions
Results:
• Protein disorders are very
common (contrary to a 20th
century belief)
• Fraction of disorder varies a lot by
genomes
• Different types of disorder exist in
proteins
• Involved with many important
functions
Kissinger et al, 1995
Scientific Data Mining Lab:
Analysis of Highway Traffic Data
Aim: understand traffic patters, predict traffic congestion and
delays
In progress…
Scientific Data Mining Lab:
Spatio-Temporal Disease Modelling
Aim: predict infection or disease risk, given
5
the information about population
movement
Activity Type
10
15
20
25
30
5
10
15
20
10
15
20
Infection Risk
0.04
0.03
0.02
0.01
0
5
Location Type
Figure 1. Illustration of location clusters and the
associated risks.
Result: movement information is very useful in prediction of the infection risk
Vucetic, S,. Sun, H., Aggregation of Location Attributes for Prediction of Infection Risk, Workshop on
Spatial Data Mining: Consolidation and Renewed Bearing, SDM, Bethesda, MD, 2006.
Scientific Data Mining Lab:
Resource-Constrained Data Mining
Aim:
Efficient knowledge discovery from large data by limited-capacity
computing devices
Approach:
 Integration of data mining and data compression

Figure1. left) Noisy checkerboard data – the goal is to discriminate between black and yellow dots and the
achievable accuracy is 90%, middle) 100 randomly selected examples and the trained prediction model that has
76% accuracy, right) 100 examples selected by the reservoir algorithm and the trained prediction model that has
88% accuracy
Resource-Constrained Data Mining:
Motivation
 Data mining objective:

Efficient and accurate algorithms for learning from large data
 Performance measures:


Accuracy
Scaling with data size (# examples, #attributes)
 Mainstream data mining:

many accurate learning algorithms that scale linearly or even sublinearly with data size and dimension, in both runtime and space
 Caveat:

linear space scaling is often not sufficient  it implies an
unbounded growth in memory with data size
 Challenge:

how to learn from large, or practically infinite, data sets/streams
using limited memory resources
Resource-Constrained Data Mining:
Learning Scenario
 Examples are observed


sequentially
in a single pass
 Data stream examples

independent and identically
distributed (IID)
 Could store the data
summary in

reservoir with fixed memory
Resource-Constrained Data Mining:
Approaches
 Model-Free: Reservoir Approach



Maintains a random sample of size R from data stream
Add xt with min(1, R/t), remove randomly
Caveat: random sampling often not optimal
 Data-Free: Online algorithms



Updates the model as examples are observed
Perceptron: wt+1 = wt + (yt - f(xt))xt , where f(x) = wTx
Caveat: sensitive to data ordering
 Hybrid: Data + Model

Implicitly done with Support Vector Machines (SVMs)
Resource-Constrained Data Mining:
Objective
 Develop a memory-constrained SVM algorithm
 What is SVM?




Popular data mining algorithm for classification
The most accurate on many problems
Theoretically and practically appealing
Computationally expensive



Cubic training time cost O(N3) (e.g. neural nets are O(N))
Quadratic training memory cost O(N2) (e.g. neural nets are O(N))
Linear prediction cost O(N) (e.g. neural nets are O(1))
Resource-Constrained Data Mining:
SVM Overview
 Goal:



Use x1 and x2 to predict class
y  {-1, 1}
Assume linear prediction
function f(x) = w1x1+w2x2+b
sign(f(x)) is final prediction
 Challenge:


What is better, f1(x) or f2(x)
What is the best choice for f(x)?
 Answer:

Best f(x) has the most wiggle
space  it has largest margin
x2
f1(x)
x1
f2(x)
Resource-Constrained Data Mining:
SVM Overview
 Maximizing margin is equivalent to:
minimize ||w||2
such that yi f(xi)  1
 What if data are noisy?
minimize ||w||2 + Cii
such that yi f(xi)  1 - i, i 0
 What if problem is nonlinear?
X  (X)
Resource-Constrained Data Mining:
SVM Overview
 Standard approach  convert to dual problem
1
minimize ||w||2 + Cii
min : W    i Qij j    i  b yi i
such that yi f(xi)  1 - i, i 0 0 C
2 i, j
i
i
i
where Qij = yiyj(xi)(xj) = yiyjK(xi, xj) , K is the Kernel function
Gaussian kernel: K(xi,xj) = exp(||xi – xj||2/A)
i are Lagrange multipliers
 Optimization becomes the Quadratic Programming Problem
(minimizing convex function with linear constraints)
 There is the optimal solution in O(N3) time and O(N2) space
 SVM predictor:
N
f ( x)   yi i K ( xi , x)  b
i 1
To predict class of example x, we should compare it with all training
examples with i > 0
Resource-Constrained Data Mining:
SVM Overview
N
f ( x)   yi i K ( xi , x)  b
i 1
Support vectors 0<i<C
Error vectors i=C
Reserve vectors i=0
f(x) = -1
f(x) = +1
Resource-Constrained Data Mining:
Incremental-Decremental SVM
 Standard SVM solution is “batch”, meaning that all training
data should be available for learning
 Alternative is “online” SVM that can be update when new
training data are available



Incremental-Decremental SVM [Cauwenberghs, Poggio, 2000]
For each new example, the update takes
 O(Ns2) time, Ns – number of support vectors (0<i<C)
 O(NsN) memory. Considering Ns = O(N), memory is O(N2)
Total cost for online training on N examples is
 O(N3) time
 O(N2) memory
 The same as for batch mode
Resource-Constrained Data Mining:
Memory-Constrained IDSVM
 Idea

Modify IDSVM by upper-bounding number of support vectors
 How  Twin Vector Machine (TVM)


Define budget B and a set of pivot vectors q1…qB
Quantize each example to its nearest pivot,
Q(x) = {qk, k = arg minj=1:B ||x-qj||}
D = {(xi,yi), i = 1…N}  Q(D) = {(Q(xi),yi), i = 1…N}
minimize ||w||2 + Cii
such that yi f(xi)  1 - i,
i 0, i = 1…N


minimize ||w||2 + Cj(nj+j+ + nj-j-)
such that f(qj)  1 - j, -f(qj)  1 - j-,
j+, j-  0, j = 1…B
Training SVM on Q(D) is equivalent to SVM on TV,
TV = {TVj, j = 1…B}
(Twin Vector Set)
TVj = {(qj,+1,nj+}, (qj,-1,nj-)}
(Twin Vector)
O(N3)  O(B3) (constant) time; O(N2)  O(B2) (constant) memory
Resource-Constrained Data Mining:
Online TVM
Online-TVM
 Input: Data stream D = {(xi,yi), i = 1…N}, budget B, kernel
function K, slack parameter C
 Output: TVM with parameters 1+,1-,… B+,B-, and b
1.
2.
3.
4.
5.
Initialize TVM = 0, TV = 
for i = 1 to N
if Beneficial(xi)
Update-TV
Update-TVM
Resource-Constrained Data Mining:
Online TVM
Beneficial
1.
2.
3.
4.
if size(TV) < B or |f(xi)|  m1
return 1
else
return 0
 -2
 +1
buffer
m1
Online-TVM
 Input: Data stream D = {(xi,yi), i = 1…N},
budget B, kernel function K, slack parameter
C
 Output: TVM with parameters 1+,1-,…
B+,B-, and b
0
 -1
buffer -2
1.
2.
3.
4.
5.
Initialize TVM = 0, TV = 
for i = 1 to N
if Beneficial(xi)
Update-TV
Update-TVM
Resource-Constrained Data Mining:
Online TVM
Update-TV
m2
s = size(TV)
 -2
TVB+1 = {(xi,yi,1), (qi,-yi,0)}
 +1
if s < B
buffer
0
TVs+1 = TVB+1
 -1
elseif maxi=1:B|f(qi)| > m2
buffer -2
k = arg maxi=1:B |f(qi)|
TVk = TVB+1
else
find best pair TVi, TVj to merge
use (**) to calculate qnew
TVi = {(qnew,+1, si+ + sj+), (qnew,-1,si- + sj-)}
TVj = TVB+1
( si  si )qi  ( s j  s j )q j
qnew 
si  si  s j  s j
(**)
Resource-Constrained Data Mining:
Online TVM
Merging Heuristics:
 Nearest versus Weighted
 +1
 Global versus One-Sided
0
 +1
0
 Rejection merging
 -1
 -1
 OneSideMerge
 GlobalMerge
Resource-Constrained Data Mining:
Results
100
-1.5
6
-1
4
-0.5
2
0
0
0.5
-2
1
-4
1.5
-6
-1
0
1
10000
400
-1.5
-1.5
-1
-1
5
-0.5
10
-0.5
0
0
0
0
0.5
0.5
-10
1
1
-5
1.5
-1
0
-20
1.5
-1
1
Budget B = 100
0
1
Resource-Constrained Data Mining:
Results
Checkerboard (noisy)
Checkerboard (noisy)
1
2500
CPU time (in seconds)
Accuracy
0.95
0.9
0.85
0.8
0.75
2
10
TVM
IDSVM
LIBSVM
Random Sampling
3
TVM
IDSVM
2000
1500
1000
500
0
4
10
10
Length of data stream (in log scale)
Budget B = 100
0
2000 4000 6000 8000 10000
Length of data stream
Resource-Constrained Data Mining:
Results
Checkerboard (noisy)
Checkerboard (noisy)
80
1
Accuracy
0.9
CPU time (in seconds)
TVM budget 50
TVM budget 100
TVM budget 200
0.95
0.85
0.8
0.75
0.7
TVM budget 50
TVM budget 100
TVM budget 200
60
40
20
0.65
0
1
10
2
3
4
10
10
10
Length of data stream (in log scale)
0
2000 4000 6000 8000
Length of data stream
10000
Resource-Constrained Data Mining:
Results
Adult
1
0.84
0.95
0.82
0.9
Accuracy
Accuracy
Checkerboard (noisy)
0.85
0.8
with buffer
without buffer
0.75
0.7
0
5000
Length of data stream
10000
Budget B = 100
OneSideMerge
GlobalMerge
0.8
0.78
0.76
0.74
2
3
4
5
10
10
10
10
Length of data stream (in log scale)
Resource-Constrained Data Mining:
Results
Banana
0.84
0.91
0.82
0.9
0.78
0.76
TVM
IDSVM
LIBSVM
Random Sampling
TVM
IDSVM
LIBSVM
Random Sampling
0.8
0.78
0.76
2
3
4
10
10
10
Length of data stream (in log scale)
TVM
IDSVM
LIBSVM
Random Sampling
0.9
0.85
0.85
2
3
4
10
10
10
Length of data stream (in log scale)
IJCNN
1
TVM
IDSVM
LIBSVM
Random Sampling
0.8
2
3
4
10
10
10
Length of data stream (in log scale)
Pendigits
1
0.99
0.98
0.98
Accuracy
Accuracy
0.82
0.88
0.86
Gauss
0.84
0.89
0.87
0.74
2
3
4
5
10
10
10
10
Length of data stream (in log scale)
0.95
Accuracy
0.8
Checkerboard
1
Accuracy
0.92
Accuracy
Accuracy
Adult
0.86
0.96
0.94
0.92
TVM
IDSVM
LIBSVM
Random Sampling
0.9
2
3
4
5
10
10
10
10
Length of data stream (in log scale)
0.97
0.96
0.95
TVM
IDSVM
LIBSVM
Random Sampling
0.94
2
3
4
10
10
10
Length of data stream (in log scale)
Resource-Constrained Data Mining:
Results
Resource-Constrained Data Mining:
Results
Resource-Constrained Data Mining:
Conclusions
 Memory-Constrained SVM is successful


Significantly higher accuracy than baseline
Close to the optimal approach
 Merging heuristics are very important
 Future work

Further improvements




Forgetting
Probabilistic merging
Use data compression
Non-IID streams
Thank You!
More information:
http://www.ist.temple.edu/~vucetic/
Collaboration/assistantship contact:
Slobodan Vucetic
CIS Department, IST Center,
Temple University
vucetic@ist.temple.edu
Download