A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

advertisement
A General Framework for Mining
Concept-Drifting Data Streams with
Skewed Distributions
Jing Gao† Wei Fan‡ Jiawei Han† Philip S. Yu‡
†University of Illinois at Urbana-Champaign
‡IBM T. J. Watson Research Center
Introduction (1)
1
1
• Data Stream
– Continuously arriving
data flow
– Applications: network
traffic, credit card
transaction flow, phone
calling records, etc.
0
1
1
0
1
1
1
0
0
Introduction (2)
• Stream Classification
– Construct a classification model based on past
records
– Use the model to predict labels for new data
– Help decision making
Fraud?
Labeling
Fraud
Classification
model
Framework
………
Classification
Model
………
?
Predict
Concept Drifts
• Changes in P(x,y)
– P(x,y)=P(y|x)P(x) x-feature vector, y-class label
– No Change, Feature Change, Conditional Change, Dual
Change
– Expected error is not a good indicator of concept drifts
– Training on the most recent data could help reduce
expected error
Time
Stamp 1
Time
Stamp 11
Time
Stamp 21
Issues in Stream Classification(1)
• Generative Model
– P(y|x) follows some
distribution
• Descriptive Model
– Let data decides
• Stream Data
– Distribution
unknown and
evolving
Issues in Stream Classification(2)
• Label Prediction
– Classify x into one
class
• Probability Estimation
– x is assigned to all
classes with different
probabilities
• Stream Applications
– Stochastic, prediction
confidence information
is needed
Mining Skewed Data Stream
-
• Skewed Distribution
– Credit card frauds, network
intrusions
• Existing Stream
Classification Algorithms
+
– Evaluated on balanced
data
• Problems
– Ignore minority examples
– The cost of misclassifying
minority examples is
usually huge
Classify
every leaf
node as
negative
Stream Ensemble Approach (1)
………
………
Step 1
Sampling
?
Training set?
Insufficient
positive examples!
Stream Ensemble Approach (2)
1
2
C1
C2
……
……
k
Ck
k
1
f E ( x)   f i ( x)
k i 1
Step 2
Ensemble
Why this approach works?
• Incorporation of old positive examples
– increase the training size, reduce variance
– negative examples reflect current concepts, so
the increase in boundary bias is small
• Ensemble
– reduce variance caused by single model
– disjoint sets of negative examples—the
classifiers will make uncorrelated errors
• Bagging & Boosting
– running cost is much higher
– cannot generate reliable probability estimates for
skewed distributions
Analysis
• Error Reduction
– Sampling
f c ( x)  P(c | x)   c  c ( x)
 b2  (2  2 ) / s 2
– Ensemble
fCE ( x)  P(c | x)   c  c ( x)

• Efficiency Analysis
2
bE
1
 2
k
k

i 1
2
bi
O(d (n p  knq ) log( n p  knq ))
– Single model
O(dk (n p  nq ) log( n p  nq ))
– Ensemble
– Ensemble is more efficient
Experiments
• Measures
– Mean Squared Error
1 n
L   ( f ( xi )  P( | xi )) 2
n i 1
– ROC Curve
– Recall-Precision Curve
• Baseline Methods
– NS: No sampling +Single Model
– SS: Sampling + Single Model
– SE: Sampling + Ensemble
Experimental Results (1)
0.9
0.8
0.7
0.6
SE
NS
SS
0.5
0.4
0.3
0.2
0.1
0
Feature
Feature Change
only P(x) changes
Conditional
Conditional Change
only P(y|x) changes
Dual
Dual Change both
P(x) and P(y|x)
changes
Mean Squared Error on Synthetic Data
Experimental Results (2)
0.25
0.2
0.15
SE
NS
SS
0.1
0.05
0
Thyroid1
Thyroid2
Opt
Letter
Covtype
Mean Squared Error on Real Data
Experimental Results (3)
Plots on Synthetic Data
ROC Curve
Recall-Precision Plot
Experimental Results (4)
Plots on Real Data
ROC Curve
Recall-Precision Plot
Experimental Results (5)
Training Time
Conclusions
• General issues in stream classification
– concept drifts
– descriptive model
– probability estimation
• Mining skewed data streams
– sampling and ensemble techniques
– accurate and efficient
• Wide applications
– graph data
– airforce data
Thanks!
• Any questions?
Download