A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao† Wei Fan‡ Jiawei Han† Philip S. Yu‡ †University of Illinois at Urbana-Champaign ‡IBM T. J. Watson Research Center Introduction (1) 1 1 • Data Stream – Continuously arriving data flow – Applications: network traffic, credit card transaction flow, phone calling records, etc. 0 1 1 0 1 1 1 0 0 Introduction (2) • Stream Classification – Construct a classification model based on past records – Use the model to predict labels for new data – Help decision making Fraud? Labeling Fraud Classification model Framework ……… Classification Model ……… ? Predict Concept Drifts • Changes in P(x,y) – P(x,y)=P(y|x)P(x) x-feature vector, y-class label – No Change, Feature Change, Conditional Change, Dual Change – Expected error is not a good indicator of concept drifts – Training on the most recent data could help reduce expected error Time Stamp 1 Time Stamp 11 Time Stamp 21 Issues in Stream Classification(1) • Generative Model – P(y|x) follows some distribution • Descriptive Model – Let data decides • Stream Data – Distribution unknown and evolving Issues in Stream Classification(2) • Label Prediction – Classify x into one class • Probability Estimation – x is assigned to all classes with different probabilities • Stream Applications – Stochastic, prediction confidence information is needed Mining Skewed Data Stream - • Skewed Distribution – Credit card frauds, network intrusions • Existing Stream Classification Algorithms + – Evaluated on balanced data • Problems – Ignore minority examples – The cost of misclassifying minority examples is usually huge Classify every leaf node as negative Stream Ensemble Approach (1) ……… ……… Step 1 Sampling ? Training set? Insufficient positive examples! Stream Ensemble Approach (2) 1 2 C1 C2 …… …… k Ck k 1 f E ( x) f i ( x) k i 1 Step 2 Ensemble Why this approach works? • Incorporation of old positive examples – increase the training size, reduce variance – negative examples reflect current concepts, so the increase in boundary bias is small • Ensemble – reduce variance caused by single model – disjoint sets of negative examples—the classifiers will make uncorrelated errors • Bagging & Boosting – running cost is much higher – cannot generate reliable probability estimates for skewed distributions Analysis • Error Reduction – Sampling f c ( x) P(c | x) c c ( x) b2 (2 2 ) / s 2 – Ensemble fCE ( x) P(c | x) c c ( x) • Efficiency Analysis 2 bE 1 2 k k i 1 2 bi O(d (n p knq ) log( n p knq )) – Single model O(dk (n p nq ) log( n p nq )) – Ensemble – Ensemble is more efficient Experiments • Measures – Mean Squared Error 1 n L ( f ( xi ) P( | xi )) 2 n i 1 – ROC Curve – Recall-Precision Curve • Baseline Methods – NS: No sampling +Single Model – SS: Sampling + Single Model – SE: Sampling + Ensemble Experimental Results (1) 0.9 0.8 0.7 0.6 SE NS SS 0.5 0.4 0.3 0.2 0.1 0 Feature Feature Change only P(x) changes Conditional Conditional Change only P(y|x) changes Dual Dual Change both P(x) and P(y|x) changes Mean Squared Error on Synthetic Data Experimental Results (2) 0.25 0.2 0.15 SE NS SS 0.1 0.05 0 Thyroid1 Thyroid2 Opt Letter Covtype Mean Squared Error on Real Data Experimental Results (3) Plots on Synthetic Data ROC Curve Recall-Precision Plot Experimental Results (4) Plots on Real Data ROC Curve Recall-Precision Plot Experimental Results (5) Training Time Conclusions • General issues in stream classification – concept drifts – descriptive model – probability estimation • Mining skewed data streams – sampling and ensemble techniques – accurate and efficient • Wide applications – graph data – airforce data Thanks! • Any questions?