IJCAI03scan

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University Problems  Many inductive algorithms are main memory-based.  When the dataset is bigger than the memory, it will "thrash".  Very low in efficiency when thrashing happens.  For algorithms that are not memory-based,  Do we need to see every piece of data? Probably not.  Overfitting curve? Not practical. Basic Idea:One Scan Algorithm Model Batch 1 Model Batch 2 Algorithm Model Batch 3 Batch 4 Model Loss and Benefit  Loss function:  Evaluate performance.  Benefit matrix – inverse of loss func  Traditional 0-1 loss  b[x,x] = 1, b[x,y] = 0  Cost-sensitive loss  Overhead of $90 to investigate a fraud.  b[fraud, fraud] = $tranamt - $90.  b[fraud, nonfraud] = $0.  b[nonfraud, fraud] = -$90.  b[nonfraud, nonfraud] = $0. Probabilistic Modeling  is the probability that x is an instance of class   is the expected benefit  Optimal decision  Example  p(fraud|x) = 0.5 and tranamt = $200  e(fraud|x) = b[fraud,fraud]p(fraud|x) + b[nonfraud, fraud] p(nonfraud|x)  =(200 – 90) x 0.5 + (-90) x 0.5 = $10  E(nonfraud|x) = b[fraud,nonfraud]p(fraud|x) + b[nonfraud,nonfraud]p(nonfraud|x)  = 0 x 0.5 + 0 x 0.5 = always 0  Predict fraud since we get $10 back. Combining Multiple Models  Individual benefits  Averaged benefits  Optimal decision How about accuracy Do we need all K models?  We stop learning if k (< K) models have the same accuracy as K models with confidence p.  Ends up scanning the dataset less than 1.  Use statistical sampling. Less than one scan Model Batch 1 No Model Batch 2 Algorithm Model Accurate Enough? Batch 3 Yes Batch 4 Hoeffding’s inequality  Random variable within R=a-b  After n observations, its mean value is y.  What is its error with confidence p regardless of the distribution? When can we stop?  Use k models  highest expected benefit  Hoeffding’s error:  second highedt expected benefit  Hoeffding’s error:  The majority label is still confidence p iff with Less Than One Scan Algorithm  Iterate the process on every instance from a validation set.  Until every instance has the same prediction as the full ensemble with confidence p. Validation Set  If we fail on one example x, we do not need to examine on another one.  So we can keep only one example in memory at a time.  If k base models’s prediction on x is the same as K models.  It is very likely that k+1 models will also be the same as K models with the same confidence. Validation Set  At anytime, we only need to keep one data item x from the validation set.  It is sequentially read from the validation set.  The validation set is read only once.  What can be a validation set?  The training set itself  A separate holdout set. Amount of Data Scan  Training Set : at most one  Validation Set: once.  Using training as validation set:  Once we decide to train model from a batch, we do not use it for validation again.  How much is used to train model? Less than one. Experiments  Donation Dataset:  Total benefits: donated charity minus overhead to send solicitations. Experiment Setup  Inductive learners:  C4.5  RIPPER  NB  Number of base models:  {8,16,32,64,128,256}  Reports their average Baseline Results (with C4.5)  Single model: $13292.7  Complete One Scan: $14702.9  The average of {8,16,32,64,128,256}  We are actually $1410 higher than the single model. Less-than-one scan (with C4.5)  Full one scan: $14702  Less-than-one scan: $14828  Actually a little higher, $126.  How much data scanned with 99.7% confidence?  71% Other datasets  Credit card fraud detection  Total benefits:  Recovered fraud amount minus overhead of investigation Results  Baseline single: $733980 (with curtailed probability)  One scan ensemble: $804964  Less than one scan: $804914  Data scan amount: 64% Smoothing effect. Related Work  Ensenbles:  Meta-learning (Chan and Stolfo): 2 scans  Bagging (Breiman) and AdaBoost (Freund and Schapire): multiple  Use of Hoeffding’s inequality:  Aggregate query (Hellerstein et al)  Streaming decision tree (Hulten and Domingos)  Single decision tree, less than one scan  Scalable decision tree:  SPRINT (Shafer et al): multiple scans  BOAT (Gehrke et al): 2 scans Conclusion  Both “one scan” and “less than one scan” have accuracy either similar or higher than the single model.  “Less than one scan” uses approximately 60% – 90% of data for training with loss of accuracy.

IJCAI03scan

Related documents

Products

Support

IJCAI03scan

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib