IJCAI03scan

advertisement
Inductive Learning in Less Than
One Sequential Data Scan
Wei Fan, Haixun Wang, and Philip S. Yu
IBM T.J.Watson
Shaw-hwa Lo
Columbia University
Problems
 Many inductive algorithms are main
memory-based.
 When the dataset is bigger than the memory, it
will "thrash".
 Very low in efficiency when thrashing happens.
 For algorithms that are not memory-based,
 Do we need to see every piece of data? Probably
not.
 Overfitting curve? Not practical.
Basic Idea:One Scan Algorithm
Model
Batch 1
Model
Batch 2
Algorithm
Model
Batch 3
Batch 4
Model
Loss and Benefit
 Loss function:
 Evaluate performance.
 Benefit matrix – inverse of loss func
 Traditional 0-1 loss
 b[x,x] = 1, b[x,y] = 0
 Cost-sensitive loss
 Overhead of $90 to investigate a fraud.
 b[fraud, fraud] = $tranamt - $90.
 b[fraud, nonfraud] = $0.
 b[nonfraud, fraud] = -$90.
 b[nonfraud, nonfraud] = $0.
Probabilistic Modeling

is the probability that x is an
instance of class

 is the expected benefit
 Optimal decision

Example
 p(fraud|x) = 0.5 and tranamt = $200
 e(fraud|x) = b[fraud,fraud]p(fraud|x)
+ b[nonfraud, fraud] p(nonfraud|x)
 =(200 – 90) x 0.5 + (-90) x 0.5 = $10
 E(nonfraud|x) =
b[fraud,nonfraud]p(fraud|x) +
b[nonfraud,nonfraud]p(nonfraud|x)
 = 0 x 0.5 + 0 x 0.5 = always 0
 Predict fraud since we get $10 back.
Combining Multiple Models
 Individual benefits
 Averaged benefits
 Optimal decision
How about accuracy
Do we need all K models?
 We stop learning if k (< K) models
have the same accuracy as K models
with confidence p.
 Ends up scanning the dataset less
than 1.
 Use statistical sampling.
Less than one scan
Model
Batch 1
No
Model
Batch 2
Algorithm
Model
Accurate
Enough?
Batch 3
Yes
Batch 4
Hoeffding’s inequality
 Random variable within R=a-b
 After n observations, its mean value
is y.
 What is its error with confidence p
regardless of the distribution?
When can we stop?
 Use k models

highest expected benefit
 Hoeffding’s error:

second highedt expected benefit
 Hoeffding’s error:
 The majority label is still
confidence p iff
with
Less Than One Scan Algorithm
 Iterate the process on every instance
from a validation set.
 Until every instance has the same
prediction as the full ensemble with
confidence p.
Validation Set
 If we fail on one example x, we do
not need to examine on another one.
 So we can keep only one example in
memory at a time.
 If k base models’s prediction on x is
the same as K models.
 It is very likely that k+1 models will also
be the same as K models with the same
confidence.
Validation Set
 At anytime, we only need to keep one
data item x from the validation set.
 It is sequentially read from the
validation set.
 The validation set is read only once.
 What can be a validation set?
 The training set itself
 A separate holdout set.
Amount of Data Scan
 Training Set : at most one
 Validation Set: once.
 Using training as validation set:
 Once we decide to train model from a
batch, we do not use it for validation
again.
 How much is used to train model? Less
than one.
Experiments
 Donation Dataset:
 Total benefits: donated charity minus
overhead to send solicitations.
Experiment Setup
 Inductive learners:
 C4.5
 RIPPER
 NB
 Number of base models:
 {8,16,32,64,128,256}
 Reports their average
Baseline Results (with C4.5)
 Single model: $13292.7
 Complete One Scan: $14702.9
 The average of {8,16,32,64,128,256}
 We are actually $1410 higher than the
single model.
Less-than-one scan (with C4.5)
 Full one scan: $14702
 Less-than-one scan: $14828
 Actually a little higher, $126.
 How much data scanned with 99.7%
confidence?
 71%
Other datasets
 Credit card fraud detection
 Total benefits:
 Recovered fraud amount minus overhead
of investigation
Results
 Baseline single: $733980 (with
curtailed probability)
 One scan ensemble: $804964
 Less than one scan: $804914
 Data scan amount: 64%
Smoothing effect.
Related Work
 Ensenbles:
 Meta-learning (Chan and Stolfo): 2 scans
 Bagging (Breiman) and AdaBoost (Freund and
Schapire): multiple
 Use of Hoeffding’s inequality:
 Aggregate query (Hellerstein et al)
 Streaming decision tree (Hulten and Domingos)
 Single decision tree, less than one scan
 Scalable decision tree:
 SPRINT (Shafer et al): multiple scans
 BOAT (Gehrke et al): 2 scans
Conclusion
 Both “one scan” and “less than one
scan” have accuracy either similar or
higher than the single model.
 “Less than one scan” uses
approximately 60% – 90% of data for
training with loss of accuracy.
Download