The Weighted Majority Algorithm. Course Retrospective

advertisement
Machine Learning Theory
Maria Florina Balcan
04/29/10
Plan for today:
- problem of “combining expert advice”
- course retrospective and open questions
Using “expert” advice
Say we want to predict the stock market.
• We solicit n “experts” for their advice. (Will the
market go up or down?)
• We then want to use their advice somehow to
make our prediction. E.g.,
Can we do nearly as well as best in hindsight?
[“expert” ´ someone with an opinion. Not necessarily someone
who knows anything.]
Simpler question
• We have n “experts”.
• One of these is perfect (never makes a mistake).
We just don’t know which one.
• Can we find a strategy that makes no more than
lg(n) mistakes?
Answer: sure. Just take majority vote over all
experts that have been correct so far.
Each mistake cuts # available by factor of 2.
Note: this means ok for n to be very large.
“halving algorithm”
Using “expert” advice
If one expert is perfect, can get · lg(n) mistakes
with halving alg.
But what if none is perfect? Can we do nearly as
well as the best one in hindsight?
Strategy #1:
• Iterated halving algorithm. Same as before, but
once we've crossed off all the experts, restart
from the beginning.
• Makes at most log(n)*[OPT+1] mistakes, where
OPT is #mistakes of the best expert in
hindsight.
Seems wasteful. Constantly forgetting what we've
“learned”. Can we do better?
Weighted Majority Algorithm
Intuition: Making a mistake doesn't completely
disqualify an expert. So, instead of crossing
off, just lower its weight.
Weighted Majority Alg:
– Start with all experts having weight 1.
– Predict based on weighted majority vote.
– Penalize mistakes by cutting weight in half.
Analysis: do nearly as well as best
expert in hindsight
• M = # mistakes we've made so far.
• m = # mistakes best expert has made so far.
• W = total weight (starts at n).
• After each mistake, W drops by at least 25%.
So, after M mistakes, W is at most n(3/4)M.
• Weight of best expert is (1/2)m. So,
constant
ratio
Randomized Weighted Majority
2.4(m + lg n) not so good if the best expert makes a
mistake 20% of the time. Can we do better? Yes.
• Instead of taking majority vote, use weights as
probabilities. (e.g., if 70% on up, 30% on down, then pick
70:30) Idea: smooth out the worst case.
• Also, generalize ½ to 1- e.
unlike most
worst-case
bounds, numbers
are pretty good.
Analysis
• Say at time t we have fraction Ft of weight on
experts that made mistake.
• So, we have probability Ft of making a mistake, and
we remove an eFt fraction of the total weight.
– Wfinal = n(1-e F1)(1 - e F2)...
– ln(Wfinal) = ln(n) + t [ln(1 - e Ft)] · ln(n) - e t Ft
(using ln(1-x) < -x)
= ln(n) - e M.
( Ft = E[# mistakes])
• If best expert makes m mistakes, then ln(Wfinal) > ln((1-e)m).
• Now solve: ln(n) - e M > m ln(1-e).
Summarizing
• At most (1+e) times worse than best
expert in hindsight, with additive e-1log(n).
• If have prior, can replace additive term
with e-1log(1/pi). [e-1 x number of bits]
• Often written in terms of additive loss.
If running T time steps, set epsilon to get
additive loss (2T log n)1/2
What can we use this for?
• Can use to combine multiple algorithms to do nearly
as well as best in hindsight.
• Can apply RWM in situations where experts are
making choices that cannot be combined.
– E.g., repeated game-playing.
– E.g., online shortest path problem
[OK if losses in [0,1]. Replace Ft with Pt¢Lt and penalize
expert i by (1-e)loss(i) ]
• Extensions:
– “bandit” problem.
– efficient algs for some cases with many experts.
– Sleeping experts / “specialists” setting.
Summary and Open Questions
Machine Learning
Incredibly useful in many domains across computer science,
engineering, and science.
• Image Classification
• Spam Detection
• Document Categorization
• Fraud Detection
• Speech Recognition
• Protein Classification
• Computational Advertising
• Branch Prediction
• Etc
12
Goals of Machine Learning Theory
Develop and analyze models to understand:
•
what kinds of tasks we can hope to learn, and from what
kind of data
• what types of guarantees might we hope to achieve
• prove guarantees for practically successful algs (when will
they succeed, how long will they take?);
• develop new algs that provably meet desired criteria
Example: Supervised Classification
Decide which emails are spam and which are important.
Not spam
Supervised classification
spam
Goal: use emails seen so far to produce good prediction
rule for future data.
14
Two Main Aspects of Supervised Learning
Algorithm Design. How to optimize?
Automatically generate rules that do well on observed data.
Confidence Bounds, Generalization
Guarantees, Sample Complexity
Confidence for rule effectiveness on future data.
Well understood for passive supervised learning.
15
Other Protocols for Supervised Learning
• Semi-Supervised Learning
Using cheap unlabeled data in addition to labeled data.
• Active Learning
The algorithm interactively asks for labels of informative examples.
Theoretical understanding severely lacking until a couple of years ago.
Lots of progress recently. We will cover some of these.
• Learning with Membership Queries
• Statistical Query Learning
16
Topics we covered
• Basic models for supervised learning: PAC and SLT.
• Simple algos and hardness results for supervised
learning.
• Standard Sample Complexity Results (VC dimension)
• Weak-learning vs. Strong-learning
• Classic, state of the art algorithms: AdaBoost and
SVM.
Structure of the Class
• Modern Sample Complexity Results
• Rademacher Complexity
• Margin analysis of Boosting and SVM
• Incorporating Unlabeled Data in the Learning
Process.
• Incorporating Interaction in the Learning Process:
• Active Learning
• Learning with Membership Queries
• Classification noise and the Statistical-Query model
• Learning Real Valued Functions
Open Questions
• In the classic PAC model
• learning decision trees, DNF
• learning functions with a few relevant vars (junta problem)
• Active learning and SSL
• right sample complex quantities
• interesting positive algorithmic results
• Models and algorithms for exciting new paradigms
• e.g., transfer learning, multi-agent learning, never ending
learning
Download