Connections between Learning Theory, Game Theory, and Optimization Maria Florina (Nina) Balcan

advertisement
Connections between Learning
Theory, Game Theory, and
Optimization
Lecture 1, August 24th 2010
Maria Florina (Nina) Balcan
Big Picture
Over the past decades, many important and deep
connections between:
•
machine learning theory
•
algorithmic game theory
•
combinatorial optimization
We will explore such connections, discussing:
• fundamental topics in each area.
• how ideas from each area can shed light on the others.
Outline
Online learning. Combining expert advice.
Regret minimization (no external regret and no internal
regret). Bandit algorithms.
Zero sum games. Nash equilibria.
Experts learning & Minimax theorem.
1/2
1
0
0
1/2
1
1
0
1/2
Nash equilibria and approximate nash equilibria in
general sum bimatrix games.
Outline
Learning in a distributional setting.
Sample complexity results.
+ +
+
+
-
Weak-learning vs. Strong-learning.
Boosting with connections to game theory.
Quality of equilibria (Price of anarchy/stability).
Games with many players. Potential games.
Dynamics in games and the price of learning.
Outline
Mechanism design (MD).
Combinatorial auctions.
[Social welfare; revenue maximization]
Auctions for digital goods.
•
Reductions from MD to algorithm
design using machine learning.
Algorithmic pricing problems.
•
Online learning for designing online pricing schemes.
Outline
Submodularity with connections to game theory and
machine learning.
•
Combinatorial auctions with submodular valuations
•
Learning submodular functions
•
Other optimization pbs involving submodularity
(ranking, clustering, etc.)
Admin
• Course web page: http://www.cc.gatech.edu/~ninamf/LGO10/
• 3 hwk assignments. Exercises/problems (pencil-and-paper
problem-solving variety). [50%]
• Project: explore a theoretical question, try some experiments, or
read a couple of papers and explain the idea. Writeup and class
presentation. Groups ok. [50%]
• “Algorithmic Game Theory”, Nisan, Roughgarden,
Tardos, Vazirani
• Other papers, surveys, and tutorials
Online learning, minimizing regret, and
combining expert advice.
•
“The weighted majority algorithm” N. Littlestone & M. Warmuth
•
“Online Algorithms in Machine Learning” (survey) A. Blum
 Algorithmic Game Theory, Nisan, Roughgarden, Tardos, Vazirani
(eds) [Chapters 4]
 Prediction, Learning, and Games, Cesa-Bianchi, Lugosi
Online learning, minimizing regret, and
combining expert advice.
Expert 1
Expert 2
Expert 3
Using “expert” advice
Assume we want to predict the stock market.
• We solicit n “experts” for their advice.
• Will the market go up or down?
• We then want to use their advice somehow to make
our prediction. E.g.,
Can we do nearly as well as best in hindsight?
Note: “expert” ´ someone with an opinion.
[Not necessairly someone who knows anything.]
Formal model
• There are n experts.
•
For each round t=1,2, …, T
• Each expert makes a prediction in {0,1}
• The learner (using experts’ predictions) makes a
prediction in {0,1}
• The learner observes the actual outcome. There is a
mistake if the predicted outcome is different form
the actual outcome.
Can we do nearly as well as best in hindsight?
Weighted Majority Algorithm
Deterministic Majority Algorithm
– Start with all experts having weight 1.
– Predict based on weighted majority vote.
– If
then predict 1
else predict 0
– Penalize mistakes by cutting weight in half.
Randomized versions of this algorithm can provide
surprisingly strong guarantees
Weighted Majority Algorithm
• E[# mistakes] · (1+e)OPT + e-1log(n).
• If set e=(log(n)/OPT)1/2 to balance the two terms out
(or use guess-and-double), get bound of
• E[mistakes]·OPT+2(OPT¢log n)1/2
Note: Of course we might not know OPT, so if running T time
steps, since OPT · T, set ² to get additive loss (2T log n)1/2
• E[mistakes]·OPT+2(T¢log n)1/2
• So, regret/T ! 0.
regret
[no regret algorithm]
Many other useful extensions
E.g., what if have n options, not n predictors?
• We’re not combining n experts, we’re choosing one.
• Nice feature of RWM: can be applied when experts
are n different options
• E.g., n different ways to drive to work each day, n
different ways to invest our money.
Other generalizations as well.
Other notions of no regret (e.g., no internal regret).
Online Learning, Game Theory, and
Minimax Optimality
“Game Theory, On-line Prediction, and Boosting”,
Freund & Schapire, GEB
Zero Sum Games
Game defined by a matrix M.
Rock
Paper
Scissors
Rock
1/2
1
0
Paper
0
1/2
1
Scissors
1
0
1/2
Assume wlog entries in [0,1].
Row player (Mindy) chooses row i.
Column player (Max) chooses column j (simultaneously).
Mindy’s goal: minimize her loss M(i,j).
Max’s goal: maximize this loss (zero sum).
Randomized Play
Mindy chooses a distribution P over rows.
Max chooses a distribution Q over columns
[simultaneously]
Mindy’s expected loss:
If i,j = pure strategies, and P,Q = mixed strategies
M(P,j) - Mindy’s expected loss when she plays P and Max plays j
M(i,Q) - Mindy’s expected loss when she plays i and Max plays Q
Sequential Play
Say Mindy plays before Max. If Mindy chooses P, then Max
will pick Q to maximize M(P,Q), so the loss will be
So, Mindy should pick P to minimize L(P). Loss will be:
Similarly, if Max plays first, loss will be:
Minimax Theorem
Playing second cannot be worse than playing first
Mindy plays first
Mindy plays second
Von Neumann’s minimax theorem:
No advantage to playing second!
Optimal Play
Von Neumann’s minimax theorem:
Value of the game
Optimal strategies:
Min-max strategy
Max-min strategy
We will show how to use WM to prove this!
And to also find approximate min-max strategies quickly.
Optimal Play
Von Neumann’s minimax theorem:
Value of the game
Optimal strategies:
Min-max strategy
Max-min strategy
(P*, Q*) is Nash Equilibria (No player has an incentive to
unilateraly deviate)
Central solution
concept we will study
Games with many players with
interesting structure
"Potential Games", D. Monderer and L, S. Shapley , Games and
Economic Behavior
Fair cost-sharing
Fair cost-sharing: n players in weighted directed graph G.
Player i wants to get from si to ti, and they share cost
of edges they use with others.
G
Fair cost-sharing
•
n players in directed graph G, each edge e costs ce.
•
Player i wants to get from si to ti.
•
All players share cost of edges they use with others.
•
Each player wants to minimize his own cost.
•
s
Good equilibrium: all use edge of cost 1.
1
n
(paying 1/n each)
Bad equilibrium: all use edge of cost n.
t
(paying 1 each)
Inefficiency of equilibria, PoA and PoS
Price of Anarchy (PoA): ratio of worst Nash equilibrium to OPT.
[Koutsoupias-Papadimitriou’99]
Price of Stability (PoS): ratio of best Nash equilibrium to OPT.
[Anshelevich et. al, 2004]
E.g., for fair cost-sharing, PoS is log(n), whereas PoA is n.
Significant effort spent on understanding these in CS.
“Algorithmic Game Theory”, Nisan, Roughgarden, Tardos, Vazirani
Congestion games
• Nice general class of games with many players.
• Always have a pure-strategy equilibrium.
• Have a potential function s.t. whenever a player
switches, potential drops by exactly that player’s
improvement.
• We will analyze dynamics in these games!!!
• What happens if players follow natural learning
dynamics!!!
Learning in a distributional
setting. [With feature information]
Used all over CS and Science
Image Classification
Document Categorization
Speech Recognition Protein Classification
Branch Prediction
Fraud Detection
Spam Detection
28
Example: Supervised Classification
Decide which emails are spam and which are important.
Not spam
Supervised classification
spam
Goal: use emails seen so far to produce good prediction
rule for future data.
29
Example: Supervised Classification
Represent each message by features. (e.g., keywords, spelling, etc.)
example
Reasonable RULES:
Predict SPAM if unknown AND (money OR pills)
Predict SPAM if 2money + 3pills –5 known > 0
label
+ + +
+
- Linearly separable
30
Two Main Aspects of Supervised Learning
Algorithm Design. How to optimize?
Automatically generate rules that do well on observed data.
Optimization played a significant role in the recent years.
Confidence Bounds, Generalization
Guarantees, Sample Complexity
Confidence for rule effectiveness on future data.
Well understood for passive supervised learning.
31
Standard Passive Supervised Learning
• X – feature space
• S={(x, l)} - set of labeled examples
– drawn i.i.d. from distr. D over X and labeled by target concept c*
• Do optimization over S, find hypothesis h 2 C.
• Goal: h has small error over D.
err(h)=Prx 2 D(h(x)  c*(x))
h
c*
• c* in C, realizable case; else agnostic
32
Standard Passive Supervised Learning
Classic models: PAC (Valiant), SLT (Vapnik)
• Sample Complexity, Finite Hypothesis Spaces, Realizable Case
• In in the non-realizable case, replace \epsilon with \epsilon ^2.
33
Standard Passive Supervised Learning
Classic models: PAC (Valiant), SLT (Vapnik)
• Sample Complexity, Finite Hypothesis Spaces, Realizable Case
• Such ideas/techniques useful in Auction design,
Learning submodular functions, etc.
34
Boosting & game theory
• Suppose I have an algorithm A that for any
distribution (weighting fn) over a dataset S can
produce a rule h2H that gets < 40% error.
• Adaboost gives a way to use such an A to get error !
0 at a good rate, using weighted votes of rules
produced.
• We can show that this is in principle possible by using
the minimax theorem!
Supermarket Pricing Problem
• A supermarket trying to decide on how to price the goods.
Seller’s Goal: set prices to maximize revenue.
• Simple case: customers make separate decisions on each item.
• Harder case: customers buy everything or nothing based on
sum of prices in list.
• Or could be even more complex.
Supermarket Pricing Problem
Algorithmic
• Seller knows the market well.
Incentive Compatible Auction
• Must be in customers’ interest (dominant strategy) to
report truthfully.
Online Pricing
• Customers arrive one at a time, buy what they want at
current prices. Seller modifies prices over time.
• Techniques from learning will be useful here.
Submodular functions
V={1,2, …, n}, f : 2V ! R
Submodularity:
f(S)+f(T) ¸ f(S Å T) + f(S [ T)
8 S,Tµ V
Equivalent
Decreasing marginal values:
f(S [ {x})-f(S) ¸ f(T [ {x})-f(T)
8SµTµV, xT
Examples:
• Vector Spaces Let V={v1,,vn}, each vi 2 Rn.
For each S µ V, let f(S) = rank(V[S])
• Concave Functions Let h : R ! R be concave.
For each S µ V, let f(S) = h(|S|)
Submodular functions
• Strong connection between optimization and
submodularity
• e.g.: minimization [C’85,GLS’87,IFF’01,S’00,…],
maximization [NWF’78,V’07,…]
• Algorithmic game theory
• Submodular utility functions
• Much interest in Machine Learning community
recently
• Tutorials at major conferences: ICML, NIPS, etc.
•
www.submodularity.org is a Machine Learning site
• Interesting to understand their learnability
Download