- Lorentz Center

advertisement
Nonstochastic Multi-Armed Bandits
With Graph-Structured Feedback
Noga Alon, TAU
Nicolo Cesa-Bianchi, Milan
Claudio Gentile, Insubria
Shie Mannor, Technion
Yishay Mansour, TAU and MSR
Ohad Shamir, Weizmann
Nonstochastic sequential
decision-making
• K actions and T time steps
• lt(a) – loss of action a at time t
• At time t
– player picks action Xt
– incurs loss lt(Xt)
– observe feedback on losses
• Multi-arm bandit: only lt(Xt)
• Experts (full information): lt(j) for any j
2
Nonstochastic sequential
decision-making
• Goal:
– minimize losses
– benchmark: The best
single action
• The action j that
minimizes the loss
– no stochastic
assumptions on losses
• Regret
T
R T  E [   ( X t )]  min
j
player loss
best
t 1
 


– MAB
t
( j)
t 1
 


action
• Known regret bounds:
– Experts
3
T
TK
T ln K
Motivation – observablity
undirected
directed
4
undirected observation graph
?
?
?
?
?
?
?
?
5
undirected observation graph
?
?
3
?
?
?
?
?
6
undirected observation graph
7
5
3
?
1
?
?
?
7
undirected observation graph
• MAB: no edges
• Experts: clique
?
?
3
?
3
?
?
?
4
1
8
2
?
8
7
5
6
Modeling
Directed vs Undirected
Informed vs Uniformed
• Different types of
dependencies
• When does the learner
observes the graph
• Different measures
– Independent set
– Dominating set
– Max Acyclic Subgraph
9
– Before
– After
• only the neighbors
Our Results
Uniformed setting
• Undirected graph
• Uniformed setting
Informed setting
• Directed graphs
– Only the neighbors of the node • Regret characterization
– dominating sets and ind. set
– Independent sets
~
O ( T  ( G ) ln K )
• Both expectation and high
prob.
– Max Acyclic Subgraph (not tight)
• Directed graph
– Random Erdos-Renyi graphs
10
EXP3-SET
• Online Algorithm
Pr[ X t  a ]  exp(  
t
s 1
 t (a )

ˆ ( a )   Pr[  ( a ) is observed

t
t

0

ˆ ( a ) )
s
]
where
if  t ( a ) obseved
otherwise
E [ ˆ t ( a )]   t ( a )
• Theorem
RT 
11
ln K



T
Pr[ X

2
t
 a |  t ( a ) is observed]
t 1


     

  (Gt )

T  (G)lnK

 

  (G t )lnK
EXP3-Set Regret – key lemma
• Lemma
Q 
• Proof: Build an i.s. S
Pr[ X t  a ]
 Pr[ a observed
a
Note:
MAB: Q=K
Full info. Q=1
]
• Note

j N ( a )
12
  (G )
– consider action a with
minimal Pr[a observed]
– Add a to S
– Delete a and its
neighbors
Pr[ X t  j ]
Pr[ j observed ]


j N ( a )
Pr[ X t  j ]
Pr[ a observed ]
1
Dominating set – directed graph
?
?
?
?
?
?
?
?
15
Dominating set – directed graph
?
?
?
?
?
?
?
?
16
EXP3-DOM
• Simplified version
– fixed graph G
– D is dominating set
• log approx
• Main modification
– add probabilities to D
• induce observability
• probabilities:
p a , t  (1   )
w a ,t
Wt


I [a  D ]
|D|
• Select Xt using pt
• Observe lt(a) for a in SXt,t
• weights
w a ,t 1  w a ,t exp(   ˆ t ( a ) / | D |)
ˆ ( a ) 
t
17
 t (a )
Pr[ observe a ]
I [ i  S X t ,t ]
EXP3-DOM
• Simple example
• Transitive observability
– tournament
• action 1 observes all
actions
– D={1}
• EXP3-DOM
• Sample action 1 with
prob γ
– action 1 is the
exploration
• Otherwise run a MAB
– specifically EXP3-SET
• Intuition
– action 1 replaces
mixture with uniform
18
Conclusion
• Observability model
– Between MAB and Experts
• more work to be done
• Uninformed setting
– Undirected graph
• Informed setting
– Directed graph
• [Kocak, Neu, Valko and R. Muno]
improved uniformed
19
Outline
• Model and motivation
• symmetric observability
• non-symmetric observability
23
EXP3-DOM: key lemma
• Lemma
• Proof: high level
– G directed graph,
– d-i indegree of i,
– α=α(G)
K
1
 1 d
i 1

i
K 

 2 ln  1 

 

– shrink graph
• GK,Gk-1, …
– delete nodes
• step s:
– delete max indegree node
• From Turan’s theorem
• Turan’s Theorem
– undirected graph G(V,E)
|V |
 1
24
  ( G );  
2|E |
|V |
max d

i

| Ds |
| Vs |

| Vs |
2 s

1
2
EXP3-DOM: key lemma (proof)
• Completing the proof
K
1
 1 d
i 1
K  K
2 K
K  K
K

1 d
2 K




i,K
2 i

i 1
i
K
1
i

1, K
K


1
 1 d
i2
1
 1 d
i2
K 1


i,K
1
 1 d
i 1

i,K

i , K 1
 2 ln( 1 
K

)
• Note, due to edge elimination
25


d i  1, K  d i , K 1
EXP3-DOM- Key lemma (modified)
• Lemma (what we really need!)
• G(V,E) directed graph
– INi indegree of i
– r size dominating set; and α size ind. set
– p distribution over V
• pi≥β
Q 
26

pi
K
i 1
pi 

j  IN i


K 2 


K 


 r 
 2 ln  1 
  2r
pj







EXP3 –DOM: changing graphs
• Simple
– all dom. set same size
– approx. same size
• Problem
– different size dom. set
• can be 1 or K
• Solution
– keep log levels
• depend on log2 (Dt)
– algorithm per level
27
• Complications
– parameters depend on
level
– setting the learning rate
• need a delicate doubling
• Main tech. challenge
– handle dynamic
adversary.
EXP3-DOM
• receive obs. graph
– find dominating set Dt
• logarithmic
approximation
• Run the right copy
– Let bt = log2 (Dt)
– run copy bt
• log copies
• For Copy bt
– param. depend on bt
28
• probabilities:
p a , t  (1   )
w a ,t
Wt


Dt
I [a  Dt ]
• Select Xt using p
• Observe lt(a) for a in SXt,t
• weights
w a ,t 1  w a ,t
ˆ ( a ) 
t
bt
ˆ
exp(    t ( a ) / 2 )
 t (a )
Pr[ observe a ]
I [ i  S X t ,t ]
EXP3-DOM – main Theorem
• Theorem:
log K
RT 

b0
b
2 ln K

b
  E [
b
b
t T
b
1
Qt
2
b 1
]
• tuning γb
R T  O ((ln K ) E [
29

T
t 1
4 | D t |  Q t t ]  (ln K ) ln( KT ))
b
Independent set
• Independent set α(G)
• [Mannor & Shamir 2012]
• Tight Regret
?
?
T  ( G ) ln K
?
?
?
?
– α(G) “replaces” K
?
• Cons:
?
30
– requires to observe G
– solves an LP each step
Download