Improved State Estimation in Multiagent Settings with Continuous or Large

advertisement
Improved State Estimation in Multiagent Settings with Continuous or Large
Discrete State Spaces
Prashant Doshi
Department of Computer Science
University of Georgia
Athens, GA 30602
pdoshi@cs.uga.edu
Abstract
out using the interactive PF (I-PF) that generalizes the PF to
the multiagent setting (Doshi & Gmytrasiewicz 2005).
Previous applications of the I-PF were confined to simple problems with a very small number of discrete physical
states. This is because a large number of particles must be
used, at the expense of computational efficiency, to achieve
good approximation quality. While this limitation also affects the traditional PF, it is especially acute for the I-PF because the interactive state space from which the particles are
sampled tends to get large as it includes the nested beliefs of
the other agents.
The above limitation of the I-PF becomes more potent in
the context of a continuous or large state space, as exhibited
by many real-world applications. In this paper, we present
an improved method for approximately carrying out state estimation in multiagent settings characterized in part by continuous or large discrete physical state spaces. We factor out
some dimensions of the interactive state space and updating
the belief over these dimensions as exactly as possible, while
sampling and propagating the remaining ones. This procedure is equivalent to Rao-Blackwellising (Casella & Robert
1996) the I-PF. Specifically, we factor out the models of
other agents and update the agent’s belief over these models. We consider the case where the belief densities are represented using Gaussians. As the states may be continuous,
we focus on models which represent the transition dynamics
using conditional linear Gaussians (CLGs) or as deterministic, and the observation functions using softmax or CLGs.
In the update, those distributions that can be handled exactly
are handled exactly, while tight approximations are used for
the remaining ones. Simultaneously, we sample particles
from the distribution over the large physical state space and
project the particles in time. When compared with the performance of I-PF on continuous settings, we show that our
approach achieves better approximation quality while consuming less computational resources as measured by the
number of particles and runtime.
Our choice of the distributions while somewhat restrictive
is motivated by several reasons: These popular functions are
well-behaved statistically and allow closed-form posteriors,
there exist efficient methods for fitting the distributions’ parameters to data (Jordan 1995), and as we illustrate using
examples, they adequately model several applications such
as target tracking and fault diagnosis.
State estimation in multiagent settings involves updating an
agent’s belief over the physical states and the space of other
agents’ models. Performance of the previous approach to
state estimation, the interactive particle filter, degrades with
large state spaces because it distributes the particles over
both, the physical state space and the other agents’ models.
We present an improved method for estimating the state in
a class of multiagent settings that are characterized in part
by continuous or large discrete state spaces. We factor out
the models of the other agents and update the agent’s belief
over these models, as exactly as possible. Simultaneously, we
sample particles from the distribution over the large physical
state space and project the particles in time. This approach is
equivalent to Rao-Blackwellising the interactive particle filter.
We focus our analysis on the special class of problems where
the nested beliefs are represented using Gaussians, the problem dynamics using conditional linear Gaussians (CLGs) and
the observation functions using softmax or CLGs. These distributions adequately represent many realistic applications.
Introduction
In order to act rationally, an agent must track the state of
the environment over time. When an agent is acting alone
in the environment, it must track the evolution of the physical state; this is usually accomplished using the Bayes filter (Doucet, Freitas, & Gordon 2001). The filter, in practice, manifests as the Kalman filter when the dynamics are
linear Gaussian and the agent’s prior belief is Gaussian, or
as the particle filter (PF) (Doucet, Freitas, & Gordon 2001)
when no assumptions about the dynamics or prior beliefs are
made. In the presence of other agents who themselves act,
observe, and update their beliefs the agent must track not
only the physical state but also the possible states of others.
This is because other agents’ actions may affect the evolution of the physical state and the agent’s payoffs. One approach is to generalize the Bayes filter to multiagent settings
as shown in (Doshi & Gmytrasiewicz 2005), in which an
agent tracks the evolution of the interactive state over time.
An interactive state consists of the physical state and models of the other agents that include their beliefs, capabilities,
and preferences. In practice, the estimation may be carried
c 2007, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
712
Background: Multiagent State Estimation
Because changes in the physical state and agent i’s observations depend also on the actions of j, we sum over all of
t
j’s actions in the second step. As ist = st , θj,0
, we write:
We consider an agent i, that is interacting with one other
agent j. The arguments easily generalize to a setting with
more than two agents. Consider a space of physical states
S. We will call agent j’s belief over S a 0th level belief,
bj,0 . Additionally, j can be modeled by specifying its set
of actions Aj , set of observations Ωj , transition and observation functions, Tj and Oj , a reward function Rj , and
an optimality criterion OCj . j’s 0th level model is then
θj,0 = bj,0 , Aj , Ωj , Tj , Oj , Rj , OCj . The 0th level models are therefore POMDPs (other agent’s actions are folded
in as noise into T , O, and R). Agent i’s 1st level beliefs are
defined over the physical states of the world and the 0th level
models of agent j. This enriched state space has been called
a set of interactive states. Thus, let ISi,1 denote a set of interactive states defined as, ISi,1 = S × Θj,0 , and ISi,0 = S,
where Θj,0 , is the set of 0th level models of agent j. Let us
rewrite θj,0 as, θj,0 = bj,0 , θ̂j , where θ̂j ∈ Θ̂j , is the agent
j’s frame. To keep matters simple, we limit our focus to i’s
singly nested beliefs, though the definition of the interactive
state space generalizes to any level.
State estimation in multiagent settings is complex because
of two reasons: First, the prediction of how the physical state
changes must be made based on the predicted actions of the
other agent. The probabilities of other’s actions are based on
its models. Second, changes in other’s models have to be included in the update. Specifically, update of the other agent’s
beliefs due to its new observation must be included. The up-
×
t−1
t
P r(θj,0
|st , at−1
, at−1
, oti , θj,0
)=
i
j
t−1
, otj )
×δD (SEθ̂t (bt−1
j,0 , aj
j
j
Factoring State Estimation
We decompose the state estimation (Eq. 1) into two factors,
one of which represents the update of the belief over the
physical states, and the other is the update of the belief over
the other agent’s models conditioned on a physical state:
def
Oj (st , at−1
, at−1
, otj )
j
i
pj,0 (x|y) = N (μyj,0 ; Σyj,0 )(x)
bti,1 (ist ) = P r(ist |at−1
, oti , bt−1
i
i,1 )
t t−1
t−1
P r(is |ai , oti , ist−1 ) bt−1
)
=
i,1 (is
def
μyj,0
(3)
Σyj,0
where
is the k-element vector of means,
is the
k × k covariance matrix for the Gaussian. Let μj,0 and Σj,0
be the set of means and covariance matrices, respectively,
for every instantiation of y. We show an example level 0
ist−1 :θ̂jt−1 =θ̂jt
ist−1 at−1
otj
Representing Prior Beliefs
recursion in belief nesting will bottom out at the 0th level,
where the belief update of the agent reduces to a Bayes filter.
−
btj,0 )
Let X = {X1 , X2 , . . . , Xk } be the set of k continuousvalued variables, where k ∈ N, and Y be the set of discretevalued elements of the (hybrid) physical state space, S. Let
x be an instantiation of X and analogously for y. Together,
X and Y completely describe the physical state space.
For illustration, we use the continuous multiagent tiger
problem, a modified version of the persistent multiagent
tiger problem discussed in (Rathnas, Doshi, & Gmytrasiewicz 2006). In our continuous version, the tiger is
located on a continuous axis, −1 ≤ x ≤ 1. The gold is
always located symmetrically (about x = 0) from the tiger’s
location. Hence, knowing the tiger’s location allows one to
exactly infer the location of the gold, analogous to the classical version. We assume a discrete action space that involves
each agent calling out the location of the gold. Here, to keep
matters simple, this could be left (OL) or right (OR). Left,
for example, could signify that the gold is located at some
point, x ≤ 0. Each agent may also listen (L), hearing a growl
from the left (GL) or right (GR) that informs the agent, noisily, the tiger’s location. The agent also overhears, noisily, the
location, if called out by the other. Once a location has been
called out, the tiger (and the gold) persist at their original
location in the next time step with a high probability.
t−1
We represent agent j’s level 0 belief, bj,0
∈ Δ(S), ust−1
(xy) =
ing a factorization of the physical state space: bj,0
pj,0 (y)pj,0 (x|y). While pj,0 (y) is a discrete probability distribution over Y, pj,0 (x|y) is a collection of multivariate
Gaussian densities, each of which is defined over the variables in X. Each Gaussian in the collection may have a
different set of parameters specific to the instantiation of Y:
(1)
t−1
t−1
P r(at−1
|θj,0
) bt−1
)
j
i,1 (is
(2)
Because agent j’s model is private and cannot be observed
by i directly, i’s observation oti plays no role above.
where α is the normalization constant, δD is the Diract−1
delta function, P r(ajt−1 |θj,0
) is the probability that ajt−1
t−1
is Bayes rational for the agent described by θj,0
, and SEθ̂t
j
stands for the update of the complete belief using the transition and observation functions in the frame θ̂jt . In the multiagent state estimation above, i’s belief update will invoke
t−1
j’s belief update (via the term SEθ̂t (bj,0
, ajt−1 , otj )). This
=
P r(st |at−1
, at−1
, oti , st−1 )
i
j
where α is the normalization constant. The other term in
t−1
t
Eq. 2, P r(θj,0
| st , at−1
, ajt−1 , oti , θj,0
), may be rewritten:
i
t−1
P r(at−1
|θj,0
)
j
at−1
j
, oti ) Ti (st−1 , at−1
, at−1
, st ) δD (SEθ̂t
×Oi (st , ait−1 , at−1
j
i
j
j
otj
t−1
t−1
t−1
t−1
t
t
t
t
t−1
(bj,0 , aj , oj ) − bj,0 ) Oj (s , ai , aj , oj ) d is
ist−1 :θ̂jt−1 =θ̂jt
t−1
bt−1
)
i,1 (is
ist−1 at−1
j
t−1
t−1
t−1
t
t
P r(θj,0 |s , ai , aj , oti , θj,0
)
P r(st |at−1
, at−1
, oti , st−1 ) = α Oi (st , at−1
, at−1
, oti )
i
j
i
j
t−1
t−1 t
t−1
× Ti (s , ai , aj , s )
def
We may expand the first term of the above equation as,
t−1
dated belief of i, bti,1 (ist ) = P r(ist |ait−1 , oti , bi,1
):
bti,1 (ist ) = α
P r(ist |at−1
, oti , bt−1
i
i,1 ) =
t−1
P r(ist |at−1
, at−1
, oti , ist−1 ) P r(ajt−1 |θj,0
)
i
j
j
t−1
× bt−1
)
i,1 (is
713
3.5
0.008
Pr i,1 (bj,0 |x=0)
0.006
Prj (x )
interactive state space is then given by the following set of
t−1 t−1 (n) N
particles, {(s(n) , bi,1
(θj,0 |s )}i=1 :
3
0.007
0.005
0.004
0.003
2.5
2
1.5
1
(b)
N
1 t−1 (n)
δD (st−1 −s(n) )bt−1
) (5)
i,1 (θj,0 |s
N n=1
Figure 1: (a) A level 0 belief of j according to which j believes
t−1
We substitute Eq. 5 into Eq. 2, expand ist−1 = st−1 , θj,0
:
0.002
0.5
0
0.001
0
-200 -150 -100 -50
-1
0
50 100 150 200
x
(a)
-0.5
0
0.5
1
0
0.2
t−1
μ j,0
0.4
0.6
t−1
Σ j,0
0.8
t−1
t−1
bt−1
, θj,0
)≈
i,1 (s
1
that the tiger is likely to be at x=0.25, (b) Given that the tiger is
at location x=0, i believes that j believes that the tiger is likely at
x=0 (j’s beliefs are likely to have a mean of 0 and small variance).
bti,1 (ist ) ≈
×
t−1 t−1
bi,1 (θj,0
|s ) = N ([wry + wry · x]r=1
def
k(k+1)|y|
P r(st |at−1
, at−1
, oti , st−1 )
i
j
t−1
t−1
t
P r(θj,0
|st , at−1
, at−1
, oti , θj,0
) P r(at−1
|θj,0
)
i
j
j
t−1 (n)
t−1
δD (st−1 − s(n) )bt−1
) d θj,0
d st−1
i,1 (θj,0 |s
δD (st−1 − s(n) ) P r(st |at−1
, at−1
, oti , st−1 )
i
j
n=1
1
N
=
st−1
at−1
j
N
× N1
belief of j in the tiger problem in Fig. 1(a). Note that the
physical state in the problem consists of a single continuous
variable, x, denoting the tiger’s location.
t−1
t−1
Agent i’s level 1 belief, bi,1
∈ Δ(S × Θj,0
), is a
distribution over the level 0 beliefs of agent j for each
physical state and frame of the other agent. In order
t−1
t−1
to represent this belief, we factor it as, bi,1
(s, θj,0
) =
t−1 t−1
t−1 t−1 t−1
t−1 t−1
bi,1 (s ) bi,1 (θj,0 |s ). The term, bi,1 (s ), is a distribution over the physical state space analogous to j’s level
0 belief, and may be represented similarly. The second fact−1 t−1
tor, bt−1
) is a distribution over the level 0 beliefs
i,1 (θj,0 |s
of j conditioned on the physical state (assuming j’s frame is
known). Because j’s belief is represented using a collection
of Gaussians, as shown in Eq. 3, that are described by their
means and covariance matrices, i’s level 1 beliefs are densities over these parameters. We represent i’s level 1 belief
conditioned on the physical state using a conditional linear
Gaussian (CLG) density:
t−1
θj,0
ajt−1 n st−1
t−1
×
ds
t−1
θj,0
t−1
t−1
t
P r(θj,0
|st , at−1
, at−1
, oti , θj,0
) P r(at−1
|θj,0
)
i
j
j
t−1 (n)
t−1
) d θj,0
× bt−1
i,1 (θj,0 |s
Because of the delta function, the above becomes:
bti,1 (ist ) ≈
1
N
ajt−1
n
P r(st |at−1
, at−1
, oti , s(n) )
i
j
t−1
θj,0
t
P r(θj,0
|st
t−1
t−1 t−1 t−1 (n)
t−1
, at−1
, at−1
, oti , θj,0
) P r(at−1
|θj,0
)bi,1 (θj,0 |s ) d θj,0
i
j
j
where:
P r(st |at−1
, at−1
, oti , s(n) ) = α Oi (st , at−1
, at−1
, oti )
i
j
i
j
t−1
t−1 t
(n)
× Ti (s , ai , aj , s )
t−1
t
P r(θj,0
) = ot Oj (st , at−1
, at−1
, otj )
|st , at−1
, at−1
, oti , θj,0
i
j
j
i
t−1
, otj ) − btj,1 )
×δD (SEθ̂t (bt−1
j,0 , aj
; Σy )(μj,0 , Σj,0 )
(4)
j
j
Thus, i’s state estimation takes the approximate form:
The CLG is a density over the means and covariances that
parameterize j’s level 0 belief. The CLG’s own mean is a
linear function of the continuous variables in X and possibly distinct for each value of y. Recall that μj,0 and Σj,0 are
sets of as many k-element means and k × k covariances, respectively, as the number of instantiations, |y|. An example
CLG for the tiger problem is shown in Fig. 1(b).
bti,1 (ist ) ≈
N
α (n)
(n)
t
ρ
(st ) κ t−1 (θj,0
|st ) where:
aj
N t−1 n=1 at−1
j
aj
(n)
ρ
(st ) = Oi (st , at−1
, at−1
, oti ) Ti (s(n) , at−1
, ajt−1 , st )
i
j
i
def
at−1
j
(6)
Approximate State Estimation
κ
The I-PF propagates a sampled representation of agent i’s
nested beliefs, over time. Because the particles are sampled
from the entire interactive space, many of them are needed to
generate good approximations of the belief. However, this
makes the I-PF computationally intensive and the problem
is exacerbated when we consider large physical state spaces
that characterize realistic application settings.
(n)
ajt−1
def
t
(θj,0
|st ) =
t−1
(bt−1
, otj ) −
j,0 , aj
t−1
θj,0
t
bj,0 )
otj
Oj (st , at−1
, at−1
, otj )δD (SEθ̂t
j
i
j
t−1
t−1 (n)
t−1
P r(at−1
|θj,0
) bt−1
) dθj,0
j
i,1 (θj,0 |s
(7)
(n)
We estimate the physical state, denoted by ρat−1 , by propaj
gating the particles as in the PF; the estimation of the other
(n)
agent’s models given by κat−1 is performed as shown next.
j
Rao-Blackwellised I-PF (RB-IPF)
We offer a way to alleviate this difficulty. We may simulate the belief update over the physical state space using the
traditional PF while updating agent i’s belief over j’s beliefs conditioned on the physical state, as exactly as possible. We sample N particles from agent i’s belief over the
t−1 t−1
physical state space, bi,1
(s ), resulting in a set of parti(1) (2)
(N )
cles {s , s , . . . , s } that together approximate the belief over the large state space. The belief over the complete
Belief Update over Models
For settings where the physical state space is continuous, we
represent both agents’ observation functions using Gaussian
or softmax (also called logistic) densities (Jordan 1995). The
observation function of j represented as a softmax is:
exp(wqy,a + wqy,a · x)
def
Oj (st , at−1
, at−1
, otj = oq ) = |Ω |
j
i
y,a
j
+ wry,a · x)
r=1 exp(wr
714
t
= 0.25 ,OL, *, x )
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
t−1
GR
GL
-1
-0.5
0
xt
0.5
Tj (x
O j (x t, L, *, ? )
where the |Ωj | sets of wy,a , wy,a are parameters of the
softmax function instantiated for the discrete state variables,
y, and the discrete actions of both the agents, a.
In the continuous multiagent tiger problem, the observation function of agent j when it listens is shown in Fig. 2(a).
Agent i’s observation function when it listens, is obtained by
multiplying the softmax densities in Fig. 2(a) with 0.8 if it
overhears correctly, and 0.1 otherwise. However, if i calls
any location, it does not hear the growl or the other agent,
and the observation function is, Oi (xt , OL, ∗, ∗) = 0.05.
1
Σ
Σtj,0 =fL,GR
(Σt−1
j,0 ) =
μ
t−1
t−1
μtj,0 =fOL/OR,∗
(μt−1
j,0 , Σj,0 ) =μj,0
t−1
t−1
Σ
t
Σj,0 =fOL/OR,∗ (Σj,0 ) =Σj,0 + 0.25
Table 1: The means and variances of j’s updated level 0 belief for
the multiagent tiger problem. λ is the variational parameter.
1.6
are functions of the mean and covariance of the prior bet−1
t−1
lief, μtj,0 = faμj ,oj (μj,0
, Σj,0
) and Σtj,0 = faΣj ,oj (Σt−1
j,0 ).
t−1
μ
t
t
We may rewrite the above as, μj,0 = gaj ,oj (μj,0 , Σj,0 ) and
t−1
= gaΣj ,oj (Σtj,0 ), where gaμj ,oj and gaΣj ,oj may be seen
Σj,0
as inverses under the assumption that faμj ,oj and faΣj ,oj are
1:1 maps. This assumption holds when the transition and
observation functions are of the form defined previously.
To illustrate, consider j’s level 0 update when it listens
(L) and hears
a growl from the right (GR): P r(xt |aj =
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-1
-0.5
(a)
0
0.5
1
(b)
xt
Figure 2: (a) Observation function of j when it listens is a collection of sigmoids. (b) i’s transition density when either or both call
out a location and the tiger was previously at xt−1 = 0.25.
1
t−1
t
t−1
t−1
)N (μt−1
)dxt−1 = N
L, bt−1
j,0 ) = −1 δD (x − x
j,0 , Σj,0 )(x
t−1
t−1
t
t
(μj,0 , Σj,0 )(x ) . When j hears GR, P r(x |oj = GR, aj =
We represent the transition functions as CLGs, which are
adequate for many realistic applications and additionally allow closed-form posteriors:
t
t−1
−x
t
)×N (μt−1
L, bt−1
j,0 ) = 1/(1+e
j,0 , Σj,0 )(x ) . Using the vari-
Tj (st−1 , ajt−1 , at−1
, st ) = N (w0y,a,y + wy,a,y · x; Σy,a,y )
i
(x ) × P r(y |x, y, a)
def
t−1
−2λ
2(0.5Σt−1
−1)
j,0 +μj,0 )λ(e
t−1
t−1
−2λ
e
(2λ−Σj,0 )+2e−λ Σt−1
j,0 −2λ−Σj,0
t−1
2Σj,0
λ(1+e−λ )
−λ −1)+2λ(e−λ +1)
−Σt−1
j,0 (e
μ
t−1
μtj,0 =fL,GR
(μt−1
j,0 , Σj,0 ) =
ational method, the product may be closely approximated
from below by a Gaussian, given in row 1 of Table 1. Here,
λ ∈ R is the variational parameter. For any value of λ the
Gaussian remains a tight lower bound; the approximation
is exact when λ = xt . Because xt is hidden, one way to
calculate a good λ is to use an iterative EM scheme given
an initial guess of λ (Murphy 1999). The updated means
and variances when j calls out a location are exact (row 2,
Table 1). Given the functions in Table 1, the inverses, say,
μ
Σ
gL,GL
(μtj,0 , Σtj,0 ), and gL,GL
(Σtj,0 ) may be obtained.
Step 2: Turning our attention to the integral in Eq. 7, the
t−1
term P r(ajt−1 |θj,0
) is the probability that action, ajt−1 , is
t−1
t−1
, ie. at−1
∈ OP T (θj,0
)
Bayes rational for the model θj,0
j
where OP T is the set of actions that are Bayes rational
t−1
given the model. Let Rat−1 ⊆ Θj,0
be the contiguous
where x , y constitute st , x, y constitute st−1 ,
P r(y |x, y, a) is a discrete probability distribution (softmax
if y depends on x), and the parameters, w0 , wy,a,y may
be distinct for each instantiation of Y, A, and Y .
In the tiger problem, the transition function for the action OL, shown in Fig. 2(a), is, Tj (xt−1 , OL, ∗, xt ) =
N (xt−1 ; 0.25)(xt ). Because the mean of the Gaussian is
at xt−1 , the tiger persists at its previous location with a high
probability. The transition function is similar for the other
action, OR. If both the agents listen, Tj (xt−1 , L, L, xt ) =
δD (xt − xt−1 ). Agent i’s transition function is analogous.
For clarity, we subdivide the process of updating i’s conditional beliefs over j’s models (Eq. 7) into three steps. In
the first and second steps, we show how j’s level 0 belief is
updated and i’s nested belief over j’s updated beliefs is obtained. The third step explains the effect of j’s observation
likelihood on the level 1 beliefs.
Step 1: Given that agent j’s prior level 0 beliefs are
Gaussian, and we may closely approximate the product of
a Gaussian and softmax function by a Gaussian (Murphy
t−1 t−1 t
1999), j’s level 0 belief update, SEθ̂t (bj,0
, aj , oj ) is
j
carried out analytically. Specifically, we derive a Gaussian that forms a lower bound to the softmax density using variational methods (see (Murphy 1999) for the derivation; (Jordan et al. 1999) for an introduction to variational methods). We note that while the variational Gaussian may not be a tight approximation to the softmax, the
product Gaussian is a tight approximation to the product of
the softmax and the Gaussian (for example, see (Murphy
1999)). Therefore, analogous to the Kalman filter, j’s posterior belief is also a Gaussian whose mean and covariance
j
region of j’s models for which the action, ajt−1 is Bayes
rational (Rathnas, Doshi, & Gmytrasiewicz 2006) ie. define
t−1
t−1
Rat−1 such that, ∀ θj,0
∈ Rat−1 , ajt−1 ∈ OP T (θj,0
).
j
j
t−1
Because, P r(ajt−1 |θj,0
) may be rewritten as,
1
t−1
|OP T (θj,0
)|
t−1
× δD (OP T (θj,0
) − ajt−1 ), the integral becomes,
R t−1
a
1
t−1
|OP T (θj,0
)|
otj
Oj (st , at−1
, at−1
, otj ) δD (SEθ̂t (bt−1
j
i
j,0 ,
j
j
t−1 (n)
t−1
, otj ) − btj,0 ) bt−1
) dθj,0
. Substituting i’s level
at−1
j
i,1 (θj,0 |s
1 belief
(Eq. 4) into
the above we get:
R t−1
a
1
t−1
|OP T (θj,0
)|
otj
Oj (st , at−1
, at−1
, otj )
j
i
δD (SEθ̂t (bt−1
j,0 ,
j
j
(n)
(n)
(n)
y
t−1
, otj ) − btj,0 ) N ([wr,0
+ wry · x(n) ]; Σy )(μt−1
at−1
j
j,0 , Σj,0 )
(n)
y
t−1
t−1
t
dθj,0
=
, at−1
, otj ) N ([ wr,0
+
i
ot Oj (s , aj
wry
(n)
j
· x(n) ] ;Σy
(n)
gaμj ,oj (μtj,0 , Σtj,0 ),
715
) (gaμj ,oj (μtj,0 , Σtj,0 ), gaΣj ,oj (Σtj,0 )),
gaΣj ,oj (Σtj,0 )
if
belongs in Rat−1 , 0 otherj
wise. Note that μtj,0 , Σtj,0 parameterize btj,0 and |OP1T (·)|
is absorbed into the density. We focus on the second term of
the previous expression next. Intuitively, this density is i’s
updated belief that results when the transformations, faμj ,oj
t−1
t−1
and faΣj ,oj are applied to the variate, μj,0
, Σj,0
at which
t−1
aj is Bayes rational. If the transformations are not linear,
the resulting density over μtj,0 , Σtj,0 may not be Gaussian.
In this case, we numerically estimate the Gaussian, N
(n)
(n)
(μaj ,oj ; Σaj ,oj ) (μtj,0 , Σtj,0 ), that best fits the density using,
say, the maximum likelihood (ML) approach.
)
1
0.6
Pr (b
Σt−1
j,0
i,1 j,0
0.8
L
0.4
0.2
OR
OL
3.5
3
2.5
2
1.5
1
0.5
0
-1
0
-1
-0.5
0
0.5
μ t
0
0.5
j,0
(a)
1 0
0.003
) (μtj,0 , Σtj,0 ) +
0.013
0.052
−0.007
Oj (xt , L, L, GR) × N ([−0.11, 0.15];
)
−0.007 0.013
(μtj,0 , Σtj,0 ) . As j’s observation densities are sigmoids,
t
t
1
the above becomes, κ(n)
L (θj,0 |x ) = 1+ext × N ([0.04, 0.15]
0.013 0.003
1
;
) (μtj,0 , Σtj,0 ) +
t × N ([−0.11, 0.15]
1+e−x
0.003 0.013
0.052
−0.007
;
) (μtj,0 , Σtj,0 ), which is a mixture of
−0.007 0.013
L, L, GL) × N ([0.04, 0.15];
1
0.8
0.6
0.4
0.2
t
Σ
j,0
-0.5
0
t
μ j,0
0.5
1 0
1
0.8
0.6
0.4
0.2 Σ t
j,0
0.013
0.003
(n)
t
two Gaussians (κL (θj,0
|xt = 0) in Fig. 3(d)).
While we started with a single prior CLG, the posterior
(n)
t
κat−1 (θj,0
|st ) is a mixture of multiple conditional Gaus-
(b)
6
t t
κL(bj,0
|x t= 0)
Pr i,1(b j,0 )
t−1
μ j,0
7
6
5
4
3
2
1
0
-1
-0.5
1
This weight is the probability with which j received its
observation, otj , on performing action, ajt−1 . This actionobservation combination form the subscripts to the mean
and variance of the Gaussian.
t
t
t
In the tiger problem, if i listens, κ(n)
L (θj,0
|x ) = Oj (x ,
5
4
j
sians. In general, there will be atmost |Ωj | distinct Gaussians in the mixture and |Aj ||Ωj | many densities that make
up agent i’s level 1 belief, after one step of the belief update. After t steps, there will be a maximum of (|Aj ||Ωj |)t
distinct densities in the mixture. As the number of mixture components grows exponentially with the updates, more
compact representations of the belief are needed.
3
2
1
0
-1
-0.5
0
t
μj,0
(c)
0.5
1 0
1
0.8
0.6
0.4
0.2 t
Σj,0
(d)
Figure 3: (a)Agent j’s horizon 1 policy. As the variance increases, j is less certain about the tiger’s location, hence chooses
to listen. The superimposed polygon (with the black border) represents the updated mean and variance of j’s belief after L, GR.
(b) Agent i’s belief over j’s given that j listens and hears a GR.
(c) A maximum likelihood Gaussian fit to the density in (b). (d)
(n)
κL (θj,0 |xt = 0) as a mixture of two Gaussian densities.
Experiments
where N (μaj ,oj , Σaj ,oj )
(n)
is the fitted Gaussian density. Notice that κat−1 is a mixture
We implemented RB-IPF and evaluated its performance on
the multiagent tiger problem and a sequential version of the
public good problem with punishment (PG) (Fudenberg &
Tirole 1991). Our sequential version of PG is from the perspective of agent i. Let xu ∈ Xu be the quantity of resources in the public pot. We assume in our formulation that
Xu is hidden. However, each agent receives an observation
of plenty (PY) or meager (MR) symbolizing the state of the
public pot. The resources in i’s private pot, xr,i ∈ Xr,i , are
perfectly observable to i. We briefly list our formulation:
• The state space is, Xi = Xu × Xr,i , which represents the
amount of resources in the public pot and private pot of agent
i • The actions are, Ai = {Contribute(C), Defect(D)}. An
agent contributes a fixed amount, xc , during the contribute
action. Let A = Ai × Aj , where Aj = Ai • The observations of i are, Ωi = {PY, MR} • The transition function,
Ti : Xi × A × Xi → [0, 1], is deterministic as the amount
of contributions are fixed and known. Note that both agents’
actions affect Xu while Xr,i is affected only by i’s action. •
The observation function is, Oi : Xu × Ωi → [0, 1] •
The reward function is, Ri : Xi × A → R. The reward
is determined as follows: Ri (xi , ai , aj ) = xr,i + ci xu −
1D (ai )1C (aj )P − 1C (ai )1D (aj )cp , where ci (=cj ) is the
marginal private return. P is the punishment meted out to
the defecting agent and cp is the non-zero cost of punishing for the contributing agent. 1D (·) is an indicator function
which is 1 if its argument is D, 0 otherwise.
The transition function when both agents contribute is,
of Gaussians where Oj (st , ajt−1 , ait−1 , otj ) is the weight
assigned to each participating Gaussian in the mixture.
Ti (xt−1
, C, C, xti ) = δD ((xt−1
+ 2xc ) − xtu )δD ((xt−1
u
i
r,i − xc ) −
t−1
t−1
t
t−1
xr,i ), where xi
= xu , xr,i and xti = xtu , xtr,i . If i
t−1 t−1 (n)
Let agent i’s level 1 belief, bi,1
(θj,0 |x ), when x(n) =
0 be the one shown in Fig. 1(b). For j’s action, ajt−1 = L,
RL is a polygon (in red) shown in Fig. 3(a). We note that
the update of j’s beliefs on listening and hearing a GR shifts
the (red) polygon right along the mean (because j heard a
GR) and reduces the variance. Given j’s action of L and
an observation
of GR, i’s
predicted belief over j’s belief is,
N ([0,0];
0.1
0.01
0.01
0.1
μ
Σ
) (gL,GR
(μtj,0 , Σtj,0 ), gL,GR
(Σtj,0 )) if
the argument of the Gaussian is in RL , otherwise it is 0.
μ
Σ
Here, gL,GR
(μtj,0 , Σtj,0 ) and gL,GR
(Σtj,0 ) are the inverses
of the functions in row 1 of Table 1. Notice that the functions in row 1 are non-linear. We show the predicted belief in Fig. 3(b), and the ML fit to the predicted belief in
Fig. 3(c). If j calls out the location,
then i’s predicted belief
over j’s belief is, N ([0,0.25];
0.1
0.01
0.01
0.1
) (μtj,0 , Σtj,0 ) if
μtj,0 , Σtj,0 − 0.25 is in ROL (or ROR ), otherwise 0.
(n)
t
Step 3: The final step in calculating κat−1 (θj,0
|st )
involves
(n)
the
product,
j
otj
Oj (st , at−1
, at−1
, otj )
j
i
(n)
(n)
N (μaj ,oj , Σaj ,oj ) (μtj,0 , Σtj,0 )
(n)
j
716
Multiagent Tiger Problem
I-PF
RB-IPF
I-PF
RB-IPF
I-PF
RB-IPF
0.16
0.14
0.12
0.2
0.15
0.1
0.08
0.08
0.06
0.06
0.25
0.2
0.15
0.1
0.1
0.05
0.05
1
10
100
No. of Particles (N)
(a)
1000
10000
I-PF
RB-IPF
0.3
L1 Error
0.1
0.35
0.25
0.14
0.12
L1 Error
L1 Error
Public Good Problem
0.3
0.18
L1 Error
0.16
1
10
100
1
No. of Particles over S
10
100
No. of Particles (N)
1000
10000
1
(c)
(b)
10
100
No. of Particles over S
(d)
defects while j contributes, Ti (xt−1
, D, C, xti ) = δD ((xt−1
+
u
i
t−1
t
−x
)
,
and,
T
(x
, C, D, xti ) = δD ((xut−1 +
xc )−xtu )δD (xt−1
i
r,i
r,i
i
t
xc ) − xtu )δD ((xt−1
r,i − xc ) − xr,i ) conversely. If both defect,
the pots remain unchanged. Probability densities for the two
observations of PY and MR for both agents are sigmoids.
We demonstrate that the RB-IPF is statistically more efficient as compared to the I-PF. In other words, the RB-IPF
estimates the hidden interactive state more accurately, while
consuming less particles and hence less computational resources. In Figs. 4(a) and (c), we show the line plots of
the L1 error as a function of the number of particles (N)
allocated to each filter. L1 error measures the distance between the approximate and exact posteriors. As it is difficult to carry out the belief update exactly, we used the IPF with half a million particles to compute the ‘exact’ beliefs. Each data point is the average of 10 runs of the respective filter. For the multiagent tiger problem, the estimation
was performed for the case where agent i listens (L) and
hears a growl from the left but does not overhear anything
(GL, S). In the PG problem, the beliefs resulting from
agent i contributing (C) and perceiving plenty of resources
(PY) in the public pot were used for comparison. We obtained similar results for other actions and observations. We
used the belief in Fig. 1(b) as the prior of agent i (bi,1 (x)
was a Gaussian with mean -0.25 and variance of 0.25) in the
tiger problem, and an analogous belief in the PG problem.
Observe that for both the problems, the posterior belief
generated by RB-IPF is much closer to the truth than that
generated by the I-PF for the same number of particles. One
reason for this is that the RB-IPF uses all the particles to
estimate the continuous physical state, while in I-PF the particles are distributed over the estimation of the physical state
and the other’s model. The second reason is that RB-IPF
updates the belief over the other agent’s models more accurately; we demonstrate this using the profiles in Fig. 4(b, d)
On average, exponentially more particles are needed for IPF to reach an identical L1 error as the RB-IPF. This is exemplified by the difference in the run times of the two methods (Fig. 5(a)) for an identical estimation accuracy.
A source of error in RB-IPF when updating the belief over
other agent’s models is the step of fitting a Gaussian to the
piecewise density resulting from j’s update of its beliefs. We
analyze the sensitivity of the performance of the RB-IPF to
this error by varying the number of sample points that are
Prob.
Tiger
PG
Run times in secs
RB-IPF
I-PF
50.93 ± 0.96
151.05 ± 0.4
32.06 ± 0.3
101.07 ± 0.45
L1 Error
Figure 4: Performance profiles for the tiger and PG problems. (a, c) Estimation accuracy of RB-IPF is significantly better than the I-PF’s
given the same number of particles. (b, d) Comparison of the accuracy of model estimations. Both filters use an identical number of particles
over the physical states; I-PF also uses that number for models.
0.12
0.11
0.1
0.09
0.08
0.07
0.06
0.05
0.04
Tiger
PG
100
1000
No. of points for Gaussian fit
Figure 5: For an identical L1 error, RB-IPF takes less run time
than the I-PF (Linux, Xeon 3.4GHz, 4GB RAM). The plot shows
that accuracy of RB-IPF is not significantly sensitive to the closeness of the Gaussian fit beyond a reasonable number of points.
used in calculating the ML Gaussian fit. Fig. 5(b) reveals
that for both problems, as few as 200 sample points produce
an estimation accuracy that is not significantly less from the
average obtained by fitting a Gaussian more closely.
Performance of the approach may suffer for deeply nested
beliefs. This is because its effectiveness rests on being able
to almost exactly update an agent’s belief over other’s models, which requires that we update other’s beliefs exactly.
However, at levels > 1, while we may recursively call the
RB-IPF to update the belief, this is no longer exact.
References
Casella, G., and Robert, C. 1996. Rao-blackwellisation of
sampling schemes. Biometrika 83:81–94.
Doshi, P., and Gmytrasiewicz, P. J. 2005. Approximating
state estimation in multiagent settings using particle filters.
In AAMAS.
Doucet, A.; Freitas, N. D.; and Gordon, N. 2001. Sequential Monte Carlo Methods in Practice. Springer Verlag.
Fudenberg, D., and Tirole, J. 1991. Game Theory. MIT
Press.
Jordan, M. I.; Ghahramani, Z.; Jaakkola, T.; and Saul, L. K.
1999. An introduction to variational methods for graphical
models. Machine Learning 37(2):183–233.
Jordan, M. I. 1995. Why the logistic function? Technical
Report 9503, Computational Cognitive Science, MIT.
Murphy, K. 1999. A variational approximation for bns
with discrete and continuous variables. In UAI.
Rathnas, B.; Doshi, P.; and Gmytrasiewicz, P. 2006. Exact solutions to i-pomdps using behavioral equivalence. In
AAMAS.
717
Download