From: AAAI Technical Report SS-01-03. Compilation copyright © 2001, AAAI (www.aaai.org). All rights reserved.
Probabilistic Pricebots
Jeffrey O. Kephart
IBMInstitute for AdvancedCommerce
IBMThomasJ. Watson Research Center
Yorktown Heights, NY 10598 USA
kephart@us,ibm. com
Amy R. Greenwald
Department of ComputerScience
BrownUniversity, Box 1910
Providence, RI 02912 USA
amygreen@cs,brown, edu
Abstract
parable to probabilistic pricebots with high-informational
requirements, and moreover,the profit earned correspond to
the game-theoreticequilibrium values.
This paper is organized as follows. First, we present our
model of an economyconsisting of shopbots and pricebots.
Then, we analyze the model from a game-theoretic point of
view. Next, we discuss both the high-information and lowinformationvariants of the price-setting strategies of interest: no external regret learning and no internal regret learning. Then, we describe simulations of collections of pricebots that implementthese strategies, in both informed and
naive settings. Wefind that no-regret pricebots can converge to an equilibriumthat is arbitrarily close to Nashequilibrium, although instantaneous price distributions need not
converge. Weconcludeby discussing the profitability of deterministic pricing algorithms in comparisonwith no-regret
learning in both high- and low-informationsettings.
Past research has beenconcemed
with the potential of
embedding
deterministicpricing algorithmsinto pricebots"softwareagentsusedbyon-linesellers to automatically price Intemetgoods.In this work,probabilistic
pricing algorithmsbasedon no-regretlearningare explored, in both high-informationand low-information
settings. It is shownvia simulationsthat the long-run
empiricalfrequenciesof prices in a marketof no-regret
pricebotscanconvergeto equilibria arbitrarily close to
an asymmetricNashequilibrium; symmetricNashequilibria are typicallyunstable.Instantaneous
pricedistributions, however,neednot converge.
Introduction
Pricebots, agents that employautomatedpricing algorithms,
are beginning to appear on the Internet. An early example
resides at buy. com. This agent monitors its primary competitors’ prices and then automatically undercuts the lowest. Driven by the ever-growinguse of shopbots, which enhance buyer price sensitivity, we anticipate a proliferation
of Internet pricebots, potentially generating complexprice
dynamics.This paper is concernedwith the dynamicsof interaction amongpricebot algorithms. Ultimately, this line of
research aims to identify those pricebot algorithms that are
most likely to be profitable, from both an individual and a
collective standpoint.
Recently, a simple model of shopbots and pricebots
(Greenwald & Kephart 1999) was introduced, and a variety of (mostly deterministic) pricing algorithms were simulated (Greenwald, Kephart, & Tesauro 1999). Motivated
in part by a game-theoretic analysis of this modelwhich
yields only mixed-strategy Nash equilibria, this paper explores the use of probabilistic pricing based on no-regret
learning (Foster & Vohra 1997; Freund &Schapire 1996),
in various informational settings. Amongthe deterministic
algorithms studied previously, one requires completeinformation about buyer demandand competitors’ prices, while
a second dependsonly on the individual pricebot’s previous
history of prices and profits. Oneof the early conclusions
was that knowledgeis power--pricebots that have access to
and makeuse of moreinformationearn greater profits. In the
present workit is demonstratedthat probabilistic pricebots
with low-informational requirements can earn profits com-
Model
In this section, we summarize a previously-introduced
model of shopbots and pricebots (Greenwald & Kephart
1999). Consider an economyin which there is a single homogeneousgoodthat is offered for sale by S sellers and of
interest to B buyers, with B >> S. Each buyer b generates
purchase orders at randomtimes, with rate Pb, while each
seller s reconsiders (and potentially resets) its price Ps
randomtimes, with rate ps. The value of the good to buyer
b is vb, and the cost of productionfor seller s is c~.
A buyerb’s utility for the goodis a function of price. In
particular, ub(p) = vb - p, ifp < vb, and ub(p) 0,otherwise. In other words, a buyerobtains positive utility if and
only if the seller’s price is less than the buyer’svaluation of
the good; otherwise, the buyer’s utility is zero. Weassume
buyers consider the prices offered by sellers using one of the
following strategies: (i) BargainHunter: buyer checks the
offer price of all sellers, determinesthe seller with the lowest price, and purchasesthe goodif that lowest price is less
than the buyer’s valuation. This type of buyer correspondsto
those whoutilize shopbots. (ii) Any Seller: buyer selects a
seller at random,and purchasesthe goodif the price charged
by that seller is less than the buyer’svaluation. This type of
buyer represents those whoare either uninformed,or interested in productcharacteristics other than price: e.g., quality,
convenience. Fraction wAof buyers employsthe any seller
37
strategy, while fraction WBbehavesas bargain hunters, with
wA + WB = 1.
A seller s’s expected profit per unit time %is a function
of the price vector/Y, as follows: 7r,(p-’) = (p, - c,)D,(~,
whereDs(p-’) is the rate of demandfor the goodproduced
seller s. This rate of demandis determined by the overall
buyer rate of demand,the likelihood of the buyers selecting
seller s as their potential seller, andthe likelihoodthat seller
s’s price p, does not exceed the buyer’s valuation vb. If
P = Y]bpb, and if h, (~ denotes the probability that seller
is selected, while g(ps) denotes the fraction of buyers whose
valuations satisfy vb 3, p,, then D,(~ = pBhs(~g(p,).
Withoutloss of generality, define the time scale s.t. pB= 1.
Now7r8 (~ is interpreted as the expectedprofit for seller
per unit sold systemwide.Moreover,seller s’s profit is such
that 7r,(p-) = (p, - c,)h,(~g(p,).
Analysis
Wenowpresent a game-theoretic analysis of the prescribed
I , assumingc~ = c, for all
modelviewedas a one-shot game
sellers s, and vb = v, for all buyers b. 2 Recall that B >> S;
in particular, the numberof buyers is assumedto be very
large, while the numberof sellers is a good deal smaller.
In accordancewith this assumption,it is reasonableto study
the strategic decision-making
of the sellers alone, since their
relatively small numbersuggests that the behavior of individual sellers indeed influences market dynamics, whereas
the large numberof buyers renders the effects of individual buyers’ actions negligible) Assumingthe distribution
of buyer behavior is exogenouslydeterminedand fixed, and
that sellers are profit maximizers,we derive the symmetric
mixedstrategy Nashequilibrium of the sellers’ (and later,
pricebots’) game,and then we derive an asymmetricvariant.
A Nashequilibrium is a vector of prices at whichsellers
maximizeindividual profits and from which they have no
incentive to deviate. (Nash 1951). There are no pure strategy (i.e., deterministic) Nashequilibrium in the prescribed
model whenever 0 < wA< 1 and the price quantum is
sufficiently small (Greenwald & Kephart 1999). There
exist mixedstrategy Nashequilibria, however,the symmetITheanalysispresentedin this sectionapplies to the one-shot
versionof our model,whilethe simulationresults reportedin Section focus on repeated (asynchronous)settings. Weconsiderthe
Nashequilibriumof the one-shot game,rather than its iterated
counterpart,for at least tworeasons:(i) the Nashequilibrium
the stage gameplayedrepeatedlyis in fact a Nashequilibriumof
the repeatedgame;(ii) the Folk Theorem
of repeated gametheory states that virtually all payoffsin a repeatedgamecorrespond
to a Nashequilibrium,for sufficiently large valuesof the discount
parameter.Weisolate the stage gameNashequilibriumas an equilibriumof particularinterest.
2Assuming
uniformvaluationsis tantamountto letting g(p)
O(v- p), the step functiondefinedas follows:
_< v
O(v - p) = 0 1 ifp
otherwise.
ric variety of which we derive presently. Let f(p) denote
the probability density function according to whichall sellers set their equilibrium prices, and let F(p) be the corresponding cumulative distribution function (CDF). In this
stochastic setting, the event that any seller is the low-priced
seller occurs with probability [1 - F(p)] s-1. Wenowobtain an expression for expected seller demand: h(p)
w.4(1/S) WB
[1 -- F(p)] s-l, for p < v . Note tha t h(p)is
expressed as a function of scalar price p, given that probability distribution F(p) describes the other sellers’ expected
prices. Profits 7r(p) =(p - c)h(p), for all prices p _< v.
A Nash equilibrium in mixedstrategies requires that (i)
sellers maximizeindividual profits, given other sellers’ pricing profiles, so as there is no incentive to deviate, and (ii) all
prices assigned positive probability yield equal profits otherwise it would not be optimal to randomize. To evaluate
those profits, let p = v. Buyers are willing to pay as much
as v, but no more; thus, F(v) = 1. It follows that h(v)
wA(1/S), and moreover that It(v) WA
(1/S)(v -- (th is
implies, incidentally, that total profits are wa(v-- c)). Setting 7r(p) = 7r(v) and solving F(p)yields:
(2)
whichimplicitly defines p and F(p) in terms of one another.
F(p) is only valid in the range [0, 1]. As noted previously,
the upper boundary of F(p) occurs at p = v; the lower
boundaryis computedby setting F(p) = 0 in Eq. 2:
= p=c+
w~t + wBS
(3)
Thus, Eq. 2 is valid in the range p* _< p _< v. A similar
derivation of the symmetricmixedstrategy equilibrium appears in Varian (1980). Greenwald, et al. (1999) presents
various generalizations.
In addition to the symmetricNash equilibrium, asymmetric Nashequilibria exist, prescribed by the following structure. 4 For 2 _< tr _< S, S - ~ sellers deterministically
set their prices to the monopolistic price v, while the remaining tr sellers employ a mixed strategy described by
the cumulativedistribution function Fo (p), derived as follows. In an asymmetricsetting, the event that one of the
nondeterministic sellers is lowest-priced occurs with probability [1 - Fa(p)] a-1. Nowexpected demandfor the ~r
sellers pricing according to mixedstrategies is given by:
h(p) = w.4(1/S) + [1 -- Fa (p) a-1. Foll owing the argumentgiven in the symmetriccase, we set 7r(p) -- (p-c)h(p)
equal to equilibrium profits 7r(v) and solve for Fo (p), which
yields the followingmixedstrategy distribution (see Fig. 1):
(4)
(1)
3Nonetheless,
in a related study weconsiderendogenous
buyer,
as wellas seller, decisions.
4Thesymmetric
equilibriumis a special case of the asymmetric
one in which5’ = ~.
38
1.0
5 NIRCDF
(theoretical)
M.) Mixedstrategy weights at time t are given by the probability vector q~ = (q~)l<~<,,~, wherera IM
I. Finally, th e
expected payoffs of mixe-dstrategyqt are denoted by E[Tr~].
Nowlet ht be the subset of the history of repeated game
Ft that is knownto the agent at time t. Let Ht denote the
set of all such histories of length t, and let H "--- [-Jo Hr.
A learning algorithm A is a map A : H -+ Q, where Q
is the agent’s set of mixedstrategies. The agent’s mixed
strategy at time t + 1 is contingent on the elements of the
history knownthrough time t: i.e., qt+l = A(ht). The
no-regret learning algorithms of interest in this study depend only on historic information regarding the payoffs obtained through time t, unlike for examplefictitious play,
whichexplicitly dependson the strategies of all players as
twell as all payoffs. For an informed player, a history h
has the form (i l, (~r~)),. ..,t, (i)),
7rt
where (lrT) is
vector of informed payoffs at time 1 < r < t for all
strategies i. For a naive player, a history ht has the form
¯ 1 1
(3 , 7rjl),..., (jr,zr~,), whichrecords only the payoffs
the strategy ff that is playedat time 1 < ~- < t.
The regret p the player feels for playing strategy i rather
than strategy j is simply the difference in payoffs obtained
by these strategies at time t: p(j, i) = 7rt. - 7r~. Suppose
a player’s learning algorithm prescribes t~at he play mixed
strategy qt at time t. Thenthe regret the player feels toward
strategy j is the difference betweenthe expected payoffs of
strategy qt and the payoffs of strategy j, namely:p(j, qt)
7r~ - E[Tr~]. A learning algorithm A is said to exhibit e-no
externalregretiff for all histories ht, for all strategies j,
0.8
_~0.6
a. 0.4
0,2
0,0
0.5
0.6
0.7
0.8
Pdce
0.9
1.0
Figure 1: CDFsfor mixed-strategy components of asymmetric Nash equilibrium price distribution, for values of
tr = 2, 3, 4, 5; cr = 5 is the symmetricsolution.
Learning
Whenthe widespread adoption of shopbots by buyers forces
sellers to becomemore competitive, sellers maywell respondby creating pricebots that automatically set prices so
as to maximizeprofitability. It seemsunrealistic, however,
to expect that pricebots will simply computea Nash equilibrium and fix prices accordingly. The real business world
is fraught with uncertainties that underminethe validity of
game-theoretic analyses: sellers lack perfect knowledgeof
buyer demands, and they have an incomplete understanding
of their competitors’ strategies. In order to be deemedprofitable, pricebots will needto learn fromand adapt to changing marketconditions.
In this paper, we study adaptive pricebot algorithms based
on variants of no-regret learning--specifically, no external (Freund &Schapire 1996) and no internal regret (Foster
& Vohra1997)--emphasizingthe differing levels of information on which the algorithms depend. An agent algorithm
that requires as input the relevant profits at all its possible
price points (including the expected profits that wouldhave
been obtained by prices that are not set) are referred to as
informed algorithms. Those algorithms whichoperate in the
absence of any information other than that whichpertains to
the actual set price are referred to as naive algorithms. We
also consider responsive variants of (both the informed and
naive) no-regret learning algorithms, which learn based on
exponentially decayedhistories, and are therefore moreapt
to respond quickly to changes in the environment.
T
;(j’ ¢) ’
lira sup 1
r-,oo
(5)
t=l
where qt = A(ht) for all 1 < t < T. In words, the limit
of the sequence of average regrets betweenthe player’s sequenceof mixedstrategies and all possible fixed alternatives
is less than e. If the algorithmexhibits e-no external regret
for all e > 0, then it is said to exhibit no external regret.
No internal regret can be understood in terms of conditional regrets. Givenan algorithm A that generates sequence
of plays {i t }, the conditionalregret _Rr (j, i) the player feels
towardstrategy j conditioned on strategy i is the sumof the
regrets at all times t that the player plays strategy i:
RTa(j,
i) = E p(j,
(6)
{l<t<Tli’=i}
Analgorithm exhibits no internal regret iff in the limit it
yields no conditional regrets on average. Expressedin terms
of expectation,for all histories ht, strategies i, j, E > 0,
T
t ¯
lira sup 1
(7)
T-+oo "~ E q~P(3, i) <
Definitions
Before describing our simulation results, we define no external and no internal regret, and describe the no-regret algorithmsof interest. This description is presented in generic
game-theoretic terms, from the point of view of an individual player, as if that player were playing a repeated gamept
against nature.5 Fromthis perspective, let 7r~ denote the payoffs obtained by the player of interest at time t via strategy
i. (Let j, as well as i, range over the set of pure strategies
t=l
where qt = A(ht) for all 1 < t < T. It is well-knownthat
an algorithmsatisfies the no internal regret property iff its
empirical distribution of play convergesto correlated equilibrium (see, for example, Foster and Vohra(1997) and Hart
and Mas-Colell (1997)). Moreover, no internal regret
plies no external regret, and these two properties are equivalent in two strategy games.
5Natureis taken to be a conglomeration
of all opponents.
39
No External Regret Learning
Freund and Schapire (1996) propose an algorithm (NER)
that achieves no external regret via a multiplicative updating scheme. Their algorithm is dependenton the cumulative
payoffs achieved by all strategies, including the surmised
payoffs of strategies which are not in fact played. Let R~
denote the cumulative payoffs obtained through time t via
t
strategy i: i.e., R~= Y’]~=I
7r~. Nowthe weightassigned to
strategy i at time t + 1, for a > 0, is given by:
qt+l
i
--
(1 + ~)R~
associated with those strategies. An estimated measure of
^t
^t
expected regret ^tr~_+j is given by
r~_+~ =^tqi(~rj
- ~)
¯
^t ^t t t
t t
^t
^t
(qi/qj)l~.Tr~ - 1iTri, whereq~ and qj are definedas in NERe.
NIR~updates weights using Eq. 9, with estimated cumulative
^t
^t
t
internal regret IRi..+j , basedon Ri..+j , in place of IRi_+j.
Nointernal regret learning can also be maderesponsive
(NIRT) in both informed and naive cases via an exponential smoothingof regret. Given 7 E (0, 1], exponentially
smoothedcumulative regret, denoted ~t.~._}j, is computedin
t or r~__}
^t depending on whether the setterms
of either r~__rj
i,
~-t+i(1
-t
t
ting is informedor naive: i.e., K~__rj:
- 7)Ri_.+j+ ri_.+j
,
=t+i (1
-t
~t
^t
or Ki_}j : - "[)Ri_.+j --}- ri.._rj.
NIR7 then uses IRi_.+j
(fi~_~j)+ as its measureof internal regret.
(8)
Ej(i
+ R}
Or)
The naive variant of NER,namely NERe,is obtained by
(i) imposing an artificial lower boundon the probability
with whichstrategies are played, in order to ensure that the
space of strategies is adequatelyexplored, and (ii) utilizing
estimates of cumulative payoffs that depend only on payoffs obtained by strategies that are actually employed.For
e C (0, 1], let ~ = (1 e)q~ + e/m betheweight assi gned
by NER,to strategy i, and let ¢r~ = l~Ir~/~.6 Estimatedcut = ~=1
t 7ri
^
mulativepayoffs (notation ^Ri)
* are given by^R~
NEReupdates weights according to the update rule given in
Eq. 8, but fi~ is used in place of R~. NER,is due to Auer,
et al. (1995)¯ NER can be also made responsive via exponential smoothing. Given 3’ E (0, 1], NER7 is defned by
substituting ~ into Eq. 8, where in informed settings and
naive settings, respectively, ~+i = (1 - 3’) R~+ 7r~+i and
Simulations
This section describes simulations of markets in which anywhere from 2 to 5 adaptive pricebots employvarious mixtures of no regret pricing strategies. At each time step, a
pricebot, say s, is randomlyselected and given the opportunity to set its price, whichit does by generating a price
accordingits current price distribution. Profits are then computed for all pricebots. The profits for pricebot s are taken
to be expectedprofits, 7 given current price vector/~
7rs(p-’)---- [-~-~--I-’~,,(~,O),rs(~B-..t_l.] (ps--c) (10)
where)% (p-’) is the numberof competitors’ prices that
less than p, and ~-~ (~ denotes the numberthat are exactly
equal to p,. Givenits profits, pricebot s uses its respective learning algorithm to update its price distribution. At
this point in our simulations, we measure the KolmogorovSmirnov(K-S) distance e between the symmetric Nash equilibrium CDFand the empirical CDF,computedas the average of the absolute differences betweenthese CDFsover all
mprices pi:
No Internal Regret Learning
Wenowdiscuss no internal regret learning (NIR), due to Foster and Vohra (1997), and a simple implementation due
Hart and Mas-Colell (1997). The regret felt by a player
time t is formulated as the difference betweenthe payoffs
obtained by utilizing the player’s strategy of choice, say i,
and the payoffs that could have been achieved had strategy
t
t t
j beenplayed instead: ri~
j = qi (Trj. - 7r~). Thecumulative
regret R~__rjis the summation
of regrets from i to j through
t r. ~ . Nowinternal regret is defined
t
time t: Ri._rj
=- )-’~x=l *-}a"
as follows"
¯ IR~_rj = (R~__}-.)+, where X+ = max{X,0}. If
.
J
.
.
the regret for havingplayed strategy j rather than strategy z
is significant, then the NIRprocedure for updating weights
increases the probability of playing strategy i. Accordingto
Hart and Mas-Colell,if strategy i is played at time t,
qt+i
j
1
: IR~_.+j
and
.~+l
~/i
1 L.~
~.t+l
¢/J
~#~
= --
e = --1 EmIFsymNash(Pi)-- Femp(Pi)[
m
(11)
i=1
Strictly speaking, Eq. 10 holds only if p, does not exceed
the buyers’ valuation v; otherwisethe seller’s profit is zero.
For simulation purposes, this property was ensured by constraining the pricebots to set their prices amonga discrete
set of cardinality rn = 51, spaced equally in the interval
[c, v], where c = 0.5 and v = 1. The mixture of buyer types
was set at WA= wn = 0.5. Simulations were iterated for
100 million time steps for NIR pricebots and 10 million time
steps for NERpricebots.
(9)
where p > (rn - 1) maxjeMIR~_U is a normalizing term.
Like NER,NIRdepends on complete payoff information
at all times t, including informationthat pertains to strategies that are not employedat time t. NIRe,whichis applicable in naive settings, dependson an estimate of internal regret that is basedonly on the payoffs obtained by the strategies that are actually played, and the approximateweights
7Wehave also experimentedwith profits based on simulated
buyerpurchases,whichintroducednoise into the profit function.
Whilethe amplitudeof the noise decreasesas the size of the buyer
populationincreases,suchincreasesalso increasesimulationtimes.
Expectedprofits enabledus to explorethe behaviorin the limit
of infinitely large buyerpopulationswithoutsufferinginordinately
longrunningtimes¯
61~is the indicatorfunction,whichhas value1 if strategyi is
employed
at timet, and0 otherwise.
40
0.5
1.0
I
I
I
I
I
I
I
I
I
Gum.
K-8
distance:
3 NIR
(~0,4)
T=IO
0.8
0.4
2
o. 0.6
8 0.3
1.1
1.0
0.9
a. 0.8
ic~ 0,2
~ 0.4
0.7
0
0.1
0.2
0.0
0.5
0.6
0.7
0.8
Price
0.9
1.0
0.0
0.6
10 20 30 40 50 60 70 80 90 100
Time
(millions)
0.5
0
10 20 30 40 50 60 70
Time
(millions)
80 90 100
Figure 2:3 informed NIR3, pricebots, 3’ = 10-6. a) Empirical CDFat time 100 million, with symmetric(~r = 3) and asymmetric
(tr = 2) Nash CDFssuperimposed, b) K-S distance between empirical and symmetric Nash CDFsover time. c) Instantaneous
averageprices over time.
1.0
0.05
I
1.1
5 NIR(~:=0.05,y=O)
I
I
I
I
I
I
I
I
5 NIR(~=O.O&
~0)
1.0
0.04
0.8
~~..;~>,,~,~.~,.z~,~¯
i ..............
0.9
0.6
,X "~\ . .........
8 0.03
.:
~. 0.8
E~0.4
o
0.02
0.2
0.01
0.0
0.5
0,7
0.6
0.7
0.8
Price
0.9
1.0
0.00
0
0.6
10 20 30 40 50 60 70 80 90 100
Time
(millions)
0.5
0
I I I I I I I I I
10 20 30 40 50 60 70 80 90 100
Time
(millions)
Figure 3:5 naive NIRspricebots, e -- 0.05. a) Empirical CDFat time 100 million, with symmetric Nash CDFsuperimposed.
b) K-S distance betweenempirical and symmetricNashCDFsover time. c) Instantaneous average prices over time.
algorithm: i.e., 3’ = 0. The long-run cumulative empirical
probability distributions coincide almost perfectly with the
theoretical asymmetric Nash equilibrium for tr = 2: one
pricebot alwayssets its price to 1, while the other 2 play the
mixed-strategy equilibrium computedin Eq. 4 with e = 2.
NIR Pricebots
Wenowpresent simulation results for no internal regret
learning. Our main observation is that NIRpricebots, both
informed and naive, converge to Nash equilibrium. This is
not entirely surprising, as NIRis knownto converge within
the set of correlated equilibria (Foster & Vohra1997),
whichNash equilibria form a (proper) subset. Furthermore,
NIR has previously been observed to converge to Nash,
rather than correlated, equilibria in gameswith small numbers of strategies (Greenwald, Friedman, &Shenker 2000).
In the present model, wherethe numberof strategies varies
between 51 and 501, we again find that NIR converges to
Nash equilibrium. The detailed nature of the convergence,
however,is quite different betweenNIRand NIRe.
First, we consider informed NIRpricebots. In simulations of 2 to 5 NIR pricebots, the empirical price distributions have been observedto evolve to a mixedstrategy Nash
equilibrium--usually an asymmetric one. A typical example involving 3 informedNIR.t pricebots is shownin Fig. 2a.
In this experiment,the responsivenessparameterwas set to a
relatively small value, 3’ -- 10-6; the results are qualitatively
similar for the ordinary non-responsiveform of the learning
Figs. 2b and 2c reveals that the convergenceto the asymmetric Nash equilibrium is not as regular as one might
suppose: the system experiences a punctuated equilibrium.
Fig. 2b displays the K-S distance between the symmetric
Nash equilibrium CDFand the empirical CDEBy about
time 3 million, the empiricalprice distributions for the pricebots begin to closely approximatethe symmetricNash equilibrium (tr = 3). This behavior continues until about
time 30.92 million, with the K-Sdistance hovering close to
0.005 for each pricebot. Quite abruptly, however, at time
30.93 million, the K-S distance for one pricebot quickly
rises toward0.378, while that of the other 2 rises to 0.146.
These are precisely the K-Sdistances between the symmetric Nash distribution and the pure and mixedcomponentsof
the asymmetric Nash distribution with tr = 2. The conclusion is clear: at time 30.93 million, there is a spontaneous
and abrupt transition from the symmetric (or = 3) to the
41
asymmetric(cr - 2) Nash equilibrium.
Fig. 2c provides someadditional insight into the learning
dynamics. At a given momentin time, each pricebot maintains an instantaneous modelprice distribution from which
it randomlydraws its price. Fig. 2c displays the meanof
this modeldistribution as a function of time. The average
prices are seen to be highly volatile. (Theactual prices set
by the pricebots vary even morewildly with time!) The sudden shift betweenNashequilibria is again evident from this
viewpoint: at time roughly 30 million, one of the average
prices is pinnedat l, while the other prices start fluctuating
wildly around 0.736, consistent with tr = 2. In numerous
experiments (of 2 to 5 NIR pricebots), we consistently observed equilibrium shifts, alwaystowardlower values of tT.
Fig. 2c, which portrays instantaneous average prices over
time, suggeststhat volatility decreases with tr, whichpartly
explains whyequilibria shift in this manner. Intuitively,
the volatility of an equilibrium consisting of entirely mixed
strategies exceedsthat of an alternative equilibriumconsisting of somepure and somemixedstrategies.
Thevolatility of the averageprices suggests that the symmetric mixed-strategy Nashequilibrium is in somesense unstable. At any one moment,the various pricebots are likely
to be generating prices according to distributions that diverge substantially from the Nashdistribution. Moreover,
the instantaneous modeldistributions drift very quickly over
time, with little temporal correlation even on a time scale
as short as 10000time steps. Even before the shift between
equilibria is made,it is evident that the pricebots haveoften experimentedwith the deterministic strategy (always set
price to 1). Remarkably,the pricebots’ learning algorithms
exert the right pressure on one another to ensure that, averagedover a long period of time, the distribution of prices actually set by each pricebot is a Nashdistribution--typically
symmetric at first, becomingmore and more asymmetric as
time progresses. The time-averageprofits also reflect this:
they are approximately 0.0838 for each pricebot, which is
quite close to the theoretical value of 0.0833 for both the
symmetric and asymmetric Nash equilibria.
Wehave also conducted numeroussimulations of naive
NIR,pricebots, finding that the empirical distributions alwaysconvergeto distributions that becomearbitrarily close
to the symmetric Nash equilibrium, as the exploration parameter e --4 0. Fig. 3a showsthe CDFfor 5 NIR, pricebots
with e = 0.05, all of which overlay the symmetric Nash
CDFvery closely. The K-S distances, depicted in Fig. 3b,
settle in the range of roughly 0.01 and 0.025. The instantaneous average prices computedfrom the pricebots’ model
distributions are plotted in Fig. 3c. The volatile behaviorof
the informed NIRpricebots is somewhatdiminished; average
prices for NIRepricebots oscillate near the meanprice of the
symmetric Nash equilibrium, namely 0.81. For e -- 0.02,
the average NIR,pricebots’ profits were 0.0498, negligibly
lower than the theoretical value of 0.05. In contrast, for
e -- 0.1, the CDFsfor all pricebots are all extremelyclose
to one another, but there is a consistent bias awayfrom the
symmetricNashprice distribution--one that leads to somewhat lower prices on average. Moreover,the instantaneous
average prices (and therefore the underlying price distribu-
tions) are considerably less volatile. Increasing the amount
of exploration leads to greater consistencyand stability, but
not surprisingly this also leads to greater deviations from the
Nashdistribution.
NER Pricebots
In this section, we present simulation results for no external
regret pricing (NER,with a = 0.5). There are a number
learning algorithmsthat satisfy the no external regret optimality criterion (the earliest are due to Hannan(1957)
Blackwell(1956)). Since no internal regret and no external
regret are equivalent in two-strategy games,no external regret algorithms converge(in empirical frequencies) to correlated equilibrium in two-strategy games. Greenwaldet
al. (2000) report that manyno external regret algorithms
convergeto Nash equilibrium in gamesof two strategies.
As illustrated in Fig. 4, the learning dynamicsare quite
different in the present case, where there are m = 51 different strategies. In Fig. 4a, 2 informedNERpricebots (with
a -- 0.5) never settle at a deterministic equilibrium. Moreover, the empirical CDFdepicted in Fig. 4b deviates significantly from the Nash CDF;and the K-S distance shownin
Fig. 4c oscillates indefinitely with an exponentiallyincreasing period, and is apparently boundedawayfrom zero. This
behavior indicates that, unlike NIRpricebots, the long-run
empirical behavior of 2 NERpricebots will never reach the
symmetricNashCDF,even after infinite time.
Instead, the 2 NERpricebots engagein cyclical price wars,
with their prices highly correlated, behavior not unlike myopic (MY) best-response pricebots (Greenwald & Kephart
1999). NER price war cycles differ from MYprice war cycles, however, in that the length of NERcycles grows exponentially, whereasthe length of MYcycles is constant.
This outcome results because NER (non-responsive) pricebots learn from the ever-growinghistory dating back to time
0, while myopiclearning at time t is basedonly on time t- 1.
The play between 2 NERpricebots in the present setting is
reminiscent of fictitious play in the Shapleygame,a 2-player
gameof 3 strategies for whichthere is no pure strategy Nash
equilibrium (Shapley 1953).
In order to eliminate exponential cycles, we nowturn to
the responsive algorithm NERo,, with responsiveness parameter ’7 ---- 10-5 that smoothsthe pricebots’ observedhistory.
This smoothingeffectively limiting the previous history to
a finite time scale on the order of 1/’7. Price war cycles are
still observed, but they quickly convergeto a constant period
of roughlyS/’7 (see Fig. 5a). In order to computean empirical CDF,we again used exponential weighting to smooththe
empirical play, but rather than using a time scale of 100,000
(which would only remembera portion of the price-war cycle) we lengthenedthe time scale to 2 million. The smoothed
empirical CDFthat results (see Fig. 5b) is extremely close
to the Nash CDF,with a final K-S distance of only 0.0036
for each pricebot.
Fig. 5c depicts the K-S distance between the smoothed
empirical distribution and the symmetricNash equilibrium
over time. Bothpricebots’ errors are plotted, but the values
are so highly correlated that only one error function is apparent. The errors diminish in an oscillatory fashion over time,
1.1
I
I
I
I
I
I
I
I
1.0
I
2 NER
(~--0,1’=0)
1.0~
0.8
0.9
,4
.o
~. 0.6
~. 0.9
I
I
I
I
I
I
I
I
2 NER(~=0, ~0)
0.10’
0,08
"5 0.06
co
~ 0.4
0.7
0.04
0.2
0.6
0.5
0.12
2.;.<,=o
’/
IIIIIIIII
12345678910
Time(millions)
0.0
0.5
0.02
0.00
0,6
0.7
0.8
Price
0.9
1.0
i
1
I
2
i
3
I I ] I I i
4 5 6 7 8 9 10
Time
(millions)
Figure 4:2 informed NER pricebots, a) Actual prices over time. a) Empirical CDFat time 10 million, with symmetric Nash
CDFsuperimposed, c) K-S distance between empirical and symmetric Nash CDFsover time.
1.1
I
1.0
I I I i I
-s)
2 NER
(¢=0,’t=10
i
I
I
1.0
I
I
L
I
0,12
I
-s)
2 NER
(~=0, ~10
2 NER(~=
0.10
0.8
0.9
0,08
0.6
~_ 0.6
0.06
0.7
0.04
0.2
0.8
0.5
O
G’)
~=0.4
1
2
3
4 5 6 7 8
Time (millions)
9
10
0.0 ~= ~,,,~,f"
0.5
0.6
0.02
I
0.7
L
0.8
Price
i
0.9
1.0
0.00
0
1
2
3 4 5 6 7
"[]me (millions)
8
9
10
Figure 5:2 informed, responsive NER pricebots; 7 = 10-5. a) Actual prices over time. b) Empirical CDFat time 10 million,
with exponential smoothingover time scale of 2 million, c) K-S distance betweenempirical and symmetricNash CDFs.
reaching 0.0036 after 10 million time steps. The long-run
empirical distribution of play of responsive, informedNER.
t
pricebots is boundedawayfrom Nash by a small function of
%but approachesNashas 3, --4 0.
Finally, we report on results for naive NERe
pricebots (not
shown). For non-responsive NER~pricebots, price-war cycles with exponentially increasing period can still be discerned despite being obscured somewhatby a uniform peppering of exploratory prices. The empirical CDFis somewhat closer to the Nash equilibrium than for non-responsive
informed NER pricebots, with the cumulative K-S distance
droppingto a minimum
of 0.021 in the course of its oscillatory trajectory. However,
it appears that it is not destined to
convergeto Nash. For NERpricebots that are both naive and
responsive, the price correlations are greatly diminished-prices over time appear randomto the naked eye. Nonetheless, the empirical CDF(computedjust as for the informed
NER pricebots) is again quite close to the Nash CDF.The
final K-Sdistances (after 10 million time steps) for the two
pricebots are 0.0136 and 0.0182.
Apparently, responsiveness makes an important difference in NERpricing. For non-responsive NER,the price
dynamicsnever approach Nash behavior, but for responsive
43
the time-averagedplay approaches Nashas 3, --4 0.
For responsive NERe,finite values of 7, e lead to near-Nash
time-averagedplay, whichapproachesNashas 7, c --4 0.
NER3,,
Conclusion
This paperinvestigated probabilistic no-regretlearning in
the context of dynamicon-line pricing. Specifically, simulations were conductedin an economicmodelof shopbots
andpricebots. It wasdetermined
that bothno internal regret
learning and (responsive) no external regret learning converge to Nashequilibrium, in the sense that the long-term
empirical frequencyof play coincides with Nash. Neither
algorithmgeneratedprobability distributions whichthemselves convergedto the Nashequilibrium, however.
It remainsto simulate heterogeneousmixturesof pricebots, both deterministic andnon-deterministic.Preliminary
studies to this effect suggest that MY(a.k.a. Cournotbestreply dynamics(Cournot1838)), a reasonableperformer
high-informationsettings, outperformsinformedno-regret
learning algorithms(both NIRandNER).MYhas no obvious
naive analogue,however,and thus far the naive versions of
no-regret learning have outperformedany naive implementation of MY.Thusit seemsthat no-regret learning would
be morefeasible than classic best-reply dynamicsin pricing
domainslike the Internet, where only limited payoff informationis available.
Thescopeof the results presented in this paper is not limited to e-commercemodels such as shopbots and pricebots.
Bots could also be used to makenetworking decisions, for
example, such as along which of a numberof routes to send
a packet or at what rate to transmit. No-regret learning is
equally applicable in this scenario (as noted by the inventors of the no external regret learning algorithm (Freund
Schapire 1996)). In future work, it wouldbe of interest
analyze and simulate a game-theoretic model of networking via no-regret learning in attempt to validate the popular
assumptionthat Nash equilibrium dictates a network’soperating point (Shenker 1995).
Shenker, S. 1995. Making greed work in networks:
A game-theoretic analysis of switch service disciplines.
IEEE/A CMTransactions on Networking 3:819-831.
Varian, H. 1980. A model of sales. American Economic
Review, Papers and Proceedings 70(4):651-659.
References
Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R.
1995. Gamblingin a rigged casino: The adversarial multiarmed bandit problem. In Proceedings of the 36th Annual
Symposium on Foundations of Computer Science, 322331. ACMPress.
Blackwell, D. 1956. An analog of the minimax theorem
for vector payoffs. Pacific Journal of Mathematics6:1-8.
Cournot, A. 1838. Recherches sur les Principes Mathematics de la Theorie de la Richesse. Hachette.
Foster, D., and Vohra, R. 1997. Regret in the on-line decision problem. Gamesand Economic Behavior 21:40-55.
Freund, ¥., and Schapire, R. 1996. Gametheory, online prediction, and boosting. In Proceedings of the 9th
Annual Conference on Computational Learning Theory.
ACMPress. 325-332.
Greenwald,A., and Kephart, J. 1999. Shopbotsand pricebots. In Proceedingsof Sixteenth International Joint Conference on Artificial Intelligence, volume1,506-511.
Greenwald,A.; Friedman,E.; and Shenker, S. 2000. Learning in network contexts: Results from experimental simulations. Gamesand EconomicBehavior: Special Issue on
Economicsand Artificial Intelligence.
Greenwald,A.; Kephart, J.; and Tesauro, G. 1999. Strategic pricebot dynamics. In Proceedings of First ACMConference on E-Commerce,58-67.
Hannan,J. 1957. Approximationto Bayes risk in repeated
plays. In Dresher, M.; Tucker, A.; and Wolfe, P., eds.,
Contributions to the Theory of Games,volume3. Princeton
University Press. 97-139.
Hart, S., and Colell, A. M. 1997. A simple adaptive procedure leading to correlated equilibrium. Technicalreport,
Center for Rationality and Interactive Decision Theory.
Nash, J. 1951. Non-cooperative games. Annals of Mathematics 54:286-295.
Shapley, L. 1953. A value for n-person games. In Kuhn,
H., and Tucker, A., eds., Contributions to the Theory of
Games,volumeII. Princeton University Press. 307-317.
44