Digitized by the Internet Archive in 2011 with funding from Boston Library Consortium Member Libraries http://www.archive.org/details/rulesofthumbforsOOelli /orking paper department of economics RULES OF THUMB FOR SOCIAL LEARNING Glenn Ellison Drew Fudenberg No. 92-12 June 1992 massachusetts institute of technology 50 memorial drive Cambridge, mass. 02139 RULES OF THUMB FOR SOCIAL LEARNING Glenn Ellison Drew Fudenberg No. 92-12 j une 1992 M.I.T. AUG LIBRARIES 4 1992 REOtsvtu MASS. SF n ,r I T TECH. : 7 ^2 Rules of Thumb for Social Drew Fudenberg 3 Glenn Ellison 2 June 'We thank Learning 18, 1992 Ahbijit Banerjee, Mathias Dewatripont, and seminar participants at University College, London, Nuffield College, Oxford, the Universite Libre de Bruxelles for helpful comments. GREMAQ We r Toulouse, and are also happy to ac- knowledge financial support from a National Science Foundation Graduate Student Fellowship, from NSF grant SES 90-08770, the John Simon Guggenheim Foundation, and Sloan Foundation Graduate Program. 2 Harvard University, Department of Economics 3 MIT, Department of Economics, and IDEI, Toulouse. ABSTRACT This paper considers agents who use the experiences of their neighbors in deciding which of two technologies to use. We consider two learning environments, one where the same technology optimal is for all players the and technology is better for some of them. suppose that a that ignore players all use data lead adjustment to fairly but can be decisions efficient quite slow when first introduced. JEL Classifications: D8, C7 , N53, 030 rules specified which tendency to use the more popular technology. can where each In both environments, exogenously historical other a in may of we thumb incorporate These naive rules the superior long run, technology but is Introduction 1. presents paper This simple two models how of economic agents decide which of two technolgies to use when the relative profitability of the technologies is unknown. In both models, agents base their decisions, at least in part, on the experience of their neighbors; that believe We this is what we mean by "social learning." social learning is frequently important an aspect of the process of technology adoption, where "technology" should be broadly construed: Although our main example concerns adoption the technology agricultural of English the in agricultural revolution, we believe that the models may also be applicable to such choices as parents' decisions whether to send their children to a public or private school. There have been several previous social learning in technology adoption. contagion the process, which models models of role the of Perhaps the earliest is adoption as random a matching process in which players switch to the new technology the first time they meet someone who is using it; yields familiar the "S-shaped curve" for the this process path time of adoption that has been widely used in empirical work. Recent papers by Banerjee [1991a], al [1991] social choices and Smith learning, is in better. [1991b], Bikchandari et [1990] study more sophisticated models of which players must decide The primary question of which interest of in two these models is whether the social learning is sufficient to prevent the population from locking on to the wrong decision. These papers suppose that players observe one another's choices, that players do not observe the payoffs that these but choices generate. 2 In a model where players choose sequentially, later decision makers may thus be led to make the same wrong choice as their predecessors, as the late deciders do not observe that the early ones now regret their choice. Moreover, this can occur even though the players could identify the optimal choice with certainty by pooling their information. The learning previous work environments three ways. in we First, context of technology adoption, that players differ is more it 3 individuals periodically reevaluate their own observe other decision. those of in the natural to suppose payoffs, our paper differs Second, from we believe that, observe their neighbors' do their choices. study well as supposing that in players' outcomes consider we Third, as and the possibility that players may be sufficiently heterogeneous that under full information they would not all make the same choice. 4 In the diffusion of innovations, agricultural for example, different technologies may be appropriate for different soils and climates. In addition environment, style of process our to paper its analysis: is differences these also differs from those a reasons for proceeding in this fashion. we in game played by consider, fully We have several First, Bayesian in some of the learning requires calculations that may be too complicated to be realistic. second motivation for our approach is that, the technology choice the we suppose that players use exogenously specified, and quite simple, "rules of thumb." environments cited learning Instead of assuming that the adoption described by the equilibrium of fully rational agents, the in may be A to the extent that substantially different than decisions previous the have players would we faced, be uncomfortable with the assumption that the technology adoption process motivation to we simply technical expediency: is way easy A somewhat different described by an equilibrium. is considerations various incorporate did not we see feel an are important into a rational-actor, equilibrium, model. paper The environments. learning population simple two has one of homogeneous a between choosing models two competing with the payoff to each technology subject to an aggregate i.i.d. has around first The players of technologies, players structured is shock. opportunity the only some fraction of the Each period, to revise their these choices; players make their choices using simple rules of thumb. Our thumb analysis in which choose players whichever This period. begins with ignore technology rule a will particularly historical all better worked lead the rule of "naive" data in popularity • simply and the of previous the two technologies to fluctuate unless one of the technologies has a higher payoff for all values of the shock. We subsequently "popularity weighting," consider tendency a which rules choose to a incorporate more popular technology even if it was somewhat less profitable last period. Since players observe both the choices and the payoffs of their neighbors, they would have no reason to use popularity weighting if they made that full use of their information. real-world popularity, and decision indeed makers our often results do may However, pay help we feel attention provide to some explanation for this phenomenon. Specifically, in our model the appropriate use of popularity weighting leads players to adopt and stick with the Intuitively, technology. better strategy a which more is popular today is likely to have done well in the past, so that the relative popularity of the technologies can serve as a proxy performance. their historical for Thus fairly is clear that popularity weighting rules can lead to better decisions; we find that one particular choice of popularity weights picks out the This leads us to ask whether better technology in the long run. there are any reasons to believe that this optimal popularity rule is either particularly likely or particularly unlikely to In response, be used. popularity rules we present a model of players choosing which in the optimal rule is the unique symmetric equilibrium. Our second model has a heterogeneous population, with each technology better for some of the players. Thus the question here is not whether the better technology will be adopted, rather whether the new distributed Moreover, . . have players will adopted be the by We suppose that there is a continuum of appropriate players. players technology but uniformly similar over payoffs to a and line, the two that nearby . technologies. 5 suppose that players base their decisions on the we relative performance of the two technologies at locations that This window width, are within one "window width" of their own. which is exogenous in our model, the result an of informational can be thought of as either constraint- players may not observe outcomes at far-away locations- or as the result of a prior belief that players at far-away locations are not sufficiently similar for their experiences to be informative. Once again, players revise their technology choices using do not know exactly how thus and simply technologies in we suppose that players In particular, simple rules of thumb. location compare average the window, their influences relative payoffs, payoffs opposed as of the using to two more sophisticated statistical methods. This second model provides a number of predications about the types and magnitudes of the errors that are likely to be The spatial nature of the process allows some degree of made. even learning social without popularity weighting, and the long-run state of the system is approximately efficient when the window width is small. the However, small window widths imply that system converges more slowly, initial state increasing the is far from which can be costly optimum. the popularity weighting in the if the Roughly speaking, spatial model has about the same effect as decreasing the window width. Before proceeding further, we should acknowledge that our belief that models of bounded rationality are a useful way to study social learning does not mean that we satisfied with the particular rules we consider. are completely In particular, in the first model use of history does not seem so complicated Our purpose is not to argue that any as to be unreasonable. one of these models is particularly compelling, but rather to identify general properties that seem to occur in some of the more obvious formulations. a One recurrent conclusion is that in number of cases the long-run state of the system is efficient, naive. fairly even though the individual decision rules are quite 2 The English Agricultural Revolution . developing Before our we models, would like to use the example of the English agricultural revolution to introduce some of issues the that our models designed to are address. The revolution in question is the improvement in English agriculture This improvement is usually attributed between 1650 and 1850. to the spread of new agricultural practices, and in particular to what is called the "new husbandry," although both the size of the improvement and its causes have become more controversial in The new husbandry refers recent years. new and crops crop rotations which to variety of a arrived England in new from Flanders in the 17th century, based on the idea of growing crops such as clover or turnips instead of leaving the land fallow. These crops rejuvenate could the soil, and hoeing the required had the by-product of eliminating weeds. The they crops themselves can be fed to livestock, which allows larger herds to be over carried winter, providing further supplies which in turn could be used for fertilizer. manure, result is grains. the (ideally) of The net increased production of both livestock and 7 The diffusion of the new husbandry seems to have involved all of the major aspects of our models that we mentioned in the introduction. observe, well as at First, it seems likely that farmers were able to least roughly, their neighbors' output of the choice of crops their neighbors, and as crop rotations. Second, the payoffs to various crops were different at different depending on the soil, locations, farm. climate, and terrain of each There is not yet a consensus on where the new husbandry should have been adopted, but it is clear that it was not a improvement: universal those suited where Plain, the so difficult to harvest and was growth, Midland the of most rot in the field Third, . the to light and were unprofitable in wet clay soils clay soils of Norfolk, like were Turnips a crop it was had limited often left to crop of turnips could be ruined by excessive rain or severe frosts, leading us to incorporate the weather as an annual stochastic shock complicating the learning process. 9 Moving away from the physical description of the situation to (harder our behavior, it resulted from pronouncement to verify) clear seems from central a attempts made along this seems that decentralized plausible that line, the about assumptions adoption final the learning, authority decisions to there (although may agents' opposed as as we discuss farmers the below.) have a were And, it been less sophisticated in their use of past observations than Bayesian learning would suggest. Moreover, with capital markets poorly developed or nonexistent, and starvation a potential concern, it seems plausible determined that primarily farmers' by technology short-term decisions considerations, and were that farmers would be unlikely to experiment with a technology with a lower expected return. The two models we discuss explore two different aspects of the adoption process. a single The homogeneous-population model looks at location in isolation, the adoption process. and focuses on the dynamics of It has been frequently noted that farmers as a group are seemingly very hesitant to try new technologies. These comments do not suggest that all hesitant; for example, Slicher von Bath (op. farmers cit. , are p. equally 243) notes that during the English agricultural revolution, very ancient ways were followed." next to lay fields observation This which crop rotations in with fits "Land tilled in our fraction of the meaning that at each date only some inertia, assumption of population considers changing technologies. The apparent inertia in the diffusion of the new husbandry has been criticized by both contemporary and modern authors as slowing progress, and indeed it does slow the transition from a dominated technology to a new one. world) improve may that (one However, worse in states all of the our analysis shows that inertia performance long-run the is process the of if the performance of the two technologies is subject to sufficiently Given our assumption that players do not large random shocks. track keep of past outcomes, without process the inertia oscillates between the two technologies, while a combination of with inertia tendency a use to most the popular technique permits the learning process to converge to the better of the Our second model examines the idea that players learn from two. their "neighbors" when new a technology may turn out to profitable for some but not all of the potential adopters. the case of farming, but we also believed to have be mind in of (In we take this spatial structure literally, similar, One adjacent.) be the learning who but most "neighbors" from need striking not and be who are geographically frequently noted characteristics of the agricultural revolution is the slow rate at which England the said innovations that the spread. rate was Contemporary only one mile observers per year, in and indeed it did take more than a century for the new husbandry to make its way across the island Many explanations have been proposed for this slow spread, technological including factors like pests institutional factors such as the enclosures, lack of education. role, and diseases, and the farmers' While all of these factors may have played a our analysis suggests that the basic fact of a slow rate of diffusion need not be surprising, as it is can be a natural consequence of social learning, particularly when the difference and when farmers pay attention to the in payoffs is not great, popularity relative technology each of making in their decisions. A second and more spirited debate has surrounded the role of elite landlords and agricultural reformers on the diffusion Ernie [1912] and Mantoux [date] process. The classic studies of portrayed the agricultural revolution as the result of valiant struggle by innovators such as Jethro Tull and Lord Townshend to overcome the have authors ignorance of peasant the attacked this view; a Revisionist farmers. main component of their argument has been that the new techniques were already in use in some areas of England before many of the so-called innovators were born. Our model reinforces the pro-elitist side of this debate by emphasizing that popularizers can have a significant impact in promoting the diffusion of a technique even if they are not the first to develop it. The arrival of new husbandry the in England is often dated to the publication in 1650 of Sir Richard Weston's observations of agricultural practices in Brabant and Flanders . When one examines the spread of the new husbandry as detailed by Kerridge the new husbandry [op. spread cit. out , pp. very 272-279], slowly it appears from a that number of geographically distinct locations. Suffolk, other first adoptions were The in but by 1680 the new husbandry had appeared in several locations over hundred a information from other counties, miles the away. providing By agricultural popularizers may have promoted such subsequent innovations, and sped up the rate of the technology's diffusion. our theoretical model may shed some light on In addition, another which question has received attention less the in Was the new husbandry eventually adopted in all of literature: the areas to which was suited? it As potential guide to a historical research, we discuss the conditions under which naive learning processes of kind the we consider tend generate to efficient long-run outcomes. Beyond these historical questions, our spatial of model, learning raises the basic question of how well "naive" learning perform rules we that homogeneous-population, tradeoff between investigated Our answers here model. rates of in adoption and our focus efficiency first, on the the of long-run equilibrium. 3 A Simple Model of Homogeneous Populations . Before considering heterogeneous simpler case players. a population, in which learning social it is interesting the same technology is systems to with consider optimal for a the all This model can be thought of as describing behavior at single site in the model we consider relative payoffs vary with location. large in (continuum) later on, where the Suppose that there population of players at a single site, is a each of whom must choose whether to use technology f or technology g. 10 In each period, (Given our assumption that players observe one the same payoff. nothing would be changed if we allowed each payoffs, anothers' player's players using the same technology receive all payoff subject be to idiosyncratic to shocks.) suppose that the payoffs to the two technologies at date We t, u. and u^, are related by the equation u^ - u£ = 8 + (1) c , t where 8 is a fixed but unknown constant parameter and the i.i.d. function (In H. and the cumulative distribution later sections the constant 9 will vary with is strictly between In mean zero We will assume that p location.) 0] with shocks are c. initial and = 1 = Prob [u^ - - H(-6) u. £ 1. denoted period, players are using technology g. fraction x a 0, the of After each period, a fraction a of the players have the opportunity to revise their choice. Very low values of might a correspond to system a in which individual player made their choices for the duration of their effective inflow with lifetimes, of replacement describe a system embodies in a capital corresponding revisions intermediate players; which in costly the choice the good of a that will values to might technology not be an is replaced 12 until it wears out. We suppose that the players who are revising their choice can observe the previous period. entire history assumption, we average payoffs However, of technologies both players do not have payoff suppose of that observations. individual To players in the access to the justify this revise their choices too infrequently to want to keep track of each period's results, and more strongly that the market at this particular 11 "location" is too small for a record-keeping agency to provide this service. The simplest behavior rule we consider is the "unweighted" rule under which all players who revise their choice pick the technology which did best in the preceding period. Under this adjustment rule, the evolution of the system is described by f(l-a)x.+ a x (2) t+1 = with probability p = Prob[u^£ u. ], \ l(l-ct)x with probability (l-p)=Prob[u^ t < u£] , so that = E ( x +ll x ) t t (2') ( 1-a x ) + ap t - Note that this specification is symmetric in its treatment of the adoption and discontinuance decisions, which corresponds to the case where the costs of "transition" are small. reading about the English agricultural revolution, 13 Our as well as studies of more recent innovations cited in Rogers and Shoemaker (p. 115) suggest that amount the of discontinuance an is important factor in the diffusion process. The result following theorem 10 of Norman [1968]. (b) of proposition Proposition of x. 1 2 standard; is it follows from e.g. (It is also a consequence of part below.) The system (2) i.e. the time-average is ergodic, converges to its expectation with respect to its unique invariant measure u. Moreover, E (x) = p, and var (x) = p(l-p)a/(2-a) 4 . A Single Location with Popularity Weighting Proposition 1 says that observing the long-run fraction of players using technology g reveals the fraction of the time that 12 been the better choice. g has If the distribution H of is c the arm that is more often better is also the arm symmetric, with the higher expected payoff. (The same conclusion holds so long as the amount of asymmetry in H is small compared to e.) This suggests that if all other players in the population are whichever choosing technology has highest the current score, each player could gain by considering the relative popularity of the two technologies, as well as their recent payoffs. Intuitively, popularity current the provides some information about the past history of the process, and thus can serve as proxy a for Although it. information this not is complete, the complete history of the process is not needed to identify the results better section this of technology. is One way in some that interpret to cases the popularity weighting is a good enough proxy for the history that the system eventually converges to the correct choice. develop now We weighting with only a a fraction period. single a parametric location. the of though, Now, simple a As model above, population updates popularity of we suppose that choice its each instead of choosing the technology which did best last period, the choice rule is (3) "Choose g if u^ - u^ i m(l - 2x . ) fc Under this rule, choices their H(m(l-2x. -9) ; the probability that those players who revise choose g is Prob[ 0+c. 2 m(l - 2x. ) ] ... x . "1 I t+i [ 1 - when all players use rule (3), the fraction using g evolves according to K*) = (l-a)x. +a with probability c with probability (l-a)x. 13 l-H(m(l-2x. H(m(l-2x t ) -6) r -G) The parameter m indexes the amount of popularity weighting; the case When above. this =0 m corresponds to the unweighted case discussed chooses players case both technologies are equally popular; in = 1/2, x. technology the current payoff for any value of m. with As m grows, the highest players become more willing to choose the currently popular technology even if its current payoff is lower use We specification linear the 14 primarily for analytic convenience. popularity weighting of combines nicely with a It second simplifying assumption that we make in this section, that the distribution H of the per-period shocks [-cr, <j] This . allows us explicitly to behavior of the system for any compute the long-run also ensures that the It m. uniform on is c. linear class of weighting rules we consider includes one rule that leads the asymptotic distribution to ft the difference additive u. is that it, x. , another point to note and any rule that compares "f -u. to transformations multiplicative ones: In a of function the of payoff x. invariant is , function, order to preserve the but not the parameter m must be multiplied by the same constant. to see why this must be the case is to to to same decision rule when the payoff functions are multiplied by a constant way the cr. Beyond the presumed linearity in (3) on 15 optimal choice, namely m = about decision rule concentrate note that A, One the expression (l-2x) is unitless, so the parameter m is measured in the same units as the payoff are. Our assumption that the per-period shocks have a uniform distribution makes it easy to determine the long-run behavior of the system. Since the lowest possible value 14 of c. is -cr, the lowest possible observation of u^ sufficiently large that e -a 2 x^ Likewise, increase. to playing x^.) s x. = (cr-m+e)/2cr -(m/cr)x "lock on" to least when x Thus, . downwards < steps, Prob[8+e, s m] so that the 0, < = system probability the upwards step is uniformly bounded away from zero. x^ > x. fraction implies x > o cr is upwards step is minimized at this probability must be at 0, cannot (Note that a x. certain is the (m-e-<x)/2m, x = is certain to increase. f Because the probability of = x. if if or equivalently if , the fraction using technology g (m-0+cr) /2m, = m(l-2x.) i Hence, is 6 -a. -u. of an Similarly, if the probability of a downwards step is uniformly bounded 1, away from . The above shows that (ignoring knife-edge cases) there are four possibilities for the long-run behavior of the system:' If x" < and 1 enough x upward the <0, jumps system is > xg that x. certain so , eventually make to that from position the system converges with probability x" > 1 initial and x > position. If < x converge (with probability 1) 1 if x. i x ,• for x_ e (x , and to x") xg < x. = . the 1, if x Q £ x to 1 system converges to x the , any initial = 1. If from any system will will converge to , the system will also eventually , converge to a steady state, but it has a positive probability of ending up at each of the two steady states of the system. the remaining case, in which x < and x^ > not converge to either steady state. will continue to with fluctuate, 1, the long-run computed following the statement of proposition The above observations do most establish the following claims: 15 of the system will the fraction Instead, the In 2 x. distribution below. work required to Proposition (a) 2 Popularity weighting m = a "optimal" is in the system converges with probability that from any x sense the to the 1 state where everyone uses the better technology. m > (b) is "overweighting", cr with probability 1 in that the system converges to a steady state, but which steady state is selected may depend on the initial condition x_. More precisely, the system converges to the better technology if while for < |0| to with 1 . If x probability converges to x * (m with probability 1. If |0|< m-cr and 1; (m+cr-0) /2m) , the system but both steady states , positive probability. (c) With "underweighting, " m < i.e. cr, the system need not It does converge converge to a steady state. 1) -cr-0) /2m, the system will eventually converge to one of the steady states have the system converges (m+cr-0)/2m, 2 Q if x e( (m-cr-0) /2m, m-cr, the behavior of the system depends on the m-cr initial condition x_ 2 |e| to the better technology if |0| (with probability but for 2 cr-m, |0| < cr-m, the system has a non-degenerate invariant distribution u, with E x = 1/2 + 0/2 (cr-m), var x = acrE xE (1-x) / and [ (2-a)cr-2 (l-a)m] . Proof: (a) x If m= cr, then x g = (2m-0) /2m is less than = -0/ 2m is greater than zero iff < 0. 1 iff © > 0, and The conclusion now follows from the argument in the text. (b) It suffices to check that if and x g < 1, that -0 > m-cr > > m-cr implies x 16 > > then then x and x g > 1, < while for m-cr > x |e|, and x" > < 1 A similar computation shows that when (c) |0| > cr-m, Appendix B establishes that the must converge. when unique invariant distribution cr-m >|©|, the system system has a and computes the corresponding mean and variance. Proposition shows that the system is certain to converge 2 to the correct choice if the popularity weight m equals a, and that the payoff loss from a wrong choice must be small if m is close to this there is any level. Thus particular it interesting is reason to suppose ask whether to that popularity weights equal or close to a are likely to be used, or conversely whether there players to consider different use game a forces are in in the weights. which players model As that a would partial simultaneously drive response, choose the we their individual popularity weights, and show that the optimal weight m = cr is its unique equilibrium outcome. This result is only a partial response, because it supposes more sophistication in the determination of the popularity weights than we find compelling. However, the result does show that popularity weighting need not conflict with individual incentives. To define the payoffs in this game, we suppose that players have a common prior distribution p over e, and that p assigns positive probability to every neighborhood of 9 = xQ e [0,1], 6 e support (p), and m, let u(0, xQ For each 0. , m) be the long-run distribution on x when all players use weighting m. For each value of B, the payoff to the profile m is defined to be the mean of the long-distribution on payoffs, and the overall payoff is the expectation of this 17 value with respect to the prior beliefs is made Our use of the long-run payoff criterion here p. solely for we would convenience: optimal behavior for discount factors near want allow to for the prefer to consider but then we would 0, weightings popularity to be chosen repeatedly (popularity has no relevance in the first period) and would need to consider the distribution of the state each in period separately. Since the profile in which all players choose m = a results this profile is clearly a Nash in the full-information payoffs, equilibrium game. the of Moreover, it is only the symmetric equilibrium, as shown in the following proposition. Proposition probability, every neighborhood If 3 unique the symmetric has positive = of equilibrium of the game in which players simultaneously choose the weight m they give to popularity is for all players to choose m = a. Remarks: (1) We do not know whether there asymmetric are equilibria as well. The long-run behavior of the system when two or more decision rules are used by a non-negligible proportion We should also of the population seems difficult to determine. point out that symmetric no Nash pure-strategy equilibrium exists in the case of a normal distribution that we consider in appendix A. case popularity weighting to be since in that This should not be too surprising, approximated, but permits the full-information payoff not to be attained exactly. However, profiles where all players use a "large" amount of popularity weighting are e-Nash equilibria. (2) If bounded players away are from certain zero, that there are 18 the absolute equilibria value in of 6 is which m can exceed 8. It is difficult to suppose that players using the (3) kinds of naive learning rules we consider would consciously choose their popularity weights to maximize their long-run payoff. We prefer to interpret the eguilibrium assumption here as the result of a long-run adaptive process, but this raises the question of the relative speeds of adjustment of the process determining m and that reflecting learning about the technologies. Fix any profile where all players use some m Proof: idea of the proof is simply that if m too popularity weighting, little would prefer while if m > deviate to and < then give more <x, * The cr. so everyone is using individual each weight to player popularity, each player would prefer to give popularity use <r, a bit less weight. Since there is a "large number" of players, behavior the of deviates. system is unaffected if |8| conclusion will then neighborhood of 8 = Suppose deviating to m' follow from payoff first ^ If state, assumption that m < a, and = m+dm for some small dm > 0. difference sufficiently strong that m' our is |e| larger; that and the every has positive probability. no effect on his long-run payoff if any player is sufficiently small, has no effect on the player's payoff when the single We will show that there is always a deviation that improves the player's payoff when (a) any the aggregate between x. consider a player This deviation has |e| £ <r-m, the two for in this case technologies is converges to the optimal choice, and m yields the first-best long-run payoff. |e| < cr-m, the system does not converge to but as a non-degenerate ergodic distribution. 19 a steady In this deviating to case, leads the player to use g m' instead of f whenever m (l-2x ' t (l-2x m' f g 3 — n u£ u s ) -0 ) £ c. m(l-2x < fc t ), < m(l-2x )-e. i Similarly, the player will now use m(l-2x.) instead of g whenever f < m' (l-2x.) -8. difference the Since - c. or between payoff in and g is f e, the expected change in the player's per-period payoff is du where . / H dm = 9 (2x-l)dH m(l-2x)-e]d(i(x) is the , uniform distribution, and distribution on that was derived in proposition -9 € [-cr,cr] for all x du. /dm = 9/2cr Since Ex for all 9 (b) dH m(l-2x)-e [0,1], e 1/2 for e > 0, e (-(cr-m),0) u and Ex cr-m) (0, < 1/2 for e < 0, m and m' use an m = true if 9 £ m-cr and x Q £ positive particular, if converging to e probability < that there cr, form any x. while using m' = cr is 2 so that both , the 20 converges (l-2x Q )m+cr, to In 0. probability positive of being played in almost every f converges to the probability that x. 9 < If the system does converge 1/2. £ using weighting m leads to period, 1, If e x" = (m+cr-e) /2m. If 0< 9 < m-o- and x Q < x^, or equivalently to 0, and Suppose first cr. with probability 1 > yield the full-information long-run payoff; the same is m' is > a, so that technology g has the higher payoff. the system converges to there du./dm . consider an individual player deviating to m-cr, Since m(l-2x) 2. (2x-l)du(x) = e/2cr [E 2x-l] > 0, invariant = l/2cr, and so Now consider a profile in which players that 9 > the is u. c. probability 2 m-e = of playing (cr+m+e)/2m, g which for all positive 9. is greater than The preceding two paragraphs show that well as m for all positive 1/2 and 0< 9< min(cr, 9, m' does at least as and does strictly better if x n A symmetric argument shows that m-cr). s m' does at least as well as m for all negative 9, and does strictly better if ©* + x z 1/2 and 0> 9> max(-cr, £ -(m-cr)). Hence if the prior assigns positive probability to the neighborhood of 9 = yields m' 0, strictly a initial position x Q higher expected payoff from any . While our formal results concern the eventual steady state the speed of convergence is of some interest as of the system, consider an initial position where x. is well. In particular, small, so that g corresponds to a "new" technology, that 9 > improvement. 9+c. > so 0, that the new technology and suppose in is fact Then the share of technology g increases whenever m(l-2x. ) , since and probability the this of event increases with 9, so does the expected rate of adoption. 17 a an Such correlation between the extent of improvement and the speed of adoption has been noted in the empirical Mansfield [1968] and Rogers and Shoemaker discussions of but has not, [1971], so far as we know, been addressed in the learning literature. Note also decreases, as a that for increases, becomes less informative. fixed so 9, that each More generally, 18 of convergence period's observation speed the in convergence will be slow if the new technology usually does about as well as the old one, but occasionally does much better. Furthermore, technology usually does slightly worse than the occasionally does much better, higher mean payoff but a (i.e. if the new old if the new one, but technology has a lower median) then naive learning rules 21 that look only at the recent relative performance will be biased towards the wrong This choice. consistent insurance, observation that seat belts, with the and vaccinations have is been slow to diffuse. Heterogeneous Population with Linear Technologies 5. Now we turn to the study of heterogeneous populations, which denoted technologies, payoffs, E(u^-u. representing different before, As individuals. we f equal , ) location a may technologies different have locations and to suppose with g, optimal be that Now though, 9. along line, a different 9's. optimal rule (both socially and privately) positive 9 to use that the It will of technology particular, In as at the for players with is choice 9 in has a f, so cut-off or 0. important be of players that so two difference think we only and players with negative 9 to use g, distribution break-point at 9 = are mean the different for there in in the following that the relative advantage of using technology g at location 9 may be correlated with the advantage" "absolute of location 9, e.g. the productivity of the "land." To capture this, we suppose that the payoffs to the technologies have the following linear form: 'u^(9)=9 + /39 +c lt (5) u£(9)=/39 +e 2t< With this parameterization, /3 > implies that technology g does better at "good" locations, while when /3 < 0, g does better at bad ones. In this model, the player's location in 8-space determines 22 his average payoff to the two technologies. variables payoff-relevant the with correlated the observed as We want to think of being unobservable locations. The idea but that is players do not know exactly which aspects of their locations are payoff -relevant, or how these aspects influence their For we reason, this allow the players not do to payoffs. regress the observed payoffs of each technology on the corresponding values Instead, we suppose that players base their decisions on of 9. the average performance of the two technologies at locations in their "observation windows," where the observation window of the player at is the interval [0-w, e+w] We call w the "window . width." We have two interpretations in mind for this model. the parameter location with location, the may 6 performance correspond of to geographical technologies the First, linked to variables such as climate or terrain that are in turn correlated with model the Second, location. describe may adoption decisions at a single village, where players are differentiated by idiosyncratic payoff-relevant characteristics such as wealth and household size. In studying geographic diffusion, for example of an observation window might reflect the farmer only observing the outputs of his neighbors, and the agricultural technology, the window width w might be fairly small. In studying adoption at a single site, the observation window corresponds to the players' beliefs about which other players are sufficiently similar for their experiences to be relevant, and players might well observe the actions window. To and the outcome of extent that others who the 23 relevant are outside of their characteristics are the window widths in this interpretation difficult, to determine, fairly might be large. In both interpretations, players might prefer to weight observations of their immediate neighbors more heavily than those of players who are farther away, within observation the window; attractive when the observation window study of homogeneous population, a may this is we particularly be As large. begin by but still the in analyzing the simple rule where players use whichever technology did better in their window period; last later allow for popularity weighting. will we enrich the model to To define this rule formally, suppose that the distribution of players over locations has a constant density, which we normalize to equal 1, and let u^(8) be the average score realized by those players in the interval who used g at period t, [S-w, 0+w] u?(0) = u. (9) is defined analogously. convention that if every player in the interval used f; -co The with the (unweighted) the average decision rule is for the player at © is then (6) "Play g at period t+1 iff u^(9) -u£(e) In the continuum previous of players sections and s considered we inertia, so " 0. that a the model with fraction a of players using each strategy can never shrink all the way to zero in finite time. suppose that all period. In our study of spatial models, though, we will that there players is at We do so in no each inertia location at individual revise their locations, choices part for reasons of convenience, so each and in part because in rural areas with low population density it seems plausible that a technology could be abandoned by everyone in an observation window after a few bad draws in a row. 24 As a first step in analyzing the decision rule that the noise terms the c . , are identically zero, and c_. Suppose system is deterministic. (6) further that the state of the system is described by a cut-off rule. suppose that there is a suppose so that initial That is, such that all players with 9 9, choose g and all those with 9 < choose 6. f. Then the period t+1 state will be described by a cut-off rule as well. note this, that players players at 9 < 9.- w play see both at hence will play g and played, all 9 in w 9 .+ > the To see only see next period, Players at every 9 f. e s e being g while all [0.-w, 9.+w] and g being played, with f +w r L (0+l)sds/(e+w-0) -g u^(0) = = (/3+1) (9+w+e)/2, and (7) Q u£(0) = Thus for u^(0") 9. £sds/(e+w-e)=/3(e-w+e)/2. | -w < - u£(0") 9' < 9" < © +w, = u^(0') so that if the player at 9' - u£(0' we have ) + (0"-0')/2, plays g in period t+1 then so does Hence the state at period t+1 is described by the player at 9". a t cut-off rule. A steady state cut-off rule must have the property that the player at the steady- state cut-off is and g given his the observations. Thus indifferent between steady state unique solution of u^(0*) = u£(9*), that (0+1) (0*+ w/2) = 0(8*- w/2), and so (8) 6* = -(20+l)w/2. 25 is f the Note that although the optimal cut-off value of When /3 = |3 the payoff to for example, 0, f if /3 discrepancy The well. any = -1/2. identically zero, is Hence when the cut-off is 9. players to for using is g which will tempt players to the left of positive, as payoff average the 0, = 9 the steady state cut-off is only at , while the payoff to g is equal to at is between steady the strictly to adopt g state and the optimum arises from our assumption that players do not directly and hence use only the average payoffs received by observe 9, the technologies two making their decisions. Note in that the maximum steady-state payoff loss at any location is the absolute value of 9 which is small if , /3 is not too (in absolute large value) and the window width w is small. Having determined the steady- state cut-off, we next examine the behavior of the system away from the steady state. It is easy to show that, from an initial cut-off 9 Q will move towards the steady state 9 within w/2 period until it interval system typically the * about 9 For . is ease of of * 9 enters reference, . , the cut-off at a distance of w each Once 9. stable a is 2 -period summarize we within this this cycle as a proposition. Proposition by (6) and From an initial cut-off e Q 4 e = -© t 9. Proof: the system determined evolves according to (7) 0. (9) , +w +26 -w —a * If u^(S.-w) 9.<9 -w/2. * © e[9 -w/2, 6 +w/2) t e.se +w/2. —f * - u. (0.-w)> 0, then all players who observe both technologies being played -i.e. all players in the interval 26 [8. eguation /3(8 we (7), that see A vs or -w), period this the is case .£ Finally, t+1. Similarly, technologies both see if A = 8 + w/2. -/3w ^ 8. who players Substituting 8 = 8 -w into use g in period t+1. - -w,8. +w] if 8 e being [8 -w/2, if (/3 6 -w/2, < e. played choose 8 +w/2), © fc s atisfy e O+l) = t+1 -e (e t+1 +0 -(2/s+i)w = -e t = ^(© © t+1 + t + w) t - w) + l)8. ^ JL so , all f in will t+1 that * t + 28 .» Next we consider the behavior of the model with noise, i.e. non-degenerate i.i.d. random variables. Let z. denote the difference in the two shocks, and let 8. with c.. and c_. = e_. 1t = 8 - e +z. ; et i- s *-he steady state of the system when e_ -e. is Because behavior rule (6) identically equal to for all r. z. depends only on the difference between the payoffs to f and g, and not on their levels, when the shock is z, with the term 8 Proposition shock is z. * 8 = -8. +W 8. . 8. , and the period-t 8 <8 -W/2. +28. t A 8 e[e.-w/2,8 +w/2) A JL 8 £8.+w/2. -w For locations 8 Proof: (9) the period t+1 cut-off is given by , t+1 0. * If the period-t cut-off is 5: system from is the same as that given in equation replaced everywhere by 8. (10) the evolution of the € [8.-w,8.+w], the difference between the average payoffs of the two technologies in 8's observation window (the interval [9.-w, B.+w] is [8 +(2/3+l)w]/2 + 8 fc * •* w/2 implies S+e. that all * players 28. who -z = t (8 for all 8 ) i.e. , +8. £ observe -28.) /2. ^ a 8 -w, e both 27 u£(8,e .) > e u.(8,c Since 6 t > * t - t + w/ 2 technologies .), 6. + implies choose g. Similarly, 9, < + 8. technologies choose w/2 implies that all players who see both Finally, f. period-t cut-off is given by Proposition When 6 e the t+1 * if 8^ = -e are z. e t [6 +28 * -w/2 ,8 ,+w/2) , the . draws i.i.d. from a distribution that has a strictly positive density on a compact support, the distribution invariant generated process dynamic and F, by the has (10) expected a probability distribution at date t converges to F uniformly over probability distributions Appendix Proof: contraction sense the in initial p.. shows C unigue that system the Norman of [1972] is and random a satisfies unigueness condition 2.11 of Futia [1982]. We have not been able Instead, directly. to characterize this distribution we have computed an invariant distribution of the simpler system generated by r (ID e t+1 = A A JL +W0 t~ e t t ®t" w V Note that system (11) e t differs from (10) only when 8. falls in an interval of width w. Normally we will think of the variance of z. as being much larger than the window width; in this case it may be reasonable to guess that the invariant distributions of (10) and (11) are close together. We should point out that the simplified system (11) (10), unlike does not have a unigue invariant distribution: Because all steps have size w, from initial position 8 is , concentrated on the grid 8 ^ + o kw, , the support of (11) and so different conditions lead to different invariant distributions. 28 initial Moreover, the supports for of the date-t distribution are different for t even and between for t systems and (10) qualitative these Despite odd. absolute magnitude the (11), differences the of effect of the initial condition is small when w is small, which supports the conjecture that the two systems are similar. Table 1 below provides further support for this belief by comparing Monte Carlo estimates of the steady-state variance of the variance that is of (10) with the particular invariant distribution of (11) computed proposition in conjectured, As 7. two the variances are close when w is small. To examine that the noise terms distributions invariant the of (11) are are i.i.d with mean z. suppose , and c.d.f. H. A Then 9. A follows a Markov process with the transition from A ©. A JL + w having probability Prob[0 +z. ^ 0. A = ] 1 9. to JL. - H(©. -9 ) The . invariant distribution has a particularly simple form when the z. are uniform on [-cr, cr] and the parameters are such that there is an invariant distribution whose support containing the points 9 Proposition M = is cr/w is 7: * is Suppose the limit of z. <r are uniform on [-cr, cr] 9 +kw) the [ ( (2M! time-average initial condition belongs to the grid 9 ) / (M-k) ! (M+k) distribution ! ] 2 ; when the * ± kw, k ^ M. Remark: Recall that the mean of this distribution variance is crw/2, and that , Then one invariant distribution of (11) * -2M = = an integer. the symmetric grid * and 9 + -cr the binomial Prob(0 this is a and that the distribution is * is 9 , its asymptotically normal as w tends to. zero. Proof : To show that f is an invariant 29 distribution, it is verify sufficient to condition" that that for all probability flow from 9 to reverse direction. and 6 6' 6' "detailed the meets it the , balance (unconditional) equals the probability flow in the Thus, we will verify that = e|e = 9'), Prob(e t+1 = e'|e t = e)= f(9') Prob(e t+1 t or equivalently that f(e) f(9)/f(9') = Prob(9 t+1 =9| 9 fc = 9')/ Prob(9 t+1 = 9' 9 = 9). t | Since the probability of a jump of more than w is zero, it suffices to check this conditions between adjacent states, so * * take 6 = B +kw and 2" 2M |"(2M!)/(M+k!) (M-k) = ) 2 for integer k between some For such states, we have -M/w and (M-l) /w. f(9)/f(9' = 6 +(k+l)w 6' -2M (2M! | ) / (M+k+1) ! (M-k-1) = ll (M+k+1) ! and Prob(9 t+1 =9 1 e t = e')/ Prob(9 t+1 = 9' | 9 = 9) fc |(o-+(k+l)w)/2cr|/(o-kw)/2o- = (M+k+1) w/ (M-k) w, so detailed balance holds. w/o- (10) (11) .5 .21 .25 .1 .048 .05 .05 .024 .025 .001 .00496 .005 TABLE 1: STEADY STATE VARIANCE FOR UNIFORM NOISE 30 = / (M-k) , As one would expect, and that the expected welfare loss the cut-off is t u d© = j Hence, (compared to © = 0) when welfare loss is ©. "2 ©J/2, the in long run the average per-period (using the invariant distribution computed in proposition 1/2 that E(0 2 = ) is Note that the social optimum is the constant © = each period. 0, state because small w corresponds to small steps in decreasing in w, r the variance of the steady steady (E(©)) 1/2 2 + state welfare 1/2 is var(0) = [(2/3+l) decreasing in 2 w. /8 7) is +cr/2]w, so small w, For despite the lack of either memory or popularity weighting, the spatial nature of the process allows the long-run outcome to be . . approximately efficient. 19 While small w's are thus desirable from the viewpoint of the time-average payoff, they entail a significant short-run welfare loss when the initial state is far from the optimum, because in this case the system will take approach the neighborhood of the optimum. reasons: Second, First, in ©. is limited to move a long to This is true for two at most w per period. the presence of noise a typical path is take far more than © /w periods to reach time a likely to neighborhood of © * , because many steps will be in the wrong direction. For a fixed initial condition and social discount factor, the socially optimal window width will trade off the speed of convergence and the steady state variance, with larger w's being optimal the farther the initial condition is from 0. If the social planner does not know the initial condition and/or the location of the social optimum, the size of the optimal w will 31 depend on the planner's prior beliefs. This tradeoff between speed of adjustment and the variance of the steady state seems a natural feature of the sorts of model we consider 20 this At would point we to like make few a observations about how the conclusions might change if the players did keep records of their past observations. * within of <r will play both technologies 9 eventually could they Since players at locations which learn themselves by keeping such records. infinitely technology often, better is for However, a few calculations suggest that this learning process will be fairly slow if the random shock to the payoffs has a sizable common component and w is small. To see this, suppose that the payoffs to each technology are subject to a common shock as well as the tj. shocks we assumed before, so that system idiosyncratic is replaced by (5) V <u^(e)=e + 0e +e + it (5') [u£(8)=00 +C + 7J t . variance the If 2t of is 77 relatively observations of informative, and only observations of both technologies only technology one at date then large, not are t very in the same period will be of much help. Players at locations more than three standard interval 9 * ± 3 deviations (aw/2) 1/2 from rarely - 9 see -that both is, • technologies and hence would need a very long memory to locations 9 closer to 9 often, but * outside learn. many observations to be the played, Players at do see both technologies played more for these players the systematic payoff between the technologies of is smaller, and hence fairly confident 32 one difference it may is require better. Our approximations, informal this is indeed the case, periods required be to reported and better is of the order of a long history would be 2 w / suggest E, that particular that the number of in fairly appendix in confident 2 , which technology is so that when w is small a very required for players to do much better than with our simple rule. Of course, players could use history even advantage the when doing to so is slight slow or to but in these cases it seems less obvious that players develop, would be led to abandon simple rules. Examples of non-linear Technologies 6. Before considering the implications of popularity weighting in a we would heterogeneous population, like to discuss some examples of what can happen without popularity weighting when the payoffs form as a presumed in function of location do not take the equation . (5) Suppose for example linear that "old" technology f has returns that are identically zero, g(0) the while = cos (9), so that regions where g is optimal alternate with (See ^.(^uv^c.^-^ regions where f is.^If there is no noise in the system, and the window width is relatively small, locations © e [-tt/2, technology will optimal. In not this tt/2] then even if all players in adopt the new technology g, spread to the other example there are regions where substantial gentleman it social from having the new technology "tested" at a number locations. the new is gains of diverse (It is for this reason that we would argue that the farmers may have played an important role in the spread of the new husbandry, even though they were not among the first to adopt it.) It may also be interesting 33 to note that when the local g(6)=cos e m ^ o Figure 1 process may fail to spread as widely as it should, random shocks to payoffs increase social welfare, can that increase as the variance of the noise term increases z. = Suppose that the technologies are f(0) zero. welfare can is, from and g(0) = cos(0), and that the initial state has all players to the right using g and players to the left using of the cutoff will move to e 1.) When the support of * = 37T/2 and stay there. sufficiently is z. Without noise, f. (See figure there will large, eventually be enough consecutive draws of very negative the cutoff reaches From this point, tt/2. system may no the longer have a single cutoff, as players to the left of tend to switch to that z. will tt/2 while those to the right switch back to g, f. Essentially, the noise leads the players in region II to use the new technology long enough that it can spread from region ,1 to region III. The next example shows that in certain extreme cases the specification error involved in ignoring how payoffs vary with "location" can allow a technology that is everywhere inferior to This is the case depicted in completely drive out a better one. figure below, 2 which in = f(0) and g(0) = 0-e current cut-off is -a u^(0) = computes + (0-(0-w) players Hence B. ) .= then the player at 0, 0-e observe t - w, both - f u (0) = w technologies - € if e, choose A. e+w] -w, [0 -f u (0) and +(0-(0-w) ) /2, Since u g (0) /2. who at the If . A A. = w > -w all e technology g. and eventually g will take over the entire population. We special: if the should point out these that technologies are quite an inferior technology can only drive out a better one difference in payoffs |f-g| 34 is small compared to the Figure 2 errors caused by estimating the payoffs by their average values These errors are of the magnitude of w df/de the window. in and w dg/de, which bound the difference between the payoffs at Thus, and e+w. (0-w) the difference in payoffs if w is small, |f-g| must be small as well in order for the inferior technology to dominate, everywhere, adopted substantial. location is the payoff and c must be wrong technology the loss each at the example above, (In c, though even hence and the payoff than w less location loss order in not is at for is each g to dominate. small For window widths, substantial more a payoff loss arises when the new technology is not adopted in a region where it is a substantial improvement. example where g = cos(0) and should be adopted example of figure are 2 f = was This the case in so that the regions where g 0, We disconnected. can so that g is better than f also modify the at every location (and so in particular is better on a connected set) and yet a substantial payoff loss results from g failing to spread. the payoffs to the In and g are such that g is much better figure 3 than f in the neighborhood of 9 = 0, but is only slightly better than f , f for extreme 6 values. Hence, if technology g is first introduced at these extreme values, it will be driven out of the population before it can be tried in the center region. 7. Heterogeneous populations and popularity weighting Our analysis of social learning in homogeneous populations showed that popularity weighting could improve the aggregate performance of the learning process, and that the optimal level of popularity weighting consistent is 35 with individual Figure 3 We incentives. now will investigate popularity weighting in our model of the implications of heterogeneous population a with linear technologies. To model popularity weighting, let x.(8) be the fraction of players in the interval [6-w, e+w] who use technology g. spirit of the popularity weighting rule (3), the decision rule used in sections (6) 5 and In the we now modify the 6 and suppose that players use the decision rule "Play g at period t+1 iff u^(0) -uj(8) (12) where, indexes m parameter the before, as m(l-2x. * the (6) importance , ) of popularity in the players' decisions. Since the analysis of this system is quite close to that of the without system results without proof. t weighting, popularity As in section 5, will give the if the state in period so will corresponds to a cut-off rule, we the state period in In addition, without noise terms the system has the same, t+1. = steady-state cut-off 9 unique, However, - (2/3+1) w/2. the introduction of popularity weighting does change the dynamics in two ways. First, the in absence of noise terms, system the converges to the steady-state cut-off from any initial cut-off; the oscillations Second, become described (and relatedly) more common, as proposition in 4 do not arise. movements of less than one window width players are more hesitant to a less popular technology. The following proposition gives a more precise description of the dynamics. Proposition 8 From an initial cut-off 9 by decision rule (12) and payoffs 36 (5) , the system described evolves according to (13) : +w ,6 e if © <© t t = e + t ' t+i 2m-w ) 2m+w / ) e "e t t if 8 6[e -(m+w/2), t t ) if e.> G.+ (m+w/2) © ~w t • Omitted. Proof: and ( e*+(m+w/2)] . The calculations involved are straightforward, similar quite (m+w/2) those to proposition of Note 4. dynamics above reduce to those of proposition that when m = 4 the 0, as they should do. in the absence of noise, To see that, to 6 ... * note that the cut-off moves a from any initial cutoff, full window width so long as \S.-e A |©.-6 [ the system converges > | m + w/2. Eventually then A * (2m-w) m ^ j| / (2m+w) (8. -0 ] then from and w/2, + on -6 0. so that the system converges to 9 ), it .at = a geometric rate. Note also that for a given 8. , the system will move less than a full window width whenever the realization of is in an e. This show that popularity weighting makes interval of 2m + w. the system more "sluggish," and suggests that it will reduce the variance the of long-run determine intuition, and weighting reduces the distribution. extent the variance, we verify To to which characterize this popularity the long-run distribution in one special case. Proposition 9 (a) If the z. are i.i.d. draws from a distribution that has a strictly positive density on a compact support, the dynamic process defined by (5) and (12) has a unique invariant distribution, (b) If the z. are i.i.d. draws from the uniform distribution on 37 and m [-a, a] the invariant distribution 2a, £ on the interval [0 -a- w/2, E f = e (0) +<t f is concentrated +w/2] and satisfies , and var, Proof: (0) = a 2 w/6m. (a) Omitted; proposition argument the that for Appendix D shows that there is a deterministic, finite time T for which the cut-off * to 6. (b) -w/2, very close is is in the interval [0 - cr . and that once this interval is reached, © T + a +w/2], remains in the interval for all subsequent periods T+s. Given a T satisfying these claims, we have * |0 -0 * * | + 1 * -S T+ | (c + w/2) < + cr , 1 T , t , _e m, \ < which is less than m + w/2 from our assumption that m > 2a. Hence we the evolution of from T on is determined by the second case in proposition 8. Writing c = (2m-w) / (2m+w) and applying this rule repeatedly, we , find that T+s - <* -c)' ) / T=0 c r 0* T+s-x Hence, E(0 m+s |0 T ) = (l-c)^c T E(0*) + c s (l-c)0 m and var(0 m+s |0 T ) = (1-c) 2 2T ^ c var(0*) = 2 2 (l-c)cr /3 (l+c)= 2wcr /12m = 38 2 cr w/6m. -» E(0*), Comparing the steady state distributions for m = that we 0, see long-run variance by a for m that popularity weighting > 2<r with reduces the factor of l/3m. The welfare consequences of increasing m for fixed w are similar to those of decreasing w for fixed m: in both cases, the speed distribution state steady at which becomes more system converges the efficient, decreases. while It may the be interesting to note, however, that in this simple model there is one way to convergence without the change (when the altering the parameters speed to initial cutoff is up the rate of from the optimum) far steady-state variance, namely increasing 21 the window width w while holding the ratio of w/m fixed. 8 . Concluding Remarks The various models we have presented suggest that even very naive learning rules can lead to quite efficient long-run social least at states, if the environment is not too highly nonlinear. Moreover, popularity weighting can contribute to this long-run efficiency, and the use of popularity-weighting passes a crude first-cut incentives. test course, Of with consistency of there are many individual other plausible specifications of behavior rules for social learning, interesting to speculate about robustness the so it is of our conclusions. In this light, we would like to report simulation results for one simple modification of popularity weighting that seems to improve the short-run performance changing its long-run behavior. of. the system For this purpose, the homogeneous-population model of section 39 4 , without we return to and now suppose that players give weight to trends in the relative popularity of the two technologies as well as to the popularity itself. More precisely, suppose that players now choose technology g realized the iff difference expression m(l-2x.) -c(x.-x t in u^ exceeds -u. where x.-x._ , ) payoffs is the the trend in popularity. Since the trend variable converges to zero along any path where the system converges to a steady state, the system still converges to the better technology with probability when m However, =cr. optimum, as in the if case the initial when state is far superior technology a l when from the is first introduced, one would expect that responsiveness to trends would help the increase to speed with which the new technology is adopted. To test this intuition, the noise term weighting m = period was second, a c uniformly distributed on In the first, cr. [-cr, cr] and B = .1 .02cr. both, with and popularity the fraction a who adjust each and the mean payoff difference 9 was .5, = we ran two simulations, In both cases, .lcr; in the we counted the number of periods required for the system to move from initial state x n = .05cr to x = .99a. The results, reported in Table 2, show that at least one obvious modification of our assumptions reinforces the impression that naive rules can perform fairly well. a=. 5, e= .1 a=.l, 8=.02 c = 39 940 c = 5 26 710 c = 10 28 Table 2 : 470 Trend Weiaht ina and the Speed 40 22 There are number a of other considered but seem important. extensions structure in we more complex have considered. equilibrium choices networks of not Also, players might be than Finally, the our either that rules of thumb are exogenous, or, are have we Players might use rules of thumb which make some use of historical data. arranged that a static simple results linear suppose in Proposition 3, game. It would be interesting to complement these results with an analysis of dynamic process by which players adjust their rules of a thumb along with their choice of technology. Finally, we should point out that popularity weighting is not always as beneficial as our results might suggest. Consider the problem of children is a poor neighborhood choosing whether to pursue higher education. If students who have done so past tend to move out of the neighborhood, in, the and past residents are underrepresented in the observation windows, then the choice of higher education will appear less popular than it really is, and decisions , choice. . based on popularity may be 23 41 biased against this Appendix Optimal A: Weighting Popularity with Other Distributions To better understand the forces generating proposition 2a- single choice of popularity weight yields the optimal long-run distribution uniformly over all values of e- we show that analogous results obtain when the per-period noise term e. has distribution F with unbounded support. that a Suppose first that a = adjusts every period, values and [Prob(x.)= and hence the state If we let 1. Prob(x 0, t) transition matrix is A = = F(m-e) l-F(m-e) F(-m-e) l-F(-m-e) Since this that the so 1, matrix x. denote the vector s. we have 1], is entire population takes on only the strictly positive, ergodic; the unique invariant distribution u where the =s.A, s. * system the is . . is given by M*(x=0) = F(-m-e) /[F(-m-0)+ l-F(m-e)]. If is F the standard normal then distribution, increases, the ratio F(-m-S)/ 1- F(m-0) converges to and converges to » if 6 < distribution of correct choice. the 0. Hence for system places if near and to oo > 0, 1 on the Moreover, the same is true for any distribution for which the the ratio F(-m-0)/ 1- F(m-S) converges to 0, m the ergodic large m, probability as if e < 0. if 6 > (This is what is meant by saying that the tails of the distribution are "infinitely revealing.") With a more involved argument, we have shown that the same conclusion holds for any a e (0,1) when players use the popularity weighting "if x. -m; if x. <l/2, choose g iff u^-u. (discontinuous) £ u?-u. £ £ choose g iff -m." The details 1/2, available on request, here is the intuition for the result. Note first that when m = » the system is deterministic with are. stable steady states at and 1. If m is finite but very large compared to a and to the distribution, then steps the "wrong way" decreasing steps when x. > are rare (i.e. 1/2 "innovations", and when the distribution is symmetric, transits ) 42 from to and 1/2 innovations. to 1 take both 1/2 same number the tails of the distribution are If then revealing, from m as -» innovations towards likely than a> technology become infinitely more towards the inferior one, and the analysis suggests that the limit Wentzell [1984] of infinitely the better of innovations Freidlin and of the ergodic distributions will be oncentrated on the better technology. To establish this formally, we partitionthe interval into a large (appropriately chosen) number of small subintervals, and approximate the original system by two finite-state Markov whose ergodic distributions will serve as bounds on the ergodic distribution of the original system. We then use the discrete-time, finite-state translation of Freidlin and processes, results Wentzell's Mailath (Kandori, Rob and to confirm the intuition above, [1992], Young the limits of the distributions of ergodic thef inite-state process are concentrated on the subinterval corresponding to the better [1992]) i.e. The above suggests that infinitely-revealing tails are sufficient for there to be a single popularity rule that is choice. approximately optimal for all 6. Moreover, this rule has the nice feature that it need not be tailored to the exact form of Even when the tails are not infinitely the distribution. revealing, however, there is another popularity rule that seems to perform very well, namely "choose g iff u^ With this rule, -1 = (l-a)x + a Prob[0 + c * F E(x |x ^)] t t+1 t fc u. s F (1-x.)." ) = x so that E(x. to _1 + a[F(e+F (x ))-x t t t ], |x.) > x. if and only if 9 > drift towards the correct choice. 0; the system tends Although the system may converge to the wrong technology with positive probability, the simulation results reported in table 2 for the logistic and non-revealing tails) suggest that when a is small the system is very likely to converge to the right choice. Intuitively, when a is small, the system evolves through a series of small steps that allow the We conjecture that there drift to outweigh the random forces. these lines. a may be general result Laplace distributions (which both 43 have Appendix B : Proof of Proposition 2c If cr-m > \9\ , then neither nor 1 is an absorbing state. Our first step is to show that there is a unique invariant distribution. To do so, we first note that the stochastic system is a random contraction in the sense of Norman (4) A [1972]. is a stochastic system in which the random contraction realization of an i.i.d. auxiliary variable (call it u) is used e V to determine which of a family of mappings is used to is a contraction "on average." In send x. to x. ., and each ip <p context, our corresponds cj payoffs, and there are only two maps <p_ (x. ) = (l-a)x. realized the to ip ^(x. : difference = (l-a)x. ) both of which are contractions, , +ct, so that in and is (4) indeed a random contraction. Norman's results then imply that the Markov operator associated with system (4) is quasi-compact. We next note that when the system <cr-m, \9\ satisfies the (4) uniqueness criterion 2.11 of Futia [1982]: for any neighborhood U of the point x = 1/2, and any point x' in [0,1], there is an n such that the probability the system starting at x' in is in U exactly n periods later is strictly positive. ( If m £ <r, the uniqueness condition fails, both as x = and x = 1 are absorbing. The last step is to compute the mean and variance of the invariant distribution E u = (l-a)E^(x) (x) u. , (x. .) we have , + ajp(x)du(x), where p(x) = (cr-m+e)/2<T + which is the m(l-2x. ) Using E (x.) = E (m/cr)x. is the probability that 8+e. probability that x. = . £ (l-a)x.+a. Simple algebra then shows that E x = 1/2 + 8/2 (cr-m). To compute the variance, we first write the identity E (x 2 ) 2 E (x )|(l-a) a = 2 [(l-p(x)) [(l-a)x] +2a(l-a)m/crl + E 2 2 = +p(x) [(l-a)x+a)] Jdu(x) (x) J2a (1-a) (cr-m+0) /2a +a 2 m/<r| (a-m+S) /2a; solving for E (x 2 ) 2 and computing var(x) = E (x - ) -(E (x)) 2 the desired result. APPENDIX C: + PROOF OF PROPOSITION To begin we rewrite (10) 6 in the equivalent form 44 (10') gives min[0 +w,2(0 +z )-0.)]0 + z s© (10') t+1 * *• * " " max[0.-w,2(0 +z )-6 )]8 +z <0 t t To show that the system "random dynamical system" as we note that the auxiliary events (10) is a described by Futia [1982], The probability distribution Q on the z's does not are the z depend on the current state, and so in particular is continuous = <p(0. in the state, and the map p(0,z) defined by 8. z. is . . , easily seen to be continuous in 9 for fixed so that z, ) (10) is as in indeed a random dynamical system. we Next that check is contraction, random a Because the map Q is constant in 0, the of the definition can be taken to equal Futia 's definition 6.2. constant M in part it (a) Next we must show that for all z, and all 0*0', and 0' there is d(<p(0,z), (p(0',z)) ^ d(0,0'), and that for all d(<p(9,z), <p(8',z)) < a positive probability of z such that 0. d(0,0') . To show that d(<p(9,z), tp(9',z)) and all 0' and all same direction (e.g. i i either z, (<p (0, z) -0) * both (a) (<p (0' d(0,0'), we note that for , z) -0' Case a has three <p(6',z)-8'. ) > and 0' 0) or subcases: move in the <p(9,z)-8 (b) either (1) <p moves both locations by w, so that d(<p(0,z), <p(8',z)) = d(0,0'); * +z. moves less than w, and the or (2) the location closer to < <p(9',z)) state farther away moves w, so that d(p(0,z), d(0,0'), or both locations move by less than w, (3) case the two locations are reflected about the point which in * +z., and d(p(0,z), <p(0',z)) = d(0,0'). In implies case (b) * that . ^ that < 0'; 0') = then case b ...... 0', and so d(0, d(0, * + +z) Using the triangle inequality, this implies that d(0 +z, 0'). d(<p(9,z) d(<p(9,z), +z s. * (CI) suppose w.l.o.g. , 0*+z) <p(9' ,z)) - d(0,0') + d(0 +z, ip(0',z)) , i - d(0, +z) - d(0 +z, 0' 0*+z)- d(0, 0*+z)] +[d(0*+z, <p(0',z))- d(0 +z,0')], and inspection of (10') shows that each of the terms in square brackets is non-positive. Thus d(<p(0,z), <p(0',z)) ^ d(0,0') = [d(<p(0,z) for all To z, , 0, show and 0'. that for and all 45 0' there is positive probability that d(tp(8,z), tp(8',z)) d(8 ,8' < * suppose first that 8 ) let 8 , Then 8' < and , sufficiently small c >0 there is a positive probability that z lies in any * sufficiently small neighborhood of 8 - 8 + e -w/2, and for z's in this neighborhood, moves less than w to the left, while 8' moves w, so that d(</>(8,z), <p(8',z)) < d(8,8'). If 6 - 6 s - <r + w/2 * 8' but a 8 > - -w/2, a + w/2. - for argument similar establishes the existence of a range of z's such that both 8 and 8' move to the right, with 8' moving less than 8. Finally, + -8 < a ife-e<-o- w/2 8' and -8 for d (8,8') * £ z's -w/2, then 8' cr in a and &{<p(6 ,z) neighborhood of e+w/2. Thus -8 > w, , ip(8',z)) (10') < is a random contraction. The step last proof the in uniqueness verify to is that (10') which requires that there be a point 8 such that for any neighborhood U of 8 and any 8, there is an n such that when the system begins at 8, it has a positive probability of being in U in period n. It is satisfies Futia's h easy to see that e.g. 8 Z = 8 * condition 2.11, satisfies this condition. APPENDIX D Proof of Proposition 9 part b. To complete the proof, we must show that there exists a deterministic, finite time T such that (i) |e -8] < a +w/2, and that |0m, c -8\ < 0"+ w/2 for all subsequent dates T+s. (ii) -8 |. Define d. = 8 Note that since (8. -8.) and (8.-8.) ~ ^ cr, have the same sign, and \8. - 8 and (8 -8. + 1 e +-) have the same sign whenever d. > a + w/2. Hence, , . | ) \ (Dl) d t+1 = |d t (i) initial condition there exists a finite of the sample path, reached, |8 either d note that d. t+1 ~d -8 t +- (|8 t+1 - 8 t |)| whenever d t > a + w/2. - As a first step towards proving To see this, ( | t+1 = e l T' we show that for any such that regardless a + w/2 or d ~e • proposition 8, s T/ implies that until (Dl) = < , T , t+i min{ w, -e tl' and from [2w/(2m+w)](8 t min{w, \& T such -8*)} > +1 a T '\ T' is * [2w/ (2m+w) ]w/2} until the conditions defining T' are satisfied, the decrease in d. is bounded below by a positive constant which is independent of the sample path. Thus, 46 If d < , it The remaining case case, (Dl) setting T = + w/2, a+ w/2 is implies that d than w - (a+ w/2) = w/2 to complete the proof of To prove claim -9 (ii) T - = , d < |© a < w/2 completes the proof of T" , s i T +1 - +cr. e T'+l ~ In * T' I (i) th ^ s which is less Hence we can set T = T'+l - 6 , | -d t , (i) , note that when |e_ -6 < \ a+ w/2, we m +w/2 from which is less than From proposition 8 we then have the assumption that m > 2a. = (2m-w)/(2m+w) (e - e*) + and since both 9* and © © T T T+1 T lie in the interval [G - a - w/2, 9 +a + w/2], so does 9 The claim now follows from induction on s. have |© T T £ | (a + w/2) [ + a, , ] . Appendix E: This appendix gives a rough computation of how many periods would be needed on average for a player using history to be reasonably sure he knew which technology is better. If we let a = var(e . + var(c_.) ) the variance of u^ ~ u t about (<r /9) observations denote / then the player at 9 will need of both technologies to be fairly confident (about 85%- i.e. 15% chance of a false rejection) that he knew which technology was better. * 1/2 + {cm/ 2) this requires on the order of 2a 2 /crw For c* * 1/2 < 9 * 9 + observations of both technologies. For 9 + (crw/ 2) 1/2 2 3(crw/2) at least 2a /9crw observations of both technologies are required. Our next step is to approximate the frequency with which these observations arrive. To do so, we approximate the steady-state distribution by a normal distribution, and then note that the density of the normal is less than -25/cr at points more than 1 standard deviation away from the mean, and that the density of the normal at the mean is less than .5/cr. Finally, we recall that only players within w of the current period's Thus we conclude that cutoff see both technologies being used. * for players within one standard deviation of 9 the required = <r/w periods; observations arrive on average every [2w-.5/cr] 2 2 2 since these players need about 2a /crw observations, 2cr /w periods are required. For players between 1 and 3 standard deviations away, the required observations arrive about every 2 /9crw observation, so that 2cr/w periods; these players need 2cr 9*6 . . , , . about 2 4<t /9w 2. C periods are required. 47 48 REFERENCES Apodaca, A. [1952] "Corn and Custom: Introduction of Hybrid Corn to Spanish American Farmers in New Mexico," ed. Human Problems in Technological in E.H. Spicer, Russel Sage Foundation, New York. Change, , Banerjee, A. [1991a] "The Economics of Rumours," forthcoming in the Review of Economic Studies Banerjee, A. [1991b] "A Simple Model of Herd Behavior," forthcoming in the Quarterly Journal of Economics. , Bikchandari, S., D. Hirshleif fer, and I. Welch [1991] "A Theory of Fads, Fashion, Custom, and Cultural Change as Informational Cascades," forthcoming in the Journal of Political Economy. Bush, R.R. and F. Mosteller [1955] Stochastic Models for Learning. Wiley, New York. Chambers, J.D. and G.D. Mingay [1966] The Agricultural Revolution 1750-1880, Schocker, New York. Conlisk, J. [1980] "Costly Optimizers Economic Imitators," Journal of Organization. 1:275-293. Cross, J.G. A Theory of University Press [1983] Cambridge Adaptive "Learning, Ellison, G. [1991] and Coordination," mimeo. English R.E.P. [1912] Longmans Green and Co., London. Ernie, versus Cheap Behavior Local Farming Economic and Behavior. Interaction, Past Present and "Invariant Distributions and the C. [1982] Behavior of Markovian Economic Models, Econometrica . Limiting Futia, 50, 377-408 Kandori, M. G. Mailath, and R. Rob "Learning, [1992] Mutation, and Long Run Eguilibrium in Games," mimeo. , Kerridge, E. [1967] Unwin, London. The Agricultural Revolution. George Allan & C. [1990] "Dynamic Choice in a Social Setting," University of Wisconsin SSRI discussion paper 9003. Manski, Mantoux, P. Century The Industrial Revolution in the Harper and Row, N.Y.; reprinted in 1961. [1905] Mingay, G.E. [1977] The Agricultural Revolution. Black, London. Adam & 18th Charles "Some Convergence Theorems for Stochastic Norman, M.F. [1968] Operators," Distance-Diminishing Learning Systems with J. Mathematical Psychology. 5, 61-101. 49 Norman, M.F. Press [1972] Markov Processes and Learning Models Academic Rogers, E. with F. Shoemaker [1971] Communication of Innovations: A Cross-Cultural approach. 2nd edition, New York, the Free Press. "A Two-Armed Bandit Rothschild, M. [1974] Pricing," J^ Econ. Theory. 9, 185-202. Ryan, Theory of and N. Gross [1943] "The Diffusion of Hybrid B. Corn in Two Iowa Communities," Rural Sociology Market Seed Schmalansee, R. [1975] "Alternative Models of Bandit Selection," Theory 10, 333-342. J. Econ . Slicher van Bath, H.S. [1965] The Agrarian History of Europe A.D. Edward Arnold, London. xxxx -1850 "Error Persistence L. [1990] Observational Learning," mimeo. Smith, and Experimental versus "The Turnip, the New Husbandry, and the Timmer, C.P. [1969] Agricultural Revolution," Quarterly Journal of English Economics 83, 375-395. , 50 REFERENCES See Rogers and Shoemaker [1971] for an extensive discussion of on adoption processes, research empirical especially in development. Mansfield [1968] and Ryan and Gross [1943] are classic studies of technology adoption in basic industries and agriculture, respectively. Cross [1983] develops a model of boundedly rational, adaptive choice with a similar information structure. Smith's paper models the pricing decisions of monopolistically competitive firms, for which the assumption of unobserved payoffs seems more plausible. Banerjee [1991a] is a model of investment decisions; Banerjee [1991b] is less explicit about its intended interpretations. Bikchandari et al suggest that their model is appropriate for a wide range of decision problems, including that of technology adoption. 4 Manski [1990] considers heterogeneous populations from an His paper provides a sufficient condition econometric viewpoint. for an individual to be able to obtain a consistent estimate of his own optimal choice by observing the choices and outcomes of others 5 Note that when the players are heterogeneous, a central to know the relative payoffs of the planner would need competing technologies for every player in order to implement the optimum by fiat. Centrally-based agricultural reformers are often hampered by their lack of understanding of the variation For example, Apodaca in farmers' tastes and production costs. [1952] describes how a planner tried to induce a New Mexico The innovation was adopted community to adopt a hybrid corn. and then discontinued despite doubling yields, as the villagers decided the taste and consistency of the corn were inappropriate making tortillas. for (In the second model the environment is complicated enough that a great many periods would be required to obtain good estimates, as we discuss in section 5.) See Mingay [1977] and Slicher von Bath [1965] for more Timmer thorough descriptions of the technology. [1965] discusses the extent of the resulting gains in productivity. 8 Kerridge [1967] pp. 28-34, 339-341. 9 Chambers and Mingay [1966] p. 55. 10 See Timmer [1969]. and Kerridge [1967]. Slicher von Bath [1965] gives a similar figure for the rate of diffusion in 243 p. France. See Timmer [1969] for an excellent summary of this debate. 51 12 ... • The consideration of inertia is further motivated by the empirical evidence that there is typically a substantial lag between the time individuals first learn of the existence of a Ryan and Gross [1943] technology and the time they adopt it. found that farmers in two rural communities on average adopted hybrid seed corn 9 years after they first heard of the innovation. Other studies cited in Rogers and Shoemaker [1971], p. 129, report lags of 2-4 years for the adoption of weed spray Note that the spread of in Iowa and fertilizer in Pakistan. literacy and modern communication media will speed up the rate at which farmers become aware of a new technology's existence, but do not seem to have eliminated the lag between becoming informed and deciding to adopt. If we interpret x. as the probability that a singleindividual chooses g, as opposed to the population fraction, (2) is an example of the linear stochastic learning theory (LSLT) of Bush This theory describes a traditional and Mosteller [1952]. one-player bandit problem, in which only 1 arm is observed in each period; it is also assumed that the only possible outcomes are "reward" and "failure," with the probability of arm i being Our model corresponds to the special case in rewarded being tt. and the two arm's outcomes in a given period which 7T. = 1-tt_ are perfectly negatively correlated. (In the bandit problem, the See Schmalansee [1975] for a joint distribution is irrelevant.) and a discussion of their brief survey of these models, pricing. system to market In our applicability (2) Schmalansee's constants L, L, G, and G all equal a, while Schmalansee's a and a both equal 1, and his /S. and jS_ equal 0. , , Schmalansee argues that in some cases the behavior these models prescribe, with both actions taken infinitely often, is more realistic than that of the optimal solution to the discounted bandit problem. .... 14 The empirical literature suggests that popularity weighting is a factor, but reliable estimates of m are hard to come by. Rogers and Shoemaker (op cit. p. 142) say that "many students of peasant life feel" that innovations must be 20% to 30% better to be adopted; they also cite a President's Science Advisory Committee figure of 50% to 100% From our reading, it is not clear whether these premia reflect popularity weighting or inertia. , . 52 15 We should warn the reader that the results we obtain for the fact uniform case will seem to rely on the that this distribution has compact support: an observation that u^ - u^ > However, compact support a implies that 6 is greater than zero. is not what underlies our conclusions. Appendix A shows that the non-linear rule "only switch if the observed payoff difference is large compared to the popularity" leads to a long-run distribution that places most of its weight on the better choice whenever the distribution of errors is "infinitely revealing in the tails." The appendix also reports simulations of a more complex rule that seems to work well even when the tails are not infinitely revealing. An alternative explanation is to use the fact that the rule m the optimal long-run decision, and that since a is the standard deviation of the per-period payoff differences, rescaling the utility function rescales cr in the same way. = a yields 17 Unless the payoff difference is so extreme that G -cr > m, in which case the rate of adoption is independent of 8. Note that the rate is also an increasing function of 8 when m = 0, s provided that 8 is smaller than cr. 18 However, the correlation is easy to explain as the result of an optimal investment policy under complete information if adopting the innovation requires investing in a capital good. 19 Although our leading example of very small window widths is the English agricultural revolution, small window widths should Anecdotal evidence not be seen as requiring illiterate agents. suggests that farmers often distrust the information of central authorities and experts, and prefer to see how innovations work Ryan and Gross [194 3] found that the out in their neighborhood. experiences of neighbors was an important factor in the adoption of hybrid seed corn by 2 0th century Iowa farmers. 20 Although we have not checked the details, it seems that a large window a rule combination of widths with of proximity-weighted averages could combine quick convergence with a small long-run variance. 21 " However, as w increases the specification bias grows. When w is large, it may be more natural to suppose that players weight the experience of those nearby more than that of those who are farther away but still within their window. . . ' .... . ... 22 Based on estimated standard errors, the first two digits are correct at the .95 level. 23 We thank Roland Benabou for this observation. (See Futia's [198 2 survey for a summary of Norman's results, invariant and other techniques for establishing that the distribution is unique.) 53 18 8 kO MIT LIBRARIES 3 =1060 0074bT3b 1