The Longer Tail: The Changing Shape of Amazon’s Sales Distribution Curve Erik Brynjolfsson, Yu (Jeffrey) Hu, Michael D. Smith This Version: September 2010 Available from: ssrn.com/abstract=1679991 Acknowledgements: The authors thank seminar participants at the 2009 Workshop on Information Systems and Economics, and the 2008 INFORMS Annual Meetings for valuable comments on this research. Smith acknowledges the National Science Foundation for generous financial support provided through CAREER award IIS-0118767. Electronic copy available at: http://ssrn.com/abstract=1679991 The Longer Tail: The Changing Shape of Amazon’s Sales Distribution Curve Abstract Internet consumers derive significant surplus from increased product variety, and in particular, the “Long Tail” of niche products that can be found on the Internet at retailers like Amazon.com. In this paper we analyze how the shape of Amazon’s sales distribution curve has changed from 2000 to 2008, and how this impacts the resulting consumer surplus gains from increased product variety in the online book market. Specifically, in 2008 we collected sales and sales rank data on a broad sample of books sold through Amazon.com and compare it to similar data we gathered in 2000. We then develop a new methodology for fitting the relationship between sales and sales rank and apply it to our data. We find that the Long Tail has grown longer over time, with niche books accounting for a larger share of total sales. Our analyses suggest that by 2008, niche books account for 36.7% of Amazon’s sales and the consumer surplus generated by niche books has increased at least five fold from 2000 to 2008. We argue that this increase is consistent with the presence of “secondary” supply- and demand-side effects driving the growth of the Long Tail online. In addition, our new methodology finds that, while power laws are a good first approximation for the rank-sales relationship, the slope is not constant for all book ranks, becoming progressively steeper for more obscure books. Key Words: Long Tail, electronic commerce, sales distribution, niche products, power law Electronic copy available at: http://ssrn.com/abstract=1679991 1. Introduction The term “The Long Tail” was coined by Wired’s Chris Anderson (Anderson 2004) to describe a phenomenon where niche products account for a much larger proportion of sales in Internet markets than they do in brick-and-mortar markets. This phenomenon has captured much attention and debate in the popular press (e.g., Gomez 2006, Orlowski 2008) and in the information systems, marketing, and operations management literatures. In an earlier study of the Internet’s Long Tail phenomenon (Brynjolfsson, Hu, and Smith (2003), we found that sales of niche books — books that are not typically stocked in brick-and-mortar bookstores —enhanced consumer surplus by $731 million to $1.03 billion in 2000. Given the high interest level in Amazon’s Long Tail, it is important to understand how the shape of online sales distributions and the resulting consumer surplus gains will change over time. Will the phenomenon we described in 2003 increase over time, or will it be a static or even short-lived phenomenon? One school of thought says that the same forces that created Amazon’s Long Tail in the first place may continue to make it longer over time (e.g. Brynjolfsson et al. 2006). First, exposure to niche products could drive consumers to develop a taste for more niche products. Second, by gaining access to “long tail” markets to stock their products, producers could have an increased incentive to create more new niche products over time. Finally, technologies that can drive consumers to niche products — such as search tools, product reviews, product popularity information, and recommendation engines — could improve over time, and consumers could become more familiar with these tools. In contrast, some have argued that the Long Tail may be a short-lived phenomenon. For instance, early adopters of e-commerce are likely to have very different tastes for products than the mainstream market (Moore 2002). As online commerce attracts more and more mainstream consumers over time, the increase in sales of mainstream popular products could outpace the increase in sales of niche products, reducing the size of The Long Tail. In addition, online search and recommendation tools could be tuned (intentionally or unintentionally) to disproportionately promote popular products (Fleder and Hosanagar 2009). Fi- nally, producers of popular products could employ online marketing strategies to promote their products and counteract the effect of search and recommendation tools in promoting niche products. This paper analyzes whether the Long Tail phenomenon represents a temporary or permanent shift. We have collected Amazon sales and sales rank data in 2008 on a larger and broader sample of books than was available in the 2000 sample used by our 2003 paper. We then match this sample to our 2000 sample to compare changes in the profile of sales over time. Our results suggest that Amazon’s Long Tail has gotten significantly longer from 2000 to 2008 and that overall consumer surplus gains from product variety at Amazon increased five-fold from 2000 to 2008. This paper also makes an important contribution to the estimation of sales-rank relationships online by showing that the relationship between Amazon sales and sales rank may not be purely log-linear. Based on the log-linear curve estimated in our 2003 paper and Chevalier and Goolsbee (2003), many empirical papers have started to use Amazon sales rank as a proxy for Amazon sales (e.g., Chevalier and Mayzlin 2006, Ghose et al. 2006, Dhar et al. 2009, Carmi et al. 2009). The results in this paper indicate that while Amazon sales rank remains a good proxy for Amazon sales, different slope coefficients should be used to fit such a relationship, especially when the books being studied span a wide spectrum of popular books and niche books. This paper develops a new methodology for fitting the relationship between sales and sales rank and applies it to our 2008 data. 2. Literature Economic explanations for the existence of superstars and popular products can be traced to Rosen (1981) and Frank and Cook (1995). Brynjolfsson, Hu, and Smith (2006) point out several demand- and supply-side factors that could drive sales to niche products on the Internet, including low inventory costs, demand aggregation, and low consumer search costs caused by search tools and recommendation systems. These demand- and supply-side factors can even reinforce each other. For instance, Cachon, Terwiesch, and Xu (2008) show that low consumer search costs can enhance a retailer’s incentives to provide a large product selection. The reinforcement of these demand- and supply-side factors could drive even more 2 sales toward niche products. Recently, researchers have explored other factors that could increase the sales of niche products. For example, Tucker and Zhang (2009) find that information about product popularity may benefit niche products with narrow appeal disproportionately and Oestreicher-Singer and Sundararajan (2009) find that the sales of a product can be influenced by the position of the product in the hyperlinked network of products. There is also a growing body of literature that empirically examines sales distributions in various product markets. Brynjolfsson, Hu, and Simester (2007) find that the sales distribution of an Internet channel is less concentrated than that of a catalog channel, using data from a clothing retailer. Elberse and Oberholzer-Gee (2008) find that the demand for niche video products has increased, although they also find that a large number of niche products have almost zero sales. Chellappa et al. (2007) have similar findings for music sales. However, none of these papers addresses whether the Long Tail phenomenon is a temporary or permanent shift, or how it might change over time. This paper answers these questions by comparing the sales distribution of a similar profile of products over a sufficiently long period of time. 3. Data Analyses The data for this paper come from a major publisher with annual sales of more than $1 billion. The publisher provided us with their Amazon sales and sales rank data on a sample of 1,598 titles over 10 weeks from June to August 2008. Overall, we have 15,980 observations of Amazon sales and sales ranks. Table 1 compares the summary statistics for our 2000 and 2008 samples. It is clear that our 2008 sample has more observations (15,980 vs. 901) and covers a broader spectrum of books (sales ranks of 71 to 5,350,140 versus 238 to 961,367) than our 2000 sample does. 3.1 Sample Matching One may argue that our 2008 sample differs from our 2000 sample, and such a sample selection effect could have confounded the findings from comparing these two samples. A standard way to control for selection effects is the propensity score matching method suggested by Rosenbaum and Rubin (1983). 3 Rassler (2002) also provides many details on sample matching. The idea is to obtain a new 2008 sample that matches our 2000 sample on observable dimensions. Such a sample matching approach reduces the difference between two samples in an attempt to control for the selection effect. Table 1: Summary Statistics for Our 2008 Sample and Our 2000 Sample Variable 2008 Sample Weekly Sales Weekly Sales Rank 2000 Sample Weekly Sales Weekly Sales Rank Obs. 15,980 15,980 Mean S.D. Min Max 3.04 338,238 17.79 330,780 0 71 950 5,350,140 18.32 34,054 30.20 61,001 0 238 480 961,367 901 901 Specifically, to conduct this sample matching we use STATA’s propensity score matching module to construct a sub-sample from our 2008 sample that matches our 2000 sample on the basis of weekly sales rank. After using this technique, summary statistics for the 2008 matched sample (reported in Table 2) are comparable to those for our 2000 sample (shown at the bottom of Table 1). Table 2: Summary Statistics for 2008 Matched Sample Variable Weekly Sales Weekly Sales Rank Obs. 901 901 Mean S.D. 21.32 34,056 Min 41.30 60,999 0 226 Max 354 961,146 3.2 Re-estimating Amazon’s Long Tail We now estimate the log-linear relationship between Amazon sales and sales rank, using the 2008 matched sample. The linear regression model we use is: yi = β 0 + β1 xi (1) where yi is ln(Weekly Sales), and xi is ln(Weekly Sales Rank). The results using the 2008 matched sample are reported in Column (1) of Table 3, with the analogous results from our 2000 sample in Column (2). Note that 41 observations in the 2008 matched sample have zero Weekly Sales and 40 observations in the 2000 sample have zero Weekly Sales. These observations are dropped after taking the natural log of Weekly Sales. The coefficient on ln(Weekly Sales Rank) is -0.613 when the 2008 matched sample is used, which is significantly smaller in size than the same coefficient when the 2000 sample is used (-0.871). 4 Table 3: Results of The Log-linear Regression 2008 Matched Sample 2000 Sample (1) (2) Constant 8.046** 10.526** (0.432) (0.156) Ln(Weekly Sales -0.613** -0.871** Rank) (0.042) (0.017) Obs. 860 861 R2 0.311 0.801 Robust standard errors are in parentheses; ** Significantly different from zero, p<0.01;* p<0.05 Weekly Sales Figure 1: Amazon’s Long Tail in 2008 vs. in 2000 10 9 8 7 6 5 4 3 2 1 0 2000 2008 0 500000 1000000 1500000 2000000 2500000 3000000 Weekly Sales Rank 10000 Weekly Sales 1000 100 2000 2008 10 1 1 0.1 10 100 1000 10000 100000 1000000 Weekly Sales Rank The lower coefficient in 2008 suggests that there is more weight in the “tail” of the 2008 sales distribution than there was in 2000. This can be seen graphically in Figure 1, which shows the estimated log-linear 5 relationship between Amazon sales and sales rank, with the 2008 results in blue and 2000 results in red. We plot these two curves first on a normal scale and then on a logarithmic scale. These two curves cross when sales rank is 14,949. This means that popular books (with sales rank below 14,949) tend to sell fewer copies in 2008 than in 2000, while niche titles (with sales rank below 14,949) tend to generate more sales in 2008 than in 2000. 3.3 Toward A More Accurate Method of Estimating Amazon’s Long Tail The log-linear regression method assumes that the coefficient on Log(Weekly Sales Rank) does not vary as a book’s sales rank increases. To test this assumption, we fit the relationship between Log(Weekly Sales) and Log(Weekly Sales Rank) to a series of splines, rather than just a single line. Such a spline fitting technique allows the slope coefficient to vary as a book’s sales rank increases, leading to a more accurate estimate of the size of Amazon’s Long Tail. We note that our 2000 sample does not contain any observation with Weekly Sales Rank above 1 million. In our 2008 sample, we have 569 observations with Weekly Sales Rank above 1 million, allowing us to more accurately estimate the shape of Amazon’s Long Tail for books with sales ranks above 1 million. We also note that books with Weekly Sales Rank above 1 million frequently have zero Weekly Sales. The method in our 2003 paper relies on a linear regression of Log(Weekly Sales) on Log(Weekly Sales Rank). Taking the natural log of Weekly Sales means that all observations with zero sales will be dropped. To utilize these observations, we now use the following negative binomial regression model: f ( yi | X i ) = e −µ i y µi i , yi = 0,1,2,3,... yi ! (2) where yi is Log(Weekly Sales), X i is a vector of explanatory variables, and E( yi | X i ) = µ i is the conditional mean (Cameron and Trivedi 1998). We model the natural log of the conditional mean as a series of n linear splines with n-1 knot points k1 , k2 ,..., kn −1 : ln( µi ) = β 0 + β1S1 (xi ) + β1S2 (xi ) + ... + β n Sn (xi ) (3) 6 where the l-th spline is Sl (xi ) = (x − kl −1 ) * I(x > kl −1 ) − (x − kl ) * I(x > kl ) for l = 2, 3,...,n − 1 , while S1 (xi ) = x − (x − k1 ) * I(x > k1 ) and Sn (xi ) = (x − kn −1 ) * I(x > kn −1 ) . Table 4: Results Using 2008 Sample, New Methodology vs. Old Methodology Negative BiNegative BiLinear ReNegative BiNegative Binomial Model nomial Model gression and nomial Model nomial Model and Four and One One Spline and Two and Three Splines Spline (3) Splines Splines (1) (2) (4) (5) Constant 10.083** 102.189** 7.480** 10.709** 10.268** (0.211) (0.147) (0.092) (0.168) (0.190) -0.782** -0.977** -0.555** -0.843** -0.800** S1 (xi ) (0.019) (0.012) (0.008) (0.015) (0.017) -1.160** -1.539** -1.281** S2 (xi ) (0.066) (0.044) (0.055) -1.217** -1.505** S3 (xi ) (0.099) (0.079) -1.680** S4 (xi ) (0.113) Obs. 15,980 15,980 7,668 15,980 15,980 Robust standard errors are in parentheses; ** Significantly different from zero, p<0.01;* p<0.05 In Column (1) of Table 4, we report the estimation results using a negative binomial regression and four splines, with knot points at the 25th, 50th, and 75th percentile of xi . These results show that the coefficients on all four spline are negative and highly significant. In addition, the slope coefficient gradually changes from -0.782 to -1.680, becoming more negative as the book’s sales rank increases. In other words, book sales decrease at an increasingly faster pace, as we move from popular books to niche books. Using a negative binomial regression and only one spline would result in a coefficient of -0.977 (shown in Column (2) of Table 4). Such a model would not have captured the accurate shape of Amazon’s sales distribution curve. Using the methodology in our 2003 paper — a linear regression and only one spline — would result in an even less accurate shape of Amazon’s Long Tail (shown in Column (3) of Table 4). As discussed earlier, applying a linear regression model would drop all observations with zero Weekly Sales, leading to even more bias in the estimation results. We have conducted robustness checks on the results shown in Column (1) of Table 4, by using different numbers of splines with different knot points. For instance, we have tried to use two splines with one knot point at the 50th percentile of xi (shown in Column (4) of Table 4), and three splines with two knot points 7 at the 33th and 67th percentiles of xi (shown in Column (5) of Table 4). We have consistently found results that are qualitatively similar to ones in Column (1) of Table 4: the slope coefficient gradually becomes more negative as the book’s sales rank increases. 3.4 Re-estimating The Size of Amazon’s Long Tail in 2008 Figure 2 illustrates the difference between the shape of the sales-rank curve using the new and the old estimation methodology (linear regression and one spline), with the curve using the new methodology in red and the curve using the old methodology in green. With our new data we can see that methodology used by Brynjolfsson, Hu, and Smith (2003) — a linear regression with one spline — could have overestimated the size of Amazon’s Long Tail. This is because the assumption that the coefficient on Log(Weekly Sales Rank) does not change as a book’s sales rank increases does not seem to hold, and because using a linear regression drops observations with zero weekly sales. Our new methodology allows us to fit the relationship between Log(Weekly Sales Rank) and Log(Weekly Sales Rank) more accurately. To obtain an accurate estimate of the total sales and the sales generated by books ranked above 100,000, we simply integrate under the curve as shown in Column (1) of Table 4 and find that books ranked above 100,000 account for 36.7% of Amazon’s total sales in 2008. The estimates in Column (3) of Table 4 using the old methodology would have estimated that books ranked above 100,000 account for 82.57% of Amazon’s total sales in 2008. We use this and our other calculations to estimate the consumer surplus gain from “Long Tail” books in 2008 and compare it to our calculations in 2000. In our prior work, we used 100,000 as the cutoff point for “niche” books under the argument that the largest book superstores typically only carry that number of unique titles. The physical stock capacity of bookstores has changed little between 2000 and now, and thus we used this cutoff point to recalculate the consumer surplus generated from selling niche books on the Internet. 8 Figure 2: Amazon’s Long Tail in 2008, Using New and Old Methodologies 10 9 Weekly Sales 8 7 6 5 2008, new methodology 4 2008, old methodology 3 2 1 0 0 500000 1000000 1500000 2000000 Weekly Sales Rank 2500000 3000000 1000 Weekly Sales 100 10 2008, new methodology 2008, old methodology 1 0.1 1 10 100 1000 10000 100000 1000000 Weekly Sales Rank However, while the stocking capacity of bookstores has remained relatively constant since 2000, several changes have happened in the intervening eight years. First, according to Books in Print, the number of books in print has increased from 2.3 million in 2000 to 3-5 million in 2008. Second, book industry revenue has climbed from $24.6 billion to $37.3 billion. Third, the share of book purchases through the Internet channel has risen from 6% in 2000 to 21-30% in 2008. Combining these changes with the new estimates of the percentage of sales in the Long Tail, we estimate that selling niche books that are unavailable in brick-and-mortar stores leads to a consumer surplus of $3.93 billion to $5.04 billion in the year 2008. 9 These estimates are about five times of the estimates in Brynjolfsson, Hu, and Smith (2003), even though the estimates in that paper are likely to have been overestimates. 4. Conclusions This paper analyzes how the shape of Amazon’s Long Tail has changed over time. We collect data in 2008 on a larger and broader sample of books than was available in the 2000 sample used by our 2003 paper. Our new data and new methodology – negative binomial regression with a series of splines – allow us to fit the relationship between sales and sales rank with much greater accuracy. This paper presents two important new findings. First, Amazon’s Long Tail has gotten significantly longer from 2000 to 2008 and that overall consumer surplus gains from product variety at Amazon increased five-fold from 2000 to 2008. This finding suggests that Amazon’s Long Tail phenomenon, which was first discussed by our 2003 paper, is likely to be a permanent shift instead of a short-lived phenomenon. Second, while previous research has assumed a constant slope between the log of sales and the log of sales rank, we find that the sales of a book drop at a faster speed than a regular power law (or a log-linear curve) indicates and that the slope becomes steeper as a book’s sales rank increases. This finding suggests that there may be forces that limit Amazon’s ability to sell books that are extremely niche. Future research is needed in order to better understand the nature of these forces. References Anderson, C. 2004. The Long Tail. Wired Magazine 12(10) 170–177. Brynjolfsson, E., Y. J. Hu, M. D. Smith. 2003. Consumer surplus in the digital economy: Estimating the value of increased product variety at online booksellers. Management Science 49(11) 1580-1596. Brynjolfsson, E., Y. J. Hu, M. D. Smith. 2006. From niches to riches: The anatomy of the long tail. Sloan Management Review 47(4) 67-71. Brynjolfsson, E., Y. J. Hu, D. Simester. 2007. Goodbye Pareto principle, hello long tail: The effect of search costs on the concentration of product sales. MIT Sloan Working Paper. Cachon, G. P., C. Terwiesch, Y. Xu. 2008. On the effects of consumer search and firm entry in a multiproduct competitive market. Marketing Science 27(3) 461-473. Chellappa, R., B. Konsynski, V. Sambamurthy, S. Shivendu. 2007. An empirical study of the myths and facts of digitization in the music industry. Workshop on Information Systems and Economics, Montreal, Canada. 10 Chevalier, J., A. Goolsbee. 2003. Price competition online: Amazon versus Barnes and Noble. Quantitative Marketing and Economics 1(2) 203-222. Chevalier, J., D. Mayzlin. 2006. The effect of word of mouth online: Online book reviews. Journal of Marketing Research 43(3) 345-354. Dhar, V., G. Oestreicher-Singer, A. Sundararajan, A. Umyarov. The gestalt in graphs: Prediction using economic networks. Working paper, New York University. Elberse, A., F. Oberholzer-Gee. 2008. Superstars and underdogs: An examination of the long tail phenomenon in video sales. Working Paper. Frank, R., P. Cook. 1995. The Winner-Take-All Society: Why the Few at the Top Get So Much More Than the Rest of Us. Penguin, New York, NY. Fleder, D., K. Hosanagar. 2009. Blockbuster culture’s next rise and fall: The impact of recommender systems on sales diversity. Management Science 55(5) 697-712. Ghose, A., M.D. Smith, R. Telang. 2006. Internet exchanges for used books: An empirical analysis of product cannibalization and welfare impact. Information Systems Research 17(1) 3-19. Gomes, L. 2006. It may be a long time before the long tail is wagging the web. Wall Street Journal July 26. Moore, G. 2002. Crossing the chasm. HarperCollins Publishers, New York, NY. Oestreicher-Singer, G., A. Sundararajan. 2009. Recommendation networks and the Long Tail of electronic commerce. Working paper, New York University. Orlowski, A. 2008. Chopping the long tail down to size. The Register Nov 7. Rassler, S. 2002. Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches. Springer, New York, NY. Rosen, S. 1981. The economics of superstars. American Economic Review 71(5) 845-858. Rosenbaum, P., D. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70(1) 41-55. Tucker, C., J. Zhang. 2009. How does popularity information affect choices? A field experiment. MIT Sloan Working Paper. 11