The Longer Tail: The Changing Shape of Amazon’s Sales Distribution Curve

advertisement
The Longer Tail:
The Changing Shape of Amazon’s Sales Distribution Curve
Erik Brynjolfsson, Yu (Jeffrey) Hu, Michael D. Smith
This Version: September 2010
Available from: ssrn.com/abstract=1679991
Acknowledgements: The authors thank seminar participants at the 2009 Workshop on Information
Systems and Economics, and the 2008 INFORMS Annual Meetings for valuable comments on this
research. Smith acknowledges the National Science Foundation for generous financial support provided
through CAREER award IIS-0118767.
Electronic copy available at: http://ssrn.com/abstract=1679991
The Longer Tail:
The Changing Shape of Amazon’s Sales Distribution Curve
Abstract
Internet consumers derive significant surplus from increased product variety, and in particular, the “Long
Tail” of niche products that can be found on the Internet at retailers like Amazon.com. In this paper we
analyze how the shape of Amazon’s sales distribution curve has changed from 2000 to 2008, and how this
impacts the resulting consumer surplus gains from increased product variety in the online book market.
Specifically, in 2008 we collected sales and sales rank data on a broad sample of books sold through
Amazon.com and compare it to similar data we gathered in 2000. We then develop a new methodology
for fitting the relationship between sales and sales rank and apply it to our data. We find that the Long
Tail has grown longer over time, with niche books accounting for a larger share of total sales. Our
analyses suggest that by 2008, niche books account for 36.7% of Amazon’s sales and the consumer surplus generated by niche books has increased at least five fold from 2000 to 2008. We argue that this increase is consistent with the presence of “secondary” supply- and demand-side effects driving the growth
of the Long Tail online. In addition, our new methodology finds that, while power laws are a good first
approximation for the rank-sales relationship, the slope is not constant for all book ranks, becoming progressively steeper for more obscure books.
Key Words: Long Tail, electronic commerce, sales distribution, niche products, power law
Electronic copy available at: http://ssrn.com/abstract=1679991
1. Introduction
The term “The Long Tail” was coined by Wired’s Chris Anderson (Anderson 2004) to describe a phenomenon where niche products account for a much larger proportion of sales in Internet markets than they
do in brick-and-mortar markets. This phenomenon has captured much attention and debate in the popular
press (e.g., Gomez 2006, Orlowski 2008) and in the information systems, marketing, and operations
management literatures. In an earlier study of the Internet’s Long Tail phenomenon (Brynjolfsson, Hu,
and Smith (2003), we found that sales of niche books — books that are not typically stocked in
brick-and-mortar bookstores —enhanced consumer surplus by $731 million to $1.03 billion in 2000.
Given the high interest level in Amazon’s Long Tail, it is important to understand how the shape of online
sales distributions and the resulting consumer surplus gains will change over time. Will the phenomenon
we described in 2003 increase over time, or will it be a static or even short-lived phenomenon?
One school of thought says that the same forces that created Amazon’s Long Tail in the first place may
continue to make it longer over time (e.g. Brynjolfsson et al. 2006). First, exposure to niche products
could drive consumers to develop a taste for more niche products. Second, by gaining access to “long tail”
markets to stock their products, producers could have an increased incentive to create more new niche
products over time. Finally, technologies that can drive consumers to niche products — such as search
tools, product reviews, product popularity information, and recommendation engines — could improve
over time, and consumers could become more familiar with these tools.
In contrast, some have argued that the Long Tail may be a short-lived phenomenon. For instance, early
adopters of e-commerce are likely to have very different tastes for products than the mainstream market
(Moore 2002). As online commerce attracts more and more mainstream consumers over time, the increase
in sales of mainstream popular products could outpace the increase in sales of niche products, reducing
the size of The Long Tail. In addition, online search and recommendation tools could be tuned (intentionally or unintentionally) to disproportionately promote popular products (Fleder and Hosanagar 2009). Fi-
nally, producers of popular products could employ online marketing strategies to promote their products
and counteract the effect of search and recommendation tools in promoting niche products.
This paper analyzes whether the Long Tail phenomenon represents a temporary or permanent shift. We
have collected Amazon sales and sales rank data in 2008 on a larger and broader sample of books than
was available in the 2000 sample used by our 2003 paper. We then match this sample to our 2000 sample
to compare changes in the profile of sales over time. Our results suggest that Amazon’s Long Tail has
gotten significantly longer from 2000 to 2008 and that overall consumer surplus gains from product variety at Amazon increased five-fold from 2000 to 2008.
This paper also makes an important contribution to the estimation of sales-rank relationships online by
showing that the relationship between Amazon sales and sales rank may not be purely log-linear. Based
on the log-linear curve estimated in our 2003 paper and Chevalier and Goolsbee (2003), many empirical
papers have started to use Amazon sales rank as a proxy for Amazon sales (e.g., Chevalier and Mayzlin
2006, Ghose et al. 2006, Dhar et al. 2009, Carmi et al. 2009). The results in this paper indicate that while
Amazon sales rank remains a good proxy for Amazon sales, different slope coefficients should be used to
fit such a relationship, especially when the books being studied span a wide spectrum of popular books
and niche books. This paper develops a new methodology for fitting the relationship between sales and
sales rank and applies it to our 2008 data.
2. Literature
Economic explanations for the existence of superstars and popular products can be traced to Rosen (1981)
and Frank and Cook (1995). Brynjolfsson, Hu, and Smith (2006) point out several demand- and supply-side factors that could drive sales to niche products on the Internet, including low inventory costs,
demand aggregation, and low consumer search costs caused by search tools and recommendation systems.
These demand- and supply-side factors can even reinforce each other. For instance, Cachon, Terwiesch,
and Xu (2008) show that low consumer search costs can enhance a retailer’s incentives to provide a large
product selection. The reinforcement of these demand- and supply-side factors could drive even more
2
sales toward niche products. Recently, researchers have explored other factors that could increase the
sales of niche products. For example, Tucker and Zhang (2009) find that information about product
popularity may benefit niche products with narrow appeal disproportionately and Oestreicher-Singer and
Sundararajan (2009) find that the sales of a product can be influenced by the position of the product in the
hyperlinked network of products.
There is also a growing body of literature that empirically examines sales distributions in various product
markets. Brynjolfsson, Hu, and Simester (2007) find that the sales distribution of an Internet channel is
less concentrated than that of a catalog channel, using data from a clothing retailer. Elberse and Oberholzer-Gee (2008) find that the demand for niche video products has increased, although they also find that a
large number of niche products have almost zero sales. Chellappa et al. (2007) have similar findings for
music sales. However, none of these papers addresses whether the Long Tail phenomenon is a temporary
or permanent shift, or how it might change over time. This paper answers these questions by comparing
the sales distribution of a similar profile of products over a sufficiently long period of time.
3. Data Analyses
The data for this paper come from a major publisher with annual sales of more than $1 billion. The publisher provided us with their Amazon sales and sales rank data on a sample of 1,598 titles over 10 weeks
from June to August 2008. Overall, we have 15,980 observations of Amazon sales and sales ranks. Table
1 compares the summary statistics for our 2000 and 2008 samples. It is clear that our 2008 sample has
more observations (15,980 vs. 901) and covers a broader spectrum of books (sales ranks of 71 to
5,350,140 versus 238 to 961,367) than our 2000 sample does.
3.1 Sample Matching
One may argue that our 2008 sample differs from our 2000 sample, and such a sample selection effect
could have confounded the findings from comparing these two samples. A standard way to control for
selection effects is the propensity score matching method suggested by Rosenbaum and Rubin (1983).
3
Rassler (2002) also provides many details on sample matching. The idea is to obtain a new 2008 sample
that matches our 2000 sample on observable dimensions. Such a sample matching approach reduces the
difference between two samples in an attempt to control for the selection effect.
Table 1: Summary Statistics for Our 2008 Sample and Our 2000 Sample
Variable
2008 Sample
Weekly Sales
Weekly Sales Rank
2000 Sample
Weekly Sales
Weekly Sales Rank
Obs.
15,980
15,980
Mean
S.D.
Min
Max
3.04
338,238
17.79
330,780
0
71
950
5,350,140
18.32
34,054
30.20
61,001
0
238
480
961,367
901
901
Specifically, to conduct this sample matching we use STATA’s propensity score matching module to
construct a sub-sample from our 2008 sample that matches our 2000 sample on the basis of weekly sales
rank. After using this technique, summary statistics for the 2008 matched sample (reported in Table 2) are
comparable to those for our 2000 sample (shown at the bottom of Table 1).
Table 2: Summary Statistics for 2008 Matched Sample
Variable
Weekly Sales
Weekly Sales Rank
Obs.
901
901
Mean
S.D.
21.32
34,056
Min
41.30
60,999
0
226
Max
354
961,146
3.2 Re-estimating Amazon’s Long Tail
We now estimate the log-linear relationship between Amazon sales and sales rank, using the 2008
matched sample. The linear regression model we use is:
yi = β 0 + β1 xi
(1)
where yi is ln(Weekly Sales), and xi is ln(Weekly Sales Rank). The results using the 2008 matched sample
are reported in Column (1) of Table 3, with the analogous results from our 2000 sample in Column (2).
Note that 41 observations in the 2008 matched sample have zero Weekly Sales and 40 observations in the
2000 sample have zero Weekly Sales. These observations are dropped after taking the natural log of
Weekly Sales. The coefficient on ln(Weekly Sales Rank) is -0.613 when the 2008 matched sample is used,
which is significantly smaller in size than the same coefficient when the 2000 sample is used (-0.871).
4
Table 3: Results of The Log-linear Regression
2008 Matched Sample
2000 Sample
(1)
(2)
Constant
8.046**
10.526**
(0.432)
(0.156)
Ln(Weekly Sales
-0.613**
-0.871**
Rank)
(0.042)
(0.017)
Obs.
860
861
R2
0.311
0.801
Robust standard errors are in parentheses; ** Significantly different from
zero, p<0.01;* p<0.05
Weekly Sales Figure 1: Amazon’s Long Tail in 2008 vs. in 2000
10 9 8 7 6 5 4 3 2 1 0 2000 2008 0 500000 1000000 1500000 2000000 2500000 3000000 Weekly Sales Rank 10000 Weekly Sales 1000 100 2000 2008 10 1 1 0.1 10 100 1000 10000 100000 1000000 Weekly Sales Rank The lower coefficient in 2008 suggests that there is more weight in the “tail” of the 2008 sales distribution
than there was in 2000. This can be seen graphically in Figure 1, which shows the estimated log-linear
5
relationship between Amazon sales and sales rank, with the 2008 results in blue and 2000 results in red.
We plot these two curves first on a normal scale and then on a logarithmic scale. These two curves cross
when sales rank is 14,949. This means that popular books (with sales rank below 14,949) tend to sell
fewer copies in 2008 than in 2000, while niche titles (with sales rank below 14,949) tend to generate more
sales in 2008 than in 2000.
3.3 Toward A More Accurate Method of Estimating Amazon’s Long Tail
The log-linear regression method assumes that the coefficient on Log(Weekly Sales Rank) does not vary
as a book’s sales rank increases. To test this assumption, we fit the relationship between Log(Weekly
Sales) and Log(Weekly Sales Rank) to a series of splines, rather than just a single line. Such a spline fitting
technique allows the slope coefficient to vary as a book’s sales rank increases, leading to a more accurate
estimate of the size of Amazon’s Long Tail.
We note that our 2000 sample does not contain any observation with Weekly Sales Rank above 1 million.
In our 2008 sample, we have 569 observations with Weekly Sales Rank above 1 million, allowing us to
more accurately estimate the shape of Amazon’s Long Tail for books with sales ranks above 1 million.
We also note that books with Weekly Sales Rank above 1 million frequently have zero Weekly Sales. The
method in our 2003 paper relies on a linear regression of Log(Weekly Sales) on Log(Weekly Sales Rank).
Taking the natural log of Weekly Sales means that all observations with zero sales will be dropped. To
utilize these observations, we now use the following negative binomial regression model:
f ( yi | X i ) =
e
−µ i
y
µi i
, yi = 0,1,2,3,...
yi !
(2)
where yi is Log(Weekly Sales), X i is a vector of explanatory variables, and E( yi | X i ) = µ i is the conditional mean (Cameron and Trivedi 1998). We model the natural log of the conditional mean as a series
of n linear splines with n-1 knot points k1 , k2 ,..., kn −1 :
ln( µi ) = β 0 + β1S1 (xi ) + β1S2 (xi ) + ... + β n Sn (xi )
(3)
6
where the l-th spline is Sl (xi ) = (x − kl −1 ) * I(x > kl −1 ) − (x − kl ) * I(x > kl ) for
l = 2, 3,...,n − 1 ,
while S1 (xi ) = x − (x − k1 ) * I(x > k1 ) and Sn (xi ) = (x − kn −1 ) * I(x > kn −1 ) .
Table 4: Results Using 2008 Sample, New Methodology vs. Old Methodology
Negative BiNegative BiLinear ReNegative BiNegative Binomial Model nomial Model
gression and
nomial Model nomial Model
and Four
and One
One Spline
and Two
and Three
Splines
Spline
(3)
Splines
Splines
(1)
(2)
(4)
(5)
Constant
10.083**
102.189**
7.480**
10.709**
10.268**
(0.211)
(0.147)
(0.092)
(0.168)
(0.190)
-0.782**
-0.977**
-0.555**
-0.843**
-0.800**
S1 (xi )
(0.019)
(0.012)
(0.008)
(0.015)
(0.017)
-1.160**
-1.539**
-1.281**
S2 (xi )
(0.066)
(0.044)
(0.055)
-1.217**
-1.505**
S3 (xi )
(0.099)
(0.079)
-1.680**
S4 (xi )
(0.113)
Obs.
15,980
15,980
7,668
15,980
15,980
Robust standard errors are in parentheses; ** Significantly different from zero, p<0.01;* p<0.05
In Column (1) of Table 4, we report the estimation results using a negative binomial regression and four
splines, with knot points at the 25th, 50th, and 75th percentile of xi . These results show that the coefficients
on all four spline are negative and highly significant. In addition, the slope coefficient gradually changes
from -0.782 to -1.680, becoming more negative as the book’s sales rank increases. In other words, book
sales decrease at an increasingly faster pace, as we move from popular books to niche books. Using a
negative binomial regression and only one spline would result in a coefficient of -0.977 (shown in Column (2) of Table 4). Such a model would not have captured the accurate shape of Amazon’s sales distribution curve. Using the methodology in our 2003 paper — a linear regression and only one spline —
would result in an even less accurate shape of Amazon’s Long Tail (shown in Column (3) of Table 4). As
discussed earlier, applying a linear regression model would drop all observations with zero Weekly Sales,
leading to even more bias in the estimation results.
We have conducted robustness checks on the results shown in Column (1) of Table 4, by using different
numbers of splines with different knot points. For instance, we have tried to use two splines with one knot
point at the 50th percentile of xi (shown in Column (4) of Table 4), and three splines with two knot points
7
at the 33th and 67th percentiles of xi (shown in Column (5) of Table 4). We have consistently found results
that are qualitatively similar to ones in Column (1) of Table 4: the slope coefficient gradually becomes
more negative as the book’s sales rank increases.
3.4 Re-estimating The Size of Amazon’s Long Tail in 2008
Figure 2 illustrates the difference between the shape of the sales-rank curve using the new and the old
estimation methodology (linear regression and one spline), with the curve using the new methodology in
red and the curve using the old methodology in green. With our new data we can see that methodology
used by Brynjolfsson, Hu, and Smith (2003) — a linear regression with one spline — could have overestimated the size of Amazon’s Long Tail. This is because the assumption that the coefficient on
Log(Weekly Sales Rank) does not change as a book’s sales rank increases does not seem to hold, and because using a linear regression drops observations with zero weekly sales.
Our new methodology allows us to fit the relationship between Log(Weekly Sales Rank) and Log(Weekly
Sales Rank) more accurately. To obtain an accurate estimate of the total sales and the sales generated by
books ranked above 100,000, we simply integrate under the curve as shown in Column (1) of Table 4 and
find that books ranked above 100,000 account for 36.7% of Amazon’s total sales in 2008. The estimates
in Column (3) of Table 4 using the old methodology would have estimated that books ranked above
100,000 account for 82.57% of Amazon’s total sales in 2008.
We use this and our other calculations to estimate the consumer surplus gain from “Long Tail” books in
2008 and compare it to our calculations in 2000. In our prior work, we used 100,000 as the cutoff point
for “niche” books under the argument that the largest book superstores typically only carry that number of
unique titles. The physical stock capacity of bookstores has changed little between 2000 and now, and
thus we used this cutoff point to recalculate the consumer surplus generated from selling niche books on
the Internet.
8
Figure 2: Amazon’s Long Tail in 2008, Using New and Old Methodologies
10 9 Weekly Sales 8 7 6 5 2008, new methodology 4 2008, old methodology 3 2 1 0 0 500000 1000000 1500000 2000000 Weekly Sales Rank 2500000 3000000 1000 Weekly Sales 100 10 2008, new methodology 2008, old methodology 1 0.1 1 10 100 1000 10000 100000 1000000 Weekly Sales Rank However, while the stocking capacity of bookstores has remained relatively constant since 2000, several
changes have happened in the intervening eight years. First, according to Books in Print, the number of
books in print has increased from 2.3 million in 2000 to 3-5 million in 2008. Second, book industry revenue has climbed from $24.6 billion to $37.3 billion. Third, the share of book purchases through the Internet channel has risen from 6% in 2000 to 21-30% in 2008. Combining these changes with the new estimates of the percentage of sales in the Long Tail, we estimate that selling niche books that are unavailable
in brick-and-mortar stores leads to a consumer surplus of $3.93 billion to $5.04 billion in the year 2008.
9
These estimates are about five times of the estimates in Brynjolfsson, Hu, and Smith (2003), even though
the estimates in that paper are likely to have been overestimates.
4. Conclusions
This paper analyzes how the shape of Amazon’s Long Tail has changed over time. We collect data in
2008 on a larger and broader sample of books than was available in the 2000 sample used by our 2003
paper. Our new data and new methodology – negative binomial regression with a series of splines – allow
us to fit the relationship between sales and sales rank with much greater accuracy. This paper presents two
important new findings. First, Amazon’s Long Tail has gotten significantly longer from 2000 to 2008 and
that overall consumer surplus gains from product variety at Amazon increased five-fold from 2000 to
2008. This finding suggests that Amazon’s Long Tail phenomenon, which was first discussed by our
2003 paper, is likely to be a permanent shift instead of a short-lived phenomenon. Second, while previous
research has assumed a constant slope between the log of sales and the log of sales rank, we find that the
sales of a book drop at a faster speed than a regular power law (or a log-linear curve) indicates and that
the slope becomes steeper as a book’s sales rank increases. This finding suggests that there may be forces
that limit Amazon’s ability to sell books that are extremely niche. Future research is needed in order to
better understand the nature of these forces.
References
Anderson, C. 2004. The Long Tail. Wired Magazine 12(10) 170–177.
Brynjolfsson, E., Y. J. Hu, M. D. Smith. 2003. Consumer surplus in the digital economy: Estimating the
value of increased product variety at online booksellers. Management Science 49(11) 1580-1596.
Brynjolfsson, E., Y. J. Hu, M. D. Smith. 2006. From niches to riches: The anatomy of the long tail. Sloan
Management Review 47(4) 67-71.
Brynjolfsson, E., Y. J. Hu, D. Simester. 2007. Goodbye Pareto principle, hello long tail: The effect of
search costs on the concentration of product sales. MIT Sloan Working Paper.
Cachon, G. P., C. Terwiesch, Y. Xu. 2008. On the effects of consumer search and firm entry in a multiproduct competitive market. Marketing Science 27(3) 461-473.
Chellappa, R., B. Konsynski, V. Sambamurthy, S. Shivendu. 2007. An empirical study of the myths and
facts of digitization in the music industry. Workshop on Information Systems and Economics,
Montreal, Canada.
10
Chevalier, J., A. Goolsbee. 2003. Price competition online: Amazon versus Barnes and Noble. Quantitative Marketing and Economics 1(2) 203-222.
Chevalier, J., D. Mayzlin. 2006. The effect of word of mouth online: Online book reviews. Journal of
Marketing Research 43(3) 345-354.
Dhar, V., G. Oestreicher-Singer, A. Sundararajan, A. Umyarov. The gestalt in graphs: Prediction using
economic networks. Working paper, New York University.
Elberse, A., F. Oberholzer-Gee. 2008. Superstars and underdogs: An examination of the long tail phenomenon in video sales. Working Paper.
Frank, R., P. Cook. 1995. The Winner-Take-All Society: Why the Few at the Top Get So Much More Than
the Rest of Us. Penguin, New York, NY.
Fleder, D., K. Hosanagar. 2009. Blockbuster culture’s next rise and fall: The impact of recommender systems on sales diversity. Management Science 55(5) 697-712.
Ghose, A., M.D. Smith, R. Telang. 2006. Internet exchanges for used books: An empirical analysis of
product cannibalization and welfare impact. Information Systems Research 17(1) 3-19.
Gomes, L. 2006. It may be a long time before the long tail is wagging the web. Wall Street Journal July
26.
Moore, G. 2002. Crossing the chasm. HarperCollins Publishers, New York, NY.
Oestreicher-Singer, G., A. Sundararajan. 2009. Recommendation networks and the Long Tail of electronic commerce. Working paper, New York University.
Orlowski, A. 2008. Chopping the long tail down to size. The Register Nov 7.
Rassler, S. 2002. Statistical Matching: A Frequentist Theory, Practical Applications and Alternative
Bayesian Approaches. Springer, New York, NY.
Rosen, S. 1981. The economics of superstars. American Economic Review 71(5) 845-858.
Rosenbaum, P., D. Rubin. 1983. The central role of the propensity score in observational studies for
causal effects. Biometrika 70(1) 41-55.
Tucker, C., J. Zhang. 2009. How does popularity information affect choices? A field experiment. MIT
Sloan Working Paper.
11
Download