Network Structure and the Long Tail of E‐Commerce Demand (Extended Summary for the 2nd Annual Statistical Challenges in Electronic Commerce Research Symposium) Gal Oestreicher‐Singer New York University Arun Sundararajan New York University goestrei@stern.nyu.edu arun@stern.nyu.edu 1. OVERVIEW AND RESEARCH AGENDA 2. PRIMARY DATA A number of networks affect economic outcomes in electronic commerce. Some of these can describe the relationships between consumers who communicate product information and influence each others purchasing, others can describe how the demand for different products are related based on shared purchasing patterns, and yet others may describe the patterns of trade between firms. Over the last year, we have collected product, pricing, demand and “network” information for a set of books sold on Amazon.com. Each book sold on Amazon.com has an associated webpage. Those pages each have a set of “co‐purchase links”, the set of products that were co‐purchased most frequently with this product on Amazon.com. This set is listed under the title “Customers who bought this also bought:” and is limited to 5 items. A good example of such a structure is the network of product pages on an ecommerce site. Each product on an ecommerce site has a network position, which is determined by the products it links to, and those that link to it. If one imagines the process of browsing an ecommerce site as being analogous to walking the aisles of a physical store, then the ecommerce aisle structure is the this graph of interconnected products, and the network position of a product in this graph is analogous to its virtual shelf placement. A product that is linked to by an intrinsically popular one is likely to enjoy an increase in sales on account of this aspect of its network position. A product linked to by hundreds of others is likely to get more “network traffic” more than one linked to by just a few. Thus, both the structure of the networks and the nodes that comprise them seem to matter. We measure the extent to which the position of a product in such a network structure will affect its demand, based on the idea that the network structure redirects the flow of consumer attention, which results in a redistribution of traffic and demand. The manner in which attention is redistributed can be measured using certain properties of the network structure, and these properties can be associated with observed variations in both individual and aggregate product demand. One specific prediction of our theory is that network structures with common degree distributions will even out traffic between products (nodes), thereby reducing demand inequity between products. This summary reports on how a particular network structure affect the demand distributions of over 240,000 products within 186 distinct categories on Amazon.com. Briefly, we compute the Weighted PageRank for each node of a composite of 14 daily instances of Amazon.com’s co‐purchase network. We characterize the demand distribution of each category using its Gini coefficient. We show that when network structure has a greater influence (when the average Weighted PageRank is higher) on a category, its demand distribution displays significantly lower inequity (its Gini coefficient is significantly lower). In other words, the presence of the network structure flattens ecommerce demand distribution. This provides an additional explanation for the widely documented long tail of ecommerce demand. We gather data about this graph using a Java based crawler, which starts from a popular book and follows the co‐purchase links using a depth‐first algorithm. At each page, the crawler records the following data about each book: ASIN, List Price, Sale Price, co‐purchases, SalesRank and Category Affiliation1. The depth‐first algorithm terminates only once the entire connected component of the graph is collected. We gather this data every day. Following the collection of each graph, we track the sales rank of each product on the graph for the next 24 hours. The resulting graphs have between 240,795 and 243,232 distinct nodes, with an average of 242,154 distinct books. The results presented in this paper are based on the daily data collected during two distinct two‐week periods, spaced six months apart: 8/10/2005‐8/23/2005, and 2/10/2006‐2/23/2006. 3. WEIGHTED PAGERANK AND GINI COEFFICIENTS Towards getting a more stable measure of the influence of the network across a two‐week window, we construct an average graph, a composite of the 14 daily co‐purchase graphs, for each two‐week period, by grouping all the links appearing in the 14 graphs into one weighted graph. The weight on each link represents the fraction of daily graphs it appears on. Our measure of the extent to which the network influences a product is based on the PageRank formula (Brin and Page, 1998). While this measure is widely used in ranking algorithms (such as Google’s), we use the fact that fundamentally, PageRank measures the probability that a “random surfer” will arrive at a hyperlinked page if he were to traverse just the hyperlinks of the network. Thus, a product with a higher PageRank is more likely to get traffic from the network than one with a lower PageRank. 1 Amazon uses a hierarchy of categories to classify its books. The most general partition has 31 distinct categories the second most general partition has 288 categories. We use the latter to define category affiliation. Of the 288 distinct categories, 186 had at least 100 books. SalesRank 1.0E-03 100 1,000 10,000 100,000 4. NETWORK STRUCTURE AND DEMAND EQUALITY We estimate the following mode using OLS regression: 1,000,000 Log[ Gini ] = a + b1Log[ NumBooks ] + b2 Log[ AvgDemand ] + b3 Log[ AvgPR ] + b4 Log[VariancePR ] + b5 Log[VariancePrice] PageRank 1.0E-04 The results of the regression are presented in Table 1. Four variables explain variation in the Gini coefficient, and the explanatory power of the model is extremely high (between 83% and 87%). We find that an increase in the average PageRank of products in the category has a negative effect on the Gini coefficient. This confirms a main conjecture: an increase in the extent to which network structure influences demand flattens the distribution of demand, or leads to a longer tail for demand, a phenomenon widely observed in electronic commerce (Anderson, 2004). 1.0E-05 1.0E-06 1.0E-07 Figure 1: SalesRank vs. PageRank (for a sample) Our other salient results are: (1) an increase in the variance in the extent to which the network influences products in a within a category increases the category’s demand inequity. (which makes intuitive sense in the context of our theory of the network “flattening” demand), (2) The number of products in a category is positively associated with demand inequity, and (3) the average demand within a category is associated with an increase in the category’s demand inequity. Further details are available on request. We adapt the original PageRank algorithm of Brin and Page to account for our weighted composite graphs, using the following formula: WeightedPR (i ) = (1 − α )+α ⎛ WeightedPR ( j ) ⎞ ⎟, ⎝ OutDegree( j ) ⎠ ∑ Weight ( j,i) ⎜ j ∈G (i ) and we recursively compute the Weighted PageRank for each book in our data set, for each two‐week time period. Interestingly, PageRank and SalesRank are not very well correlated, as illustrated in Figure 1. Table 1: How network structure affects demand inequity Dependent Variable: Log[Gini] Variable Next, we characterize the variation in the demand distribution across categories by computing the Gini coefficient for each category. This is a measure of distributional inequality, a number between 0 and 1, where 0 corresponds with perfect equality (in our case: where all the books in that category have the same demand) and 1 corresponds with perfect inequality (where one book has all the demand, and all other books have zero demand). One first ranks the books in the category in increasing order of their demand, thereby constructing the Lorenz curve. The Gini coefficient is then calculated as a ratio of the area under the Lorenz curve to the area under the “perfect equality” line, illustrated for two categories in Figure 2. This coefficient is especially suitable for this study because it measures inequality in the demand distribution, regardless of the category’s average demand, which facilitates comparing different categories despite their intrinsic differences and independent of their scale. 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.5 Constant Log[NumberOfBooks] Log[AverageDemand] Log[AveragePageRank] Log[VariancePageRank] Log[VariancePrice] Log[AveragePrice] 0.5 0.4 0.4 0.3 0.3 0.2 0.2 gini/2 0.1 0.1 - - Environmental and Natural Resources Law (gini=0.22) Publishing and Books (gini=0.82) Figure 2: Lorenz curves, Gini coefficients for two categories a b1 b2 b3 b4 b5 b6 R² 0.6 gini/2 Coeffient Estimated Values (Standard Error) Aug-05 Feb-06 -0.85 (0.2)*** -1.89 (0.57)** 0.011 (0.002)*** 0.036 (0.007)*** 0.11 (0.004)*** 0.28 (0.01)*** -0.16 (0.05)** -0.078 (0.02)*** 0.02 (0.004)*** 0.04 (0.01)*** -0.002 (0.002) -0.007 (0.01) -0.004 (0.007) -0.003 (0.02) 87% 83% 5. KEY STATISTICAL CHALLENGES This study analyzes a unique and new time series of massive ecommerce graphs, combining methods from economic theory, econometrics and computer science. Our analysis of this data set raises many new statistical challenges. First, techniques that can reliably sample the graphs we observe while preserving key network properties would facilitate easier analysis. Second, while the use of a depth‐first search algorithm for graph collection allows us to gather the nodes of giant component efficiently, it does not preserve all the in‐links of the graph, omitting those that are not traversed, and statistical methods for assessing the amount of lost information would be valuable. Finally, the methods we use for analyzing our time series of graphs were developed for the econometric analysis of panel data and time series’ of vector (or Euclidean) data rather than for a time series of graphs. As the “black box” of network externalities is opened further (Kauffman et al., 2000), the analysis of networked ecommerce data will continue to grow, and thus, statistical techniques that are explicitly developed to study dynamic graphs would be of great value. We hope our research agenda and our data set facilitates or motivates addressing some of these challenges.