Navigating by the Stars: What Do Online User Ratings Reveal About Product Quality? Abstract Consumers increasingly turn to online user ratings to inform purchase decisions, but little is known about whether these ratings are valid indicators of product quality. We developed a database of (1) quality scores from Consumer Reports, (2) user ratings from Amazon.com, (3) selling prices from Amazon.com, and (4) brand perceptions from a proprietary industry survey. Analyses reveal that the average and number of user ratings are only weakly related to quality and far less diagnostic than price (Study 1). Yet, when consumers infer quality, they rely mostly on the average user rating and much less on the number of ratings and price (Study 2). The dissociation between user ratings and quality can be traced in part to the influence of firm marketing actions on users’ evaluations. Controlling for quality, average user ratings are higher for more expensive products, higher for brands with a reputation for offering more functional and emotional benefits, and lower for brands with a reputation for affordability (Study 3). We conclude that consumer trust in online user ratings is largely misplaced. Key Words: online user ratings, quality inferences, consumer learning, brand image, pricequality heuristic 2 Consumers typically have to make an inference about a product’s quality before buying. Traditionally consumers have drawn on marketer-controlled variables like price and brand name to make such inferences (Rao and Monroe 1989), but the consumer decision context has changed radically over the last several years. One of the most important changes has been the proliferation of user-generated content (Keen 2008). Almost all retailers now provide user reviews and ratings on their websites, and these reviews often play an important role in the purchase process (Mayzlin, Dover, and Chevalier 2012). Consider for instance a purchase decision recently faced by one of the authors: which car seat to buy for his 1-year-old daughter. A search on Amazon.com turns up 156 options that can be sorted based on average user rating. On top of the list is the “Britax Frontier 85.” It has an average user rating of 4.60 (out of 5). Looking much further down the list is the “Cosco Highback Booster” with an average rating of 4.10. On its face the Britax option seems to be of higher quality. This interpretation rests on the assumption that the average user rating accurately reflects product quality, but is this a valid assumption? Consumer Reports provides expert ratings of quality for 106 car seats. These ratings are based on rigorous tests on dimensions like crash protection, ease of use, and fit to vehicle. Surprisingly, the option with the marginal consumer rating on Amazon.com is the top rated on Consumer Reports with a score of 69 out of 100. The option that was so favorably reviewed by consumers on Amazon.com fares much more poorly on Consumer Reports. With a quality score of 53, it is second to worst in the category. This is an anecdotal example but it highlights the need for research that evaluates the utility of user ratings for inferring quality. 3 THE RISE OF ONLINE USER RATINGS Consumers increasingly rely on user ratings to inform their purchase decisions (Grant 2013; Mayzlin, Dover, and Chevalier 2012). A 2012 survey of 28,000 respondents in 56 countries found that online user ratings are the second most trusted source of brand information (after recommendations from family and friends), with 70% of respondents indicating trust in them (Nielsen 2012). Similarly, the 2012 Local Consumer Review Survey of 2,862 panel members in the U.S., U.K., and Canada found that 72% of respondents said they trust online reviews as much as personal recommendations, and 52% said that positive online reviews make them more likely to use a local business (Anderson 2012). A recent study conducted by Weber Shandwick in conjunction with KRC research found that 65% of potential consumer electronics purchasers were inspired by customer reviews to purchase a brand that was not in their original consideration set. Moreover, consumer electronics buyers are more than three times as likely to consult user reviews than expert reviews when making a purchase decision (Loechner 2013). This influence cuts across product and service categories. In addition to electronics, online product ratings have been shown to influence consumers’ choice of books (Chevalier and Mayzlin 2006; Forman, Ghose, and Wiesenfeld 2008), movies (Chintagunta, Gopinath, and Venkataraman 2010; Duan, Gu, and Whinston 2008), restaurants (comScore 2007; Luca 2011; Anderson and Macgruder 2012), online video games (Zhu and Zhang 2010), bath, fragrance, and beauty products (Moe and Trusov 2011), hotel, travel, automotive, home, medical, and legal services (comScore 2007), and more recently, restaurant patronage. For instance, a one star increase in a restaurant’s rating on Yelp.com was found to lead to a sales increase of between 5% and 9% (Luca 2011) and a 19% increase in the likelihood of selling out of tables during rush 4 hour (Anderson and Macgruder 2012). The importance of consumer ratings to companies is further reflected in the fact that they affect stock market performance (Tirunillai and Tellis 2012). Because of their influence with consumers, online ratings have become important inputs to managerial decision-making. A recent survey of 1,734 Chief Marketing Officers spanning 19 industries and 64 countries found that 48% of CMOs formally track consumer online ratings (IBM Global Business Services 2011). Managers use consumer online ratings as input for decision-making in areas such as brand building, promotion, customer acquisition, customer retention, product development, and quality assurance (Hu, Pavlou, and Zhang 2006). ARE USER RATINGS PREDICTIVE OF QUALITY? The steep rise of user ratings suggests that consumers think they are valuable for making good decisions. A few recent papers in the marketing literature have challenged this assumption (Hu et al. 2006; Koh, Hu, and Celmons 2010) but there has been little work that has empirically examined the correspondence between user ratings and product quality and what work there is has only examined limited data sets (Chen and Xie 2008). A key objective of the present investigation is to address this void in the literature by analyzing the correspondence between user ratings and technical quality across a wide range of product categories. Following a long tradition in the fields of marketing (Hardie, Johnson, and Fader 1993; Lichtenstein and Burton 1989; Mitra and Golder 2006; Tellis and Wernerfelt 1987), psychology (Wilson and Schooler 1991), and economics (Bagwell and Riordan 1991), we use the quality scores generated by Consumer Reports as the most accurate indicator of technical product quality (Zeithaml 1988). Consumer Reports is an independent source that is not allied in any way to any group of firms 5 and “has a scientific approach to analyzing quality through blind laboratory studies, which in scope and consistency is unrivaled in the U.S. and in the world” (Tellis and Wernerfelt 1987, p. 244). There is an intuitive rationale for believing that the average user rating is a good cue to product quality. Use experience is a direct signal of quality that, unlike marketer-controlled sources of information such as advertising, promises a reliable and unbiased perspective on use experience. However, a review of the academic marketing literature reveals a variety of factors that may undermine this rationale. We will distinguish between two broad classes of factors that may limit the utility of average user ratings and expand on each of them below. First, there are statistical concerns associated with the distribution of ratings. The average user rating only has value to the extent that it provides a relatively precise and unbiased estimate of the population average. There are several reasons to doubt this is generally the case, some of them documented in previous research and some proposed here. Second, there are psychological reasons that consumers may be unable to provide an unbiased perspective on their use experience. Intrinsic product quality is often hard to discern and consumers are likely to draw on extrinsic cues when forming their evaluations. Previous research on the validity of user ratings has focused primarily on statistical concerns associated with the non-representativeness of the distribution of ratings for the population. Hu et al. (2006) found that review writers are more likely to be those that “brag” or “moan” about their product experience, resulting in a bimodal distribution of ratings, whose average does not give a good indication of the true population average. They see this nonrepresentativeness as a serious challenge to the validity of online reviews, going so far as to say “… the most commonly used factor in the literature—average review score—is mostly an 6 incorrect proxy of a product’s true quality. Making this unwarranted assumption can possibly lead to erroneous conclusions about consumers’ online shopping behaviors and wrong marketing decisions (p. 328).” Similarly, Koh, Hu, and Clemons (2010) provide evidence of cross-cultural influences on the propensity to write a review, suggesting a lack of representativeness of reviews, and a lack of validity in reflecting true quality as a consequence. Another issue leading to nonrepresentativeness is review manipulation. Firms (or their agents) sometimes post fictitious favorable reviews for their own products and services and/or post fictitious negative reviews for the products and services of their competitors (Mayzlin, Dover, and Chevalier 2012). Additionally, raters are influenced by previously posted ratings, creating herding effects (Moe and Trusov 2011; Muchnik, Aral, and Taylor 2013; Schlosser 2005). Non-representativeness is not the only statistical concern associated with average user ratings. A summary statistic can only provide evidence about a population parameter to the extent that it is precise. Precision is a function of two elements: sample size and variability in the distribution. The average user rating should therefore provide a better indication of product quality as the sample size of reviews increases and the variability of the underlying distribution decreases. Unfortunately, average review ratings are often based on small samples (Hu et al. 2006). There are also reasons to suspect that variability tends to be quite high. User experiences are subject to a lot of variation. Users may give a poor rating due to a bad experience with shipping, may accidentally review the wrong product, or may blame a product for a failure that is actually due to user error. Some consumers may view the purpose of product reviews differently than others. For instance, some consumers may rate purchase value (quality for the money) rather than product quality, thereby penalizing more costly brands whereas others may rate quality without considering price. Perhaps most fundamentally, there may be substantial 7 heterogeneity in consumer taste. All of these factors suggest that the average may often be too imprecise to support a good estimate of quality. Beyond these statistical issues there are psychological factors that cast doubt on the ability of consumers to discern quality through use experience and post an unbiased rating. Numerous studies have shown that marketing actions have a major influence on perceived quality. When consumers drink from labeled bottles they rate the taste of their favorite beer brand more favorably than the taste of other brands, but not so when they drink from unlabeled bottles (Allison and Uhl 1964). Moreover, consumer memory is fallible and post-experience marketing actions can bias product evaluations. Favorable post-experience advertising makes memory for the quality of a bad tasting orange juice more positive (Braun 1999). Consumers may also engage in motivated reasoning to justify buying certain kinds of products such as those that are expensive or those made by a favored brand (Jain and Maheswaran 2000, Kunda 1990). This also biases evaluations because a product receives a higher rating by being more expensive or by being manufactured by a favored brand, independent of its quality. These “top-down” influences of extrinsic cues on quality perceptions are most pronounced when quality is difficult to observe. There is good reason to believe that this is often the case. Product performance on important dimensions is often revealed only under exceptional circumstances. For instance, when considering a car seat, new parents would likely place a high value on crash protection, an attribute that they will hopefully never be in a position to evaluate. More generally, product quality may be difficult to evaluate in many categories (e.g., deck stains, tires, batteries), especially in the short time-course between purchase and review posting, typically only days or weeks. Though most research in the literature has focused on the average user rating because it is 8 assumed this is what consumers pay the most attention to, there is another feature of the distribution of user ratings that may relate to quality and is readily available to consumers when they evaluate products online: the number of reviews. Retailers frequently use promotional phrases such as “Over 10,000 Sold” as an indication that the product is one of quality and that other consumers see it as such. Literature on social proof shows that consumers are strongly influenced by the behavior of others (Cialdini 2001). This raises the possibility that the number of user ratings provides information about quality independently of the average rating. To our knowledge no previous research has addressed this possibility. Thus a secondary objective of our analyses is to assess the extent to which the number of reviews is predictive of quality. RESEARCH OBJECTIVES AND OVERVIEW OF STUDIES The research has three broad goals: First we aim to understand the predictive validity of user ratings for technical product quality in the marketplace. Second we aim to evaluate consumer beliefs about the predictive validity of user ratings for technical quality. Third we will test the conjecture that user ratings are contaminated by the biasing influence of marketing actions. Together the studies allow us to assess disconnects between consumer beliefs and reality in order to draw implications for consumer welfare and to make prescriptions for better decisionmaking. In Study 1, we will examine the extent to which the average and number of user ratings are predictive of product quality. Given the history of studies investigating the relationship between technical product quality as measured by Consumer Reports and price (for reviews, see Lichtenstein and Burton 1989, Tellis and Wernerfelt 1987), we include selling prices in our 9 analysis. Numerous studies dating back to the 1950s indicate that the correlation between price and technical quality is approximately between 0.20 and 0.30 (Tellis and Wernerfelt 1987). We anticipate a similar relationship between price and quality in our study and we use the predictive validity of price as a reference benchmark for evaluating the predictive validity of user ratings. Recognizing that previous research has focused on sample non-representativeness as a challenge to the validity of user ratings, we will also examine whether the predictive validity of the average rating varies as a function of its precision. As we discussed above, we expect user ratings to be more predictive of quality when they are based on larger samples and/or when the distribution of ratings has lower variability. This study thus advances the literature by documenting an additional challenge related to the statistical precision of the average. Given the increasing importance of user ratings to consumer decision-making it is likely that consumers trust user ratings as indicators of quality, but no previous research has tried to quantify this belief or compare its magnitude to the price-quality heuristic. In Study 2, we will assess whether consumers have accurate intuitions about the relative predictive validity of the different cues to quality. Using an ecologically valid experimental method we ask consumers to judge the quality of products that vary in average user rating, number of ratings, and price. This will allow us to compute the partial effect of each cue on quality perceptions and to compare the effect sizes to those found in the market data in Study 1. A secondary goal of this study is to assess whether consumers put more faith in user ratings that have greater precision. Many studies have shown that consumers’ naïve statistical intuitions are flawed. They often jump to conclusions based on summary statistics from small, inconclusive samples (Obrecht, Chapman, and Gelman 2007; Tversky and Kahneman 1971). Thus we predict that consumers will not adjust their quality inferences enough to account for differences in precision. Finally, this study adds to 10 the literature by being the first to evaluate the extent to which consumers rely on price as a cue for quality in an environment where user ratings are simultaneously available. As we reviewed above, product experience often provides only a noisy signal of quality, and a consumer evaluating the product resolves this uncertainty by drawing on prior beliefs, for example, higher price means higher quality. Thus, a high priced product or one from a good brand might get the “benefit of the doubt” in the face of a less than favorable experience. Similarly, a consumer might expect an inexpensive product to fail and be more sensitive to any flaws. A related phenomenon is that consumers may be less likely to admit the faults of products that they paid a lot for or that are manufactured by a favored brand. These processes suggest that user ratings include a signal not just of technical quality but also of extrinsic marketing variables. Variables such as price and brand name that positively correlate with expectations for performance should positively bias user ratings. In Study 3, we supplement the database from Study 1 with brand perception data. We present a new regression analysis that treats average user rating as the dependent variable and marketing variables (price and brand perceptions) as the independent variables. By also including Consumer Reports quality scores as predictors in the model, we control for technical quality and isolate the contaminating influence of marketing actions on average user ratings. STUDY 1: THE MARKETPLACE The goal of Study 1 was to examine the extent to which the average user rating and the number of user ratings are valid cues for product quality in the marketplace and to compare their predictive validities with that of price. For this purpose, we developed a database of quality 11 ratings from Consumer Reports for products across a wide range of product categories and their corresponding average user ratings, number of ratings, and selling prices on Amazon.com. Consumers typically evaluate products in a comparative context; they want to know how good one product is relative to another, rather than the absolute quality of a product. For instance, a consumer might ask how much of a quality difference can be expected if one product has an average user rating that is 0.50 higher than another, or if it is rated 20 times more frequently than another, or if it costs $50 more than another. In light of this, we used a comparative approach to analyze the data. Rather than regressing the quality rating of an individual product on the three predictive cues, we regressed the difference in quality between each pair of products within a category on the difference in magnitude of the cues between the products. Data We visited the website of Consumer Reports (ConsumerReports.org) in February 2012 and extracted quality ratings for all items within all product categories where Consumer Reports provides these data, except for automobiles and wine. This resulted in ratings for 3817 items across 262 product categories. To ensure that product categories were relatively homogeneous (and quality ratings were comparable across items within a category), we defined product categories at the lowest level of abstraction. For example, Consumer Reports provides product ratings for air conditioners subcategorized by BTUs (e.g., 5,000 to 6,500 as opposed to 7,000 to 8,200). That is, brands are only rated relative to other brands in the subcategory. Thus, we treated each subcategory as a separate product category. We augmented this dataset with current prices and user ratings from Amazon.com collected in June of 2012. For each item for which we had a technical quality score from Consumer Reports, we searched the Amazon.com website and 12 recorded all user ratings and the price. We were able to find selling prices and at least one Amazon.com consumer rating for 1669 items across 205 product categories. To ensure that our conclusions were not disproportionally influenced by items with a very small number of user ratings, we only retained items that were rated at least 5 times. This resulted in 1385 items across 192 product categories. Because our goal was to model brand pairs instead of individual brands, we only retained product categories with data for at least three items. The final dataset consisted of 1280 items across 121 product categories. We then used these data to generate a pair-wise comparative database, which served as the basis for all subsequent analyses. We took each product in a category and compared it to every other item in the category. We then sorted all product pairs in alphabetical order such that product A was the product that came first in alphabetical order and product B was the product that came second. For each product pair, we computed the difference between product A and product B in terms of quality score (ΔQ), average user rating (ΔA), number of user ratings (ΔN), and price (ΔP). To make effect sizes comparable across cues we transformed all difference scores such that they had a mean of zero and a standard deviation of one. The variation in quality, the number of ratings, and selling prices were substantially different across product categories and thus we standardized these difference scores by product category. Differences in user ratings were relatively homogeneous across categories and were standardized across the whole dataset. The final pair-wise database consisted of 16,489 comparisons across the 121 product categories. Results Simple Correlations. As a first analysis of the effects of the three predictive cues on quality, we first computed simple correlations between differences in the three predictive cues 13 (ΔA, ΔN, and ΔP) and quality differences (ΔQ),1 and then averaged these correlations across product categories. All three correlations were highly significant and positive (all p’s < .001). The correlation was highest for price (M = 0.29, SD = 0.52), followed by average user rating (M = 0.20, SD = 0.49), and then number of user ratings (M = 0.14, SD = 0.47). Hierarchical Regression Analysis. Simple correlations do not control for additional predictors and therefore do not yield firm conclusions about the relative strength of the three predictors. To more formally compare the extent to which differences in technical quality can be predicted based on average user ratings, number of ratings, and price, we estimated the following hierarchical regression model: ΔQij = [α0 + βj,0] + ΔAij[α1ΔA + βj,1ΔA] + ΔNij[α1ΔN + βj,1ΔN] + ΔPij[α1ΔP + βj,1ΔP] + εij ΔQij, ΔAij, ΔNij, and ΔPij indicate the difference between product A and product B for product pair i in category j in terms of quality, average user rating, number of ratings, and price, respectively. α0 is the fixed intercept, βj,0 are random intercepts that vary across categories, α1’s are the fixed slopes, βj,1’s are the random slopes that vary across categories, and εij are error terms. Because the random slopes (βj,1’s) have a mean of zero, the fixed slopes (α1’s) are the expected value of the random slopes. The fixed slopes thus indicate the estimated predictive validity of each cue averaged across product categories. Table 1 shows point estimates and 95% confidence intervals for the fixed slopes of the 1 Consumer Reports as well as Amazon.com confound brands that are very similar. For instance, the Garmin Nuvi 1350T and the Garmin Nuvi 1370T have the same objective quality according to Consumer Reports. We therefore did all analyses reported in this manuscript twice, one time including brand pairs for which the brands have identical objective quality scores and/or identical average user ratings (number of pairs = 16,489) and another time excluding these brand pairs (number of pairs = 15,578). The results are virtually identical for all analyses. In the manuscript, we report the results after excluding similar brands. 14 three predictive cues. Because the variables are standardized, the fixed slopes can be compared in magnitude to determine the relative strength of each of the cues for predicting quality, controlling for the other cues. The slope of price is highest (B = 0.287, p < .001), followed by number of ratings (B = .148, p < .001) and average user rating (B = .114, p < .001). Thus, the partial effect of price on quality is about twice as large as the effect of average rating and number of ratings. Moreover, The 95% confidence interval for price does not overlap with the 95% confidence intervals for average user rating and number of user ratings, indicating that price by far explains the most variation in quality. The predictive validities of average rating and number of ratings do not differ significantly. [Insert Table 1 here] Next, we looked at how the predictive validity of a difference in average user ratings varies as a function of its precision. We measured the precision of the difference in average rating between a product pair using the pooled standard error (SEP) of the two average ratings. The pooled standard error is a function of the sample sizes (N) and variation in user ratings (VAR) for the products A and B in a pair: SEP = √[(VARA/NA) + (VARB/NB)]) For each pair-wise comparison in the database we calculated this precision measure and added it to the regression model above as a main effect and as an interaction with the difference in average user ratings. We predicted based on statistical theory that more precise differences in 15 average user ratings would explain more variability in quality. Consistent with this, the regression revealed a significant negative interaction such that a lower standard error was associated with greater predictive validity of the difference in average user rating (B = -0.217, 95% CI: [-0.311; -0.124]). To summarize and visualize the results of this regression, the top panel in Figure 1 plots the predictive validity (i.e. the fixed slope) for the average user rating as a function of the precision of the difference in average user rating (i.e. the pooled standard error). The chart also shows the predictive validities of the other cues which do not vary as a function of the standard error of the average user rating in the model. At the mean level of precision (SEp = 0.306) the fixed slope for average user rating is about half that of price and approximately the same as number of ratings. The predictive validity of average user rating increases with precision but even at the 5th percentile of precision (SEp = 0.106), the fixed slope is still only about two third that of price. At the 95th percentile (SEp = 0.639), the average user rating is a very weak predictor of quality, weaker than the number of ratings. [Insert Figure 1 here] Correspondence Analysis. The analyses above relate the magnitude of the difference in average user rating to the magnitude of the difference in technical quality. Another way to look at the data is to dichotomize the quality difference and evaluate the extent to which the average user rating can predict which product is of higher quality. If one were randomly to select two products in a category, and product A has a higher average user rating than product B, what is the likelihood that it also has a higher quality and how does this likelihood vary as a function of 16 the magnitude of the difference in average user rating? To answer this question we present a purely descriptive analysis. For each product pair we created a variable that was coded as “match” if the product with the higher average user rating also has higher quality. It was coded as “mismatch” if the product with the higher average user rating has lower quality. Product pairs with identical quality scores or user ratings were not included in this analysis. To visualize the results, we binned product pairs according to the absolute magnitude of the difference in average user rating (using a bin width of 0.20) and calculated for each of the bins the proportion of times the average user rating correctly predicted quality. Those results are plotted in the top panel of Figure 2 (solid line). Very few comparisons had rating differences larger than 2 so the data are only shown for differences between 0 and 2 stars, which accounts for approximately 95% of the database. Averaging across all comparisons, the correspondence between the average user rating and quality is 57%. When the difference in user ratings is smaller than 0.40, correspondence is 50%. These small differences have essentially no predictive value. This percentage increases as the difference in user rating grows larger but the increase is modest and correspondence never exceeds approximately 70%. Figure 2 also shows the proportion of product pairs in each bin (dashed line). As the difference in average user rating increases the proportion of product pairs decreases, reflecting that large rating differences between products are relatively rare. This is consistent with previous work showing that the distribution of user ratings tends to be left skewed, with most products rated as a 4 or 5 (Hu et al. 2006). It is notable that a large plurality of product comparisons yields small differences in average user ratings and these have no predictive value at all. Approximately 50% of product comparisons consist of rating differences that are smaller than 0.40 where correspondence with quality is at chance. 17 [Insert Figure 2 here] Discussion This study was the first to examine how well user ratings can predict technical quality. Across a wide range of product categories, we found that both the average user rating and the number of user ratings are significant, positive predictors of technical quality but the relationship is weak. The predictive accuracy of the average user ratings falls off dramatically when it is based on a smaller samples and/or the underlying distribution has greater variability. We also examined the predictive validity of price and found it to be similar to what has been suggested in past research (Tellis and Wernerfelt 1987). Notably, although the correlation between price and quality is modest, it is a much better predictor than either the average or number of ratings. STUDY 2: CONSUMER BELIEFS Study 2 examined the extent to which average user ratings, number of user ratings, and selling prices influence consumer perceptions of technical product quality with an eye to comparing consumer beliefs to the marketplace relationships revealed in Study 1. This is the first study we are aware of that evaluates how user ratings influence quality perceptions. Moreover, numerous studies have documented a pervasive influence of price on perceptions of quality (Kardes et al. 2004; Broniarczyk and Alba 1994) but none have examined price-quality beliefs in a context where user ratings are also available as a basis for inference. We asked consumers to search for pairs of products on Amazon.com and then to rate the 18 relative quality of these products. By choosing products that varied in terms of average user rating, number of ratings and price we could conduct similar analyses to Study 1, this time using consumer perceptions of quality as the dependent variable rather than the actual quality scores from Consumer Reports. To avoid any demand effects due to the experimental paradigm we designed the search and rating task to be as ecologically valid as possible and we gave participants minimal training. Method Participants and procedure. Three-hundred-four respondents from Amazon Mechanical Turk were paid $0.65 to complete the study. Participants were first informed that they would be provided with pairs of products along with links to the URLs for the products at Amazon.com. They would be asked to inspect the product webpages and judge the relative technical quality of the two products. To ensure that participants understood that they were evaluating technical quality as measured by Consumer Reports, we provided the following description on the instruction page: “Expert ratings like those generated by Consumer Reports magazine are generated by engineers and technicians with years and sometimes decades of expertise in their field. They live with the products for several weeks, putting them through a battery of objective tests using scientific measurements, along with subjective tests that replicate the user experience. All models within a category go through exactly the same tests, side by side, so they're judged on a level playing field, and test results can be compared.” After reading the instructions, participants saw the first pair of products. They were asked to click on the links and examine the products. They were then asked to rate “how you feel experts at Consumer Reports magazine would likely rate the quality of the products relative to 19 each other.” The rating scale ranged from 1 (Product A would be rated as higher quality) to 10 (Product B would be rated as higher quality). After completing the ratings, participants were provided with the next set of products. Each participant completed this relative quality rating for two products in each of eight product categories. The two products were sampled at random from a pre-determined set as specified below, and category presentation order was randomized as well. Product Selection. We selected the 10 product categories with the most products from data used in Study 1. We excluded MP3 players and kitchen knives because many items in these categories were not comparable to one another. For example, in the kitchen knife category, some products referred to single knives while others referred to knife sets. We then selected one product from each brand to use in the study. Because the data from Study 1 was not current, we visited the ConsumerReports.org and Amazon.com websites again and re-recorded the technical quality scores, the average user rating, the number of ratings, and the selling price for each of these products. The final set of stimuli resulting from this procedure included 4 printers, 7 digital cameras, 2 GPS navigators, 4 cordless phones, 16 coffee makers, 10 steam irons, 8 food processor, and 7 strollers. To verify the representativeness of these products for the marketplace, we calculated the actual quality differences (the difference in Consumer Reports quality scores, as in Study 1) between each potential product pair and we correlated the quality difference with the difference in average rating, the difference in number of ratings, and the difference in price. Results were similar to the full database in Study 1. Averaging over the eight product categories, all correlations were positive. The correlation was largest for price (M = 0.36, SD = 0.45), next largest for number of ratings (M = 0.22, SD = 0.45), and smallest for average user rating (M = 0.07, SD = 0.70). 20 Results To make the results as comparable as possible to Study 1 we repeated all of the same analyses, this time using relative perceived quality rather than actual quality difference as the dependent variable. The data consisted of eight judgments of relative quality for each participant. For each of these eight product pairs, we computed the difference between product A and product B in average user rating (ΔA), number of user ratings (ΔN), and price (ΔP). We also standardized all variables using the same procedure as Study 1. Simple Correlations. As in Study 1, as a first indication of the results we computed simple correlations between the three predictive cues and the outcome variable, in this case, judgments of relative quality. Due to the experimental design, we averaged across participants rather than across product categories. The correlation to perceived quality was highest for average user rating (M = 0.42, SD = 0.41), followed by price (M = 0.20, SD = 0.42) and number of ratings (M = 0.18, SD = 0.38). As in Study 1, all three correlations were significantly positive (all p’s < .001). The pattern was markedly different however. In Study 1 the price difference was the most potent predictor of actual relative quality. Here, the difference in average review was the strongest predictor of perceived relative quality. Hierarchical Regression Analysis. To more formally compare the extent to which judgments of quality are related to average user ratings, number of ratings, and price, we estimated the following hierarchical regression model, analogous to the model used in Study 1: ΔQij = [α0 + βj,0] + ΔAij[α1ΔA + βj,1ΔA] + ΔNij[α1ΔN + βj,1ΔN] + ΔPij[α1ΔP + βj,1ΔP] + εij 21 ΔQij, ΔAij, ΔNij, and ΔPij indicate the difference between product A and product B in product category i for participant j in terms of judged quality, average user rating, number of ratings, and price, respectively. α0 is the fixed intercept, βj,0 are random intercepts that vary across participants, α1’s are the fixed slopes, βj,1’s are the random slopes that vary across participants, and εij are error terms. Table 1 shows point estimates and 95% confidence intervals for the fixed slopes, averaged across participants. Quality judgments are most strongly related to differences in average user rating (B = 0.342, p < .001), followed by differences in price (B = 0.173, p < .001) and differences in number of ratings (B = 0.098, p < .001). The 95% confidence interval for average user rating does not overlap with the 95% confidence intervals for price and number of user ratings, indicating that the average user rating explains the most variation in judgments of quality by far. Again, these results are substantially different from the marketplace relationships revealed in Study 1. Next, we analyzed whether respondents weighted the difference in average user rating differently depending on its level of precision. We calculated the standard error of the difference in average user rating for each of the product comparisons and added this precision measure to the regression model as a main effect and as an interaction with difference in average user rating. The interaction effect was not significantly different from zero (95% CI: [-0.088; 0.376]) indicating that consumers fail to moderate their quality inferences as a function of the precision of the difference in average rating. To summarize and visualize the results of this regression, the bottom panel in Figure 1 plots the predictive validity (i.e. the fixed slope) for each predictive cue as a function of the 22 precision of the difference in average user rating (i.e. the pooled standard error). Note that the effect of average user rating on perceived quality appears to actually increase with standard error, but this increase is not statistically significant. The figure shows that when consumers infer quality they rely more on the average user rating than on price at all levels of standard error. This stands in contrast to Study 1 where the market data revealed that the effect of price is larger than the effect of user rating at all levels of standard error. In other words, consumers substantially over-estimate the relation between user ratings and quality, and substantially underestimate the relation between price and quality. Their beliefs about the effect of number of ratings on quality are approximately consistent with marketplace reality. Correspondence Analysis. As in Study 1 we performed a descriptive analysis, dichotomizing the quality score and evaluating the correspondence between differences in user ratings and judgments of quality superiority. We coded each judgment by a participant as a match or a mismatch. It was coded as “match” if the product with the higher average user rating was also judged to have higher quality. The comparison was coded as “mismatch” if the product with the higher average user was judged to having lower quality. We placed all product comparisons with rating differences smaller than 2 into bins of 0.20 stars and calculated the percentage of matches in each bin. Those results are shown with the black line in the bottom panel of Figure 2. The plot also includes a dotted line denoting the proportion of product pairs in the database represented in each bin. The gray line, reproduced from the corresponding graph in Study 1 (see top panel), shows the correspondence between user ratings and quality in the marketplace for reference. Consumers believe that differences between 0.20 and 0.40 predict quality superiority 65% of the time. The market data (grey line) show that such comparisons have no predictive value. As the difference in average user rating grows larger consumer 23 judgments of quality follow the average user rating about 80% of the time. This again is in contrast to the market data, which maxes out at about 70%. In summary, the black line denoting consumer beliefs lies much higher and is much steeper than the gray line denoting marketplace reality. This indicates that consumers substantially overestimate the predictive validity of differences in user ratings. Discussion This study was the first to examine the influence of user ratings and selling prices on judgments of product quality. Consumers perceived a significant positive relation between all three cues—price, average user rating and number of user ratings—to quality. Average user rating is by far the strongest predictor of quality perceptions. We also found that consumers fail to moderate their reliance on the average user rating as a function of its precision. We suspected this would be the case based on previous research on consumers’ flawed statistical intuitions. Although the effect of average ratings on quality inferences may not be surprising in light of many studies documenting their pervasive influence on consumer and managerial decisionmaking, it is notable that consumers think the average user rating is so much more predictive of quality than price. The price-quality heuristic is one of the most studied phenomena in the marketing literature and typically considered to be one of the cues that consumers treat as most diagnostic of quality. Our study is the first to revisit the strength of the perceived price quality relationship in a multi-cue environment where user ratings are simultaneously present, which is a common mode in which consumers evaluate products today. The moderate effect of number of ratings on judgments of quality could also have been anticipated based on prior research on the influence of social proof. But to our knowledge, consumers’ reliance on the number of user 24 ratings as a cue for quality had not been examined in previous research. Our findings suggest that the number of user ratings influences consumer decision making (and ultimately, sales) and may have been understudied. This may be an interesting avenue for future research. In sum, comparing Studies 1 and 2 reveals substantial mismatches between the actual relationships between the predictive cues and quality in the marketplace and the way consumers use these cues when making inferences about product quality. Consumers substantially overestimate the predictive validity of the average user rating and they fail to anticipate that less precise averages are less predictive of quality. They also substantially underestimate the predictive validity of price presumably because price gets overshadowed by user ratings in a context like Amazon.com where both cues are simultaneously available. STUDY 3: ANTECEDENTS OF AVERAGE USER RATINGS Study 1 revealed that average user ratings are not very diagnostic of technical quality. We hypothesized that one reason for this is the biasing influence of marketing actions on user evaluations. To test this idea in Study 3 we supplemented the market data with brand perception data and conduct a new regression that predicts average user rating as a function of price, brand perceptions and technical quality. By controlling for quality we isolate the contaminating influence of extrinsic cues on user ratings. Partial effects of these cues on ratings would provide evidence for our conjecture and would cast more doubt on the utility of user ratings for inferring quality. In Studies 1 and 2 we used a comparative approach because quality inferences from marketplace cues are often made in a multi-option environment. In Study 3 we use a noncomparative approach because consumers typically purchase and evaluate only a single option. 25 Data We supplemented the database used in Study 1 with brand perception measures from a proprietary consumer survey. This survey is administered to a representative sample of U.S. consumers annually and asks multiple questions about shopping habits and attitudes toward retailers and brands across numerous product categories. We obtained data from three versions of the survey that together covered most of the product categories examined in Study 1: electronics (e.g. televisions, computers, cell phones), appliances and home improvement (e.g. blenders, refrigerators, power tools), and housewares (e.g. dishes, cookware, knives). For the brand perception portion of the survey, participants were first asked to rate familiarity of all brands in the category and then were asked further about brand attitudes for three brands for which their familiarity was high. All brand perception questions were asked on five-points agree/disagree Likert scales. The brand perception questions differed somewhat across the three versions of the survey, so we retained data only for the 15 brand perception questions that were asked in all three versions of the survey. We cleaned the data by removing data from any participants who did not complete the survey or who gave the same response to all brand perception questions. When merging the brand perception data with our existing data set, we were able to realize brand perception measures for 888 products representing 132 brands across 88 product categories. The data consisted of ratings from 37,953 respondents with an average of 288 sets of ratings for each brand, and substantially more for well-known brands. For purposes of data reduction, we submitted the average value for each brand for each of the 15 questions to a principal components analysis with a varimax rotation restricting the 26 number of factors to two.2 The two factors accounted for 71% of variance in the data set. We interpreted the first factor to represent perceived functional and emotional benefits (12 items) and the second factor to represent perceived affordability of the brand (3 items). As all inter-item correlations met or exceeded levels advocated in the measurement literature (see Netemeyer, Bearden, and Sharma 2003; Robinson, Shaver, and Wrightsman 1991), we averaged the respective scale items to form two brand perception measures. The individual scale items loading on each of the respective factors are shown in Table 2. The correlation between the two subscales was moderately negative (r = -0.21). This suggests that consumers see brands that provide more benefits as being less affordable. [Insert Table 2 here] Results We analyzed average user ratings on Amazon.com with the following hierarchical regression model: Aij = [α0 + βj,0] + Qij[α1Q + βj,1Q] + Pij[α1P + βj,1P] + BENEFITij[α1BENEFIT] + AFFORDABLEij[α1AFFORDABLE] + εij, 2 An unrestricted principal components analysis yields 3 factors explaining 83% of variance in the dataset. The three factors can be interpreted as brand associations related to functional benefits (7 items), emotional benefits (5 items), and price (3 items). While loading on separate factors, multi-item scales comprised of the respective emotional and functional-based items were highly correlated (r = 0.76), leading to multicollinearity issues in subsequent regression analyses. Upon inspection of all brand equity items, the price-based items represented sentiments related to sacrificing resources for the purchase (e.g., “is affordable”), while the functional and emotional items represented what is received in the purchase (e.g., “is durable” and “is growing in popularity” ). Therefore, we reconducted the principal components analysis using the a priori criterion of restricting the number of factors to two (Hair, Anderson, Tatham, and Black 1995). 27 Aij indicates the average user rating for product i in product category j, Qij indicates the objective quality of the product, Pij indicates the selling price of the product on Amazon.com, BENEFITij indicates the extent to which the brand has positive functional and emotional associations, and AFFORDABLEij indicates the extent to which the brand is perceived to be affordable, α0 is the fixed intercept, βj,0 are random intercepts that vary across categories, α1’s are the fixed slopes, βj,1’s are the random slopes that vary across categories, and εij are error terms. We standardized all variables by product category before analysis such that they have a mean of zero and a standard deviation of one. Table 3 shows point estimates and 95% confidence intervals for the fixed slopes. Consistent with Study 1, products superior technical quality is associated with a higher rating on Amazon.com (B = 0.111, p < .01). The effect for selling price is significant and positive and about the same magnitude as the effect of technical quality (B = 0.101, p <.01). The positive effect of selling price on user ratings while controlling for technical quality and brand reputation suggests that user ratings are not indicators of value (i.e. quality – price) but rather that price plays a “positive” role, as an extrinsic cue for quality (i.e., quality + price).3 In line with the positive effect of selling price on the average user rating, the effect of the brand’s reputation for affordability was significant and negative (B = -0.080, p <.05). Brands with a discount reputation 3 Based on theory (see introduction), we assume here that the flow of influence runs from price to Amazon.com user ratings. An alternative interpretation for the positive effect of price is that Amazon.com raises/lowers their prices in response to user ratings. While we are aware that Amazon.com sets prices based on individual level data that relates to the consumer’s price sensitivity (e.g., the consumer’s previous purchase history or the browser the consumer is using; see The Economist 2012), we are unaware of any source that has alleged that Amazon.com uses user ratings to set prices across consumers. Nevertheless, in order to gain some insight into this issue we collected Amazon.com prices for the brands in our dataset at three additional points in time (09-22-2012, 11-22-2012, and 01-22-2013). We scraped the Amazon.com user ratings and prices used in Study 1 from the website on 02-14-2012. If user ratings influence prices, we would expect to find a positive correlation between these ratings and subsequent price changes. That is, higher ratings at time 1 (i.e., 02-14-2012) should be positively related to price changes from time 1 to time 2 (i.e., the difference in price between any of these three additional times and the price on 02-14-2012). Thus, we calculated three price changes and found these price changes were not significantly related to the Amazon.com user ratings on 02-14-12 (rsep = .01, p > .87; rnov = .04, p > .35; rjan = -.01, p > .74), which is inconsistent with the reverse causality argument. 28 are thus rated less favorably by users, controlling for quality. Finally, there was a significant positive effect of the perceived emotional and functional benefits offered by a brand (B = 0.185, p < .001). In sum, we find that price affects user ratings in its positive role, both at the product level (i.e., the effect of selling price) and at the brand level (i.e., the effect of a brand’s discount reputation). These analyses suggest that higher market prices lead to more favorable user ratings, and that perceptions of being a high-priced brand further inflate ratings. Brands that have a better reputation for offering benefits also benefit from inflated ratings. [Insert Table 3 here] To assess the generalizability of these findings, we obtained user ratings and selling prices from a second source. Subscribers to the Consumer Reports website not only have access to the product quality ratings performed by Consumer Reports’ experts but also have the opportunity to provide product ratings using the same 1-to-5 star rating system as Amazon.com for any of the products they purchase, and these ratings are visible to other subscribers to the website. This sample serves as a strong test of the hypothesis that marketing variables bias user ratings, in the sense that Consumer Reports subscribers are interested in technical quality and may be less likely to use price and brand as cues to quality when evaluating products. ConsumerReports.org had a least one user rating for 602 items in our dataset. Using this set of items, we repeated the analysis above substituting the average user ratings from Amazon.com with the average user ratings listed on ConsumerReports.org. Table 3 provides point estimates and confidence intervals for the fixed slopes. The results are mostly consistent 29 with those from Amazon.com user ratings. The effect for selling price is positive and significant (B = 0.098, p < .05), the effect of a brand’s reputation for affordability is negative and significant (B = -0.119, p < .01), and the effect of the perceived emotional and functional benefits offered by a brand is positive and marginally significant (B = 0.081, p = .05). The only major difference between the two sources of user ratings is that the effect of technical quality on ConsumerReports.org user ratings is not significant. Discussion Study 3 tested the hypothesis that user ratings are biased by extrinsic marketing variables including price and brand name. To test this prediction, we regressed average user ratings on technical quality, price and brand perceptions. Controlling for technical quality, ratings were indeed higher for more expensive products and for brands perceived to deliver more benefits. Ratings were lower for brands perceived to be affordable. In fact, substantially more of the variability in user ratings was attributable to price and brand perceptions than to technical quality suggesting that user ratings are, in large part, determined by the top-down influence of marketing actions. We also tested the hypothesis with a second sample of user ratings from Consumer Reports subscribers. Surprisingly, these ratings were not related to technical quality at all. This may be because ConsumerReports.org has fewer ratings per product than Amazon.com (MEDIAN = 4 vs. 67) and, as we have shown, the predictive validity of the average user rating depends in part on having a sufficient sample size. Nonetheless, the biasing effect of marketing variables did emerge just as with the user ratings from Amazon.com. Evidently, these effects are strong enough to come through even with the limited samples sizes available on 30 ConsumerReports.org. These results are ironic in the sense that the consumer reports subscribers are particularly interested in technical quality. We wondered whether consumers anticipate the positive effect of price on user ratings so we asked 56 Amazon Mechanical Turk workers to imagine the following scenario: Consumer Reports tested the quality of two brands of blenders, one selling for $50 and the other selling for $75, and that the tests revealed both brands are equal in terms of quality. We then asked respondents which brand they deemed most likely to have the higher Amazon.com user rating. Seventy percent of respondents thought that the less expensive brand would have a higher user rating, 18% thought that the more expensive brand would have a higher user rating, and 13% thought that both brands would be rated equally. In other words, a majority of respondents believed that price feeds into ratings in its negative role, as an indicator of purchase value, contrary to the results of Study 3. GENERAL DISCUSSION This research had three broad objectives. First we sought to evaluate the predictive validity of online user ratings for technical product quality in the marketplace. Previous research has raised concerns about the non-representativeness of the sample of ratings for the population. We raised several other concerns including those related to the statistical precision of the average and the fact that raters are likely influenced by extrinsic marketing variables. Analysis of secondary data showed that these concerns are well founded. The relationship between average user ratings and quality is weak. Not controlling for other predictors, having a higher Amazon user rating predicts having a higher quality score only 57% of the time. Differences in user 31 ratings lower than 0.40 have essentially no predictive value and larger differences predict quality superiority only about 65% of the time. We also found that the predictive validity of the average user ratings depends substantially on its precision as measured by standard error. As sample size increases and/or the variability of the distribution decreases, user ratings become better predictors of quality. Unfortunately a large proportion of products in the marketplace do not have sufficient precision to yield a reliable estimate. The average user rating is about equally predictive of quality as the number of ratings and much less predictive than price. Price is a useful benchmark because there is much research on it. This research generally highlights the weakness of price as a predictor of quality and cautions that consumers have a tendency to rely too much on price as a cue for quality (Gerstner 1985; Kardes et al. 2004; Lichtenstein and Burton 1989). The second research objective was to examine the extent to which consumer rely on user ratings when making inferences about product quality. Consumers searched for pairs of products on Amazon.com that were chosen to vary in average rating, number of ratings, and price. They then indicated their perception of the relative technical quality of the two products. The key result was that the average user rating was by far the strongest predictor of quality perceptions. Consumers also did not moderate their quality inferences depending on the precision of the difference in average user rating. The other cues, number of ratings and price, were also treated as somewhat informative of quality but much less than the average rating. These results stand in stark contrast to the marketplace relationships revealed in Study 1. We conclude that consumers overestimate the predictive validity of the average user rating, underestimate the predictive validity of price in a multi-cue environment that also includes user ratings, and fail to anticipate that the precision of the average user rating determines how strong a quality inference one can 32 make. The third objective was to test the conjecture that user ratings are contaminated by extrinsic marketing variables including price and brand name. Controlling for technical quality, Amazon.com user ratings are higher for products that cost more, and for brands with a reputation for offering benefits. Ratings are lower for brands with a reputation for affordability. These results generalized to user ratings from subscribers to Consumer Reports, a sample of consumers who are particularly interested in technical quality. An Illusion of Validity Why is it that consumers are so off the mark in their interpretation of user ratings? Classic research on the psychology of prediction may provide an explanation. In describing a judgment fallacy they call the “illusion of validity” Kahneman and Tversky (1974) write “… people often predict by selecting the outcome that is most representative of the input. The confidence they have in their prediction depends primarily on the degree of representativeness (that is, on the quality of the match between the selected outcome and the input) with little or no regard for the factors that limit predictive accuracy (p. 1126).” User ratings may be highly representative of quality in the minds of consumers because they are intended to be direct measures of quality. Amazon.com for instance provides instructions to review writers that focus on evaluation of product features and comparison between similar products (see: http://www.amazon.com/gp/community-help/customer-reviews-guidelines). Other predictive cues, like price and the number of reviews, bear less resemblance to quality and consumers may therefore not rely on these cues to the same extent as the average user rating. Further contributing to an illusion of validity are people’s erroneous intuitions about the 33 laws of chance. People tend to believe that the characteristics of a randomly drawn sample (e.g., its mean) are very similar to the characteristics of the overall population. For instance, Obrecht et al. (2007) showed that when presented with distributions of product ratings consumers showed virtually no understanding of sampling theory. They were insensitive to the number of ratings and the standard deviation of those ratings. Because higher standard errors attenuate the correlation between a predictive cue and an outcome, people should weight a cue less as the standard error increases. Participants’ failure to moderate their reliance on the average rating as a function of its precision (Study 2) is another instantiation of people’s flawed understanding of statistical principles. We have also shown that user ratings are biased by marketing variables such as price and brand name. People may fail to anticipate such bias because these variables often enter into evaluations outside of awareness (Ferraro, Bettman, and Chartrand 2009; Shiv, Carmon, and Ariely 2005). Because consumers are not aware of drawing on marketing variables when making their own quality evaluations, they fail to anticipate them contaminating the evaluations of others. Implications for Consumer Welfare We have shown that consumers dramatically overestimate the correspondence between user ratings and expert ratings of technical quality. An important implication may be that consumers who are relying on user ratings as a cue for quality are making suboptimal decisions. This conclusion rests on the assumption that expert ratings are better measures of quality and hence better predictors of consumption utility than user ratings. Though this assumption is commonly made in the academic literature it is definitely debatable. One could argue that Consumer Reports experts might have a different value function over attributes than consumers. 34 It could be that user ratings do a better job of predicting consumption utility, which would mean that user ratings are facilitating good decisions despite their lack of correspondence to the expert evaluations. This argument is weakened by the fact that the correspondence increases with statistical precision, suggesting that the lack of correspondence is due in part to user ratings being based on unreliable estimates from insufficient sample sizes. It is also weakened by the fact that in many categories the attributes of importance to consumers are not readily evaluable in an ordinary usage context but are evaluated by the testers at Consumer Reports (e.g. the safety of an infant car seat). More evidence for this idea comes from Mitra and Golder (2006) who show that quality improvements made to products first show up in expert evaluations like Consumer Reports and are eventually reflected in consumer sentiment, though the time course is much longer, typically 5-6 years! This is one more reason to doubt that consumers can accurately report quality in the timeframe within which reviews are normally written, often days or weeks after purchase. Beyond these arguments, we wanted to shed some new light on the question of the appropriateness of operationalizing Consumer Reports expert ratings as a measure of technical quality, by testing whether expert ratings and user ratings can predict a market-based measure of quality. Resale price of used products is often used in marketing contexts as an indicator of product quality. For instance, auto manufacturers commonly tout that a particular car has “the highest resale value of any car in its class.” Ginter, Young, and Dickson (1987) calculated the ratio of the average retail price of a model of used car to its manufacturer’s suggested list price. They used this as a dependent variable to gauge the influence of the car model’s reliability and performance as measured by Consumer Reports. Their rationale was that cars with better reliability and performance should retain more of their original selling price. We use the same 35 rationale here; products with better performance (quality) should retain more of their original selling price. To augment our database with retained value data, we scraped prices from the camelcamelcamel.com website. This website reports new and used prices of products sold by third parties on the Amazon.com website. This allowed us to calculate a measure of retained value: the lowest third party selling price at the time of the data scraping (January 2013) across all third-party sellers of the product on the Amazon.com website, divided by the lowest price of the product sold as new by a third party seller on the Amazon website.4 In order to assess the relative ability of Consumer Reports quality ratings and user ratings to predict retained value, we regressed retained on Consumer Reports quality scores and user ratings. In the first analysis we used user ratings from Amazon.com (971 products). In the second analysis we used the user ratings from ConsumerReports.org (625 products). The regression models were specified as follows: RVij = [α0 + βj,0] + Aij[α1ΔA + βj,1ΔA] + Qij[α1ΔQ + βj,1ΔQ] + εij, RVij indicates retained value for product i in product category j, Aij indicates the average user rating on either Amazon.com or ConsumerReports.org, Qij indicates the objective quality of the product, α0 is the fixed intercept, βj,0 are random intercepts that vary across categories, α1’s are the fixed slopes, βj,1’s are the random slopes that vary across categories, and εij are error terms. We standardized all variables by product category before analysis such that they have a mean of zero and a standard deviation of one. Table 4 shows point estimates and 95% confidence intervals for the fixed slopes. 4 In cases where there are no current third party sellers of the new or used product the website reports the most recent used/new price. 36 Consumer Reports quality scores were significantly related to retained value (B = 0.091 and p < .01 when controlling for Amazon.com user ratings; B = 0.109 and p < .01 when controlling for ConsumerReports.org user ratings). User ratings were not, regardless of their source (B = 0.030 and p = .37 for Amazon.com; B = 0.018 and p = .66 for ConsumerReports.org). Taken together, these results and arguments suggest that Consumer Reports quality scores are a valid measure of quality and important to consumer welfare. [Insert Table 4 here] Regardless of the outcome of these analyses and future similar analyses, it will be difficult to completely rule out the possibility that user ratings capture some elements of the consumption experience that are not captured by more objective sources of data. Consumer perceptions can be driven by top-down influences. A good brand name or a high price can overshadow the intrinsic properties of a product, heighten an experience, and provide pleasure (Shiv, Carmon, and Ariely 2005) even at the neural level (Plassman et al. 2008). Nonetheless, even if consumer ratings are somewhat reflective of consumption utility, the point still stands that there is a disconnect between what user ratings reflect and what consumers think they reflect. Such misperceptions are usually not advantageous. Limitations and Ideas for Future Research Given the importance of review ratings to consumer decision-making further study in this area would be beneficial. One important limitation of our analyses is that we looked only at user ratings. When customers provide reviews, they usually have the opportunity to also write a 37 narrative description of their product experience. Future research might look at how often consumers read these narratives, how they integrate the narrative information with the quantitative ratings, how they choose which narratives to read and whether the narratives help or hinder consumers’ quality inferences. Another limitation of our data is that we represented quality as a unidimensional score. Consumer Reports rates products on numerous dimensions in addition to giving an overall score. Future studies could look at how different types of attributes predict consumer ratings. We suspect that attributes that are highly evaluable by consumers over a short time horizon (e.g. ease of use) would predict more variance in reviews than inevaluable dimensions like safety and reliability. The methodological challenge in implementing this study is that Consumer Reports typically rates a relatively small set of products on a relatively large set of attributes. Model fitting suffers from too many degrees of freedom and not enough observations. A third limitation of our data is that it doesn’t cover the full range of products and services for which people consult online user reviews. We used a wide range of categories— essentially all categories rated by Consumer Reports—but these categories are primarily durables such as electronics and appliances. Online review ratings are also pervasive in the evaluation of more experiential products like alcoholic beverages (e.g., winespectator.com, beeradvocate.com) and services like restaurants (e.g., Yelp.com), hotels (e.g., tripadvisor.com), and contractors (e.g., angieslist.com). Future research might explore whether the findings we have documented generalize to other types of products and services. Conclusion User ratings are a perfect storm of forces that lead to information processing biases, and 38 these forces conspire to create a powerful illusion of validity in the minds of consumers. Consumers should proceed with much more caution than they do when using reviews to inform their decisions. 39 References Allison, Ralph I. and Kenneth P. Uhl (1964), “Influence of Beer Brand Identification on Taste Perception,” Journal of Marketing Research, 1 (August), 36-39. Anderson, Myles (2012), “Study: 72% of Consumers Trust Online Reviews as Much as Personal Recommendations,” http://searchengineland.com/study-72-of-consumers-trust-onlinereviews-as-much-as-personal-recommendations-114152, March 12. Anderson, Michael and Jeremy Magruder (2012), “Learning from the Crowd: Regression Discontinuity Estimates of the Effects of an Online Review Database,” The Economic Journal, 122 (September), 957-989. Bagwell, Kyle and Michael H. Riordan (1991), “High and Declining Prices Signal Product Quality,” Amercian Economic Review, 81 (March), 224-239. Braun, Kathryn A. (1999), “Postexperience Advertising Effects on Consumer Memory,” Journal of Consumer Research, 25 (March), 319-334. Broniarczyk, Susan and Joseph W. Alba (1994), “The Importance of the Brand in Brand Extension,” Journal of Marketing Research, 31 (May), 214-228. Chen, Yubo and Jinhong Xie (2008), “Online Consumer Review: Word-of-Mouth as a New Element of Marketing Communication Mix,” Marketing Science, 54 (March), 477-491. Cialdini, Robert B. (2001), Influence: Science and Practice, Allyn & Bacon Publishers, Needham Heights, MA 02494. Chevalier, Judith A. and Dina Mayzlin (2006), “The Effect of Word of Mouth on Sales: Online Book Reviews,” Journal of Marketing Research, 43 (August), 345-354. Chintagunta, Pradeep K., Shyam Gopinath, and Venkataraman (2010), “The Effects of Online User Reviews on Movie Box Office Performance: Accounting for Sequential Rollout and 40 Aggregation Across Local Markets, Marketing Science, 29 (September-October), 944-957. comScore (2007, November 29), “Online Consumer-Generated Reviews Have Significant Impact on Offline Purchase Behavior,” http://www.comscore.com/Press_Events/Press_Releases/2007/11/Online_Consumer_Revie ws_Impact_Offline_Purchasing_Behavior). Duan, Wenjing, Bin Gu, and Andrew B. Whinston (2008), “The Dynamics of Online Word-ofMouth and Product Sales – An Empirical Investigation of the Movie Industry,” Journal of Retailing, 84 (2), 233-242. Ferraro, Rosellina, James R. Bettman, and Tanya L. Chartrand (2009), “The Power of Strangers: The Effect of Incidental Consumer Brand Encounters on Brand Choice,” Journal of Consumer Research, 35 (February), 729-741. Forman, Chris, Anindya Ghose, and Batia Wiesenfeld (2008), “Examining the Relationship Between Reviews and Sales: The Role of Reviewer Identity Disclosure in Electronic Markets,” Information Systems Research, 19 (September), 291-313. Gerstner, Eitan (1985), “Do Higher Prices Signal Higher Quality,” Journal of Marketing Research, 22 (May), 209-215. Ginter, James L., Murray A. Young, and Peter R. Dickson (1987), “A Market Efficieny Study of Used Car Reliability and Prices,” Journal of Consumer Affairs, 21 (Winter), 258-276. Grant, Kelli B. (March 4, 2013), “10 Things Online Reviewers Won’t Say,” Wall Street Journal, http://www.marketwatch.com/Story/story/print?guid=FB144D96-82A0-11E2-B54A002128040CF6. Hardie, Bruce G. S., Eric J. Johnson, and Peter S. Fader (1993), “Modeling Loss Aversion and Reference Dependence Effects on Brand Choice,” Marketing Science, 12 (4), 378-394. 41 Hu, Nan, Paul A. Pavlou, and Jennifer Zhang, (2006), “Can Online Reviews Reveal a Products True Quality? Empirical Findings and Analytical Modeling of Online Word-of-Mouth Communication. In Proceedings of the 7th ACM Conference on Electronic Commerce, (EC’06, June 11-15), 324–330. IBM Global Chief Marketing Officer Study (2011), From Stretched to Strengthened, IBM Global Business Services, Somers, NY: 10589. Jain, Shailendra Pratap, and Durairaj Maheswaran (2000). "Motivated Reasoning: A Depth-OfProcessing Perspective." Journal of Consumer Research, 26 (March), 358-371. Kardes, Frank R., Maria L. Cronley, James J. Kellaris, and Steven S. Posavac (2004), “The Role of Selective Information Processing in Price-Quality Inference,” Journal of Consumer Research, 31 (September), 368-374. Keen, Andrew (2008), The Cult of the Amateur: How blogs, MySpace, YouTube, and the rest of today's user-generated media are destroying our economy, our culture, and our values. Random House Digital, Inc. Koh, Noi Sian, Nan Hu, Eric K. Clemons (2010), “Do Online Reviews Reflect a Product’s True Perceived Quality? An Investigation of Online Movie Reviews Across Cultures,” Electronic Commerce Research and Applications,” 9, 374-385. Kunda, Ziva (1990), “The Case for Motivated Reasoning,” Psychological Bulletin, 108, 480-498. Lichtenstein, Donald R. and Scot Burton (1989), "The Relationship Between Perceived and Objective Price-Quality," Journal of Marketing Research, 26 (November), 429-443. Loechner, Jack (2013), “Consumer Review Said To Be THE Most Powerful Purchase Influence,” Research Brief from the Center for Media Research, http://www.mediapost.com/publications/article/190935/consumer-review-said-to-be-the- 42 most-powerful-purch.html#axzz2Mgmt90tc. Luca, Michael (2011, September 16), “Reviews, Reputation, and Revenue: The Case of Yelp.com,” Working Paper 12-016, Harvard Business School. Mayzlin, Dina (2006), “Promotional Chat on the Internet,” Marketing Science, 25 (2), 155-163. Mayzlin, Dina, Yaniv Dover, and Judith Chevalier (2012, August), “Promotional Reviews: An Empirical Investigation of Online Review Manipulation,” Unpublished Working Paper. Mitra, Debanjan and Peter N. Golder (2006), “How Does Objective Quality Affect Perceived Quality? Short-Term Effects, Long-Term Effects, and Asymmetries,” Marketing Science, 25 (3), 230-247. Moe, Wendy W. and Michael Trusov (2011), “The Value of Social Dynamics in Online Product Ratings Forums,” Journal of Marketing Research, 49 (June), 444-456. Muchnik, S., L. Muchnik, and S. Taylor (2013), “Social Influence Bias: A Randomized Experiment,” Science, 341 (August 9), 647-651. Netemeyer, Richard G., William O. Bearden, and Subhash Sharma, (2003), Scale Development in the Social Sciences: Issues and Applications, 1st ed., Palo Alto CA, Sage Publications, Inc. Nielsen (2012), “Consumer Trust in Online, Social and Mobile Advertising Grows,” http://www.nielsen.com/us/en/newswire/2012/consumer-trust-in-online-social-and-mobileadvertising-grows.html. Obrecht, Natalie, Gretchen B. Chapman, and Rochel Gelman (2007), “Intuitive t-tests: Lay Use of Statistical Information,” Psychological Bulletin & Review, 14 (6), 1147-1152. Plassman, Hilke, John O’Doherty, Baba Shiv, and Antonio Rangel (2008), “Marketing Actions Can Modulate Neural Representations of Experienced Pleasantness,” The National Academy of Sciences of the USA,” 105 (January 22), 1050-1054. 43 Rao, Akshay R. and Kent B. Monroe (1989), “The Effect of Price, Brand Name, and Store Name on Buyers’ Perceptions of Product Quality: An Integrative Review,” Journal of Marketing Research, August (26), 351-357. Robinson, John P., Phillip R. Shaver, and Lawrence S. Wrightsman (1991), “Criteria for Scale Selection and Evaluation,” in Measures of Personality and Social Psychological Attitudes, eds. J. P. Robinson, P. R. Shaver, and L. S. Wrightsman, San Diego, CA: Academic Press, 1–15. Schlosser, Ann (2005), “Posting Versus Lurking: Communicating in a Multiple Audience Context,” Journal of Consumer Research, 32 (September), 260-265. Shiv, Baba, Ziv Carmon, and Dan Ariely (2005), “Placebo Effects of Marketing Actions: Consumers May Get What They Pay For,” Journal of Marketing Research, 42 (November), 383-393. Tellis, Gerard J. and Birger Wernerfelt (1987), “Competitive Price and Quality Under Asymmetric Information,” Marketing Science, 6 (Summer), 240-253. Tirunillai, Seshadri and Gerard J. Tellis (2012), “Does Chatter Really Matter? Dynamics of User-Generated Content and Stock Performance,” Marketing Science, 31 (2), 198-240. The Economists (2012), “Personalising Online Prices, How Deep Are Your Pockets? Businesses are Offered Software That Spots Which Consumers Will Pay More,” http://www.economist.com/node/21557798. Tversky, Amos and Daniel Kahneman (1971), “Belief in the Law of Small Numbers,” Psychological Bulletin, 76 (2), 105-110. Tversky, Amos, and Daniel Kahneman (1974), "Judgment Under Uncertainty: Heuristics and Biases," Science 185, 1124-1131. 44 Wilson, Timothy D. and Jonathan W. Schooler (1991), “Thinking Too Much: Introspection Can Reduce the Quality of Preferences and Decisions,” Journal of Personality and Social Psychology, 60 (February), 181-192. Zeithaml, Valarie A. (1988), “Consumer Perceptions of Price, Quality, and Value: A Means-End Model and Synthesis of Evidence,” Journal of Marketing, 52 (July), 2-22. Zhu, Feng and Xiaoquan (Michael) Zhang (2010), “Impact of Online Consumer Reviews on Sales: The Moderating Role of Product and Consumer Characteristics,” Journal of Marketing, 74 (March), 133-148. 45 Table 1. The predictive validities of three cues for technical quality in the marketplace and as perceived by consumers. Predictor Difference in Average Rating (α1ΔA) Difference in Number of Ratings (α1ΔN) Difference in Price (α1ΔP) Study 1: The Marketplace Study 2: Consumer Beliefs Point Estimate [95% Confidence Interval] Point Estimate [95% Confidence Interval] 0.114 [0.050; 0.177] 0.342 [0.300; 0.384] 0.148 [0.086; 0.210] 0.098 [0.056; 0.139] 0.287 [0.225; 0.348] 0.173 [0.132; 0.215] 46 Table 2. Brand perception measures and factor loadings in study 3. Brand Perception Measure Factor Loadings Affordability Benefits Has the features/benefits you want 0.92 -0.08 Is a brand you can trust 0.88 -0.25 Has high quality products 0.86 -0.40 Offers real solutions for you 0.85 -0.03 Is easy to use 0.82 0.07 Has the latest trends 0.82 -0.05 Is durable 0.82 -0.34 Offers good value for the money 0.82 0.26 Looks good in my home 0.80 0.02 Offers coordinated collections of items 0.80 -0.07 Is growing in popularity 0.75 0.04 Is endorsed by celebrities 0.32 -0.21 Is affordable 0.00 0.95 Is high-priced 0.23 0.83 Has a lot of sales or special deals -0.50 0.80 47 Table 3. The influence of technical quality, price and brand perceptions on user ratings. USER RATINGS FROM AMAZON.COM USER RATINGS FROM CONSUMERREPORTS.ORG Point Estimate [95% Confidence Interval] Point Estimate [95% Confidence Interval] 0.111 [0.041; 0.183] -0.001 [-0.090; 0.087] 0.101 [0.030; 0.172] 0.098 [0.011; 0.185] Benefits (α1BENEFIT) 0.185 [0.119; 0.252] 0.081 [-0.002; 0.166] Affordability (α1AFFORDABLE) -0.080 [-0.147; -0.014] -0.119 [-0.202; -0.035] Predictors Technical Quality (α1Q) Price (α1P) 48 Table 4. The influence of technical quality and online ratings on retained value. Predictors Technical Quality (α1Q) Average User Rating (α1A) USER RATINGS FROM AMAZON.COM USER RATINGS FROM CONSUMERREPORTS.ORG Point Estimate [95% Confidence Interval] Point Estimate [95% Confidence Interval] 0.091 [0.025; 0.157] 0.109 [0.028; 0.190] 0.030 [-0.036; 0.096] 0.018 [-0.063; 0.098] 49 Figure 1. The predictive validity of the average user rating as a function of its precision and the predictive validities of the number of user ratings and price in the marketplace (top panel; Study 1) and as perceived by consumers (bottom panel; Study 2). 50 Figure 2. The correspondence between average user ratings and actual quality in the marketplace (top panel; Study 1) and perceived quality according to consumers (bottom panel; Study 2).