Navigating by the Stars: What Do Online User Ratings Reveal About

advertisement
Navigating by the Stars: What Do Online User Ratings Reveal About Product Quality?
Abstract
Consumers increasingly turn to online user ratings to inform purchase decisions, but little is
known about whether these ratings are valid indicators of product quality. We developed a
database of (1) quality scores from Consumer Reports, (2) user ratings from Amazon.com, (3)
selling prices from Amazon.com, and (4) brand perceptions from a proprietary industry survey.
Analyses reveal that the average and number of user ratings are only weakly related to quality
and far less diagnostic than price (Study 1). Yet, when consumers infer quality, they rely mostly
on the average user rating and much less on the number of ratings and price (Study 2). The
dissociation between user ratings and quality can be traced in part to the influence of firm
marketing actions on users’ evaluations. Controlling for quality, average user ratings are higher
for more expensive products, higher for brands with a reputation for offering more functional
and emotional benefits, and lower for brands with a reputation for affordability (Study 3). We
conclude that consumer trust in online user ratings is largely misplaced.
Key Words: online user ratings, quality inferences, consumer learning, brand image, pricequality heuristic
2
Consumers typically have to make an inference about a product’s quality before buying.
Traditionally consumers have drawn on marketer-controlled variables like price and brand name
to make such inferences (Rao and Monroe 1989), but the consumer decision context has changed
radically over the last several years. One of the most important changes has been the
proliferation of user-generated content (Keen 2008). Almost all retailers now provide user
reviews and ratings on their websites, and these reviews often play an important role in the
purchase process (Mayzlin, Dover, and Chevalier 2012).
Consider for instance a purchase decision recently faced by one of the authors: which car
seat to buy for his 1-year-old daughter. A search on Amazon.com turns up 156 options that can
be sorted based on average user rating. On top of the list is the “Britax Frontier 85.” It has an
average user rating of 4.60 (out of 5). Looking much further down the list is the “Cosco
Highback Booster” with an average rating of 4.10. On its face the Britax option seems to be of
higher quality. This interpretation rests on the assumption that the average user rating accurately
reflects product quality, but is this a valid assumption? Consumer Reports provides expert ratings
of quality for 106 car seats. These ratings are based on rigorous tests on dimensions like crash
protection, ease of use, and fit to vehicle. Surprisingly, the option with the marginal consumer
rating on Amazon.com is the top rated on Consumer Reports with a score of 69 out of 100. The
option that was so favorably reviewed by consumers on Amazon.com fares much more poorly on
Consumer Reports. With a quality score of 53, it is second to worst in the category. This is an
anecdotal example but it highlights the need for research that evaluates the utility of user ratings
for inferring quality.
3
THE RISE OF ONLINE USER RATINGS
Consumers increasingly rely on user ratings to inform their purchase decisions (Grant
2013; Mayzlin, Dover, and Chevalier 2012). A 2012 survey of 28,000 respondents in 56
countries found that online user ratings are the second most trusted source of brand information
(after recommendations from family and friends), with 70% of respondents indicating trust in
them (Nielsen 2012). Similarly, the 2012 Local Consumer Review Survey of 2,862 panel
members in the U.S., U.K., and Canada found that 72% of respondents said they trust online
reviews as much as personal recommendations, and 52% said that positive online reviews make
them more likely to use a local business (Anderson 2012). A recent study conducted by Weber
Shandwick in conjunction with KRC research found that 65% of potential consumer electronics
purchasers were inspired by customer reviews to purchase a brand that was not in their original
consideration set. Moreover, consumer electronics buyers are more than three times as likely to
consult user reviews than expert reviews when making a purchase decision (Loechner 2013).
This influence cuts across product and service categories. In addition to electronics,
online product ratings have been shown to influence consumers’ choice of books (Chevalier and
Mayzlin 2006; Forman, Ghose, and Wiesenfeld 2008), movies (Chintagunta, Gopinath, and
Venkataraman 2010; Duan, Gu, and Whinston 2008), restaurants (comScore 2007; Luca 2011;
Anderson and Macgruder 2012), online video games (Zhu and Zhang 2010), bath, fragrance, and
beauty products (Moe and Trusov 2011), hotel, travel, automotive, home, medical, and legal
services (comScore 2007), and more recently, restaurant patronage. For instance, a one star
increase in a restaurant’s rating on Yelp.com was found to lead to a sales increase of between 5%
and 9% (Luca 2011) and a 19% increase in the likelihood of selling out of tables during rush
4
hour (Anderson and Macgruder 2012). The importance of consumer ratings to companies is
further reflected in the fact that they affect stock market performance (Tirunillai and Tellis 2012).
Because of their influence with consumers, online ratings have become important inputs to
managerial decision-making. A recent survey of 1,734 Chief Marketing Officers spanning 19
industries and 64 countries found that 48% of CMOs formally track consumer online ratings
(IBM Global Business Services 2011). Managers use consumer online ratings as input for
decision-making in areas such as brand building, promotion, customer acquisition, customer
retention, product development, and quality assurance (Hu, Pavlou, and Zhang 2006).
ARE USER RATINGS PREDICTIVE OF QUALITY?
The steep rise of user ratings suggests that consumers think they are valuable for making
good decisions. A few recent papers in the marketing literature have challenged this assumption
(Hu et al. 2006; Koh, Hu, and Celmons 2010) but there has been little work that has empirically
examined the correspondence between user ratings and product quality and what work there is
has only examined limited data sets (Chen and Xie 2008). A key objective of the present
investigation is to address this void in the literature by analyzing the correspondence between
user ratings and technical quality across a wide range of product categories. Following a long
tradition in the fields of marketing (Hardie, Johnson, and Fader 1993; Lichtenstein and Burton
1989; Mitra and Golder 2006; Tellis and Wernerfelt 1987), psychology (Wilson and Schooler
1991), and economics (Bagwell and Riordan 1991), we use the quality scores generated by
Consumer Reports as the most accurate indicator of technical product quality (Zeithaml 1988).
Consumer Reports is an independent source that is not allied in any way to any group of firms
5
and “has a scientific approach to analyzing quality through blind laboratory studies, which in
scope and consistency is unrivaled in the U.S. and in the world” (Tellis and Wernerfelt 1987, p.
244).
There is an intuitive rationale for believing that the average user rating is a good cue to
product quality. Use experience is a direct signal of quality that, unlike marketer-controlled
sources of information such as advertising, promises a reliable and unbiased perspective on use
experience. However, a review of the academic marketing literature reveals a variety of factors
that may undermine this rationale. We will distinguish between two broad classes of factors that
may limit the utility of average user ratings and expand on each of them below. First, there are
statistical concerns associated with the distribution of ratings. The average user rating only has
value to the extent that it provides a relatively precise and unbiased estimate of the population
average. There are several reasons to doubt this is generally the case, some of them documented
in previous research and some proposed here. Second, there are psychological reasons that
consumers may be unable to provide an unbiased perspective on their use experience. Intrinsic
product quality is often hard to discern and consumers are likely to draw on extrinsic cues when
forming their evaluations.
Previous research on the validity of user ratings has focused primarily on statistical
concerns associated with the non-representativeness of the distribution of ratings for the
population. Hu et al. (2006) found that review writers are more likely to be those that “brag” or
“moan” about their product experience, resulting in a bimodal distribution of ratings, whose
average does not give a good indication of the true population average. They see this nonrepresentativeness as a serious challenge to the validity of online reviews, going so far as to say
“… the most commonly used factor in the literature—average review score—is mostly an
6
incorrect proxy of a product’s true quality. Making this unwarranted assumption can possibly
lead to erroneous conclusions about consumers’ online shopping behaviors and wrong marketing
decisions (p. 328).” Similarly, Koh, Hu, and Clemons (2010) provide evidence of cross-cultural
influences on the propensity to write a review, suggesting a lack of representativeness of reviews,
and a lack of validity in reflecting true quality as a consequence. Another issue leading to nonrepresentativeness is review manipulation. Firms (or their agents) sometimes post fictitious
favorable reviews for their own products and services and/or post fictitious negative reviews for
the products and services of their competitors (Mayzlin, Dover, and Chevalier 2012).
Additionally, raters are influenced by previously posted ratings, creating herding effects (Moe
and Trusov 2011; Muchnik, Aral, and Taylor 2013; Schlosser 2005).
Non-representativeness is not the only statistical concern associated with average user
ratings. A summary statistic can only provide evidence about a population parameter to the
extent that it is precise. Precision is a function of two elements: sample size and variability in the
distribution. The average user rating should therefore provide a better indication of product
quality as the sample size of reviews increases and the variability of the underlying distribution
decreases. Unfortunately, average review ratings are often based on small samples (Hu et al.
2006). There are also reasons to suspect that variability tends to be quite high. User experiences
are subject to a lot of variation. Users may give a poor rating due to a bad experience with
shipping, may accidentally review the wrong product, or may blame a product for a failure that is
actually due to user error. Some consumers may view the purpose of product reviews differently
than others. For instance, some consumers may rate purchase value (quality for the money)
rather than product quality, thereby penalizing more costly brands whereas others may rate
quality without considering price. Perhaps most fundamentally, there may be substantial
7
heterogeneity in consumer taste. All of these factors suggest that the average may often be too
imprecise to support a good estimate of quality.
Beyond these statistical issues there are psychological factors that cast doubt on the
ability of consumers to discern quality through use experience and post an unbiased rating.
Numerous studies have shown that marketing actions have a major influence on perceived
quality. When consumers drink from labeled bottles they rate the taste of their favorite beer
brand more favorably than the taste of other brands, but not so when they drink from unlabeled
bottles (Allison and Uhl 1964). Moreover, consumer memory is fallible and post-experience
marketing actions can bias product evaluations. Favorable post-experience advertising makes
memory for the quality of a bad tasting orange juice more positive (Braun 1999). Consumers
may also engage in motivated reasoning to justify buying certain kinds of products such as those
that are expensive or those made by a favored brand (Jain and Maheswaran 2000, Kunda 1990).
This also biases evaluations because a product receives a higher rating by being more expensive
or by being manufactured by a favored brand, independent of its quality. These “top-down”
influences of extrinsic cues on quality perceptions are most pronounced when quality is difficult
to observe. There is good reason to believe that this is often the case. Product performance on
important dimensions is often revealed only under exceptional circumstances. For instance, when
considering a car seat, new parents would likely place a high value on crash protection, an
attribute that they will hopefully never be in a position to evaluate. More generally, product
quality may be difficult to evaluate in many categories (e.g., deck stains, tires, batteries),
especially in the short time-course between purchase and review posting, typically only days or
weeks.
Though most research in the literature has focused on the average user rating because it is
8
assumed this is what consumers pay the most attention to, there is another feature of the
distribution of user ratings that may relate to quality and is readily available to consumers when
they evaluate products online: the number of reviews. Retailers frequently use promotional
phrases such as “Over 10,000 Sold” as an indication that the product is one of quality and that
other consumers see it as such. Literature on social proof shows that consumers are strongly
influenced by the behavior of others (Cialdini 2001). This raises the possibility that the number
of user ratings provides information about quality independently of the average rating. To our
knowledge no previous research has addressed this possibility. Thus a secondary objective of our
analyses is to assess the extent to which the number of reviews is predictive of quality.
RESEARCH OBJECTIVES AND OVERVIEW OF STUDIES
The research has three broad goals: First we aim to understand the predictive validity of
user ratings for technical product quality in the marketplace. Second we aim to evaluate
consumer beliefs about the predictive validity of user ratings for technical quality. Third we will
test the conjecture that user ratings are contaminated by the biasing influence of marketing
actions. Together the studies allow us to assess disconnects between consumer beliefs and reality
in order to draw implications for consumer welfare and to make prescriptions for better decisionmaking.
In Study 1, we will examine the extent to which the average and number of user ratings
are predictive of product quality. Given the history of studies investigating the relationship
between technical product quality as measured by Consumer Reports and price (for reviews, see
Lichtenstein and Burton 1989, Tellis and Wernerfelt 1987), we include selling prices in our
9
analysis. Numerous studies dating back to the 1950s indicate that the correlation between price
and technical quality is approximately between 0.20 and 0.30 (Tellis and Wernerfelt 1987). We
anticipate a similar relationship between price and quality in our study and we use the predictive
validity of price as a reference benchmark for evaluating the predictive validity of user ratings.
Recognizing that previous research has focused on sample non-representativeness as a challenge
to the validity of user ratings, we will also examine whether the predictive validity of the average
rating varies as a function of its precision. As we discussed above, we expect user ratings to be
more predictive of quality when they are based on larger samples and/or when the distribution of
ratings has lower variability. This study thus advances the literature by documenting an
additional challenge related to the statistical precision of the average.
Given the increasing importance of user ratings to consumer decision-making it is likely
that consumers trust user ratings as indicators of quality, but no previous research has tried to
quantify this belief or compare its magnitude to the price-quality heuristic. In Study 2, we will
assess whether consumers have accurate intuitions about the relative predictive validity of the
different cues to quality. Using an ecologically valid experimental method we ask consumers to
judge the quality of products that vary in average user rating, number of ratings, and price. This
will allow us to compute the partial effect of each cue on quality perceptions and to compare the
effect sizes to those found in the market data in Study 1. A secondary goal of this study is to
assess whether consumers put more faith in user ratings that have greater precision. Many studies
have shown that consumers’ naïve statistical intuitions are flawed. They often jump to
conclusions based on summary statistics from small, inconclusive samples (Obrecht, Chapman,
and Gelman 2007; Tversky and Kahneman 1971). Thus we predict that consumers will not adjust
their quality inferences enough to account for differences in precision. Finally, this study adds to
10
the literature by being the first to evaluate the extent to which consumers rely on price as a cue
for quality in an environment where user ratings are simultaneously available.
As we reviewed above, product experience often provides only a noisy signal of quality,
and a consumer evaluating the product resolves this uncertainty by drawing on prior beliefs, for
example, higher price means higher quality. Thus, a high priced product or one from a good
brand might get the “benefit of the doubt” in the face of a less than favorable experience.
Similarly, a consumer might expect an inexpensive product to fail and be more sensitive to any
flaws. A related phenomenon is that consumers may be less likely to admit the faults of products
that they paid a lot for or that are manufactured by a favored brand. These processes suggest that
user ratings include a signal not just of technical quality but also of extrinsic marketing variables.
Variables such as price and brand name that positively correlate with expectations for
performance should positively bias user ratings. In Study 3, we supplement the database from
Study 1 with brand perception data. We present a new regression analysis that treats average user
rating as the dependent variable and marketing variables (price and brand perceptions) as the
independent variables. By also including Consumer Reports quality scores as predictors in the
model, we control for technical quality and isolate the contaminating influence of marketing
actions on average user ratings.
STUDY 1: THE MARKETPLACE
The goal of Study 1 was to examine the extent to which the average user rating and the
number of user ratings are valid cues for product quality in the marketplace and to compare their
predictive validities with that of price. For this purpose, we developed a database of quality
11
ratings from Consumer Reports for products across a wide range of product categories and their
corresponding average user ratings, number of ratings, and selling prices on Amazon.com.
Consumers typically evaluate products in a comparative context; they want to know how good
one product is relative to another, rather than the absolute quality of a product. For instance, a
consumer might ask how much of a quality difference can be expected if one product has an
average user rating that is 0.50 higher than another, or if it is rated 20 times more frequently than
another, or if it costs $50 more than another. In light of this, we used a comparative approach to
analyze the data. Rather than regressing the quality rating of an individual product on the three
predictive cues, we regressed the difference in quality between each pair of products within a
category on the difference in magnitude of the cues between the products.
Data
We visited the website of Consumer Reports (ConsumerReports.org) in February 2012
and extracted quality ratings for all items within all product categories where Consumer Reports
provides these data, except for automobiles and wine. This resulted in ratings for 3817 items
across 262 product categories. To ensure that product categories were relatively homogeneous
(and quality ratings were comparable across items within a category), we defined product
categories at the lowest level of abstraction. For example, Consumer Reports provides product
ratings for air conditioners subcategorized by BTUs (e.g., 5,000 to 6,500 as opposed to 7,000 to
8,200). That is, brands are only rated relative to other brands in the subcategory. Thus, we treated
each subcategory as a separate product category. We augmented this dataset with current prices
and user ratings from Amazon.com collected in June of 2012. For each item for which we had a
technical quality score from Consumer Reports, we searched the Amazon.com website and
12
recorded all user ratings and the price. We were able to find selling prices and at least one
Amazon.com consumer rating for 1669 items across 205 product categories. To ensure that our
conclusions were not disproportionally influenced by items with a very small number of user
ratings, we only retained items that were rated at least 5 times. This resulted in 1385 items across
192 product categories. Because our goal was to model brand pairs instead of individual brands,
we only retained product categories with data for at least three items. The final dataset consisted
of 1280 items across 121 product categories.
We then used these data to generate a pair-wise comparative database, which served as
the basis for all subsequent analyses. We took each product in a category and compared it to
every other item in the category. We then sorted all product pairs in alphabetical order such that
product A was the product that came first in alphabetical order and product B was the product
that came second. For each product pair, we computed the difference between product A and
product B in terms of quality score (ΔQ), average user rating (ΔA), number of user ratings (ΔN),
and price (ΔP). To make effect sizes comparable across cues we transformed all difference
scores such that they had a mean of zero and a standard deviation of one. The variation in quality,
the number of ratings, and selling prices were substantially different across product categories
and thus we standardized these difference scores by product category. Differences in user ratings
were relatively homogeneous across categories and were standardized across the whole dataset.
The final pair-wise database consisted of 16,489 comparisons across the 121 product categories.
Results
Simple Correlations. As a first analysis of the effects of the three predictive cues on
quality, we first computed simple correlations between differences in the three predictive cues
13
(ΔA, ΔN, and ΔP) and quality differences (ΔQ),1 and then averaged these correlations across
product categories. All three correlations were highly significant and positive (all p’s < .001).
The correlation was highest for price (M = 0.29, SD = 0.52), followed by average user rating (M
= 0.20, SD = 0.49), and then number of user ratings (M = 0.14, SD = 0.47).
Hierarchical Regression Analysis. Simple correlations do not control for additional
predictors and therefore do not yield firm conclusions about the relative strength of the three
predictors. To more formally compare the extent to which differences in technical quality can be
predicted based on average user ratings, number of ratings, and price, we estimated the following
hierarchical regression model:
ΔQij = [α0 + βj,0] + ΔAij[α1ΔA + βj,1ΔA] + ΔNij[α1ΔN + βj,1ΔN] + ΔPij[α1ΔP + βj,1ΔP] + εij
ΔQij, ΔAij, ΔNij, and ΔPij indicate the difference between product A and product B for product
pair i in category j in terms of quality, average user rating, number of ratings, and price,
respectively. α0 is the fixed intercept, βj,0 are random intercepts that vary across categories, α1’s
are the fixed slopes, βj,1’s are the random slopes that vary across categories, and εij are error
terms. Because the random slopes (βj,1’s) have a mean of zero, the fixed slopes (α1’s) are the
expected value of the random slopes. The fixed slopes thus indicate the estimated predictive
validity of each cue averaged across product categories.
Table 1 shows point estimates and 95% confidence intervals for the fixed slopes of the
1
Consumer Reports as well as Amazon.com confound brands that are very similar. For instance, the Garmin Nuvi
1350T and the Garmin Nuvi 1370T have the same objective quality according to Consumer Reports. We therefore
did all analyses reported in this manuscript twice, one time including brand pairs for which the brands have identical
objective quality scores and/or identical average user ratings (number of pairs = 16,489) and another time excluding
these brand pairs (number of pairs = 15,578). The results are virtually identical for all analyses. In the manuscript,
we report the results after excluding similar brands.
14
three predictive cues. Because the variables are standardized, the fixed slopes can be compared
in magnitude to determine the relative strength of each of the cues for predicting quality,
controlling for the other cues. The slope of price is highest (B = 0.287, p < .001), followed by
number of ratings (B = .148, p < .001) and average user rating (B = .114, p < .001). Thus, the
partial effect of price on quality is about twice as large as the effect of average rating and number
of ratings. Moreover, The 95% confidence interval for price does not overlap with the 95%
confidence intervals for average user rating and number of user ratings, indicating that price by
far explains the most variation in quality. The predictive validities of average rating and number
of ratings do not differ significantly.
[Insert Table 1 here]
Next, we looked at how the predictive validity of a difference in average user ratings
varies as a function of its precision. We measured the precision of the difference in average
rating between a product pair using the pooled standard error (SEP) of the two average ratings.
The pooled standard error is a function of the sample sizes (N) and variation in user ratings
(VAR) for the products A and B in a pair:
SEP = √[(VARA/NA) + (VARB/NB)])
For each pair-wise comparison in the database we calculated this precision measure and added it
to the regression model above as a main effect and as an interaction with the difference in
average user ratings. We predicted based on statistical theory that more precise differences in
15
average user ratings would explain more variability in quality. Consistent with this, the
regression revealed a significant negative interaction such that a lower standard error was
associated with greater predictive validity of the difference in average user rating (B = -0.217,
95% CI: [-0.311; -0.124]).
To summarize and visualize the results of this regression, the top panel in Figure 1 plots
the predictive validity (i.e. the fixed slope) for the average user rating as a function of the
precision of the difference in average user rating (i.e. the pooled standard error). The chart also
shows the predictive validities of the other cues which do not vary as a function of the standard
error of the average user rating in the model. At the mean level of precision (SEp = 0.306) the
fixed slope for average user rating is about half that of price and approximately the same as
number of ratings. The predictive validity of average user rating increases with precision but
even at the 5th percentile of precision (SEp = 0.106), the fixed slope is still only about two third
that of price. At the 95th percentile (SEp = 0.639), the average user rating is a very weak predictor
of quality, weaker than the number of ratings.
[Insert Figure 1 here]
Correspondence Analysis. The analyses above relate the magnitude of the difference in
average user rating to the magnitude of the difference in technical quality. Another way to look
at the data is to dichotomize the quality difference and evaluate the extent to which the average
user rating can predict which product is of higher quality. If one were randomly to select two
products in a category, and product A has a higher average user rating than product B, what is
the likelihood that it also has a higher quality and how does this likelihood vary as a function of
16
the magnitude of the difference in average user rating? To answer this question we present a
purely descriptive analysis. For each product pair we created a variable that was coded as “match”
if the product with the higher average user rating also has higher quality. It was coded as
“mismatch” if the product with the higher average user rating has lower quality. Product pairs
with identical quality scores or user ratings were not included in this analysis.
To visualize the results, we binned product pairs according to the absolute magnitude of
the difference in average user rating (using a bin width of 0.20) and calculated for each of the
bins the proportion of times the average user rating correctly predicted quality. Those results are
plotted in the top panel of Figure 2 (solid line). Very few comparisons had rating differences
larger than 2 so the data are only shown for differences between 0 and 2 stars, which accounts
for approximately 95% of the database. Averaging across all comparisons, the correspondence
between the average user rating and quality is 57%. When the difference in user ratings is
smaller than 0.40, correspondence is 50%. These small differences have essentially no predictive
value. This percentage increases as the difference in user rating grows larger but the increase is
modest and correspondence never exceeds approximately 70%. Figure 2 also shows the
proportion of product pairs in each bin (dashed line). As the difference in average user rating
increases the proportion of product pairs decreases, reflecting that large rating differences
between products are relatively rare. This is consistent with previous work showing that the
distribution of user ratings tends to be left skewed, with most products rated as a 4 or 5 (Hu et al.
2006). It is notable that a large plurality of product comparisons yields small differences in
average user ratings and these have no predictive value at all. Approximately 50% of product
comparisons consist of rating differences that are smaller than 0.40 where correspondence with
quality is at chance.
17
[Insert Figure 2 here]
Discussion
This study was the first to examine how well user ratings can predict technical quality.
Across a wide range of product categories, we found that both the average user rating and the
number of user ratings are significant, positive predictors of technical quality but the relationship
is weak. The predictive accuracy of the average user ratings falls off dramatically when it is
based on a smaller samples and/or the underlying distribution has greater variability. We also
examined the predictive validity of price and found it to be similar to what has been suggested in
past research (Tellis and Wernerfelt 1987). Notably, although the correlation between price and
quality is modest, it is a much better predictor than either the average or number of ratings.
STUDY 2: CONSUMER BELIEFS
Study 2 examined the extent to which average user ratings, number of user ratings, and
selling prices influence consumer perceptions of technical product quality with an eye to
comparing consumer beliefs to the marketplace relationships revealed in Study 1. This is the first
study we are aware of that evaluates how user ratings influence quality perceptions. Moreover,
numerous studies have documented a pervasive influence of price on perceptions of quality
(Kardes et al. 2004; Broniarczyk and Alba 1994) but none have examined price-quality beliefs in
a context where user ratings are also available as a basis for inference.
We asked consumers to search for pairs of products on Amazon.com and then to rate the
18
relative quality of these products. By choosing products that varied in terms of average user
rating, number of ratings and price we could conduct similar analyses to Study 1, this time using
consumer perceptions of quality as the dependent variable rather than the actual quality scores
from Consumer Reports. To avoid any demand effects due to the experimental paradigm we
designed the search and rating task to be as ecologically valid as possible and we gave
participants minimal training.
Method
Participants and procedure. Three-hundred-four respondents from Amazon Mechanical
Turk were paid $0.65 to complete the study. Participants were first informed that they would be
provided with pairs of products along with links to the URLs for the products at Amazon.com.
They would be asked to inspect the product webpages and judge the relative technical quality of
the two products. To ensure that participants understood that they were evaluating technical
quality as measured by Consumer Reports, we provided the following description on the
instruction page: “Expert ratings like those generated by Consumer Reports magazine are
generated by engineers and technicians with years and sometimes decades of expertise in their
field. They live with the products for several weeks, putting them through a battery of objective
tests using scientific measurements, along with subjective tests that replicate the user experience.
All models within a category go through exactly the same tests, side by side, so they're judged on
a level playing field, and test results can be compared.”
After reading the instructions, participants saw the first pair of products. They were asked
to click on the links and examine the products. They were then asked to rate “how you feel
experts at Consumer Reports magazine would likely rate the quality of the products relative to
19
each other.” The rating scale ranged from 1 (Product A would be rated as higher quality) to 10
(Product B would be rated as higher quality). After completing the ratings, participants were
provided with the next set of products. Each participant completed this relative quality rating for
two products in each of eight product categories. The two products were sampled at random from
a pre-determined set as specified below, and category presentation order was randomized as well.
Product Selection. We selected the 10 product categories with the most products from
data used in Study 1. We excluded MP3 players and kitchen knives because many items in these
categories were not comparable to one another. For example, in the kitchen knife category, some
products referred to single knives while others referred to knife sets. We then selected one
product from each brand to use in the study. Because the data from Study 1 was not current, we
visited the ConsumerReports.org and Amazon.com websites again and re-recorded the technical
quality scores, the average user rating, the number of ratings, and the selling price for each of
these products. The final set of stimuli resulting from this procedure included 4 printers, 7 digital
cameras, 2 GPS navigators, 4 cordless phones, 16 coffee makers, 10 steam irons, 8 food
processor, and 7 strollers.
To verify the representativeness of these products for the marketplace, we calculated the
actual quality differences (the difference in Consumer Reports quality scores, as in Study 1)
between each potential product pair and we correlated the quality difference with the difference
in average rating, the difference in number of ratings, and the difference in price. Results were
similar to the full database in Study 1. Averaging over the eight product categories, all
correlations were positive. The correlation was largest for price (M = 0.36, SD = 0.45), next
largest for number of ratings (M = 0.22, SD = 0.45), and smallest for average user rating (M =
0.07, SD = 0.70).
20
Results
To make the results as comparable as possible to Study 1 we repeated all of the same
analyses, this time using relative perceived quality rather than actual quality difference as the
dependent variable. The data consisted of eight judgments of relative quality for each participant.
For each of these eight product pairs, we computed the difference between product A and
product B in average user rating (ΔA), number of user ratings (ΔN), and price (ΔP). We also
standardized all variables using the same procedure as Study 1.
Simple Correlations. As in Study 1, as a first indication of the results we computed
simple correlations between the three predictive cues and the outcome variable, in this case,
judgments of relative quality. Due to the experimental design, we averaged across participants
rather than across product categories. The correlation to perceived quality was highest for
average user rating (M = 0.42, SD = 0.41), followed by price (M = 0.20, SD = 0.42) and number
of ratings (M = 0.18, SD = 0.38). As in Study 1, all three correlations were significantly positive
(all p’s < .001). The pattern was markedly different however. In Study 1 the price difference was
the most potent predictor of actual relative quality. Here, the difference in average review was
the strongest predictor of perceived relative quality.
Hierarchical Regression Analysis. To more formally compare the extent to which
judgments of quality are related to average user ratings, number of ratings, and price, we
estimated the following hierarchical regression model, analogous to the model used in Study 1:
ΔQij = [α0 + βj,0] + ΔAij[α1ΔA + βj,1ΔA] + ΔNij[α1ΔN + βj,1ΔN] + ΔPij[α1ΔP + βj,1ΔP] + εij
21
ΔQij, ΔAij, ΔNij, and ΔPij indicate the difference between product A and product B in product
category i for participant j in terms of judged quality, average user rating, number of ratings, and
price, respectively. α0 is the fixed intercept, βj,0 are random intercepts that vary across
participants, α1’s are the fixed slopes, βj,1’s are the random slopes that vary across participants,
and εij are error terms.
Table 1 shows point estimates and 95% confidence intervals for the fixed slopes,
averaged across participants. Quality judgments are most strongly related to differences in
average user rating (B = 0.342, p < .001), followed by differences in price (B = 0.173, p < .001)
and differences in number of ratings (B = 0.098, p < .001). The 95% confidence interval for
average user rating does not overlap with the 95% confidence intervals for price and number of
user ratings, indicating that the average user rating explains the most variation in judgments of
quality by far. Again, these results are substantially different from the marketplace relationships
revealed in Study 1.
Next, we analyzed whether respondents weighted the difference in average user rating
differently depending on its level of precision. We calculated the standard error of the difference
in average user rating for each of the product comparisons and added this precision measure to
the regression model as a main effect and as an interaction with difference in average user rating.
The interaction effect was not significantly different from zero (95% CI: [-0.088; 0.376])
indicating that consumers fail to moderate their quality inferences as a function of the precision
of the difference in average rating.
To summarize and visualize the results of this regression, the bottom panel in Figure 1
plots the predictive validity (i.e. the fixed slope) for each predictive cue as a function of the
22
precision of the difference in average user rating (i.e. the pooled standard error). Note that the
effect of average user rating on perceived quality appears to actually increase with standard error,
but this increase is not statistically significant. The figure shows that when consumers infer
quality they rely more on the average user rating than on price at all levels of standard error. This
stands in contrast to Study 1 where the market data revealed that the effect of price is larger than
the effect of user rating at all levels of standard error. In other words, consumers substantially
over-estimate the relation between user ratings and quality, and substantially underestimate the
relation between price and quality. Their beliefs about the effect of number of ratings on quality
are approximately consistent with marketplace reality.
Correspondence Analysis. As in Study 1 we performed a descriptive analysis,
dichotomizing the quality score and evaluating the correspondence between differences in user
ratings and judgments of quality superiority. We coded each judgment by a participant as a
match or a mismatch. It was coded as “match” if the product with the higher average user rating
was also judged to have higher quality. The comparison was coded as “mismatch” if the product
with the higher average user was judged to having lower quality. We placed all product
comparisons with rating differences smaller than 2 into bins of 0.20 stars and calculated the
percentage of matches in each bin. Those results are shown with the black line in the bottom
panel of Figure 2. The plot also includes a dotted line denoting the proportion of product pairs in
the database represented in each bin. The gray line, reproduced from the corresponding graph in
Study 1 (see top panel), shows the correspondence between user ratings and quality in the
marketplace for reference. Consumers believe that differences between 0.20 and 0.40 predict
quality superiority 65% of the time. The market data (grey line) show that such comparisons
have no predictive value. As the difference in average user rating grows larger consumer
23
judgments of quality follow the average user rating about 80% of the time. This again is in
contrast to the market data, which maxes out at about 70%. In summary, the black line denoting
consumer beliefs lies much higher and is much steeper than the gray line denoting marketplace
reality. This indicates that consumers substantially overestimate the predictive validity of
differences in user ratings.
Discussion
This study was the first to examine the influence of user ratings and selling prices on
judgments of product quality. Consumers perceived a significant positive relation between all
three cues—price, average user rating and number of user ratings—to quality. Average user
rating is by far the strongest predictor of quality perceptions. We also found that consumers fail
to moderate their reliance on the average user rating as a function of its precision. We suspected
this would be the case based on previous research on consumers’ flawed statistical intuitions.
Although the effect of average ratings on quality inferences may not be surprising in light
of many studies documenting their pervasive influence on consumer and managerial decisionmaking, it is notable that consumers think the average user rating is so much more predictive of
quality than price. The price-quality heuristic is one of the most studied phenomena in the
marketing literature and typically considered to be one of the cues that consumers treat as most
diagnostic of quality. Our study is the first to revisit the strength of the perceived price quality
relationship in a multi-cue environment where user ratings are simultaneously present, which is a
common mode in which consumers evaluate products today. The moderate effect of number of
ratings on judgments of quality could also have been anticipated based on prior research on the
influence of social proof. But to our knowledge, consumers’ reliance on the number of user
24
ratings as a cue for quality had not been examined in previous research. Our findings suggest that
the number of user ratings influences consumer decision making (and ultimately, sales) and may
have been understudied. This may be an interesting avenue for future research.
In sum, comparing Studies 1 and 2 reveals substantial mismatches between the actual
relationships between the predictive cues and quality in the marketplace and the way consumers
use these cues when making inferences about product quality. Consumers substantially
overestimate the predictive validity of the average user rating and they fail to anticipate that less
precise averages are less predictive of quality. They also substantially underestimate the
predictive validity of price presumably because price gets overshadowed by user ratings in a
context like Amazon.com where both cues are simultaneously available.
STUDY 3: ANTECEDENTS OF AVERAGE USER RATINGS
Study 1 revealed that average user ratings are not very diagnostic of technical quality. We
hypothesized that one reason for this is the biasing influence of marketing actions on user
evaluations. To test this idea in Study 3 we supplemented the market data with brand perception
data and conduct a new regression that predicts average user rating as a function of price, brand
perceptions and technical quality. By controlling for quality we isolate the contaminating
influence of extrinsic cues on user ratings. Partial effects of these cues on ratings would provide
evidence for our conjecture and would cast more doubt on the utility of user ratings for inferring
quality. In Studies 1 and 2 we used a comparative approach because quality inferences from
marketplace cues are often made in a multi-option environment. In Study 3 we use a noncomparative approach because consumers typically purchase and evaluate only a single option.
25
Data
We supplemented the database used in Study 1 with brand perception measures from a
proprietary consumer survey. This survey is administered to a representative sample of U.S.
consumers annually and asks multiple questions about shopping habits and attitudes toward
retailers and brands across numerous product categories. We obtained data from three versions of
the survey that together covered most of the product categories examined in Study 1: electronics
(e.g. televisions, computers, cell phones), appliances and home improvement (e.g. blenders,
refrigerators, power tools), and housewares (e.g. dishes, cookware, knives). For the brand
perception portion of the survey, participants were first asked to rate familiarity of all brands in
the category and then were asked further about brand attitudes for three brands for which their
familiarity was high. All brand perception questions were asked on five-points agree/disagree
Likert scales. The brand perception questions differed somewhat across the three versions of the
survey, so we retained data only for the 15 brand perception questions that were asked in all
three versions of the survey. We cleaned the data by removing data from any participants who
did not complete the survey or who gave the same response to all brand perception questions.
When merging the brand perception data with our existing data set, we were able to realize brand
perception measures for 888 products representing 132 brands across 88 product categories. The
data consisted of ratings from 37,953 respondents with an average of 288 sets of ratings for each
brand, and substantially more for well-known brands.
For purposes of data reduction, we submitted the average value for each brand for each of
the 15 questions to a principal components analysis with a varimax rotation restricting the
26
number of factors to two.2 The two factors accounted for 71% of variance in the data set. We
interpreted the first factor to represent perceived functional and emotional benefits (12 items)
and the second factor to represent perceived affordability of the brand (3 items). As all inter-item
correlations met or exceeded levels advocated in the measurement literature (see Netemeyer,
Bearden, and Sharma 2003; Robinson, Shaver, and Wrightsman 1991), we averaged the
respective scale items to form two brand perception measures. The individual scale items loading
on each of the respective factors are shown in Table 2. The correlation between the two
subscales was moderately negative (r = -0.21). This suggests that consumers see brands that
provide more benefits as being less affordable.
[Insert Table 2 here]
Results
We analyzed average user ratings on Amazon.com with the following hierarchical
regression model:
Aij = [α0 + βj,0] + Qij[α1Q + βj,1Q] + Pij[α1P + βj,1P] + BENEFITij[α1BENEFIT] +
AFFORDABLEij[α1AFFORDABLE] + εij,
2
An unrestricted principal components analysis yields 3 factors explaining 83% of variance in the dataset. The three
factors can be interpreted as brand associations related to functional benefits (7 items), emotional benefits (5 items),
and price (3 items). While loading on separate factors, multi-item scales comprised of the respective emotional and
functional-based items were highly correlated (r = 0.76), leading to multicollinearity issues in subsequent regression
analyses. Upon inspection of all brand equity items, the price-based items represented sentiments related to
sacrificing resources for the purchase (e.g., “is affordable”), while the functional and emotional items represented
what is received in the purchase (e.g., “is durable” and “is growing in popularity” ). Therefore, we reconducted the
principal components analysis using the a priori criterion of restricting the number of factors to two (Hair, Anderson,
Tatham, and Black 1995).
27
Aij indicates the average user rating for product i in product category j, Qij indicates the objective
quality of the product, Pij indicates the selling price of the product on Amazon.com, BENEFITij
indicates the extent to which the brand has positive functional and emotional associations, and
AFFORDABLEij indicates the extent to which the brand is perceived to be affordable, α0 is the
fixed intercept, βj,0 are random intercepts that vary across categories, α1’s are the fixed slopes,
βj,1’s are the random slopes that vary across categories, and εij are error terms. We standardized
all variables by product category before analysis such that they have a mean of zero and a
standard deviation of one.
Table 3 shows point estimates and 95% confidence intervals for the fixed slopes.
Consistent with Study 1, products superior technical quality is associated with a higher rating on
Amazon.com (B = 0.111, p < .01). The effect for selling price is significant and positive and
about the same magnitude as the effect of technical quality (B = 0.101, p <.01). The positive
effect of selling price on user ratings while controlling for technical quality and brand reputation
suggests that user ratings are not indicators of value (i.e. quality – price) but rather that price
plays a “positive” role, as an extrinsic cue for quality (i.e., quality + price).3 In line with the
positive effect of selling price on the average user rating, the effect of the brand’s reputation for
affordability was significant and negative (B = -0.080, p <.05). Brands with a discount reputation
3
Based on theory (see introduction), we assume here that the flow of influence runs from price to Amazon.com user
ratings. An alternative interpretation for the positive effect of price is that Amazon.com raises/lowers their prices in
response to user ratings. While we are aware that Amazon.com sets prices based on individual level data that relates
to the consumer’s price sensitivity (e.g., the consumer’s previous purchase history or the browser the consumer is
using; see The Economist 2012), we are unaware of any source that has alleged that Amazon.com uses user ratings
to set prices across consumers. Nevertheless, in order to gain some insight into this issue we collected Amazon.com
prices for the brands in our dataset at three additional points in time (09-22-2012, 11-22-2012, and 01-22-2013). We
scraped the Amazon.com user ratings and prices used in Study 1 from the website on 02-14-2012. If user ratings
influence prices, we would expect to find a positive correlation between these ratings and subsequent price changes.
That is, higher ratings at time 1 (i.e., 02-14-2012) should be positively related to price changes from time 1 to time 2
(i.e., the difference in price between any of these three additional times and the price on 02-14-2012). Thus, we
calculated three price changes and found these price changes were not significantly related to the Amazon.com user
ratings on 02-14-12 (rsep = .01, p > .87; rnov = .04, p > .35; rjan = -.01, p > .74), which is inconsistent with the reverse
causality argument.
28
are thus rated less favorably by users, controlling for quality. Finally, there was a significant
positive effect of the perceived emotional and functional benefits offered by a brand (B = 0.185,
p < .001).
In sum, we find that price affects user ratings in its positive role, both at the product level
(i.e., the effect of selling price) and at the brand level (i.e., the effect of a brand’s discount
reputation). These analyses suggest that higher market prices lead to more favorable user ratings,
and that perceptions of being a high-priced brand further inflate ratings. Brands that have a better
reputation for offering benefits also benefit from inflated ratings.
[Insert Table 3 here]
To assess the generalizability of these findings, we obtained user ratings and selling
prices from a second source. Subscribers to the Consumer Reports website not only have access
to the product quality ratings performed by Consumer Reports’ experts but also have the
opportunity to provide product ratings using the same 1-to-5 star rating system as Amazon.com
for any of the products they purchase, and these ratings are visible to other subscribers to the
website. This sample serves as a strong test of the hypothesis that marketing variables bias user
ratings, in the sense that Consumer Reports subscribers are interested in technical quality and
may be less likely to use price and brand as cues to quality when evaluating products.
ConsumerReports.org had a least one user rating for 602 items in our dataset. Using this
set of items, we repeated the analysis above substituting the average user ratings from
Amazon.com with the average user ratings listed on ConsumerReports.org. Table 3 provides
point estimates and confidence intervals for the fixed slopes. The results are mostly consistent
29
with those from Amazon.com user ratings. The effect for selling price is positive and significant
(B = 0.098, p < .05), the effect of a brand’s reputation for affordability is negative and significant
(B = -0.119, p < .01), and the effect of the perceived emotional and functional benefits offered by
a brand is positive and marginally significant (B = 0.081, p = .05). The only major difference
between the two sources of user ratings is that the effect of technical quality on
ConsumerReports.org user ratings is not significant.
Discussion
Study 3 tested the hypothesis that user ratings are biased by extrinsic marketing variables
including price and brand name. To test this prediction, we regressed average user ratings on
technical quality, price and brand perceptions. Controlling for technical quality, ratings were
indeed higher for more expensive products and for brands perceived to deliver more benefits.
Ratings were lower for brands perceived to be affordable. In fact, substantially more of the
variability in user ratings was attributable to price and brand perceptions than to technical quality
suggesting that user ratings are, in large part, determined by the top-down influence of marketing
actions.
We also tested the hypothesis with a second sample of user ratings from Consumer
Reports subscribers. Surprisingly, these ratings were not related to technical quality at all. This
may be because ConsumerReports.org has fewer ratings per product than Amazon.com
(MEDIAN = 4 vs. 67) and, as we have shown, the predictive validity of the average user rating
depends in part on having a sufficient sample size. Nonetheless, the biasing effect of marketing
variables did emerge just as with the user ratings from Amazon.com. Evidently, these effects are
strong enough to come through even with the limited samples sizes available on
30
ConsumerReports.org. These results are ironic in the sense that the consumer reports subscribers
are particularly interested in technical quality.
We wondered whether consumers anticipate the positive effect of price on user ratings so
we asked 56 Amazon Mechanical Turk workers to imagine the following scenario: Consumer
Reports tested the quality of two brands of blenders, one selling for $50 and the other selling for
$75, and that the tests revealed both brands are equal in terms of quality. We then asked
respondents which brand they deemed most likely to have the higher Amazon.com user rating.
Seventy percent of respondents thought that the less expensive brand would have a higher user
rating, 18% thought that the more expensive brand would have a higher user rating, and 13%
thought that both brands would be rated equally. In other words, a majority of respondents
believed that price feeds into ratings in its negative role, as an indicator of purchase value,
contrary to the results of Study 3.
GENERAL DISCUSSION
This research had three broad objectives. First we sought to evaluate the predictive
validity of online user ratings for technical product quality in the marketplace. Previous research
has raised concerns about the non-representativeness of the sample of ratings for the population.
We raised several other concerns including those related to the statistical precision of the average
and the fact that raters are likely influenced by extrinsic marketing variables. Analysis of
secondary data showed that these concerns are well founded. The relationship between average
user ratings and quality is weak. Not controlling for other predictors, having a higher Amazon
user rating predicts having a higher quality score only 57% of the time. Differences in user
31
ratings lower than 0.40 have essentially no predictive value and larger differences predict quality
superiority only about 65% of the time. We also found that the predictive validity of the average
user ratings depends substantially on its precision as measured by standard error. As sample size
increases and/or the variability of the distribution decreases, user ratings become better
predictors of quality. Unfortunately a large proportion of products in the marketplace do not have
sufficient precision to yield a reliable estimate. The average user rating is about equally
predictive of quality as the number of ratings and much less predictive than price. Price is a
useful benchmark because there is much research on it. This research generally highlights the
weakness of price as a predictor of quality and cautions that consumers have a tendency to rely
too much on price as a cue for quality (Gerstner 1985; Kardes et al. 2004; Lichtenstein and
Burton 1989).
The second research objective was to examine the extent to which consumer rely on user
ratings when making inferences about product quality. Consumers searched for pairs of products
on Amazon.com that were chosen to vary in average rating, number of ratings, and price. They
then indicated their perception of the relative technical quality of the two products. The key
result was that the average user rating was by far the strongest predictor of quality perceptions.
Consumers also did not moderate their quality inferences depending on the precision of the
difference in average user rating. The other cues, number of ratings and price, were also treated
as somewhat informative of quality but much less than the average rating. These results stand in
stark contrast to the marketplace relationships revealed in Study 1. We conclude that consumers
overestimate the predictive validity of the average user rating, underestimate the predictive
validity of price in a multi-cue environment that also includes user ratings, and fail to anticipate
that the precision of the average user rating determines how strong a quality inference one can
32
make.
The third objective was to test the conjecture that user ratings are contaminated by
extrinsic marketing variables including price and brand name. Controlling for technical quality,
Amazon.com user ratings are higher for products that cost more, and for brands with a reputation
for offering benefits. Ratings are lower for brands with a reputation for affordability. These
results generalized to user ratings from subscribers to Consumer Reports, a sample of consumers
who are particularly interested in technical quality.
An Illusion of Validity
Why is it that consumers are so off the mark in their interpretation of user ratings?
Classic research on the psychology of prediction may provide an explanation. In describing a
judgment fallacy they call the “illusion of validity” Kahneman and Tversky (1974) write “…
people often predict by selecting the outcome that is most representative of the input. The
confidence they have in their prediction depends primarily on the degree of representativeness
(that is, on the quality of the match between the selected outcome and the input) with little or no
regard for the factors that limit predictive accuracy (p. 1126).” User ratings may be highly
representative of quality in the minds of consumers because they are intended to be direct
measures of quality. Amazon.com for instance provides instructions to review writers that focus
on evaluation of product features and comparison between similar products (see:
http://www.amazon.com/gp/community-help/customer-reviews-guidelines). Other predictive
cues, like price and the number of reviews, bear less resemblance to quality and consumers may
therefore not rely on these cues to the same extent as the average user rating.
Further contributing to an illusion of validity are people’s erroneous intuitions about the
33
laws of chance. People tend to believe that the characteristics of a randomly drawn sample (e.g.,
its mean) are very similar to the characteristics of the overall population. For instance, Obrecht et
al. (2007) showed that when presented with distributions of product ratings consumers showed
virtually no understanding of sampling theory. They were insensitive to the number of ratings
and the standard deviation of those ratings. Because higher standard errors attenuate the
correlation between a predictive cue and an outcome, people should weight a cue less as the
standard error increases. Participants’ failure to moderate their reliance on the average rating as a
function of its precision (Study 2) is another instantiation of people’s flawed understanding of
statistical principles.
We have also shown that user ratings are biased by marketing variables such as price and
brand name. People may fail to anticipate such bias because these variables often enter into
evaluations outside of awareness (Ferraro, Bettman, and Chartrand 2009; Shiv, Carmon, and
Ariely 2005). Because consumers are not aware of drawing on marketing variables when making
their own quality evaluations, they fail to anticipate them contaminating the evaluations of others.
Implications for Consumer Welfare
We have shown that consumers dramatically overestimate the correspondence between
user ratings and expert ratings of technical quality. An important implication may be that
consumers who are relying on user ratings as a cue for quality are making suboptimal decisions.
This conclusion rests on the assumption that expert ratings are better measures of quality and
hence better predictors of consumption utility than user ratings. Though this assumption is
commonly made in the academic literature it is definitely debatable. One could argue that
Consumer Reports experts might have a different value function over attributes than consumers.
34
It could be that user ratings do a better job of predicting consumption utility, which would mean
that user ratings are facilitating good decisions despite their lack of correspondence to the expert
evaluations. This argument is weakened by the fact that the correspondence increases with
statistical precision, suggesting that the lack of correspondence is due in part to user ratings
being based on unreliable estimates from insufficient sample sizes. It is also weakened by the
fact that in many categories the attributes of importance to consumers are not readily evaluable
in an ordinary usage context but are evaluated by the testers at Consumer Reports (e.g. the safety
of an infant car seat). More evidence for this idea comes from Mitra and Golder (2006) who
show that quality improvements made to products first show up in expert evaluations like
Consumer Reports and are eventually reflected in consumer sentiment, though the time course is
much longer, typically 5-6 years! This is one more reason to doubt that consumers can accurately
report quality in the timeframe within which reviews are normally written, often days or weeks
after purchase.
Beyond these arguments, we wanted to shed some new light on the question of the
appropriateness of operationalizing Consumer Reports expert ratings as a measure of technical
quality, by testing whether expert ratings and user ratings can predict a market-based measure of
quality. Resale price of used products is often used in marketing contexts as an indicator of
product quality. For instance, auto manufacturers commonly tout that a particular car has “the
highest resale value of any car in its class.” Ginter, Young, and Dickson (1987) calculated the
ratio of the average retail price of a model of used car to its manufacturer’s suggested list price.
They used this as a dependent variable to gauge the influence of the car model’s reliability and
performance as measured by Consumer Reports. Their rationale was that cars with better
reliability and performance should retain more of their original selling price. We use the same
35
rationale here; products with better performance (quality) should retain more of their original
selling price. To augment our database with retained value data, we scraped prices from the
camelcamelcamel.com website. This website reports new and used prices of products sold by
third parties on the Amazon.com website. This allowed us to calculate a measure of retained
value: the lowest third party selling price at the time of the data scraping (January 2013) across
all third-party sellers of the product on the Amazon.com website, divided by the lowest price of
the product sold as new by a third party seller on the Amazon website.4
In order to assess the relative ability of Consumer Reports quality ratings and user ratings
to predict retained value, we regressed retained on Consumer Reports quality scores and user
ratings. In the first analysis we used user ratings from Amazon.com (971 products). In the second
analysis we used the user ratings from ConsumerReports.org (625 products). The regression
models were specified as follows:
RVij = [α0 + βj,0] + Aij[α1ΔA + βj,1ΔA] + Qij[α1ΔQ + βj,1ΔQ] + εij,
RVij indicates retained value for product i in product category j, Aij indicates the average user
rating on either Amazon.com or ConsumerReports.org, Qij indicates the objective quality of the
product, α0 is the fixed intercept, βj,0 are random intercepts that vary across categories, α1’s are
the fixed slopes, βj,1’s are the random slopes that vary across categories, and εij are error terms.
We standardized all variables by product category before analysis such that they have a mean of
zero and a standard deviation of one.
Table 4 shows point estimates and 95% confidence intervals for the fixed slopes.
4
In cases where there are no current third party sellers of the new or used product the website reports the most
recent used/new price.
36
Consumer Reports quality scores were significantly related to retained value (B = 0.091 and p
< .01 when controlling for Amazon.com user ratings; B = 0.109 and p < .01 when controlling for
ConsumerReports.org user ratings). User ratings were not, regardless of their source (B = 0.030
and p = .37 for Amazon.com; B = 0.018 and p = .66 for ConsumerReports.org). Taken together,
these results and arguments suggest that Consumer Reports quality scores are a valid measure of
quality and important to consumer welfare.
[Insert Table 4 here]
Regardless of the outcome of these analyses and future similar analyses, it will be
difficult to completely rule out the possibility that user ratings capture some elements of the
consumption experience that are not captured by more objective sources of data. Consumer
perceptions can be driven by top-down influences. A good brand name or a high price can
overshadow the intrinsic properties of a product, heighten an experience, and provide pleasure
(Shiv, Carmon, and Ariely 2005) even at the neural level (Plassman et al. 2008). Nonetheless,
even if consumer ratings are somewhat reflective of consumption utility, the point still stands
that there is a disconnect between what user ratings reflect and what consumers think they reflect.
Such misperceptions are usually not advantageous.
Limitations and Ideas for Future Research
Given the importance of review ratings to consumer decision-making further study in this
area would be beneficial. One important limitation of our analyses is that we looked only at user
ratings. When customers provide reviews, they usually have the opportunity to also write a
37
narrative description of their product experience. Future research might look at how often
consumers read these narratives, how they integrate the narrative information with the
quantitative ratings, how they choose which narratives to read and whether the narratives help or
hinder consumers’ quality inferences.
Another limitation of our data is that we represented quality as a unidimensional score.
Consumer Reports rates products on numerous dimensions in addition to giving an overall score.
Future studies could look at how different types of attributes predict consumer ratings. We
suspect that attributes that are highly evaluable by consumers over a short time horizon (e.g. ease
of use) would predict more variance in reviews than inevaluable dimensions like safety and
reliability. The methodological challenge in implementing this study is that Consumer Reports
typically rates a relatively small set of products on a relatively large set of attributes. Model
fitting suffers from too many degrees of freedom and not enough observations.
A third limitation of our data is that it doesn’t cover the full range of products and
services for which people consult online user reviews. We used a wide range of categories—
essentially all categories rated by Consumer Reports—but these categories are primarily durables
such as electronics and appliances. Online review ratings are also pervasive in the evaluation of
more experiential products like alcoholic beverages (e.g., winespectator.com, beeradvocate.com)
and services like restaurants (e.g., Yelp.com), hotels (e.g., tripadvisor.com), and contractors (e.g.,
angieslist.com). Future research might explore whether the findings we have documented
generalize to other types of products and services.
Conclusion
User ratings are a perfect storm of forces that lead to information processing biases, and
38
these forces conspire to create a powerful illusion of validity in the minds of consumers.
Consumers should proceed with much more caution than they do when using reviews to inform
their decisions.
39
References
Allison, Ralph I. and Kenneth P. Uhl (1964), “Influence of Beer Brand Identification on Taste
Perception,” Journal of Marketing Research, 1 (August), 36-39.
Anderson, Myles (2012), “Study: 72% of Consumers Trust Online Reviews as Much as Personal
Recommendations,” http://searchengineland.com/study-72-of-consumers-trust-onlinereviews-as-much-as-personal-recommendations-114152, March 12.
Anderson, Michael and Jeremy Magruder (2012), “Learning from the Crowd: Regression
Discontinuity Estimates of the Effects of an Online Review Database,” The Economic
Journal, 122 (September), 957-989.
Bagwell, Kyle and Michael H. Riordan (1991), “High and Declining Prices Signal Product
Quality,” Amercian Economic Review, 81 (March), 224-239.
Braun, Kathryn A. (1999), “Postexperience Advertising Effects on Consumer Memory,” Journal
of Consumer Research, 25 (March), 319-334.
Broniarczyk, Susan and Joseph W. Alba (1994), “The Importance of the Brand in Brand
Extension,” Journal of Marketing Research, 31 (May), 214-228.
Chen, Yubo and Jinhong Xie (2008), “Online Consumer Review: Word-of-Mouth as a New
Element of Marketing Communication Mix,” Marketing Science, 54 (March), 477-491.
Cialdini, Robert B. (2001), Influence: Science and Practice, Allyn & Bacon Publishers,
Needham Heights, MA 02494.
Chevalier, Judith A. and Dina Mayzlin (2006), “The Effect of Word of Mouth on Sales: Online
Book Reviews,” Journal of Marketing Research, 43 (August), 345-354.
Chintagunta, Pradeep K., Shyam Gopinath, and Venkataraman (2010), “The Effects of Online
User Reviews on Movie Box Office Performance: Accounting for Sequential Rollout and
40
Aggregation Across Local Markets, Marketing Science, 29 (September-October), 944-957.
comScore (2007, November 29), “Online Consumer-Generated Reviews Have Significant
Impact on Offline Purchase Behavior,”
http://www.comscore.com/Press_Events/Press_Releases/2007/11/Online_Consumer_Revie
ws_Impact_Offline_Purchasing_Behavior).
Duan, Wenjing, Bin Gu, and Andrew B. Whinston (2008), “The Dynamics of Online Word-ofMouth and Product Sales – An Empirical Investigation of the Movie Industry,” Journal of
Retailing, 84 (2), 233-242.
Ferraro, Rosellina, James R. Bettman, and Tanya L. Chartrand (2009), “The Power of Strangers:
The Effect of Incidental Consumer Brand Encounters on Brand Choice,” Journal of
Consumer Research, 35 (February), 729-741.
Forman, Chris, Anindya Ghose, and Batia Wiesenfeld (2008), “Examining the Relationship
Between Reviews and Sales: The Role of Reviewer Identity Disclosure in Electronic
Markets,” Information Systems Research, 19 (September), 291-313.
Gerstner, Eitan (1985), “Do Higher Prices Signal Higher Quality,” Journal of Marketing
Research, 22 (May), 209-215.
Ginter, James L., Murray A. Young, and Peter R. Dickson (1987), “A Market Efficieny Study of
Used Car Reliability and Prices,” Journal of Consumer Affairs, 21 (Winter), 258-276.
Grant, Kelli B. (March 4, 2013), “10 Things Online Reviewers Won’t Say,” Wall Street Journal,
http://www.marketwatch.com/Story/story/print?guid=FB144D96-82A0-11E2-B54A002128040CF6.
Hardie, Bruce G. S., Eric J. Johnson, and Peter S. Fader (1993), “Modeling Loss Aversion and
Reference Dependence Effects on Brand Choice,” Marketing Science, 12 (4), 378-394.
41
Hu, Nan, Paul A. Pavlou, and Jennifer Zhang, (2006), “Can Online Reviews Reveal a Products
True Quality? Empirical Findings and Analytical Modeling of Online Word-of-Mouth
Communication. In Proceedings of the 7th ACM Conference on Electronic Commerce,
(EC’06, June 11-15), 324–330.
IBM Global Chief Marketing Officer Study (2011), From Stretched to Strengthened, IBM
Global Business Services, Somers, NY: 10589.
Jain, Shailendra Pratap, and Durairaj Maheswaran (2000). "Motivated Reasoning: A Depth-OfProcessing Perspective." Journal of Consumer Research, 26 (March), 358-371.
Kardes, Frank R., Maria L. Cronley, James J. Kellaris, and Steven S. Posavac (2004), “The Role
of Selective Information Processing in Price-Quality Inference,” Journal of Consumer
Research, 31 (September), 368-374.
Keen, Andrew (2008), The Cult of the Amateur: How blogs, MySpace, YouTube, and the rest of
today's user-generated media are destroying our economy, our culture, and our values.
Random House Digital, Inc.
Koh, Noi Sian, Nan Hu, Eric K. Clemons (2010), “Do Online Reviews Reflect a Product’s True
Perceived Quality? An Investigation of Online Movie Reviews Across Cultures,” Electronic
Commerce Research and Applications,” 9, 374-385.
Kunda, Ziva (1990), “The Case for Motivated Reasoning,” Psychological Bulletin, 108, 480-498.
Lichtenstein, Donald R. and Scot Burton (1989), "The Relationship Between Perceived and
Objective Price-Quality," Journal of Marketing Research, 26 (November), 429-443.
Loechner, Jack (2013), “Consumer Review Said To Be THE Most Powerful Purchase Influence,”
Research Brief from the Center for Media Research,
http://www.mediapost.com/publications/article/190935/consumer-review-said-to-be-the-
42
most-powerful-purch.html#axzz2Mgmt90tc.
Luca, Michael (2011, September 16), “Reviews, Reputation, and Revenue: The Case of
Yelp.com,” Working Paper 12-016, Harvard Business School.
Mayzlin, Dina (2006), “Promotional Chat on the Internet,” Marketing Science, 25 (2), 155-163.
Mayzlin, Dina, Yaniv Dover, and Judith Chevalier (2012, August), “Promotional Reviews: An
Empirical Investigation of Online Review Manipulation,” Unpublished Working Paper.
Mitra, Debanjan and Peter N. Golder (2006), “How Does Objective Quality Affect Perceived
Quality? Short-Term Effects, Long-Term Effects, and Asymmetries,” Marketing Science, 25
(3), 230-247.
Moe, Wendy W. and Michael Trusov (2011), “The Value of Social Dynamics in Online Product
Ratings Forums,” Journal of Marketing Research, 49 (June), 444-456.
Muchnik, S., L. Muchnik, and S. Taylor (2013), “Social Influence Bias: A Randomized
Experiment,” Science, 341 (August 9), 647-651.
Netemeyer, Richard G., William O. Bearden, and Subhash Sharma, (2003), Scale Development
in the Social Sciences: Issues and Applications, 1st ed., Palo Alto CA, Sage Publications, Inc.
Nielsen (2012), “Consumer Trust in Online, Social and Mobile Advertising Grows,”
http://www.nielsen.com/us/en/newswire/2012/consumer-trust-in-online-social-and-mobileadvertising-grows.html.
Obrecht, Natalie, Gretchen B. Chapman, and Rochel Gelman (2007), “Intuitive t-tests: Lay Use
of Statistical Information,” Psychological Bulletin & Review, 14 (6), 1147-1152.
Plassman, Hilke, John O’Doherty, Baba Shiv, and Antonio Rangel (2008), “Marketing Actions
Can Modulate Neural Representations of Experienced Pleasantness,” The National Academy
of Sciences of the USA,” 105 (January 22), 1050-1054.
43
Rao, Akshay R. and Kent B. Monroe (1989), “The Effect of Price, Brand Name, and Store Name
on Buyers’ Perceptions of Product Quality: An Integrative Review,” Journal of Marketing
Research, August (26), 351-357.
Robinson, John P., Phillip R. Shaver, and Lawrence S. Wrightsman (1991), “Criteria for Scale
Selection and Evaluation,” in Measures of Personality and Social Psychological Attitudes,
eds. J. P. Robinson, P. R. Shaver, and L. S. Wrightsman, San Diego, CA: Academic Press,
1–15.
Schlosser, Ann (2005), “Posting Versus Lurking: Communicating in a Multiple Audience
Context,” Journal of Consumer Research, 32 (September), 260-265.
Shiv, Baba, Ziv Carmon, and Dan Ariely (2005), “Placebo Effects of Marketing Actions:
Consumers May Get What They Pay For,” Journal of Marketing Research, 42 (November),
383-393.
Tellis, Gerard J. and Birger Wernerfelt (1987), “Competitive Price and Quality Under
Asymmetric Information,” Marketing Science, 6 (Summer), 240-253.
Tirunillai, Seshadri and Gerard J. Tellis (2012), “Does Chatter Really Matter? Dynamics of
User-Generated Content and Stock Performance,” Marketing Science, 31 (2), 198-240.
The Economists (2012), “Personalising Online Prices, How Deep Are Your Pockets? Businesses
are Offered Software That Spots Which Consumers Will Pay More,”
http://www.economist.com/node/21557798.
Tversky, Amos and Daniel Kahneman (1971), “Belief in the Law of Small Numbers,”
Psychological Bulletin, 76 (2), 105-110.
Tversky, Amos, and Daniel Kahneman (1974), "Judgment Under Uncertainty: Heuristics and
Biases," Science 185, 1124-1131.
44
Wilson, Timothy D. and Jonathan W. Schooler (1991), “Thinking Too Much: Introspection Can
Reduce the Quality of Preferences and Decisions,” Journal of Personality and Social
Psychology, 60 (February), 181-192.
Zeithaml, Valarie A. (1988), “Consumer Perceptions of Price, Quality, and Value: A Means-End
Model and Synthesis of Evidence,” Journal of Marketing, 52 (July), 2-22.
Zhu, Feng and Xiaoquan (Michael) Zhang (2010), “Impact of Online Consumer Reviews on
Sales: The Moderating Role of Product and Consumer Characteristics,” Journal of
Marketing, 74 (March), 133-148.
45
Table 1. The predictive validities of three cues for technical quality in the marketplace and as
perceived by consumers.
Predictor
Difference in Average
Rating
(α1ΔA)
Difference in Number
of Ratings
(α1ΔN)
Difference in Price
(α1ΔP)
Study 1:
The Marketplace
Study 2:
Consumer Beliefs
Point Estimate
[95% Confidence Interval]
Point Estimate
[95% Confidence Interval]
0.114
[0.050; 0.177]
0.342
[0.300; 0.384]
0.148
[0.086; 0.210]
0.098
[0.056; 0.139]
0.287
[0.225; 0.348]
0.173
[0.132; 0.215]
46
Table 2. Brand perception measures and factor loadings in study 3.
Brand Perception Measure
Factor Loadings
Affordability
Benefits
Has the features/benefits you want
0.92
-0.08
Is a brand you can trust
0.88
-0.25
Has high quality products
0.86
-0.40
Offers real solutions for you
0.85
-0.03
Is easy to use
0.82
0.07
Has the latest trends
0.82
-0.05
Is durable
0.82
-0.34
Offers good value for the money
0.82
0.26
Looks good in my home
0.80
0.02
Offers coordinated collections of items
0.80
-0.07
Is growing in popularity
0.75
0.04
Is endorsed by celebrities
0.32
-0.21
Is affordable
0.00
0.95
Is high-priced
0.23
0.83
Has a lot of sales or special deals
-0.50
0.80
47
Table 3. The influence of technical quality, price and brand perceptions on user ratings.
USER RATINGS FROM
AMAZON.COM
USER RATINGS FROM
CONSUMERREPORTS.ORG
Point Estimate
[95% Confidence Interval]
Point Estimate
[95% Confidence Interval]
0.111
[0.041; 0.183]
-0.001
[-0.090; 0.087]
0.101
[0.030; 0.172]
0.098
[0.011; 0.185]
Benefits
(α1BENEFIT)
0.185
[0.119; 0.252]
0.081
[-0.002; 0.166]
Affordability
(α1AFFORDABLE)
-0.080
[-0.147; -0.014]
-0.119
[-0.202; -0.035]
Predictors
Technical
Quality
(α1Q)
Price
(α1P)
48
Table 4. The influence of technical quality and online ratings on retained value.
Predictors
Technical
Quality
(α1Q)
Average User
Rating
(α1A)
USER RATINGS FROM
AMAZON.COM
USER RATINGS FROM
CONSUMERREPORTS.ORG
Point Estimate
[95% Confidence Interval]
Point Estimate
[95% Confidence Interval]
0.091
[0.025; 0.157]
0.109
[0.028; 0.190]
0.030
[-0.036; 0.096]
0.018
[-0.063; 0.098]
49
Figure 1. The predictive validity of the average user rating as a function of its precision and the
predictive validities of the number of user ratings and price in the marketplace (top panel; Study
1) and as perceived by consumers (bottom panel; Study 2).
50
Figure 2. The correspondence between average user ratings and actual quality in the marketplace
(top panel; Study 1) and perceived quality according to consumers (bottom panel; Study 2).
Download