Marketing Engineering Notes

advertisement
Marketing Engineering Notes
Purpose
During the lectures I will cover some material that is not in the readings or that I do not
think is well explained. The purpose of this set of notes is to provide you some information on
these topics. These notes give you a preview of what I plan to talk about in class and also a
review after my lecture. The spreadsheets that are referenced in this note should be available at
http://www.business.utah.edu/~mktbm/mkt6600/ I will not include this full path in these notes.
Response Models
Response models form the heart of marketing engineering and marketing decisionmaking. A response model forecasts the change in a dependent variable, Y, as a function of the
change in one or more independent variables, X. Most commonly, we will look at how changes
in the amount spent on advertising, sale promotions, or the sales force, changes in price, or
changes in the features of a product or service impact sales. Other times we will look at how
changes in the characteristics of a product or service change a person’s preference or probability
of purchasing it. So, typical dependent variables, i.e., Y, include sales, market share, preference,
and probability of choice. Independent variables, i.e, X, include marketing mix elements, product
or service characteristics, and characteristics of the buyer. Page 4 in the Response Models
Technical Note (in the WebCT Technical Notes folder) shows a large number of possible
functional forms of response models. While it is good to know something about each of these
functions, we will deal primarily with four functions: linear, multiplicative, ADBUDG, and logit.
Linear Regression
The most common assumption is that the dependent variable is linearly related to the
independent variable(s). That is, regression is based on an assumption that a set of points can be
1
adequately represented by a straight line (or a hyper plane when there are several independent
variables), i.e., most of the data points will lie relatively close to the regression line. Consider a
linear equation with two independent variables X1 and X2:
2
Yi  a  b1 X 1i  b2 X 2i  a   bk X ki i = 1, …, n
k 1
In this equation, a is the intercept. It is the expected value of Yi when both Xs are 0. Each
regression coefficient, bk, (k = 1, 2), is the slope associated with that independent variable, Xki. It
gives the expected change in Yi for a one unit change in Xki holding the effect of the other
variable(s) constant.
Regression finds the combination of estimates of a, b1, and b2 (i.e., aˆ , bˆ1 and bˆ2 ) that
minimize the sum of the squared errors over all n observations, i ei2, where ei is the difference
^ i.
between the actual and predicted Yi, i.e., ei = Yi - Y
Yi  aˆ  bˆ1 X 1i  bˆ2 X 2i  ei  Yˆi  ei
^
Diagnostics. The variation in Y is called the total sum of squares, TSS, the variation in Y
is called the explained sum of squares, ESS, and the variation in e is called the residual sum of
squares, RSS. (However, sometimes this terminology is just reversed and ESS stands for error
sum of squares and RSS stands for regression sum of squares.)
TSS  i 1 (Yi  Y ) 2
n
ESS  i 1 (Yˆi  Y ) 2

RSS  i 1 ei2  i 1 Yi  Yˆi
n
n
n

2
Because the regression line always runs through the mean of the data, the following identity
holds:

n
n
TSS  i 1 Y  Yi   i 1 Yˆi  Yi
2
   Y  Yˆ    Yˆ  Y   
2
2
n
i 1
i
2
n
i 1
i
i
n
e  ESS  RSS
2
i 1 i
R2 is a measure of the amount of variation in the dependent variable, TSS, can be
2
explained by the regression, ESS. R2 is the ratio of explained to total variance.
ESS
R 

TSS
2
 (Yˆ  Y )
 (Y  Y )
i
i
i
i
RSS
i ei
 1
 1
TSS
i (Yi  Y ) 2
2
2
2
This equation shows that the same regression weights that minimize the sum of the squared
errors, RSS, also maximize R2. There is no universal standard for a good R2, it depends on the
application. We will see some very successful applications where the R2 is relatively low and
unsuccessful applications with high R2s.
The standard error of the estimate is a measure of how close the points lie to the
regression line, se 
e
2
i
i
n  k where n is the number of observations and k is the number of
parameters. Usually about 2/3 of the observed Ys will lie within +- one se of the regression line
and 95% of the points will lie within +- two ses of the line.
One should examine the t-statistics and p-values for each regression coefficient (weight)
to see if it is statistically significant. P-values should be less than .1 and many people believe
they should be less than .05. This corresponds to t-statistics of approximately 1.8 and 2.0.
The medical advertising data (MedAdv.xls) recorded the response to a series of
advertisements. In the weight loss advertising campaign, between 0 and 4 ads were run every
month for a year and the number of calls each month inquiring about a weight loss program were
recorded. We can run a regression to estimate the relationship between the number of ads run in
a given month at the expected number of calls. The dependent variable is the number of calls
each month and the independent variable is the number of ads. Here are the data for the first four
months:
Weight Loss
Ads
Calls
3
January
February
March
April
3
2
3
3
113
98
147
115
This resulted in the following regression equation and output:
Callsi = a + b adsi + ei
SUMMARY OUTPUT - Weight Loss Advertising
Regression Statistics
Multiple R
0.95
R Square
0.91
Adjusted R Square
0.90
Standard Error
15.32
Observations
12
ANOVA
Regression (Explained)
Residual
Total
Intercept
Ads
df
1
10
11
SS
23778.17
2348.08
26126.25
Coefficients
12.63
36.10
MS
23778.17
234.81
Standard
Error
6.30
3.59
F
101.2663
t Stat
2.01
10.06
Significance
F
1.5E-06
P-value
0.07
0.00
The high R2 = .91 indicates that most of the data points lie close to the regression line.
The Standard error of the estimate is 15.32. That says about 2/3 of the observations will be
within 15 calls from what is predicted by the line.
The Analysis of Variance (ANOVA) shows divides the total sum of squares TSS
(26126.25) into the explained sum of squares ESS (23778.17) and residual sum of squares RSS
(2348.08). Note that ESS + RSS = TSS and that R2 = ESS/TSS = 23,778.17/ 26,126.25 = 1(RSS/TSS) = 1-(2,348.08/26,126.25) = .91.
The Intercept of 12.6 has a p-value = .07 < .1 and is significantly different from zero. It
4
says we should expect 12.6 calls a month if there were no advertising. The slope of 36.10 (also
statistically significant p-value < 0.01) indicates that on average each ad generates 36.10 more
calls. The sign of the coefficient is positive, indicating that more ads are associated with more
calls, which is what one would expect. We could examine either a plot of the data or the
residuals, i.e., the eis, to see that a linear model does a good job of representing the data.
Linear regression models are popular because they are easy to estimate and are robust.
That means that even when assumptions are violated, regression typically works pretty well. It is
also a good approximation of the phenomena within a certain range.
Forecasting with Linear Response Models. We can use the estimated regression
coefficients to forecast the number of calls that would be generated from a given number of ads
per month by “plugging” the expected number of ads into the regression equation.
Forecast calls = 12.63 + 36.10 (number of ads)
With zero ads, we expect to attract 12.63 calls, with one ad we expect 48.73 calls (12.63 because
of the intercept and 36.1 because of the ad), and each extra ad would generate 36.1 incremental
calls.
Profit Models. In addition to forecasting sales we can also forecast profits. We will
typically use the following profit model:
Profit = Unit Sales x Margin – Fixed Cost
Assume there is a linear response model for sales, y, as a function of advertising dollars, x,
which is a fixed cost: yˆ  aˆ  bˆx . Then profits are:


Profits = yˆ  margin  FCu  x  aˆ  bˆx  margin  FCu  x  aˆ  margin  b  margin  FCu   x
Continuing with the weight loss example, if each call generates $59 in contribution and
each ad costs $1300, our forecast profit = $59 * Forecast calls - $1300 * number of ads
5
Each additional ad would generate 36.1 calls and $2130 (= $59 * 36.1) in contribution. It costs
$1300. This campaign generates an incremental profit of $830 per ad. If we do not use our
judgment, the model says we should place an infinite number of ads. With a linear response
model, the optimal action is always going to be to either spend nothing or spend an infinite
amount on advertising. So, we should not take the recommendations literally, but use them to say
what you should do directionally what to do rather than exactly, i.e., run a few more or less ads.
Judgmental Calibration. You are more familiar with statistical estimation or calibration;
however, it is also possible to calibrate models judgmentally. While this lacks the objectivity of
statistical estimation, it has the benefit of incorporating the decision maker’s beliefs into the
model and increases the likelihood of using the model. We will use judgmental estimates with
both multiplicative and ADBUDG models later in the course.
A linear model is a two-parameter model (slope and an intercept) when there is one
independent variable. Therefore we need two judgments to calibrate this model. One way is to
ask (1) what are the current levels of the independent (e.g., dollars spent on advertising or
number of advertisements placed) and dependent variables (sales, market share, or calls) and (2)
how much will the dependent variable change with a one-unit increase in the independent
variable. Looking at the first month of the weight loss data, one would say the current number of
ads was 3 and the current number of calls was 113. If we expect to get 36 more calls per
advertisement, we can solve for the intercept.
Calls = a + 36 (Ads) => 113 = a + 36 * 3 or a = 113 – 108 = 5.
If there are two independent variables, we would need to ask for the current levels of both
independent and single dependent variable as well as how much the dependent would change
with a one-unit increase in each of the independent variables.
6
Multinomial Logit (MNL) Models
Linear regression is an appropriate methodology when the dependent variable is
continuous. However when the dependent variable is either a zero – one variable, (for example,
if the dependent variable is brand choice, where the value is one if that brand was chosen and
zero if it was not chosen) or if it is constrained to be between zero and one, (for example, market
share must be between 0 and 100%), it may be better to use a logit model to estimate the
relationship between independent and dependent variables. For example, in choice-based
conjoint analysis, we use a logit model to estimate the relationship between product
characteristics and the probability that a person would choose a product with those
characteristics. Alternatively, we could estimate how merchandising characteristics, like price
promotions, advertising, displays, etc. influence market share or the probability that an individual
would choose a certain brand.
In this case, the dependent variable is going to be a zero – one variable, e.g., one if the
person chose an alternative and a zero if s/he did not choose it. The independent variables are
variables that might influence choice such as product characteristics or merchandising
characteristics. The output of a logit model is an estimated probability of choosing each
alternative. The formula looks more complicated than a linear regression, but is quite similar.
Pi 
Ai
K
A
k 1
k

e
 j  j xij
k e
 j  j xkj
where K is the number of brands in the choice set and
Ai  exp(  j ˆ j X ij )  exp( ˆ1 X i1  ˆ2 X i 2  ...  ˆJ X iJ )
Pi is the probabilit y of choosing the ith alterative ,
X ij is the amount of the jth attribute contained in the ith alternativ e, and
ˆ j is the importance of the jth attribute.
Choice-based conjoint is similar to a ratings-based conjoint model; both are used to
7
understand why people choose or prefer certain alternatives. We assume people choose or like
things because of the benefits they offer or the characteristics they possess – preference is
“caused” by product characteristics. We typically assume a linear function of the characteristics.
Overall preference is a weighted sum of the attribute levels the product possesses.
In either case, the product descriptions are the independent variables and the measure of
preference is the dependent variable. When the dependent variable, e.g., preference, is measured
on a continuous scale (1- 10), we use regression to estimate the importance of the product
attribute when making preference judgments about the brands.
Pref i   j Bˆ j X ij  ei
Once we have estimated the importance weights, we can predict the preference or
likelihood of purchasing any competitive product by substituting its perceptions into the
equation. Also, we can estimate the impact of changing the perception of a given attribute; this
may be a change in the physical product or just a new message. In Excel, we have done this with
the sumproduct function of the regression weights and the independent variables.
Pref new   j Bˆ j X ( new) j
In ratings-based conjoint, we typically assume the alterative with the highest predicted
preference is the one that is chosen. The multinomial logit model assumes that the probability of
choosing the ith alternative is equal to:
Pi  e
 j xij j
e
 j xkj  j
k
There are several differences between logit and regression models. The dependent
variable in a logit model is choice (a 0–1 variable) rather than preference rated on a continuous
(1 to 10) scale. These “choices” can be either stated choices, what the person says s/he would
8
choose or revealed choices –what they actually chose. We have similar independent variables to
ratings-based conjoint.
In ratings-based conjoint, if a product with a particular attribute level (low price or high
gas mileage) is typically preferred (receives a high preference rating) the regression weight
associated with that level will be large and positive. In a logit model, if a product with a
particular attribute level is consistently chosen, its regression weight will also be large and
positive. So we interpret the parameters the same way.
The estimated probability of choice is the exponentiated utility of that object over the
sum of the exponentiated utilities of all of the objects. Because it includes all of the competing
alternatives, logit models allow us to capture competitive effects. It can model either market
shares at the aggregate level or choice probabilities at the individual level.
Its form may look complicated, but it has two very nice properties. First Ai is always nonnegative, because it is an exponentiation. Second, the predicted choice probabilities (or market
shares) are all between zero and one and they sum to one.
The models are usually estimated by a procedure called maximum likelihood. The model
finds the parameters that maximize the probability of the observed outcomes. If a person makes a
series of choices, the model finds parameters that make the estimated probabilities of the chosen
alternatives as close to one as possible and the estimated probabilities of the non-chosen
alternatives as close to zero as possible. The parameters are estimated through a search procedure
like Solver, but our logit model does everything automatically.
The MNL model is similar to a preference regression model in that (1) we are trying to
estimate importance weights for product attributes and (2) the independent variables are the
product attributes or characteristics. It differs from the preference regression model in that (1) the
9
dependent variable is choice, or probability of choice, instead of preference or liking (2) all of
the alternatives in a given choice set are considered to be part of one observation instead of each
brand constituting a separate observation. The software accomplishes this by asking for the
number of alternatives (per case). Regression minimizes the sum of the squared errors; the MNL
maximizes a likelihood function.
Example of Modeling Transportation Choice. In a simple example we want to determine
the probability that a person will choose a car or mass transit. In the following table there are two
rows for each person, one for each alternative: Auto or Mass. The dependent variable is the
chosen mode of transportation it is a 0 – 1 variable. The first two people chose mass transit and
the third person chose auto. We could have asked people to choose a transportation mode or we
could have observed what they actually chose.
The independent variables are travel time and a dummy variable (called a brand specific
or alternative specific constant) for the first alternative, auto. The last alternative, mass transit, is
the reference level. The dummy variable is automatically supplied by the program and the
coefficient associated with it is the difference in utility between taking auto and mass transit
when travel time is held constant – some people would rather drive if travel times are similar and
others would rather take mass transportation.
Observations /
Choice data
Alternatives
Choice
Time
Auto
1
Auto
0
52.9
1
1
Mass
1
4.4
0
2
Auto
0
4.1
1
2
Mass
1
28.5
0
10
3
Auto
1
4.1
1
3
Mass
0
86.9
0
The output of the model looks like the following for these data:
Variables / Coefficient Coefficient
estimates
estimates
Time
-0.05
Standard
errors
0.02
Auto
0.75
-0.24
t-statistic
-2.57
-0.32
The negative coefficient associated with time says that people will tend to choose the
faster transportation mode, i.e., the mode with the smaller travel time. The coefficient associated
with auto is also negative, but insignificant. The insignificance says that people are indifferent
between auto and mass transit.
Forecasting with Logit Models. Just like regression, we use the estimated coefficients and
the logit formula to forecast the probability that a given person would choose either auto or mass
transit. This requires three steps. The logit formula is:
Pi 
e
 j  j xij
e
 j  j xkj
k
1. Calculate Vi = jjxij for each of the alternatives
2. exponentiate these calculated Vis, Ai = exp(Vi) = exp(jjxij)
3. Plug them into the logit formula
Like the first person in this data set, assume that auto travel time is 52.9 minutes and mass transit
travel time is 4.4 minutes.
V1auto = -.053 * 52.9 - .24 * 1 = -3.04
V1mass = -.053 * 4.4 - .24 * 0 = -.23
11
Exp(V1auto) = exp( -3.04) = .0478
Exp(V1mass) = exp(-.23) = .795
Exp(V1auto) + Exp(V1mass) = .0478 + .795 = .842
P1auto = .0478 / .842 = .057
P1mass = .795 / .842 = .944
As soon as we know a person’s auto and mass transit travel times, we can forecast his/her
transportation choice in a similar manner.
Example of Modeling Detergent Choice. In this example, we will use a logit model to
predict the probability of choosing a different brand of laundry detergent. We want to see how
effective various marketing mix elements are. Specifically we measure the effect of price, price
discount, whether it was on an end of aisle display, or featured in an ad that week. The model
will also put in brand dummy variables for the first three brands to capture perceived product
quality or overall image, and a loyalty variable to capture past purchases. We can use this to see
how many people would purchase due to an end of aisle display and see if that is worth the cost
of paying a store to do that.
In this example, 8 people each made 10 purchases of laundry detergent (once each
month) among four brads: Wisk, All, Tide, and Yes. The first column is the consumer number.
The second column is the purchase number (again, one for each month) for that customer (1
through 10). For each purchase there are four brands listed in the third column. The fourth
column is the 0 – 1 variable showing which brand was chosen. After the loyalty column (which I
will not cover in detail), there are four merchandising variables: List Price, Price Discount,
Display, and Feature Ad. These are the marketing mix elements under control of the retailer.
Finally, there is a dummy variable for each of the first three brands. Yes is the reference level.
12
Loyalty
List Price
Discount
Display
Feature
Wisk
All
Tide
1
Wisk
0
0.25
3.25
0.63
0
0
1
0
0
1
1
All
0
0.25
3.10
0.71
0
0
0
1
0
1
1
Tide
1
0.25
3.30
0.82
1
1
0
0
1
1
1
Yes
0
0.25
2.95
0.86
0
0
0
0
0
1
2
Wisk
0
0.2
3.25
0.63
0
0
1
0
0
1
2
All
0
0.2
3.10
0.71
0
0
0
1
0
1
2
Tide
0
0.4
3.67
0.82
0
0
0
0
1
1
2
Yes
1
0.2
2.95
0.60
0
1
0
0
0
Choice
1
Brands
/ Choice data
Month
Observations
When we estimate this model, we get the following set of coefficients: List price is negative.
That says people are less likely to buy products at a higher price. The price discount is positive
and says people are more likely to purchase when there is a bigger discount. Notice that people
are more sensitive to the amount of the discount than to the list price. People are also more likely
to buy when a product is displayed or when it is in a feature ad. The sizes of these two
merchandising coefficients are approximately equal. The positive coefficient for Tide says that
people are significantly more likely to buy Tide than the reference brand Yes, but there is no
significant difference between the probability of choosing any of the other three brands.
Variables /
Coefficient
estimates
Loyalty
List Price
Discount
Display
Feature
Wisk
All
Tide
Coefficient
estimates
Standard
deviation
t-statistic
1.78
-3.54
10.58
1.18
1.25
0.37
0.67
1.99
1.22
1.10
2.01
0.43
0.45
0.59
0.56
0.69
1.46
-3.22
5.25
2.72
2.77
0.63
1.20
2.90
13
Linearizable Response Functions – Multiplicative Model
Decreasing and Increasing Returns Response Functions. If either theory or an inspection
of the data suggests a nonlinear relationship between one or more of the independent variables
and the dependent variable, you should consider a nonlinear model. The first alternative is a nonlinear model that can be linearized through a simple transformation. This allows you to do the
estimation with linear regression. This is typically easier than nonlinear least squares estimation,
it may be more robust, and the regression module in Excel provides a number of useful
diagnostics like R2 and t-statistics. By far, the most widely used linearizable model is the
multiplicative model. It is the only one we will cover in class.
Multiplicative Model. The multiplicative model is a commonly used model to represent
either an increasing or decreasing returns function. This model is popular because it is a constant
elasticity model, i.e., it models the response in such a way that a given percent change in an
independent variable always produces the same percentage change in the dependent variable. An
increasing returns model (see P5 on Exhibit 2, page 4 of the Response Models Technical Note)
might occur if there are network effects or positive feedback. A decreasing returns model (see P3
on Exhibit 2, page 4 of the Response Models Technical Note) might occur if the impact of a
repeated advertisement declines over time.
The two-variable multiplicative model is written as:
Y  aX 1 1 X 2
b
b2
X i  0 i  1,2
It is called multiplicative because the Xs are multiplied together rather than added. If the bs are
less than one it is a decreasing returns model and if they are greater than one, it is an increasing
returns model. We estimate the parameters by taking logarithms of the above equation:
14
Ln(Y )  Ln(aX 1 1 X 2 2 )  Ln(a)  Ln( X 1 1 )  Ln( X 2 2 )
b
b
b
b
 Ln(a)  b1 Ln( X 1 )  b2 Ln( X 2 )
To estimate the model, take the logarithms of Y, X1 and X2 and then regress Ln Y on Ln
X1 and Ln X2.
Ln(Yt )  ˆ  ˆ1 Ln( X 1t )  ˆ2 Ln( X 2t )  et
where ˆ  Ln(a ), exp( ˆ )  a and ˆ  b
i
i
i  1, 2
Once we have estimated the parameters, we can use the following equation to forecast Y.
ˆ
Y  eˆ X 1 1 X 2
ˆ2
 aX 1 1 X 2
b
b2
Judgmental Calibration. This is also a two-parameter model when there is one
independent variable, so we must ask at least two questions to judgmentally estimate the
parameters. First, what is the current level of the independent variable (e.g., dollars spent on
advertising or price) and the current level of the dependent variable (sales or market share)?
Second, what will be the percent change in the dependent variable with a one percent increase in
the independent variable. If current sales are supposed to be 50 units when the price is $130 and
sales will increase by 1.5% with every one percent decrease in price, we would have the
following model:
Sales = a (Price)-b =>50 = a (130)-1.5 = a (.0007) or a = 50/(.0007) = 74111.
If there are two independent variables, we would need to ask for the current levels of both
independent and single dependent variable as well as the expected percentage change in the
dependent with a one percent change in each of the independent variables.
Forecasting with the Multiplicative Model. The process is the same as with the linear
model: “plug” the forecast values of the independent variables into the equation and solve for the
value of the dependent variable. For example, in the sales example above, sales when price is
15
dropped to $120 is calculated as follows:
Sales($120) = a(Price)-b = 74111 ($120)-1.5 = 56.38 units.
Profit Models. We can build profit models just like we did with the linear model. Again,
the general profit function is: Profit = Unit Sales x Margin – Fixed Cost
In the above example, the marketing variable is price, which is not a fixed cost, but enters
into the margin. If the unit cost is $50, then
Profits = a(Price)-b * Margin = 74111 ($120)-1.5 ($120 - $50) = 56.38 * $70 = $3946.60
16
Measuring the Impact of Price and Display on Sales
The cheese data contain weekly unit volume, price, and a measure of display activity on
several key accounts (a city – retailer combination) for approximately 65 weeks. These data are
for a sliced cheese product manufactured by Borden. The measure of display activity is percent
of ACV (all category volume) on display. Later we will look at some soft drink data that have
the same information for both a focal brand and a competitor. The models may look complicated,
so we will build them in steps.
First is a simple model where sales are a function of just price:
S t  a0  Pt 1
a0  the intercept  value of S t when Pt  1
 t  the slope associated with Pt  price elasticity
Where St is the unit volume at time t, Pt is the price at time t. In this equation,  adjusts
for the size of the market. It is the size of the market when all independent variables equal one.
1 is the price elasticity – the percent change in volume for a 1% change in price.
To estimate the model, we take natural logarithms of each side of the equation to get an
equivalent model: ln S t   ln a0   1 ln Pt 
We can estimate this model with regression, where ln(St) is the dependent variable and ln(Pt) is
the independent variable:
ln( S t )  ˆ0  ˆ1 ln( Pt )  et
where ˆ0  ln a0  or exp( ˆ0 )  a0 and ˆ1  1
ˆ
Once we have estimated the parameters, the estimated sales volume is: Sˆt  exp( ˆ0 )  Pt 1
Next, assume that display activity affects volume only and not price sensitivity. This
results in the following model:
17
S t  a0  a1Dt  Pt 2
1  multiplier for deal periods

Where Dt is the percent of ACV on display and 1 is a multiplier for display, i.e., if there
is no display activity, (i.e., Dt = 0) the impact of 1 is a multiplication by 1, if ACV display is 1,
the sales volume is increased by a factor of 1 It is a measure of the percentage change in
volume when there is a display.
This model can be written equivalently in terms of logarithms as:
ln St   ln a0   ln( a1 ) Dt   2 ln Pt 
We can estimate this model with regression, where ln(St) is the dependent variable and Dt and
ln(Pt) are the independent variables:
ln( St )  ˆ0  ˆ1 Dt  ˆ2 ln( Pt )  et
where ˆi  ln ai  or exp( ˆi )  ai i  0, 1 and ˆ2   2
Once we have estimated the regression coefficients, we can forecast sales with the following
equation by plugging in the price and display activity:
ˆ
Sˆt  exp( ˆ0 )  exp( ˆ1 ) Dt  Pt  2
Next, we can complicate the model even further by assuming that a display impacts not only
volume, but also price sensitivity.
St  a0  a1Dt  Pt 2  Pt Dt 3  a0  a1Dt  Pt 2  Dt 3
 3  change in price senstivity due to display activity
As before, 0 measures the percentage change in volume due to display activity and 3 measures
the change in price sensitivity due to display activity. Again, we can write this into an equivalent
model by taking logarithms of both sides:
ln St   ln a0   ln( a1 ) Dt   2 ln Pt    3 Dt ln Pt 
18
We can estimate that model with regression where ln(St) is the dependent variable and Dt, ln(Pt),
and Dt*ln(Pt) are the independent variables:
ln( S t )  ˆ0  ˆ1 Dt  ˆ2 ln( Pt )  ˆ3 Dt ln( Pt )  et
where ˆi  ln ai  or exp( ˆi )  ai i  0, 1 and ˆ j   j j  2, 3
Once we have estimated the regression coefficients, we can forecast sales with the following
equation for any level of price and display activity:
ˆ
ˆ
ˆ
ˆ
Sˆt  exp( ˆ0 )  exp( ˆ1 ) Dt  Pt  2  Pt Dt 3  exp( ˆ0 )  exp( ˆ1 ) Dt  Pt  2  Dt 3
Finally, we look at two brands, i and j, where we will call brand i our own brand and brand j the
other brand. Furthermore, we will model the effects of price and display activity both on our own
brand and on the other brand. Sales of brand i is a function of its own pricing and display
activity as well as the pricing and display activity of the other brand, brand j.
D 6
S it   01Dit Pit 2 PitDit 3  4 jt Pjt5 Pjt jt
D
we expect that 2 and 3 will be negative – as its own price increases, its sales will decrease. On
the other hand if the two brands are competing, we expect 5 and 6 to be positive – as the price
of the other brand increases we expect sales of own brand to increase. Similarly we expect 
and to be of opposite signs: display activities of own brand should increase own brand sales
and display activities of the other brand should decrease own brand sales.
We can write this model as an equivalent model by taking logarithms, estimate it using
regression, and make forecasts once we have estimated the parameters:
19
Equivalent model :
ln Sit   ln a0   ln( a1 ) Dit   2 ln Pit    3 Dit ln Pit   ln( a4 ) D jt   5 ln Pjt    6 D jt ln Pjt 
Regression model for estimation :
ln( Sit )  ˆ0  ˆ1 Dit  ˆ2 ln( Pit )  ˆ3 Dit ln( Pit )  ˆ4 D jt  ˆ5 ln( Pjt )  ˆ6 D jt ln( Pjt )  et
where ˆi  ln ai  or exp( ˆi )  ai i  0, 1, 4 and ˆ j   j j  2, 3, 5, 6
ˆ
ˆ
ˆ
ˆ
ˆ
D
D ˆ
D
ˆ  D ˆ
Sˆit  exp( ˆ0 )  exp( ˆ1 ) Dit  Pit 2  PitDit 3  exp( ˆ4 ) jt  Pjt5  Pjt jt 6  exp( ˆ0 )  exp( ˆ1 ) Dt  Pt  2  Dt 3 exp( ˆ4 ) jt  Pjt 5 jt 6
Other Linearizeable Models
We will probably not use these models, but they are very similar to the multiplicative
model, so they are briefly mentioned.
Exponential Model. Rather than taking logs of both X and Y, one can take logs of only
one or the other. The exponential model has the following form and can model either increasing
or decreasing returns:
Y = aeXb
If we take logs of both sides of this, we have
Ln Yi = Ln a + bXi + ei =  + bXi + ei where  = Ln a or a = e
Therefore, if we take the logarithm of Y, but not X, we are estimating this exponential model.
This is one of the curve fitting options in Excel Chart.
Semi-logarithmic Model. It is also possible to take logs of just one or more of the X
variables, i.e.,
Y = a + b1ln X1i + b2 X2i + ei
Typically, we might choose this model when we expect one of the independent variables to
display a nonlinear relationship to Y. This might occur if X1 is a size variable, like number of
employees. There may be large differences between small and medium sized companies, but
smaller differences between large and very large companies.
20
Example. The spreadsheet NonlinearAdvSales.xls provides an example of nonlinear
modeling. It happens to have been done within the chart option of Excel rather than regression;
however, the appropriate columns allow you to run regression with multiplicative, exponential,
and semi logarithmic models. Look at the R2s and the plots of the residuals to choose the most
appropriate model. 
21
Estimating Nonlinear Models with Solver
It is also possible to estimate response models with Excel’s Solver add-in (Read the Excel
Solver Technical Note in the WebCT Technical Notes folder). Solver searches for values of cells,
or parameters, that maximize or minimize another cell, which is a function of the parameters.
When estimating a response model, we will be searching for parameters (like regression weights)
that minimize the sum of squared errors between the predicted and actual dependent variable.
We can also use Solver to find values of marketing mix elements that maximize profits.
The spreadsheet NonlinearLeastSquares.xls contains two examples of estimating
response models using Solver. The first is the linear regression dealing with the weight-loss
problem we saw earlier (see IntroReg sheet, Sheet1 and Chart1). The second deals with the
ADBUDG model (see IntroADBUDG sheet, Sheet2 and Chart2). In either case, the steps are the
same and are given in the two Intro spreadsheets.
The following description is for the Weight Loss advertising
1. Select locations for the parameter you want to estimate and put in initial guesses. Select cells
that are contiguous – A3 and B3.
2. Place the independent variable (in this case number of advertisements) in a column (in this
case column B).
3. Place the dependent variable (calls) in a column (C ). Calculate the mean of that column.
4. Create a column that uses the parameters to estimate the dependent variable (D).
5. Create a column (E) that is the squared difference between the dependent variable and the
predicted dependent variable (C – D)2. Sum this column. That sum, which is the Residual (or
error) sum of squares, is the number you want to minimize. Solver should search over
different values of the parameters (A3 and B3) to minimize this cell.
22
6. This is not required to estimate the parameters, but create a column (F) that is the total sum
of squares, i.e., the squared difference between the dependent variable (C ) and its mean. The
purpose of this is to allow you to calculate R2.
7. Calculate R2.
8. To use solver with either Excel 2003 or Excel 2007, click on Data then Solver (If solver has
not been installed in Excel 2003, click on Tools then Add-ins and click solver. If solver has
not been installed in Excel 2007, click the Microsoft Office Button, Excel Options, Add-Ins,
Manage Excel Add-ins, Go). After the solver dialog box comes up, select the cell to be
minimized (sum of the squared error column), click minimize. Select the cells to be searched
over (A3:B3). Add any appropriate restrictions (none are needed in this case). Click Solve.
ADBUDG
ADBUDG is a flexible model that was developed for judgmental data. It can represent
either an s-shaped model where increasing returns occur up to a point, and decreasing returns
after that or a concave model, which always has decreasing returns. The s-shaped model is
appropriate for the situation where there is little response until we spend more than a certain
amount, and then sales increase rapidly for a period, but at some point advertising will become
increasingly less effective.
ADBUDG has four parameters:
 xc 

Y ( x)  b  (a  b) c
 x d 
b is the minimum value of Y – “what will sales be if you do not do any advertising or
promotion?”
a is the maximum value of Y - “what will sales be if you spend an infinite amount on
23
advertising?”
c controls the shape of the curve; the curve is concave if 0 < c < 1 and it is s-shaped if c > 1, and
d works with c to control how quickly the curve rises.
Statistical Estimation using Solver. This is based on the model in Sheet2 of the
NonLinearLeastSquares.xls spreadsheet.
1. There are four parameters a – d. They are placed in A6:D6. Initial values are selected: b is set
at a minimum value, a is the maximum value, c is set at 2 (I always do that), and d was set at
20 (that is hard to explain why).
2. The independent variable (marketing effort) is placed in column A.
3. The dependent variable (sales) is in column B. The mean of the column is at the bottom.
4. Forecast sales (Yhat) is in C – check out how this was calculated – I just plugged in the
ADBUDG function using the parameter cells.
5. Calculate a column of Squared errors (C – B)2. The sum is at the bottom. Create a column of
TSS by taking the squared difference between the dependent variable and its mean. Sum this
column.
6. Estimate R2.
7. To run Solver, click Tools, then Solver, We want to minimize the sum of the squared errors
by searching over the parameters (A6:D6). Here we should put some constraints on the
parameters a6:d6 >0 and b6 < a6.
Judgmental parameter estimation. There are four parameters and they can be uniquely
determined with four estimates. Usually these estimates are in terms of changes from the current
situation. By what percent would sales grow (shrink) if you used a saturation level of (did no)
advertising? By what percent would sales increase if you spent 50% more on advertising? We
24
assume sales would remain constant if your level of advertising remained constant.
b = y(0), i.e., the percent of current sales that would be retained if advertising were cut to zero
a = y(), i.e., the percent sales would grow if the advertising level was infinite
y(1) is the sales at the current level
y(1.5) is the percent of current sales you would sell if you spent 50% more on advertising.
Because 1c = 1, we can solve for d with the following formula:
 1 
Y (1)  b  (a  b)

1 d 
Going through some algebra, we can see d is equal to the following:
d
a  y(1) y()  y(1)
y ( )  1

when y(1) = 1, d 
.
1  y ( 0)
y(1)  b y(1)  y(0)
Assuming that the person also provided an estimate of y(1.5), we can solve for c. After more
algebra:
  y (1.5)  b)  
d 
ln  
a

y
(
1
.
5
)

 
c 
ln( 1.5)
For example assume a manager assumed that sales would drop to 60% of current without any
advertising, rise to 2X current sales with saturation advertising, and rise to 1.3X current sales
with 1.5X as much advertising. This would generate the following parameters:
a = 2.0
b = .6
d
a  y(1) 2  1 1

  2.5
y(1)  b 1  .6 .4
  y (1.5)  b  
d 
ln  
a  y (1.5)  


c
ln 1.5
  1.2  .6  
ln  
2.5 
2  1.2  



ln( 1.5)
25
  .6  
ln   2.5 
.8
   
.4055
 1.55
The cases in the book, Conglom, Syntex, and Blue Mountain Coffee all use a slightly
different method. They ask for estimated change sales from the current sales at the following
four levels of marketing effort: 0, 50% of current, 150% of current, and saturation. The method
implicitly assumes that the current level of marketing effort is going to result in the current level
of sales.
The first three parameters, a, b, and d are estimated in the very same way as above:
b = y(0)
a = y()
y(1) is sales at the current level
d
a 1
a  y(1)
as y(1) = 1 d 
.
1 b
y(1)  b
The other two estimates, a non linear least squares procedure is used with the
observations y(.5) and y(1.5) to estimate c. The errors in the two estimates are:

 0.5c  
  and
e(0.5)  y (.5)  yˆ (.5)  y (.5)   b  (a  b) c

0
.
5

d




 1.5 c  
 
e(1.5)  y (1.5)  yˆ (1.5) y (1.5)   b  (a  b) c

1
.
5

d



We use Solver to search for the value of c that minimizes e(0.5)2 + e(1.5)2
Both of these procedures are illustrated in the spreadsheet ADBUDGJudmental.xls
Forecasting with ADBUDG Models. We do this the very same way we did forecasting
with a regression mode. The first step is to estimate the four parameters of the ADBUDG model.
This can be done either with marketplace data (see the ADBUDG portion of the
NonLinearLeastSquares.xls spreadsheet) or judgmental data (see the ADBUDGJudmental.xls
26
spreadsheet). Continuing with the example from the NonLinearLeastSquares.xls spreadsheet, as
with the regression model, we use the estimated coefficients and plug in the expected “marketing
effort” to forecast unit sales:
 xc 


x 2.3
  30.3  (56.7  30.3) 2.3

Y ( x)  b  (a  b) c
 x d 
 x  20.1 
With zero marketing effort we would expect 30.3 sales and with an infinite amount of marketing
effort, we would expect 56.7 sales.
Profit Models. In addition to forecasting sales we can also forecast profits. The general
model will be is the same as earlier:
Profit = Unit Sales x Margin – Fixed Cost
Continuing with the same example and assuming that the margin is $2 and the cost of a
unit of marketing effort is $1.5

 xc 
   margin - FC u  x
c

 x  d 
  Y ( x)  margin - FC u  x   b  (a  b)




x 2.3
   $2  $1.5  x
  30.3  (56.7  30.3) 2.3

x

20
.
1



The Profit worksheet calculates forecast sales and profits for different levels of marketing effort.
This is graphed in Chart3. Cells H13:H17 of that same sheet allow you to use Solver to find the
level of marketing effort that maximizes profits.
27
Clustering for Segmentation and Classification
Rather than assuming that the data can be represented by a line (or hyper plane) as in
regression, cluster analysis assumes that the data can be represented by a much smaller set of
points in a space. That is, most of the data points are expected to “cluster” around one of a small
number of points. So this smaller set of points can adequately represent the data, just as a line
can adequately represent the data in a regression.
In most cases, we will be clustering people to form market segments. We can think of
each person in an n-dimensional space, where n is the number of variables on which we have
data. For example, in a demographic segmentation, we could have variables for age, income,
educational level, marital status, and region of the country. If we did an attitudinal segmentation,
each person would be represented by their answers to a number of attitudinal questions. We want
to learn if there is some structure to the data, i.e., are people spread out uniformly or are their
distinct groups or segments. Examples might be high income professionals, liberals,
conservatives, etc.
The Segmentation and Classification program in ME>XL has two options: hierarchical
clustering and k-means. Hierarchical clustering is the default method, but I think k-means is
more valuable. The hierarchical clustering program starts with each point in its own cluster. It
goes through a series of steps. In each it combines the two clusters that are most similar, the new
cluster is located at the centroid or the average of the two clusters that have been combined. It
continues through this process until there is only one cluster. At each step the two clusters that
are joined together are the two that would increase the Error Sum of Squares (ESS) by the least
amount – essentially, it combines the two clusters that are closest together. (In regression RSS,
residual sum of squares is the same thing as error sum of squares.) At each step the ESS is the
28
total error sum of squares that is associated with that number of clusters. A smaller ESS means a
better fit. As the data are aggregated into fewer and fewer clusters, the ESS will continue to rise.
The k-means program uses the hierarchical solution as a starting point and “optimizes”
that solution by sequentially moving each point to each cluster to see if the fit improves. It
reports fit as a ratio of between to within sum of squares. This ratio is not that meaningful, but it
can be transformed into a number that is similar to an R2.
Because the clusters are described in terms of the means of the points in the cluster, i.e.,
the centroid, the Total Sum of Squares = Between Sum of Squares + Within Sum of Squares or
TSS = BSS + WSS. This is like regression in which TSS = ESS + RSS. Here BSS corresponds to
ESS, the explained sum of squares, and WSS corresponds to RSS, the residual sum of squares.
The WSS is the sum of the squared distances from all points represented by a cluster to
the centroid of that cluster. It measures how well the cluster centroid represents that set of points.
It is like e2 the sum of the squared errors in regression, the sum of the squared distances of the
data points from the line. BSS is a weighted sum of the squared distances between each pair of
group centroids, where the weight is the number of points represented by each cluster. BSS is
bigger when the groups are further apart or are better separated.
In regression R2 = ESS / TSS. We would like a similar statistic, but the k-means
clustering program reports only the ratio BSS/WSS. However we can go through a little algebra
to calculate an R2:
BSS
BSS

TSS BSS  WSS
If we divide the numerator and denominator of the RHS by WSS, we get:
Variance Accounted For (VAF )  R 2 
VAF 
BSS
BSS / WSS
ratio


BSS  WSS ) BSS / WSS  WSS / WSS ratio  1
29
Where ratio = BSS/WSS that is printed out by the program. Therefore, we can calculate the R2 =
ratio/(ratio+1) from the k-means output.
We can use either the R2 or the ESS to help determine the proper number of clusters. In
either case, we look a big improvement in fit up to a certain number of clusters and a small
improvement after that, this is called an “elbow.”
In the PDA data, we had the following Between/Within Sum of Squares and ESSs for
differing numbers of clusters:
Clusters
1
2
3
4
5
6
7
8
B/W
VAF
.3609
.6964
.9594
1.211
1.441
1.639
1.793
0.27
0.41
0.49
0.55
0.59
0.62
0.64
ESS
2.84
1.53
0.83
0.67
0.44
0.31
0.31
0.25
Looking first at ESS, when we go from one two clusters, the ESS drops by 1.31 (=2.841.53). It drops by .70 (=1.53 - .83) when we go from two to three clusters. It drops by .16 going
from three to four clusters, etc.
If there is a clear “elbow” we would choose that number of clusters. For example suppose
the ESSs for one to five clusters were the following: 2.84, 1.53, .83, .80, and .75. We would see
that there is a large drop in ESS as we go from one to three clusters, but little is gained after that.
With real data we do not usually get this clean of a solution, and we must look at other things
such as size and interpretability of the clusters.
Similarly, if we saw R2s that increased .27, .41, .49, .55, .59, we might choose the
solution with an R2 of .49, or possibly .55, as the gains get smaller after that.
In our data, we see that little is gained after six clusters and quite a bit is gained for the
first three or four clusters. This says that the proper number is probably between 3 and 6. We
30
need to look at size of clusters and the interpretation of the new clusters to make a determination
of the optimal number.
Discriminant Analysis
In addition to clustering, the Segmentation and Classification program also performs a
“discrimination.” Once clusters have been formed on one set of variables, say attitudinal, then
the program attempts to see if there are differences among these clusters in terms of another set
of variables, say demographic. So it may try to determine if there are demographic differences
between liberals and conservatives. This is accomplished through a statistical technique called
discriminant analysis.
Discriminant analysis shares some similarities with both cluster analysis and regression.
Like regression it is a statistical technique that determines the best linear relationship between a
set of independent variables and a dependent variable.
Yi = a + b1 X1i + b2X2i + b3 X3i + ei
Regression finds that linear combination, i.e., that set of a and bs that best explains the variation
in a dependent variable. It finds that combination of a, b1, b2, and b3 that minimizes e2 or
maximizes R2. The independent variable is assumed to be interval scaled. It assumes that the
relationship between the dependent and independent variables can be represented in terms of a
straight line (or actually hyperplane in multiple regression).
In discriminant analysis, the Yi is a categorical variable, i.e., group membership.
Categorical variables are just different, e.g., male and female (or the benefit clusters in the PDA
case), but there is no order to them. Discriminant analysis finds that linear combination (or linear
combinations) that bests separates groups or, equivalently, that does the best job of predicting
group membership.
31
For example, if a market is segmented into benefit or needs clusters, we might use
discriminant analysis to see if a linear combination of demographic variables can separate these
groups, i.e., determine which demographic variables best differentiate, or separate these
segments. Stated differently, we want to see if the clusters differ significantly in terms of
demographic variables.
Rather than estimating a line that all points like close to, it estimates a function such that
the scores of all observations in one group are close to each other and they are far from the
scores of the other groups. So, rather than interpreting the data in terms of a straight line, we
interpret the data in terms of a single point for each group. Can we adequately represent our data
as a set of points? We want a lot of variation (or distance) between groups, this is called Between
group Sum of Squares, BSS, and very little within group – variation, called Within group Sum of
Squares, i.e., we want our points to all be close to the centroid of the group to which it belongs.
Like regression, which attempts to minimize RSS or maximize ESS, discriminant analysis
attempts to maximize a function of the ratio of BSS/WSS.
Like clustering, the statistics are based on WSS, within group sum of squares, and BSS,
between group sum of squares. The big difference from clustering is that we do not know which
cluster each observation is in before we start. We do not even know how many clusters there are.
In discriminant analysis, we know which group each observation is in and we want to find out if
there are any differences in a set of independent variables among observations in different
groups.
Suppose we wanted to discriminate between a group of males and a group of females.
We want to try to predict a group membership, person’s gender, i.e., we want to find a function
of the independent variables that gives high scores to one gender and low scores to the other.
32
Suppose we measure people in terms of height, weight, shoe size, eye color, grade point, and
GMAT. The discriminant function would look like the following:
Genderi =  + Hti + Wti + SSi + ECi + GPAi +  GMATi
If male is coded as one and female as zero, we want a set of  and s that give scores
close to one to men and scores close to zero to women. In this case we might expect that 1 2
and 3 would be greater than zero, i.e., on average men tend to be physically bigger than women.
If the sample consisted of MBA students, we might not expect a significant difference in GPA or
GMAT between men and women. This function tells us which variables differ significantly
between men and women. We can use this function (called a discriminant function) to predict the
gender of a given person, given knowledge of their height, weight, shoe size, eye color, GPA,
and GMAT. Some larger women would get incorrectly classified as men and some smaller men
would get incorrectly classified as women.
One measure of the quality of a discriminant analysis is the proportion of observations
that are correctly classified. This is kind of like an R2 – the amount of explained variance.
If there are only two groups, we can model their centroids in terms of two points on a
line. That is, the space will be one-dimensional. When we have three groups, we will need to
locate them as points in a two-dimensional space unless one of the groups falls on the line
between the other two groups. This means there will be two linear combinations, or discriminant
functions, one for each dimension in the space. The first explains as much of the variation as
possible, i.e., maximizes BSS/WSS. The second function explains as much of the residual
variation as possible, subject to the constraint that it is orthogonal (perpendicular) to the first
function. There can be no more than one fewer dimension (discriminant function) than there are
groups. If we have four groups, we can have at most three discriminant functions. One of the
33
outputs to a discriminant analysis is the amount of variance that is explained by each dimension
as well as the cumulative variance explained by all dimensions up to and including the last. This
can be interpreted like the fit statistics in clustering. You want to balance a small number of
dimensions with as much explanatory power as possible. Like in cluster analysis, where the
question is whether one should add one more cluster to the solution, the question with
discriminant analysis is whether another dimension is needed to adequately represent the groups.
In the following table, taken from the four-cluster needs-based PDA solution, only two
discriminant functions are needed to adequately represent the demographic clusters as they
capture 80% of the variance.
Discriminant
function
-----------1
2
3
Percent of
variance
---------48.49
30.59
20.91
Cumulative
percent
---------48.49
79.09
100.00
Significance
level
-----------.000
.000
.015
It would be nice if the discriminant analysis program printed out the actual discriminant
functions, which would be like printing out the regression weights. Unfortunately, ours prints out
the correlations between the independent variables and the discriminant functions. The
correlations are related to the discriminant function weights but are not the same thing. They do
tell show the direction of the weight and which weights are more important in determining each
function.
Variable
---------PDA
Income
Bus_Week
Education
Professnl
M_Gourmet
PC_Mag
Func1 Func2
------ -----.708
.132
.669
.086
.635 -.089
.622 -.011
.591
.137
.456
.064
.354 -.024
34
Field&Stre
Construct
Sales
Emergency
Service
Age
-.277
-.187
-.045
-.265
-.356
-.076
.674
.660
-.512
.424
-.328
.030
Again, these correlations are taken from the same PDA discriminant analysis. This says
the first dimension is primarily: PDA, income, education, Professional, etc. and the second
dimension is Field&Stream, Construction, Sales, and Emergency.
Following is a plot of the group centroids from the positioning analysis program. It is
similar to discriminant analysis. The first dimension has been reversed as professional, PDA, etc
are located on the left side of the space.
35
The first, horizontal, dimension separates the professionals from the other groups and has
all the first discriminant function variables lying on that axis. The second, vertical, dimension is
sales, construction, service, and Field & Stream. This shows where the groups fall relative to
each other demographically.
For example, if a market is segmented into benefit clusters, we might use discriminant
analysis to see if a linear combination of demographic variables can separate these groups, i.e.,
determine which demographic variables best differentiate, or separate these clusters.
36
Perceptual Mapping
This section will cover perceptual mapping, or positioning analysis using factor analysis.
Cluster analysis tries to group observations (e.g., people) that are similar in groups. Factor
analysis tries to “group” variables that are similar together. If two (or more) variables are highly
correlated, then a single variable could do a fairly good job of representing both. Factor analysis
replaces a set of correlated variables with a linear combination of them that retains as much of
their information (variance) as possible. These linear combinations are called factors. This allows
us to represent a set of objects in a reduced dimensional space. For example, following are a set
of 10 cars (from several years ago) that have been rated on seven attributes:
Attributes /
Brands
BMW
Cavalier
Intrepid
Taurus
Accord
Altima
Saturn
Subaru
Camry
VW
Passat
Fuel Econ
-0.413
-0.152
-0.891
-0.543
0.413
0.065
0.587
0.021
0.587
0.326
Reliability
0.573
-1.034
-0.73
-0.73
0.921
0.182
-0.034
0.182
0.834
-0.165
Style
1.43
-1.091
-0.221
-1.004
0.517
0.12
-0.569
-0.134
0.43
0.517
Price
-1.465
0.969
-0.204
0.404
-0.247
0.273
0.969
-0.334
-0.16
-0.204
Fun to Drive
1.704
-1.078
0.139
-1.034
0.182
0.008
-0.6
0.182
0.095
0.4
Safety
0.652
-0.782
-0.217
0
0.217
0
-0.173
0.26
0.217
-0.173
Space
-0.543
-0.673
0.326
0.5
0.108
0.108
-0.282
0.63
0.195
-0.369
These numbers of been scaled so the average rating on each attribute is 0.0. Positive
numbers indicate above average ratings and negative numbers represent a lower than average
rating. We can think of these cars as located in a 7-dimensional space. We cannot visualize
things in seven dimensions, but we could plot pairs of dimensions, like plot the cars on the
dimensions of fuel economy and reliability, then fuel economy and style, etc.
The Positioning Analysis program uses factor analysis to derive a smaller number of
dimensions that contains as much information from the original variables as possible. Following
is a correlation table of the attributes:
37
Fuel
Reliability
Style
Price
Fun
Safety
Space
Fuel Reliability Style Price
Fun Safety Space
1
0.59
1.00
0.18
0.77
1.00
0.23
-0.55 -0.87
1.00
-0.05
0.61
0.95 -0.93
1.00
0.06
0.78
0.76 -0.81
0.75
1.00
-0.17
0.09 -0.19 -0.06 -0.16
0.29
1.00
Five of these attributes: Reliability, Style, Price, Fun to Drive, and Safety are highly
(positively or negatively) correlated. If two variables are positively correlated then a car that is
perceived to be higher (or lower) than average on one of these attributes, it is likely to be
perceived as higher (or lower) on the other as well. If two variables are perfectly correlated, then
the second one contains no new information and their sum would contain just as much
information as the two variables by themselves. If several variables are correlated, then their
weighted sum will contain most of the information in all of them individually. In this example,
these five attributes can be represented as a single dimension in a perceptual space without too
much loss of information. Of the other two attributes, Fuel economy is correlated with reliability,
but no other attributes and Space is relatively uncorrelated with any other attributes.
In a two-dimensional space, we might guess that the first dimension – the horizontal
dimension will represent these first five attributes and the second dimension will represent some
combination of the other two. The positioning program produces the following perceptual map:
38
We see that Fun to Drive, Safety, Style, and Reliability all lie close to each other and
Price, which was negatively correlated, points in the opposite direction. The first dimension
accounts for 55.1% of the variation in the data. The second dimension, which is primarily Fuel
Economy, accounts for 20.1% of the variation in the data. Additionally, brands that are similar to
each other are located close to each other in the space, e.g., Camry and Accord are located close
to each other. Brands that are distinct, like BMW and Taurus are located away from other
brands.
Construction of Joint Spaces
The information needed to locate preference vectors or ideal points in the space consists
of consumers’ preferences or purchase likelihoods of each brand. A regression is used to find
39
the relationship between the brand locations and preferences. Preference, or purchase likelihood,
Pj, is the dependent variable and brand locations, X1j and X2j, are the independent variables.
Preference vectors. The location of a preference vector is determined by a regression that
is shown in the next equation:
Pj  Bˆ 0  Bˆ1 X 1 j  Bˆ 2 X 2 j  e j
This procedure is illustrated with an example involving one person's likelihood of
purchasing each automobile on a 0 to 10 point scale (where 10 means the person was very likely
to purchase the automobile).
Respondents / Brands
BMW
Cavalier
Intrepid
Taurus
Accord
Altima
Saturn
Subaru
Camry
VW Passat
6
2
2
4
10
8
2
8
10
8
Bill
The brand locations are given in the “Diagnostics” page of the ME>XL output under
Coordinates. The above row of preferences has been special pasted into the last column:
Dimensions / Brands
BMW
Cavalier
Intrepid
Taurus
Accord
Altima
Saturn
Subaru
Camry
VW Passat
1
0.6245
-0.5663
-0.1211
-0.3228
0.2363
0.0044
-0.2366
0.1075
0.2061
0.0681
2
0.2516
-0.0962
0.5908
0.3792
-0.2636
-0.071
-0.4502
0.1541
-0.3224
-0.1723
Bill
6
2
2
4
10
8
2
8
10
8
A regression is run with the preferences as the dependent variable and the two
coordinates as the independent variables. The regression gave the following results:
40
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.75
R Square
0.56
Adjusted R
Square
0.44
Standard Error
2.45
Observations
Intercept
10
Coefficients
6.00
1
6.47
2
-3.46
Standard
Error
0.78
2.45
2.45
t Stat
7.73
2.64
-1.41
P-value
0.00
0.03
0.20
The positive signs on the regression weights for X1 and the negative sign on X2 indicate
that likelihood of purchase increases as one moves to the lower right in the space. Because the
first dimension has a weight that is approximately twice as large as dimension two (i.e., 6.47
versus -3.46), dimension one is more important when determining likelihood of purchasing.
The preference vector is located by drawing a vector through the origin that decreases
3.46 units vertically for every 6.47 units horizontally. The preference vector for this person is
shown in the following figure. This can be drawn manually by starting at the origin and moving
some multiple of 3.46 down for the same multiple of 6.47 to the right.
41
42
Ideal points. This formulation assumes that the squared distance between a brand (X1j,
X2j) and an ideal point (Y1, Y2), in two dimensions is inversely related to preference for the
brand, i.e., an ideal point is located close to brands preferred by that person.
The following equation models preference for a given brand as a linear function of its
squared distance from the ideal point in a two dimensional perceptual space:
Pj  a  bd 2j  a  bi 1 (Yi  X ij ) 2
2
Where Pj, a, b, and Xij are the defined as before. Yi is the location of the ideal point on the ith
dimension. The negative sign on the b indicates that preference for a brand decreases as the
further it is from the ideal point1. This relationship between brand preference and its squared
distance from the ideal point can be used to locate the ideal point. Stated differently, we know a
person’s preference for each brand, Pj and the brand’s location in a perceptual space, Xij, we
want to find the location of the ideal point,Yi. The above equation is transformed into the
following nonlinear regression equation:
2
Pj  aˆ  bˆi 1 (Yi  X ij ) 2  e j
where Pj is the preference for the jth brand, ( Yˆ1 , Yˆ2 ) are the coordinates of the ideal point, (X1j,
X2j) are the coordinates of the jth brand, and â and b̂ are regression weights.
We can expand the above equation as follows:
2
2
2
2
2
2
Pj  aˆ  bˆi 1 (Yi  X ij ) 2  e j  aˆ  bˆi 1 Yi  2bˆi 1 Yi X ij  bˆi 1 X ij  e j
2
2
Pj  aˆ  bˆi 1 Yi  2bˆY1 X 1 j  2bˆY2 X 2 j  bˆ( X 12j  X 22 j )  e j
1
If the sign on b is positive, it indicates that preference increases as a brand moves away
from the ideal point. In this case the ideal point is called an "anti-ideal point" as it indicates a
position of minimum, rather than maximum, preference.
43
Remember, the goal is to find estimates for ( Yˆ1 , Yˆ2 ). The above equation is rewritten so
preference can be a function of just the X’s as follows:
Pj  Bˆ 0  Bˆ1 X 1 j  Bˆ 2 X 2 j  Bˆ 3 ( X 12j  X 22 j )  e j
2
where Bˆ 0  aˆ  bˆi 1Yˆi 2 , Bˆ 3  bˆ , and Bˆ i  2bˆYˆi for i=1, 2.
The location of the ideal point on the ith dimension, Yˆi , is given by:
Yˆi   Bˆ i / 2Bˆ 3 .
While the math may look complicated, it just shows it is possible to run a regression
where Pj is the dependent variable and X1j2 + X2j2, X1j, and X2j are the three independent
variables. The location of the ideal point is given by the above equation. If B̂3 is negative, then
Yi represents an ideal point - a place of maximum preference. If B̂3 is positive, then Yi
represents an anti-ideal point - a point of minimum preference.
Again, this is illustrated with the data from the same person as before. The only
difference is the addition of the X1j2 + X2j2 term to the previous regression.
Dimensions / Brands
BMW
Cavalier
Intrepid
Taurus
Accord
Altima
Saturn
Subaru
Camry
VW Passat
1
0.6245
-0.5663
-0.1211
-0.3228
0.2363
0.0044
-0.2366
0.1075
0.2061
0.0681
2
0.2516
-0.0962
0.5908
0.3792
-0.2636
-0.071
-0.4502
0.1541
-0.3224
-0.1723
Dim 1^2 +
Dim 2^2
Bill
0.4533
0.3300
0.3637
0.2480
0.1253
0.0051
0.2587
0.0353
0.1464
0.0343
6
2
2
4
10
8
2
8
10
8
Again, one would expect the ideal point to be in the center of the Japanese cars and
closest to the Accord and Camry. The following regression is run:
44
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.93
R Square
0.86
Adjusted R
Square
0.79
Standard Error
1.50
Observations
10
Intercept
Dim 1^2 + Dim
2^2
Coefficients
8.55
1
6.18
2
-0.99
Standard
Error
0.86
1.50
1.65
t Stat
9.97
4.11
-0.60
P-value
0.00
0.01
0.57
-12.74
3.57
-3.57
0.01
The coefficient associated with the X1j2 + X2j2 term is negative, so this represents an ideal point.
The coordinates are given by:
Y1 = 6.18 / {2 * (-12.74)} = .25 and
Y2 = -.99 / {2 * (-12.74)} = .04
This location is different from the one generated by ME>XL plotted in the next figure.
45
Download