Marketing Engineering Notes Purpose During the lectures I will cover some material that is not in the readings or that I do not think is well explained. The purpose of this set of notes is to provide you some information on these topics. These notes give you a preview of what I plan to talk about in class and also a review after my lecture. The spreadsheets that are referenced in this note should be available at http://www.business.utah.edu/~mktbm/mkt6600/ I will not include this full path in these notes. Response Models Response models form the heart of marketing engineering and marketing decisionmaking. A response model forecasts the change in a dependent variable, Y, as a function of the change in one or more independent variables, X. Most commonly, we will look at how changes in the amount spent on advertising, sale promotions, or the sales force, changes in price, or changes in the features of a product or service impact sales. Other times we will look at how changes in the characteristics of a product or service change a person’s preference or probability of purchasing it. So, typical dependent variables, i.e., Y, include sales, market share, preference, and probability of choice. Independent variables, i.e, X, include marketing mix elements, product or service characteristics, and characteristics of the buyer. Page 4 in the Response Models Technical Note (in the WebCT Technical Notes folder) shows a large number of possible functional forms of response models. While it is good to know something about each of these functions, we will deal primarily with four functions: linear, multiplicative, ADBUDG, and logit. Linear Regression The most common assumption is that the dependent variable is linearly related to the independent variable(s). That is, regression is based on an assumption that a set of points can be 1 adequately represented by a straight line (or a hyper plane when there are several independent variables), i.e., most of the data points will lie relatively close to the regression line. Consider a linear equation with two independent variables X1 and X2: 2 Yi a b1 X 1i b2 X 2i a bk X ki i = 1, …, n k 1 In this equation, a is the intercept. It is the expected value of Yi when both Xs are 0. Each regression coefficient, bk, (k = 1, 2), is the slope associated with that independent variable, Xki. It gives the expected change in Yi for a one unit change in Xki holding the effect of the other variable(s) constant. Regression finds the combination of estimates of a, b1, and b2 (i.e., aˆ , bˆ1 and bˆ2 ) that minimize the sum of the squared errors over all n observations, i ei2, where ei is the difference ^ i. between the actual and predicted Yi, i.e., ei = Yi - Y Yi aˆ bˆ1 X 1i bˆ2 X 2i ei Yˆi ei ^ Diagnostics. The variation in Y is called the total sum of squares, TSS, the variation in Y is called the explained sum of squares, ESS, and the variation in e is called the residual sum of squares, RSS. (However, sometimes this terminology is just reversed and ESS stands for error sum of squares and RSS stands for regression sum of squares.) TSS i 1 (Yi Y ) 2 n ESS i 1 (Yˆi Y ) 2 RSS i 1 ei2 i 1 Yi Yˆi n n n 2 Because the regression line always runs through the mean of the data, the following identity holds: n n TSS i 1 Y Yi i 1 Yˆi Yi 2 Y Yˆ Yˆ Y 2 2 n i 1 i 2 n i 1 i i n e ESS RSS 2 i 1 i R2 is a measure of the amount of variation in the dependent variable, TSS, can be 2 explained by the regression, ESS. R2 is the ratio of explained to total variance. ESS R TSS 2 (Yˆ Y ) (Y Y ) i i i i RSS i ei 1 1 TSS i (Yi Y ) 2 2 2 2 This equation shows that the same regression weights that minimize the sum of the squared errors, RSS, also maximize R2. There is no universal standard for a good R2, it depends on the application. We will see some very successful applications where the R2 is relatively low and unsuccessful applications with high R2s. The standard error of the estimate is a measure of how close the points lie to the regression line, se e 2 i i n k where n is the number of observations and k is the number of parameters. Usually about 2/3 of the observed Ys will lie within +- one se of the regression line and 95% of the points will lie within +- two ses of the line. One should examine the t-statistics and p-values for each regression coefficient (weight) to see if it is statistically significant. P-values should be less than .1 and many people believe they should be less than .05. This corresponds to t-statistics of approximately 1.8 and 2.0. The medical advertising data (MedAdv.xls) recorded the response to a series of advertisements. In the weight loss advertising campaign, between 0 and 4 ads were run every month for a year and the number of calls each month inquiring about a weight loss program were recorded. We can run a regression to estimate the relationship between the number of ads run in a given month at the expected number of calls. The dependent variable is the number of calls each month and the independent variable is the number of ads. Here are the data for the first four months: Weight Loss Ads Calls 3 January February March April 3 2 3 3 113 98 147 115 This resulted in the following regression equation and output: Callsi = a + b adsi + ei SUMMARY OUTPUT - Weight Loss Advertising Regression Statistics Multiple R 0.95 R Square 0.91 Adjusted R Square 0.90 Standard Error 15.32 Observations 12 ANOVA Regression (Explained) Residual Total Intercept Ads df 1 10 11 SS 23778.17 2348.08 26126.25 Coefficients 12.63 36.10 MS 23778.17 234.81 Standard Error 6.30 3.59 F 101.2663 t Stat 2.01 10.06 Significance F 1.5E-06 P-value 0.07 0.00 The high R2 = .91 indicates that most of the data points lie close to the regression line. The Standard error of the estimate is 15.32. That says about 2/3 of the observations will be within 15 calls from what is predicted by the line. The Analysis of Variance (ANOVA) shows divides the total sum of squares TSS (26126.25) into the explained sum of squares ESS (23778.17) and residual sum of squares RSS (2348.08). Note that ESS + RSS = TSS and that R2 = ESS/TSS = 23,778.17/ 26,126.25 = 1(RSS/TSS) = 1-(2,348.08/26,126.25) = .91. The Intercept of 12.6 has a p-value = .07 < .1 and is significantly different from zero. It 4 says we should expect 12.6 calls a month if there were no advertising. The slope of 36.10 (also statistically significant p-value < 0.01) indicates that on average each ad generates 36.10 more calls. The sign of the coefficient is positive, indicating that more ads are associated with more calls, which is what one would expect. We could examine either a plot of the data or the residuals, i.e., the eis, to see that a linear model does a good job of representing the data. Linear regression models are popular because they are easy to estimate and are robust. That means that even when assumptions are violated, regression typically works pretty well. It is also a good approximation of the phenomena within a certain range. Forecasting with Linear Response Models. We can use the estimated regression coefficients to forecast the number of calls that would be generated from a given number of ads per month by “plugging” the expected number of ads into the regression equation. Forecast calls = 12.63 + 36.10 (number of ads) With zero ads, we expect to attract 12.63 calls, with one ad we expect 48.73 calls (12.63 because of the intercept and 36.1 because of the ad), and each extra ad would generate 36.1 incremental calls. Profit Models. In addition to forecasting sales we can also forecast profits. We will typically use the following profit model: Profit = Unit Sales x Margin – Fixed Cost Assume there is a linear response model for sales, y, as a function of advertising dollars, x, which is a fixed cost: yˆ aˆ bˆx . Then profits are: Profits = yˆ margin FCu x aˆ bˆx margin FCu x aˆ margin b margin FCu x Continuing with the weight loss example, if each call generates $59 in contribution and each ad costs $1300, our forecast profit = $59 * Forecast calls - $1300 * number of ads 5 Each additional ad would generate 36.1 calls and $2130 (= $59 * 36.1) in contribution. It costs $1300. This campaign generates an incremental profit of $830 per ad. If we do not use our judgment, the model says we should place an infinite number of ads. With a linear response model, the optimal action is always going to be to either spend nothing or spend an infinite amount on advertising. So, we should not take the recommendations literally, but use them to say what you should do directionally what to do rather than exactly, i.e., run a few more or less ads. Judgmental Calibration. You are more familiar with statistical estimation or calibration; however, it is also possible to calibrate models judgmentally. While this lacks the objectivity of statistical estimation, it has the benefit of incorporating the decision maker’s beliefs into the model and increases the likelihood of using the model. We will use judgmental estimates with both multiplicative and ADBUDG models later in the course. A linear model is a two-parameter model (slope and an intercept) when there is one independent variable. Therefore we need two judgments to calibrate this model. One way is to ask (1) what are the current levels of the independent (e.g., dollars spent on advertising or number of advertisements placed) and dependent variables (sales, market share, or calls) and (2) how much will the dependent variable change with a one-unit increase in the independent variable. Looking at the first month of the weight loss data, one would say the current number of ads was 3 and the current number of calls was 113. If we expect to get 36 more calls per advertisement, we can solve for the intercept. Calls = a + 36 (Ads) => 113 = a + 36 * 3 or a = 113 – 108 = 5. If there are two independent variables, we would need to ask for the current levels of both independent and single dependent variable as well as how much the dependent would change with a one-unit increase in each of the independent variables. 6 Multinomial Logit (MNL) Models Linear regression is an appropriate methodology when the dependent variable is continuous. However when the dependent variable is either a zero – one variable, (for example, if the dependent variable is brand choice, where the value is one if that brand was chosen and zero if it was not chosen) or if it is constrained to be between zero and one, (for example, market share must be between 0 and 100%), it may be better to use a logit model to estimate the relationship between independent and dependent variables. For example, in choice-based conjoint analysis, we use a logit model to estimate the relationship between product characteristics and the probability that a person would choose a product with those characteristics. Alternatively, we could estimate how merchandising characteristics, like price promotions, advertising, displays, etc. influence market share or the probability that an individual would choose a certain brand. In this case, the dependent variable is going to be a zero – one variable, e.g., one if the person chose an alternative and a zero if s/he did not choose it. The independent variables are variables that might influence choice such as product characteristics or merchandising characteristics. The output of a logit model is an estimated probability of choosing each alternative. The formula looks more complicated than a linear regression, but is quite similar. Pi Ai K A k 1 k e j j xij k e j j xkj where K is the number of brands in the choice set and Ai exp( j ˆ j X ij ) exp( ˆ1 X i1 ˆ2 X i 2 ... ˆJ X iJ ) Pi is the probabilit y of choosing the ith alterative , X ij is the amount of the jth attribute contained in the ith alternativ e, and ˆ j is the importance of the jth attribute. Choice-based conjoint is similar to a ratings-based conjoint model; both are used to 7 understand why people choose or prefer certain alternatives. We assume people choose or like things because of the benefits they offer or the characteristics they possess – preference is “caused” by product characteristics. We typically assume a linear function of the characteristics. Overall preference is a weighted sum of the attribute levels the product possesses. In either case, the product descriptions are the independent variables and the measure of preference is the dependent variable. When the dependent variable, e.g., preference, is measured on a continuous scale (1- 10), we use regression to estimate the importance of the product attribute when making preference judgments about the brands. Pref i j Bˆ j X ij ei Once we have estimated the importance weights, we can predict the preference or likelihood of purchasing any competitive product by substituting its perceptions into the equation. Also, we can estimate the impact of changing the perception of a given attribute; this may be a change in the physical product or just a new message. In Excel, we have done this with the sumproduct function of the regression weights and the independent variables. Pref new j Bˆ j X ( new) j In ratings-based conjoint, we typically assume the alterative with the highest predicted preference is the one that is chosen. The multinomial logit model assumes that the probability of choosing the ith alternative is equal to: Pi e j xij j e j xkj j k There are several differences between logit and regression models. The dependent variable in a logit model is choice (a 0–1 variable) rather than preference rated on a continuous (1 to 10) scale. These “choices” can be either stated choices, what the person says s/he would 8 choose or revealed choices –what they actually chose. We have similar independent variables to ratings-based conjoint. In ratings-based conjoint, if a product with a particular attribute level (low price or high gas mileage) is typically preferred (receives a high preference rating) the regression weight associated with that level will be large and positive. In a logit model, if a product with a particular attribute level is consistently chosen, its regression weight will also be large and positive. So we interpret the parameters the same way. The estimated probability of choice is the exponentiated utility of that object over the sum of the exponentiated utilities of all of the objects. Because it includes all of the competing alternatives, logit models allow us to capture competitive effects. It can model either market shares at the aggregate level or choice probabilities at the individual level. Its form may look complicated, but it has two very nice properties. First Ai is always nonnegative, because it is an exponentiation. Second, the predicted choice probabilities (or market shares) are all between zero and one and they sum to one. The models are usually estimated by a procedure called maximum likelihood. The model finds the parameters that maximize the probability of the observed outcomes. If a person makes a series of choices, the model finds parameters that make the estimated probabilities of the chosen alternatives as close to one as possible and the estimated probabilities of the non-chosen alternatives as close to zero as possible. The parameters are estimated through a search procedure like Solver, but our logit model does everything automatically. The MNL model is similar to a preference regression model in that (1) we are trying to estimate importance weights for product attributes and (2) the independent variables are the product attributes or characteristics. It differs from the preference regression model in that (1) the 9 dependent variable is choice, or probability of choice, instead of preference or liking (2) all of the alternatives in a given choice set are considered to be part of one observation instead of each brand constituting a separate observation. The software accomplishes this by asking for the number of alternatives (per case). Regression minimizes the sum of the squared errors; the MNL maximizes a likelihood function. Example of Modeling Transportation Choice. In a simple example we want to determine the probability that a person will choose a car or mass transit. In the following table there are two rows for each person, one for each alternative: Auto or Mass. The dependent variable is the chosen mode of transportation it is a 0 – 1 variable. The first two people chose mass transit and the third person chose auto. We could have asked people to choose a transportation mode or we could have observed what they actually chose. The independent variables are travel time and a dummy variable (called a brand specific or alternative specific constant) for the first alternative, auto. The last alternative, mass transit, is the reference level. The dummy variable is automatically supplied by the program and the coefficient associated with it is the difference in utility between taking auto and mass transit when travel time is held constant – some people would rather drive if travel times are similar and others would rather take mass transportation. Observations / Choice data Alternatives Choice Time Auto 1 Auto 0 52.9 1 1 Mass 1 4.4 0 2 Auto 0 4.1 1 2 Mass 1 28.5 0 10 3 Auto 1 4.1 1 3 Mass 0 86.9 0 The output of the model looks like the following for these data: Variables / Coefficient Coefficient estimates estimates Time -0.05 Standard errors 0.02 Auto 0.75 -0.24 t-statistic -2.57 -0.32 The negative coefficient associated with time says that people will tend to choose the faster transportation mode, i.e., the mode with the smaller travel time. The coefficient associated with auto is also negative, but insignificant. The insignificance says that people are indifferent between auto and mass transit. Forecasting with Logit Models. Just like regression, we use the estimated coefficients and the logit formula to forecast the probability that a given person would choose either auto or mass transit. This requires three steps. The logit formula is: Pi e j j xij e j j xkj k 1. Calculate Vi = jjxij for each of the alternatives 2. exponentiate these calculated Vis, Ai = exp(Vi) = exp(jjxij) 3. Plug them into the logit formula Like the first person in this data set, assume that auto travel time is 52.9 minutes and mass transit travel time is 4.4 minutes. V1auto = -.053 * 52.9 - .24 * 1 = -3.04 V1mass = -.053 * 4.4 - .24 * 0 = -.23 11 Exp(V1auto) = exp( -3.04) = .0478 Exp(V1mass) = exp(-.23) = .795 Exp(V1auto) + Exp(V1mass) = .0478 + .795 = .842 P1auto = .0478 / .842 = .057 P1mass = .795 / .842 = .944 As soon as we know a person’s auto and mass transit travel times, we can forecast his/her transportation choice in a similar manner. Example of Modeling Detergent Choice. In this example, we will use a logit model to predict the probability of choosing a different brand of laundry detergent. We want to see how effective various marketing mix elements are. Specifically we measure the effect of price, price discount, whether it was on an end of aisle display, or featured in an ad that week. The model will also put in brand dummy variables for the first three brands to capture perceived product quality or overall image, and a loyalty variable to capture past purchases. We can use this to see how many people would purchase due to an end of aisle display and see if that is worth the cost of paying a store to do that. In this example, 8 people each made 10 purchases of laundry detergent (once each month) among four brads: Wisk, All, Tide, and Yes. The first column is the consumer number. The second column is the purchase number (again, one for each month) for that customer (1 through 10). For each purchase there are four brands listed in the third column. The fourth column is the 0 – 1 variable showing which brand was chosen. After the loyalty column (which I will not cover in detail), there are four merchandising variables: List Price, Price Discount, Display, and Feature Ad. These are the marketing mix elements under control of the retailer. Finally, there is a dummy variable for each of the first three brands. Yes is the reference level. 12 Loyalty List Price Discount Display Feature Wisk All Tide 1 Wisk 0 0.25 3.25 0.63 0 0 1 0 0 1 1 All 0 0.25 3.10 0.71 0 0 0 1 0 1 1 Tide 1 0.25 3.30 0.82 1 1 0 0 1 1 1 Yes 0 0.25 2.95 0.86 0 0 0 0 0 1 2 Wisk 0 0.2 3.25 0.63 0 0 1 0 0 1 2 All 0 0.2 3.10 0.71 0 0 0 1 0 1 2 Tide 0 0.4 3.67 0.82 0 0 0 0 1 1 2 Yes 1 0.2 2.95 0.60 0 1 0 0 0 Choice 1 Brands / Choice data Month Observations When we estimate this model, we get the following set of coefficients: List price is negative. That says people are less likely to buy products at a higher price. The price discount is positive and says people are more likely to purchase when there is a bigger discount. Notice that people are more sensitive to the amount of the discount than to the list price. People are also more likely to buy when a product is displayed or when it is in a feature ad. The sizes of these two merchandising coefficients are approximately equal. The positive coefficient for Tide says that people are significantly more likely to buy Tide than the reference brand Yes, but there is no significant difference between the probability of choosing any of the other three brands. Variables / Coefficient estimates Loyalty List Price Discount Display Feature Wisk All Tide Coefficient estimates Standard deviation t-statistic 1.78 -3.54 10.58 1.18 1.25 0.37 0.67 1.99 1.22 1.10 2.01 0.43 0.45 0.59 0.56 0.69 1.46 -3.22 5.25 2.72 2.77 0.63 1.20 2.90 13 Linearizable Response Functions – Multiplicative Model Decreasing and Increasing Returns Response Functions. If either theory or an inspection of the data suggests a nonlinear relationship between one or more of the independent variables and the dependent variable, you should consider a nonlinear model. The first alternative is a nonlinear model that can be linearized through a simple transformation. This allows you to do the estimation with linear regression. This is typically easier than nonlinear least squares estimation, it may be more robust, and the regression module in Excel provides a number of useful diagnostics like R2 and t-statistics. By far, the most widely used linearizable model is the multiplicative model. It is the only one we will cover in class. Multiplicative Model. The multiplicative model is a commonly used model to represent either an increasing or decreasing returns function. This model is popular because it is a constant elasticity model, i.e., it models the response in such a way that a given percent change in an independent variable always produces the same percentage change in the dependent variable. An increasing returns model (see P5 on Exhibit 2, page 4 of the Response Models Technical Note) might occur if there are network effects or positive feedback. A decreasing returns model (see P3 on Exhibit 2, page 4 of the Response Models Technical Note) might occur if the impact of a repeated advertisement declines over time. The two-variable multiplicative model is written as: Y aX 1 1 X 2 b b2 X i 0 i 1,2 It is called multiplicative because the Xs are multiplied together rather than added. If the bs are less than one it is a decreasing returns model and if they are greater than one, it is an increasing returns model. We estimate the parameters by taking logarithms of the above equation: 14 Ln(Y ) Ln(aX 1 1 X 2 2 ) Ln(a) Ln( X 1 1 ) Ln( X 2 2 ) b b b b Ln(a) b1 Ln( X 1 ) b2 Ln( X 2 ) To estimate the model, take the logarithms of Y, X1 and X2 and then regress Ln Y on Ln X1 and Ln X2. Ln(Yt ) ˆ ˆ1 Ln( X 1t ) ˆ2 Ln( X 2t ) et where ˆ Ln(a ), exp( ˆ ) a and ˆ b i i i 1, 2 Once we have estimated the parameters, we can use the following equation to forecast Y. ˆ Y eˆ X 1 1 X 2 ˆ2 aX 1 1 X 2 b b2 Judgmental Calibration. This is also a two-parameter model when there is one independent variable, so we must ask at least two questions to judgmentally estimate the parameters. First, what is the current level of the independent variable (e.g., dollars spent on advertising or price) and the current level of the dependent variable (sales or market share)? Second, what will be the percent change in the dependent variable with a one percent increase in the independent variable. If current sales are supposed to be 50 units when the price is $130 and sales will increase by 1.5% with every one percent decrease in price, we would have the following model: Sales = a (Price)-b =>50 = a (130)-1.5 = a (.0007) or a = 50/(.0007) = 74111. If there are two independent variables, we would need to ask for the current levels of both independent and single dependent variable as well as the expected percentage change in the dependent with a one percent change in each of the independent variables. Forecasting with the Multiplicative Model. The process is the same as with the linear model: “plug” the forecast values of the independent variables into the equation and solve for the value of the dependent variable. For example, in the sales example above, sales when price is 15 dropped to $120 is calculated as follows: Sales($120) = a(Price)-b = 74111 ($120)-1.5 = 56.38 units. Profit Models. We can build profit models just like we did with the linear model. Again, the general profit function is: Profit = Unit Sales x Margin – Fixed Cost In the above example, the marketing variable is price, which is not a fixed cost, but enters into the margin. If the unit cost is $50, then Profits = a(Price)-b * Margin = 74111 ($120)-1.5 ($120 - $50) = 56.38 * $70 = $3946.60 16 Measuring the Impact of Price and Display on Sales The cheese data contain weekly unit volume, price, and a measure of display activity on several key accounts (a city – retailer combination) for approximately 65 weeks. These data are for a sliced cheese product manufactured by Borden. The measure of display activity is percent of ACV (all category volume) on display. Later we will look at some soft drink data that have the same information for both a focal brand and a competitor. The models may look complicated, so we will build them in steps. First is a simple model where sales are a function of just price: S t a0 Pt 1 a0 the intercept value of S t when Pt 1 t the slope associated with Pt price elasticity Where St is the unit volume at time t, Pt is the price at time t. In this equation, adjusts for the size of the market. It is the size of the market when all independent variables equal one. 1 is the price elasticity – the percent change in volume for a 1% change in price. To estimate the model, we take natural logarithms of each side of the equation to get an equivalent model: ln S t ln a0 1 ln Pt We can estimate this model with regression, where ln(St) is the dependent variable and ln(Pt) is the independent variable: ln( S t ) ˆ0 ˆ1 ln( Pt ) et where ˆ0 ln a0 or exp( ˆ0 ) a0 and ˆ1 1 ˆ Once we have estimated the parameters, the estimated sales volume is: Sˆt exp( ˆ0 ) Pt 1 Next, assume that display activity affects volume only and not price sensitivity. This results in the following model: 17 S t a0 a1Dt Pt 2 1 multiplier for deal periods Where Dt is the percent of ACV on display and 1 is a multiplier for display, i.e., if there is no display activity, (i.e., Dt = 0) the impact of 1 is a multiplication by 1, if ACV display is 1, the sales volume is increased by a factor of 1 It is a measure of the percentage change in volume when there is a display. This model can be written equivalently in terms of logarithms as: ln St ln a0 ln( a1 ) Dt 2 ln Pt We can estimate this model with regression, where ln(St) is the dependent variable and Dt and ln(Pt) are the independent variables: ln( St ) ˆ0 ˆ1 Dt ˆ2 ln( Pt ) et where ˆi ln ai or exp( ˆi ) ai i 0, 1 and ˆ2 2 Once we have estimated the regression coefficients, we can forecast sales with the following equation by plugging in the price and display activity: ˆ Sˆt exp( ˆ0 ) exp( ˆ1 ) Dt Pt 2 Next, we can complicate the model even further by assuming that a display impacts not only volume, but also price sensitivity. St a0 a1Dt Pt 2 Pt Dt 3 a0 a1Dt Pt 2 Dt 3 3 change in price senstivity due to display activity As before, 0 measures the percentage change in volume due to display activity and 3 measures the change in price sensitivity due to display activity. Again, we can write this into an equivalent model by taking logarithms of both sides: ln St ln a0 ln( a1 ) Dt 2 ln Pt 3 Dt ln Pt 18 We can estimate that model with regression where ln(St) is the dependent variable and Dt, ln(Pt), and Dt*ln(Pt) are the independent variables: ln( S t ) ˆ0 ˆ1 Dt ˆ2 ln( Pt ) ˆ3 Dt ln( Pt ) et where ˆi ln ai or exp( ˆi ) ai i 0, 1 and ˆ j j j 2, 3 Once we have estimated the regression coefficients, we can forecast sales with the following equation for any level of price and display activity: ˆ ˆ ˆ ˆ Sˆt exp( ˆ0 ) exp( ˆ1 ) Dt Pt 2 Pt Dt 3 exp( ˆ0 ) exp( ˆ1 ) Dt Pt 2 Dt 3 Finally, we look at two brands, i and j, where we will call brand i our own brand and brand j the other brand. Furthermore, we will model the effects of price and display activity both on our own brand and on the other brand. Sales of brand i is a function of its own pricing and display activity as well as the pricing and display activity of the other brand, brand j. D 6 S it 01Dit Pit 2 PitDit 3 4 jt Pjt5 Pjt jt D we expect that 2 and 3 will be negative – as its own price increases, its sales will decrease. On the other hand if the two brands are competing, we expect 5 and 6 to be positive – as the price of the other brand increases we expect sales of own brand to increase. Similarly we expect and to be of opposite signs: display activities of own brand should increase own brand sales and display activities of the other brand should decrease own brand sales. We can write this model as an equivalent model by taking logarithms, estimate it using regression, and make forecasts once we have estimated the parameters: 19 Equivalent model : ln Sit ln a0 ln( a1 ) Dit 2 ln Pit 3 Dit ln Pit ln( a4 ) D jt 5 ln Pjt 6 D jt ln Pjt Regression model for estimation : ln( Sit ) ˆ0 ˆ1 Dit ˆ2 ln( Pit ) ˆ3 Dit ln( Pit ) ˆ4 D jt ˆ5 ln( Pjt ) ˆ6 D jt ln( Pjt ) et where ˆi ln ai or exp( ˆi ) ai i 0, 1, 4 and ˆ j j j 2, 3, 5, 6 ˆ ˆ ˆ ˆ ˆ D D ˆ D ˆ D ˆ Sˆit exp( ˆ0 ) exp( ˆ1 ) Dit Pit 2 PitDit 3 exp( ˆ4 ) jt Pjt5 Pjt jt 6 exp( ˆ0 ) exp( ˆ1 ) Dt Pt 2 Dt 3 exp( ˆ4 ) jt Pjt 5 jt 6 Other Linearizeable Models We will probably not use these models, but they are very similar to the multiplicative model, so they are briefly mentioned. Exponential Model. Rather than taking logs of both X and Y, one can take logs of only one or the other. The exponential model has the following form and can model either increasing or decreasing returns: Y = aeXb If we take logs of both sides of this, we have Ln Yi = Ln a + bXi + ei = + bXi + ei where = Ln a or a = e Therefore, if we take the logarithm of Y, but not X, we are estimating this exponential model. This is one of the curve fitting options in Excel Chart. Semi-logarithmic Model. It is also possible to take logs of just one or more of the X variables, i.e., Y = a + b1ln X1i + b2 X2i + ei Typically, we might choose this model when we expect one of the independent variables to display a nonlinear relationship to Y. This might occur if X1 is a size variable, like number of employees. There may be large differences between small and medium sized companies, but smaller differences between large and very large companies. 20 Example. The spreadsheet NonlinearAdvSales.xls provides an example of nonlinear modeling. It happens to have been done within the chart option of Excel rather than regression; however, the appropriate columns allow you to run regression with multiplicative, exponential, and semi logarithmic models. Look at the R2s and the plots of the residuals to choose the most appropriate model. 21 Estimating Nonlinear Models with Solver It is also possible to estimate response models with Excel’s Solver add-in (Read the Excel Solver Technical Note in the WebCT Technical Notes folder). Solver searches for values of cells, or parameters, that maximize or minimize another cell, which is a function of the parameters. When estimating a response model, we will be searching for parameters (like regression weights) that minimize the sum of squared errors between the predicted and actual dependent variable. We can also use Solver to find values of marketing mix elements that maximize profits. The spreadsheet NonlinearLeastSquares.xls contains two examples of estimating response models using Solver. The first is the linear regression dealing with the weight-loss problem we saw earlier (see IntroReg sheet, Sheet1 and Chart1). The second deals with the ADBUDG model (see IntroADBUDG sheet, Sheet2 and Chart2). In either case, the steps are the same and are given in the two Intro spreadsheets. The following description is for the Weight Loss advertising 1. Select locations for the parameter you want to estimate and put in initial guesses. Select cells that are contiguous – A3 and B3. 2. Place the independent variable (in this case number of advertisements) in a column (in this case column B). 3. Place the dependent variable (calls) in a column (C ). Calculate the mean of that column. 4. Create a column that uses the parameters to estimate the dependent variable (D). 5. Create a column (E) that is the squared difference between the dependent variable and the predicted dependent variable (C – D)2. Sum this column. That sum, which is the Residual (or error) sum of squares, is the number you want to minimize. Solver should search over different values of the parameters (A3 and B3) to minimize this cell. 22 6. This is not required to estimate the parameters, but create a column (F) that is the total sum of squares, i.e., the squared difference between the dependent variable (C ) and its mean. The purpose of this is to allow you to calculate R2. 7. Calculate R2. 8. To use solver with either Excel 2003 or Excel 2007, click on Data then Solver (If solver has not been installed in Excel 2003, click on Tools then Add-ins and click solver. If solver has not been installed in Excel 2007, click the Microsoft Office Button, Excel Options, Add-Ins, Manage Excel Add-ins, Go). After the solver dialog box comes up, select the cell to be minimized (sum of the squared error column), click minimize. Select the cells to be searched over (A3:B3). Add any appropriate restrictions (none are needed in this case). Click Solve. ADBUDG ADBUDG is a flexible model that was developed for judgmental data. It can represent either an s-shaped model where increasing returns occur up to a point, and decreasing returns after that or a concave model, which always has decreasing returns. The s-shaped model is appropriate for the situation where there is little response until we spend more than a certain amount, and then sales increase rapidly for a period, but at some point advertising will become increasingly less effective. ADBUDG has four parameters: xc Y ( x) b (a b) c x d b is the minimum value of Y – “what will sales be if you do not do any advertising or promotion?” a is the maximum value of Y - “what will sales be if you spend an infinite amount on 23 advertising?” c controls the shape of the curve; the curve is concave if 0 < c < 1 and it is s-shaped if c > 1, and d works with c to control how quickly the curve rises. Statistical Estimation using Solver. This is based on the model in Sheet2 of the NonLinearLeastSquares.xls spreadsheet. 1. There are four parameters a – d. They are placed in A6:D6. Initial values are selected: b is set at a minimum value, a is the maximum value, c is set at 2 (I always do that), and d was set at 20 (that is hard to explain why). 2. The independent variable (marketing effort) is placed in column A. 3. The dependent variable (sales) is in column B. The mean of the column is at the bottom. 4. Forecast sales (Yhat) is in C – check out how this was calculated – I just plugged in the ADBUDG function using the parameter cells. 5. Calculate a column of Squared errors (C – B)2. The sum is at the bottom. Create a column of TSS by taking the squared difference between the dependent variable and its mean. Sum this column. 6. Estimate R2. 7. To run Solver, click Tools, then Solver, We want to minimize the sum of the squared errors by searching over the parameters (A6:D6). Here we should put some constraints on the parameters a6:d6 >0 and b6 < a6. Judgmental parameter estimation. There are four parameters and they can be uniquely determined with four estimates. Usually these estimates are in terms of changes from the current situation. By what percent would sales grow (shrink) if you used a saturation level of (did no) advertising? By what percent would sales increase if you spent 50% more on advertising? We 24 assume sales would remain constant if your level of advertising remained constant. b = y(0), i.e., the percent of current sales that would be retained if advertising were cut to zero a = y(), i.e., the percent sales would grow if the advertising level was infinite y(1) is the sales at the current level y(1.5) is the percent of current sales you would sell if you spent 50% more on advertising. Because 1c = 1, we can solve for d with the following formula: 1 Y (1) b (a b) 1 d Going through some algebra, we can see d is equal to the following: d a y(1) y() y(1) y ( ) 1 when y(1) = 1, d . 1 y ( 0) y(1) b y(1) y(0) Assuming that the person also provided an estimate of y(1.5), we can solve for c. After more algebra: y (1.5) b) d ln a y ( 1 . 5 ) c ln( 1.5) For example assume a manager assumed that sales would drop to 60% of current without any advertising, rise to 2X current sales with saturation advertising, and rise to 1.3X current sales with 1.5X as much advertising. This would generate the following parameters: a = 2.0 b = .6 d a y(1) 2 1 1 2.5 y(1) b 1 .6 .4 y (1.5) b d ln a y (1.5) c ln 1.5 1.2 .6 ln 2.5 2 1.2 ln( 1.5) 25 .6 ln 2.5 .8 .4055 1.55 The cases in the book, Conglom, Syntex, and Blue Mountain Coffee all use a slightly different method. They ask for estimated change sales from the current sales at the following four levels of marketing effort: 0, 50% of current, 150% of current, and saturation. The method implicitly assumes that the current level of marketing effort is going to result in the current level of sales. The first three parameters, a, b, and d are estimated in the very same way as above: b = y(0) a = y() y(1) is sales at the current level d a 1 a y(1) as y(1) = 1 d . 1 b y(1) b The other two estimates, a non linear least squares procedure is used with the observations y(.5) and y(1.5) to estimate c. The errors in the two estimates are: 0.5c and e(0.5) y (.5) yˆ (.5) y (.5) b (a b) c 0 . 5 d 1.5 c e(1.5) y (1.5) yˆ (1.5) y (1.5) b (a b) c 1 . 5 d We use Solver to search for the value of c that minimizes e(0.5)2 + e(1.5)2 Both of these procedures are illustrated in the spreadsheet ADBUDGJudmental.xls Forecasting with ADBUDG Models. We do this the very same way we did forecasting with a regression mode. The first step is to estimate the four parameters of the ADBUDG model. This can be done either with marketplace data (see the ADBUDG portion of the NonLinearLeastSquares.xls spreadsheet) or judgmental data (see the ADBUDGJudmental.xls 26 spreadsheet). Continuing with the example from the NonLinearLeastSquares.xls spreadsheet, as with the regression model, we use the estimated coefficients and plug in the expected “marketing effort” to forecast unit sales: xc x 2.3 30.3 (56.7 30.3) 2.3 Y ( x) b (a b) c x d x 20.1 With zero marketing effort we would expect 30.3 sales and with an infinite amount of marketing effort, we would expect 56.7 sales. Profit Models. In addition to forecasting sales we can also forecast profits. The general model will be is the same as earlier: Profit = Unit Sales x Margin – Fixed Cost Continuing with the same example and assuming that the margin is $2 and the cost of a unit of marketing effort is $1.5 xc margin - FC u x c x d Y ( x) margin - FC u x b (a b) x 2.3 $2 $1.5 x 30.3 (56.7 30.3) 2.3 x 20 . 1 The Profit worksheet calculates forecast sales and profits for different levels of marketing effort. This is graphed in Chart3. Cells H13:H17 of that same sheet allow you to use Solver to find the level of marketing effort that maximizes profits. 27 Clustering for Segmentation and Classification Rather than assuming that the data can be represented by a line (or hyper plane) as in regression, cluster analysis assumes that the data can be represented by a much smaller set of points in a space. That is, most of the data points are expected to “cluster” around one of a small number of points. So this smaller set of points can adequately represent the data, just as a line can adequately represent the data in a regression. In most cases, we will be clustering people to form market segments. We can think of each person in an n-dimensional space, where n is the number of variables on which we have data. For example, in a demographic segmentation, we could have variables for age, income, educational level, marital status, and region of the country. If we did an attitudinal segmentation, each person would be represented by their answers to a number of attitudinal questions. We want to learn if there is some structure to the data, i.e., are people spread out uniformly or are their distinct groups or segments. Examples might be high income professionals, liberals, conservatives, etc. The Segmentation and Classification program in ME>XL has two options: hierarchical clustering and k-means. Hierarchical clustering is the default method, but I think k-means is more valuable. The hierarchical clustering program starts with each point in its own cluster. It goes through a series of steps. In each it combines the two clusters that are most similar, the new cluster is located at the centroid or the average of the two clusters that have been combined. It continues through this process until there is only one cluster. At each step the two clusters that are joined together are the two that would increase the Error Sum of Squares (ESS) by the least amount – essentially, it combines the two clusters that are closest together. (In regression RSS, residual sum of squares is the same thing as error sum of squares.) At each step the ESS is the 28 total error sum of squares that is associated with that number of clusters. A smaller ESS means a better fit. As the data are aggregated into fewer and fewer clusters, the ESS will continue to rise. The k-means program uses the hierarchical solution as a starting point and “optimizes” that solution by sequentially moving each point to each cluster to see if the fit improves. It reports fit as a ratio of between to within sum of squares. This ratio is not that meaningful, but it can be transformed into a number that is similar to an R2. Because the clusters are described in terms of the means of the points in the cluster, i.e., the centroid, the Total Sum of Squares = Between Sum of Squares + Within Sum of Squares or TSS = BSS + WSS. This is like regression in which TSS = ESS + RSS. Here BSS corresponds to ESS, the explained sum of squares, and WSS corresponds to RSS, the residual sum of squares. The WSS is the sum of the squared distances from all points represented by a cluster to the centroid of that cluster. It measures how well the cluster centroid represents that set of points. It is like e2 the sum of the squared errors in regression, the sum of the squared distances of the data points from the line. BSS is a weighted sum of the squared distances between each pair of group centroids, where the weight is the number of points represented by each cluster. BSS is bigger when the groups are further apart or are better separated. In regression R2 = ESS / TSS. We would like a similar statistic, but the k-means clustering program reports only the ratio BSS/WSS. However we can go through a little algebra to calculate an R2: BSS BSS TSS BSS WSS If we divide the numerator and denominator of the RHS by WSS, we get: Variance Accounted For (VAF ) R 2 VAF BSS BSS / WSS ratio BSS WSS ) BSS / WSS WSS / WSS ratio 1 29 Where ratio = BSS/WSS that is printed out by the program. Therefore, we can calculate the R2 = ratio/(ratio+1) from the k-means output. We can use either the R2 or the ESS to help determine the proper number of clusters. In either case, we look a big improvement in fit up to a certain number of clusters and a small improvement after that, this is called an “elbow.” In the PDA data, we had the following Between/Within Sum of Squares and ESSs for differing numbers of clusters: Clusters 1 2 3 4 5 6 7 8 B/W VAF .3609 .6964 .9594 1.211 1.441 1.639 1.793 0.27 0.41 0.49 0.55 0.59 0.62 0.64 ESS 2.84 1.53 0.83 0.67 0.44 0.31 0.31 0.25 Looking first at ESS, when we go from one two clusters, the ESS drops by 1.31 (=2.841.53). It drops by .70 (=1.53 - .83) when we go from two to three clusters. It drops by .16 going from three to four clusters, etc. If there is a clear “elbow” we would choose that number of clusters. For example suppose the ESSs for one to five clusters were the following: 2.84, 1.53, .83, .80, and .75. We would see that there is a large drop in ESS as we go from one to three clusters, but little is gained after that. With real data we do not usually get this clean of a solution, and we must look at other things such as size and interpretability of the clusters. Similarly, if we saw R2s that increased .27, .41, .49, .55, .59, we might choose the solution with an R2 of .49, or possibly .55, as the gains get smaller after that. In our data, we see that little is gained after six clusters and quite a bit is gained for the first three or four clusters. This says that the proper number is probably between 3 and 6. We 30 need to look at size of clusters and the interpretation of the new clusters to make a determination of the optimal number. Discriminant Analysis In addition to clustering, the Segmentation and Classification program also performs a “discrimination.” Once clusters have been formed on one set of variables, say attitudinal, then the program attempts to see if there are differences among these clusters in terms of another set of variables, say demographic. So it may try to determine if there are demographic differences between liberals and conservatives. This is accomplished through a statistical technique called discriminant analysis. Discriminant analysis shares some similarities with both cluster analysis and regression. Like regression it is a statistical technique that determines the best linear relationship between a set of independent variables and a dependent variable. Yi = a + b1 X1i + b2X2i + b3 X3i + ei Regression finds that linear combination, i.e., that set of a and bs that best explains the variation in a dependent variable. It finds that combination of a, b1, b2, and b3 that minimizes e2 or maximizes R2. The independent variable is assumed to be interval scaled. It assumes that the relationship between the dependent and independent variables can be represented in terms of a straight line (or actually hyperplane in multiple regression). In discriminant analysis, the Yi is a categorical variable, i.e., group membership. Categorical variables are just different, e.g., male and female (or the benefit clusters in the PDA case), but there is no order to them. Discriminant analysis finds that linear combination (or linear combinations) that bests separates groups or, equivalently, that does the best job of predicting group membership. 31 For example, if a market is segmented into benefit or needs clusters, we might use discriminant analysis to see if a linear combination of demographic variables can separate these groups, i.e., determine which demographic variables best differentiate, or separate these segments. Stated differently, we want to see if the clusters differ significantly in terms of demographic variables. Rather than estimating a line that all points like close to, it estimates a function such that the scores of all observations in one group are close to each other and they are far from the scores of the other groups. So, rather than interpreting the data in terms of a straight line, we interpret the data in terms of a single point for each group. Can we adequately represent our data as a set of points? We want a lot of variation (or distance) between groups, this is called Between group Sum of Squares, BSS, and very little within group – variation, called Within group Sum of Squares, i.e., we want our points to all be close to the centroid of the group to which it belongs. Like regression, which attempts to minimize RSS or maximize ESS, discriminant analysis attempts to maximize a function of the ratio of BSS/WSS. Like clustering, the statistics are based on WSS, within group sum of squares, and BSS, between group sum of squares. The big difference from clustering is that we do not know which cluster each observation is in before we start. We do not even know how many clusters there are. In discriminant analysis, we know which group each observation is in and we want to find out if there are any differences in a set of independent variables among observations in different groups. Suppose we wanted to discriminate between a group of males and a group of females. We want to try to predict a group membership, person’s gender, i.e., we want to find a function of the independent variables that gives high scores to one gender and low scores to the other. 32 Suppose we measure people in terms of height, weight, shoe size, eye color, grade point, and GMAT. The discriminant function would look like the following: Genderi = + Hti + Wti + SSi + ECi + GPAi + GMATi If male is coded as one and female as zero, we want a set of and s that give scores close to one to men and scores close to zero to women. In this case we might expect that 1 2 and 3 would be greater than zero, i.e., on average men tend to be physically bigger than women. If the sample consisted of MBA students, we might not expect a significant difference in GPA or GMAT between men and women. This function tells us which variables differ significantly between men and women. We can use this function (called a discriminant function) to predict the gender of a given person, given knowledge of their height, weight, shoe size, eye color, GPA, and GMAT. Some larger women would get incorrectly classified as men and some smaller men would get incorrectly classified as women. One measure of the quality of a discriminant analysis is the proportion of observations that are correctly classified. This is kind of like an R2 – the amount of explained variance. If there are only two groups, we can model their centroids in terms of two points on a line. That is, the space will be one-dimensional. When we have three groups, we will need to locate them as points in a two-dimensional space unless one of the groups falls on the line between the other two groups. This means there will be two linear combinations, or discriminant functions, one for each dimension in the space. The first explains as much of the variation as possible, i.e., maximizes BSS/WSS. The second function explains as much of the residual variation as possible, subject to the constraint that it is orthogonal (perpendicular) to the first function. There can be no more than one fewer dimension (discriminant function) than there are groups. If we have four groups, we can have at most three discriminant functions. One of the 33 outputs to a discriminant analysis is the amount of variance that is explained by each dimension as well as the cumulative variance explained by all dimensions up to and including the last. This can be interpreted like the fit statistics in clustering. You want to balance a small number of dimensions with as much explanatory power as possible. Like in cluster analysis, where the question is whether one should add one more cluster to the solution, the question with discriminant analysis is whether another dimension is needed to adequately represent the groups. In the following table, taken from the four-cluster needs-based PDA solution, only two discriminant functions are needed to adequately represent the demographic clusters as they capture 80% of the variance. Discriminant function -----------1 2 3 Percent of variance ---------48.49 30.59 20.91 Cumulative percent ---------48.49 79.09 100.00 Significance level -----------.000 .000 .015 It would be nice if the discriminant analysis program printed out the actual discriminant functions, which would be like printing out the regression weights. Unfortunately, ours prints out the correlations between the independent variables and the discriminant functions. The correlations are related to the discriminant function weights but are not the same thing. They do tell show the direction of the weight and which weights are more important in determining each function. Variable ---------PDA Income Bus_Week Education Professnl M_Gourmet PC_Mag Func1 Func2 ------ -----.708 .132 .669 .086 .635 -.089 .622 -.011 .591 .137 .456 .064 .354 -.024 34 Field&Stre Construct Sales Emergency Service Age -.277 -.187 -.045 -.265 -.356 -.076 .674 .660 -.512 .424 -.328 .030 Again, these correlations are taken from the same PDA discriminant analysis. This says the first dimension is primarily: PDA, income, education, Professional, etc. and the second dimension is Field&Stream, Construction, Sales, and Emergency. Following is a plot of the group centroids from the positioning analysis program. It is similar to discriminant analysis. The first dimension has been reversed as professional, PDA, etc are located on the left side of the space. 35 The first, horizontal, dimension separates the professionals from the other groups and has all the first discriminant function variables lying on that axis. The second, vertical, dimension is sales, construction, service, and Field & Stream. This shows where the groups fall relative to each other demographically. For example, if a market is segmented into benefit clusters, we might use discriminant analysis to see if a linear combination of demographic variables can separate these groups, i.e., determine which demographic variables best differentiate, or separate these clusters. 36 Perceptual Mapping This section will cover perceptual mapping, or positioning analysis using factor analysis. Cluster analysis tries to group observations (e.g., people) that are similar in groups. Factor analysis tries to “group” variables that are similar together. If two (or more) variables are highly correlated, then a single variable could do a fairly good job of representing both. Factor analysis replaces a set of correlated variables with a linear combination of them that retains as much of their information (variance) as possible. These linear combinations are called factors. This allows us to represent a set of objects in a reduced dimensional space. For example, following are a set of 10 cars (from several years ago) that have been rated on seven attributes: Attributes / Brands BMW Cavalier Intrepid Taurus Accord Altima Saturn Subaru Camry VW Passat Fuel Econ -0.413 -0.152 -0.891 -0.543 0.413 0.065 0.587 0.021 0.587 0.326 Reliability 0.573 -1.034 -0.73 -0.73 0.921 0.182 -0.034 0.182 0.834 -0.165 Style 1.43 -1.091 -0.221 -1.004 0.517 0.12 -0.569 -0.134 0.43 0.517 Price -1.465 0.969 -0.204 0.404 -0.247 0.273 0.969 -0.334 -0.16 -0.204 Fun to Drive 1.704 -1.078 0.139 -1.034 0.182 0.008 -0.6 0.182 0.095 0.4 Safety 0.652 -0.782 -0.217 0 0.217 0 -0.173 0.26 0.217 -0.173 Space -0.543 -0.673 0.326 0.5 0.108 0.108 -0.282 0.63 0.195 -0.369 These numbers of been scaled so the average rating on each attribute is 0.0. Positive numbers indicate above average ratings and negative numbers represent a lower than average rating. We can think of these cars as located in a 7-dimensional space. We cannot visualize things in seven dimensions, but we could plot pairs of dimensions, like plot the cars on the dimensions of fuel economy and reliability, then fuel economy and style, etc. The Positioning Analysis program uses factor analysis to derive a smaller number of dimensions that contains as much information from the original variables as possible. Following is a correlation table of the attributes: 37 Fuel Reliability Style Price Fun Safety Space Fuel Reliability Style Price Fun Safety Space 1 0.59 1.00 0.18 0.77 1.00 0.23 -0.55 -0.87 1.00 -0.05 0.61 0.95 -0.93 1.00 0.06 0.78 0.76 -0.81 0.75 1.00 -0.17 0.09 -0.19 -0.06 -0.16 0.29 1.00 Five of these attributes: Reliability, Style, Price, Fun to Drive, and Safety are highly (positively or negatively) correlated. If two variables are positively correlated then a car that is perceived to be higher (or lower) than average on one of these attributes, it is likely to be perceived as higher (or lower) on the other as well. If two variables are perfectly correlated, then the second one contains no new information and their sum would contain just as much information as the two variables by themselves. If several variables are correlated, then their weighted sum will contain most of the information in all of them individually. In this example, these five attributes can be represented as a single dimension in a perceptual space without too much loss of information. Of the other two attributes, Fuel economy is correlated with reliability, but no other attributes and Space is relatively uncorrelated with any other attributes. In a two-dimensional space, we might guess that the first dimension – the horizontal dimension will represent these first five attributes and the second dimension will represent some combination of the other two. The positioning program produces the following perceptual map: 38 We see that Fun to Drive, Safety, Style, and Reliability all lie close to each other and Price, which was negatively correlated, points in the opposite direction. The first dimension accounts for 55.1% of the variation in the data. The second dimension, which is primarily Fuel Economy, accounts for 20.1% of the variation in the data. Additionally, brands that are similar to each other are located close to each other in the space, e.g., Camry and Accord are located close to each other. Brands that are distinct, like BMW and Taurus are located away from other brands. Construction of Joint Spaces The information needed to locate preference vectors or ideal points in the space consists of consumers’ preferences or purchase likelihoods of each brand. A regression is used to find 39 the relationship between the brand locations and preferences. Preference, or purchase likelihood, Pj, is the dependent variable and brand locations, X1j and X2j, are the independent variables. Preference vectors. The location of a preference vector is determined by a regression that is shown in the next equation: Pj Bˆ 0 Bˆ1 X 1 j Bˆ 2 X 2 j e j This procedure is illustrated with an example involving one person's likelihood of purchasing each automobile on a 0 to 10 point scale (where 10 means the person was very likely to purchase the automobile). Respondents / Brands BMW Cavalier Intrepid Taurus Accord Altima Saturn Subaru Camry VW Passat 6 2 2 4 10 8 2 8 10 8 Bill The brand locations are given in the “Diagnostics” page of the ME>XL output under Coordinates. The above row of preferences has been special pasted into the last column: Dimensions / Brands BMW Cavalier Intrepid Taurus Accord Altima Saturn Subaru Camry VW Passat 1 0.6245 -0.5663 -0.1211 -0.3228 0.2363 0.0044 -0.2366 0.1075 0.2061 0.0681 2 0.2516 -0.0962 0.5908 0.3792 -0.2636 -0.071 -0.4502 0.1541 -0.3224 -0.1723 Bill 6 2 2 4 10 8 2 8 10 8 A regression is run with the preferences as the dependent variable and the two coordinates as the independent variables. The regression gave the following results: 40 SUMMARY OUTPUT Regression Statistics Multiple R 0.75 R Square 0.56 Adjusted R Square 0.44 Standard Error 2.45 Observations Intercept 10 Coefficients 6.00 1 6.47 2 -3.46 Standard Error 0.78 2.45 2.45 t Stat 7.73 2.64 -1.41 P-value 0.00 0.03 0.20 The positive signs on the regression weights for X1 and the negative sign on X2 indicate that likelihood of purchase increases as one moves to the lower right in the space. Because the first dimension has a weight that is approximately twice as large as dimension two (i.e., 6.47 versus -3.46), dimension one is more important when determining likelihood of purchasing. The preference vector is located by drawing a vector through the origin that decreases 3.46 units vertically for every 6.47 units horizontally. The preference vector for this person is shown in the following figure. This can be drawn manually by starting at the origin and moving some multiple of 3.46 down for the same multiple of 6.47 to the right. 41 42 Ideal points. This formulation assumes that the squared distance between a brand (X1j, X2j) and an ideal point (Y1, Y2), in two dimensions is inversely related to preference for the brand, i.e., an ideal point is located close to brands preferred by that person. The following equation models preference for a given brand as a linear function of its squared distance from the ideal point in a two dimensional perceptual space: Pj a bd 2j a bi 1 (Yi X ij ) 2 2 Where Pj, a, b, and Xij are the defined as before. Yi is the location of the ideal point on the ith dimension. The negative sign on the b indicates that preference for a brand decreases as the further it is from the ideal point1. This relationship between brand preference and its squared distance from the ideal point can be used to locate the ideal point. Stated differently, we know a person’s preference for each brand, Pj and the brand’s location in a perceptual space, Xij, we want to find the location of the ideal point,Yi. The above equation is transformed into the following nonlinear regression equation: 2 Pj aˆ bˆi 1 (Yi X ij ) 2 e j where Pj is the preference for the jth brand, ( Yˆ1 , Yˆ2 ) are the coordinates of the ideal point, (X1j, X2j) are the coordinates of the jth brand, and â and b̂ are regression weights. We can expand the above equation as follows: 2 2 2 2 2 2 Pj aˆ bˆi 1 (Yi X ij ) 2 e j aˆ bˆi 1 Yi 2bˆi 1 Yi X ij bˆi 1 X ij e j 2 2 Pj aˆ bˆi 1 Yi 2bˆY1 X 1 j 2bˆY2 X 2 j bˆ( X 12j X 22 j ) e j 1 If the sign on b is positive, it indicates that preference increases as a brand moves away from the ideal point. In this case the ideal point is called an "anti-ideal point" as it indicates a position of minimum, rather than maximum, preference. 43 Remember, the goal is to find estimates for ( Yˆ1 , Yˆ2 ). The above equation is rewritten so preference can be a function of just the X’s as follows: Pj Bˆ 0 Bˆ1 X 1 j Bˆ 2 X 2 j Bˆ 3 ( X 12j X 22 j ) e j 2 where Bˆ 0 aˆ bˆi 1Yˆi 2 , Bˆ 3 bˆ , and Bˆ i 2bˆYˆi for i=1, 2. The location of the ideal point on the ith dimension, Yˆi , is given by: Yˆi Bˆ i / 2Bˆ 3 . While the math may look complicated, it just shows it is possible to run a regression where Pj is the dependent variable and X1j2 + X2j2, X1j, and X2j are the three independent variables. The location of the ideal point is given by the above equation. If B̂3 is negative, then Yi represents an ideal point - a place of maximum preference. If B̂3 is positive, then Yi represents an anti-ideal point - a point of minimum preference. Again, this is illustrated with the data from the same person as before. The only difference is the addition of the X1j2 + X2j2 term to the previous regression. Dimensions / Brands BMW Cavalier Intrepid Taurus Accord Altima Saturn Subaru Camry VW Passat 1 0.6245 -0.5663 -0.1211 -0.3228 0.2363 0.0044 -0.2366 0.1075 0.2061 0.0681 2 0.2516 -0.0962 0.5908 0.3792 -0.2636 -0.071 -0.4502 0.1541 -0.3224 -0.1723 Dim 1^2 + Dim 2^2 Bill 0.4533 0.3300 0.3637 0.2480 0.1253 0.0051 0.2587 0.0353 0.1464 0.0343 6 2 2 4 10 8 2 8 10 8 Again, one would expect the ideal point to be in the center of the Japanese cars and closest to the Accord and Camry. The following regression is run: 44 SUMMARY OUTPUT Regression Statistics Multiple R 0.93 R Square 0.86 Adjusted R Square 0.79 Standard Error 1.50 Observations 10 Intercept Dim 1^2 + Dim 2^2 Coefficients 8.55 1 6.18 2 -0.99 Standard Error 0.86 1.50 1.65 t Stat 9.97 4.11 -0.60 P-value 0.00 0.01 0.57 -12.74 3.57 -3.57 0.01 The coefficient associated with the X1j2 + X2j2 term is negative, so this represents an ideal point. The coordinates are given by: Y1 = 6.18 / {2 * (-12.74)} = .25 and Y2 = -.99 / {2 * (-12.74)} = .04 This location is different from the one generated by ME>XL plotted in the next figure. 45